Bioinformatics Advance Access originally published online on February 24, 2005
Bioinformatics 2005 21(10):2456-2462; doi:10.1093/bioinformatics/bti352
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Haplotype reconstruction from SNP fragments by minimum error correction
1Academy of Mathematics and Systems Science, Chinese Academy of Sciences Beijing 100080, China
2Beijing Materials Institute Beijing 101149, China
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Haplotype reconstruction based on aligned single nucleotide polymorphism (SNP) fragments is to infer a pair of haplotypes from localized polymorphism data gathered through short genome fragment assembly. An important computational model of this problem is the minimum error correction (MEC) model, which has been mentioned in several literatures. The model retrieves a pair of haplotypes by correcting minimum number of SNPs in given genome fragments coming from an individual's DNA.
Results: In the first part of this paper, an exact algorithm for the MEC model is presented. Owing to the NP-hardness of the MEC model, we also design a genetic algorithm (GA). The designed GA is intended to solve large size problems and has very good performance. The strength and weakness of the MEC model are shown using experimental results on real data and simulation data. In the second part of this paper, to improve the MEC model for haplotype reconstruction, a new computational model is proposed, which simultaneously employs genotype information of an individual in the process of SNP correction, and is called MEC with genotype information (shortly, MEC/GI). Computational results on extensive datasets show that the new model has much higher accuracy in haplotype reconstruction than the pure MEC model.
Contact: wangrsh{at}amss.ac.cn
| 1 INTRODUCTION |
|---|
|
|
|---|
With complete genome sequences for humans now available, the investigation on genetic differences will be one of the main topics in genomics. It is generally accepted that all humans share
99% identity at the DNA level and only few regions of differences in DNA sequences are responsible for genetic diseases (Terwilliger and Weiss, 1998; Hoehe et al., 2000). Among various genetic variations, single nucleotide polymorphism (SNP), a single DNA base varying from one individual to another, is believed to be the most frequent form to address genetic differences (Chakravarti, 1998) and has fundamental importance for drug-design and medical applications. Many researches have been carried out for determining the SNP sites or designing a detailed SNP map for human genome (Altshuler et al., 2000; Helmuth, 2001). An SNP is a base pair position in genomic DNA where different nucleotide variants exist in some populations and each variant is called an allele. Almost all SNPs have two different alleles, which we denote by 0 (wild type) and 1 (mutant type). In diploid organisms, genomes are organized into pairs of chromosomes (a paternal copy and a maternal copy). The SNP sequence information on each copy of a pair of chromosomes is called a haplotype, which is a string over {0, 1}. A genotype is the conflation of two haplotypes on the homologous chromosomes. When a pair of alleles at an SNP site is made up of two identical values, this SNP site is called homozygous site, otherwise it is called heterozygous site.
A recent work (Stephens et al., 2001) indicates that haplotypes generally have more information content than individual SNPs in disease association studies, but it is substantially more difficult to determine haplotypes than to determine genotypes or individual SNPs through experiments. Hence, computational methods that can reduce the cost of determining haplotypes become attractive alternatives. There are generally two classes of computational methods. One class concerns with inferring haplotypes from the genotype samples in a population. There are several models on this line based on different assumptions on the biological system under consideration (Clark, 1990; Gusfield, 2002; Wang and Xu, 2003; Halperin and Eskin, 2004). The second class, called single individual haplotyping or haplotype assembly, is based on the data and methodology of shotgun sequence assembly (Lancia et al., 2001; Lippert et al., 2002). The input data are short genome fragments with SNPs coming from DNA shotgun sequencing or generated by a resequencing effort for the purpose of large-scale haplotyping. When we focus on SNP positions, these short genome fragments are actually aligned SNP fragments. Such methods are concerned with the partitioning of the aligned SNP fragments into two sets according to the SNP states, with each set determining a haplotype. DNA sequencing errors and the diploidy of human genome make the problem complex. This paper falls into this category.
While emphasizing different types of errors, Lancia et al. (2001) have given two models for the haplotype assembly problem, the minimum fragment removal (MFR) model and the minimum SNP removal (MSR) model. Practical algorithms for these two models were proposed by Rizzi et al. (2002). Later, a statistical version of haplotype reconstruction based on SNP fragments was studied by Li et al. (2004). Another important computational model for the haplotype assembly problem, the minimum error correction (MEC) model, was proposed and proved to be NP-hard by Lippert et al. (2002). This model assumes that all the fragments come from one organism but there are sequencing errors to be corrected (this case is very practical). Designing practical algorithms for the MEC model is still an interesting and unsolved problem as indicated by review papers (Bonizzoni et al., 2003; Greenberg et al., 2004).
In the first part (Sections 24) of this paper, we present an exact algorithm based on branch-and-bound method for the MEC model. Owing to the NP-hardness of the MEC model, a heuristic method based on genetic algorithm (GA) is also designed to solve this model. The designed GA is intended to solve large size problems and has very good performance. It can always find optimal solutions for most instances. Experimental results on real data and simulation data show the strength and weakness of the MEC model. In order to improve the MEC model, a new computational model for haplotype reconstruction from SNP fragments, which combines genotype information of an individual into the MEC model, is proposed in the second part (Section 5) of this paper. We call it MEC with genotype information, shortly MEC/GI. Computational results on extensive datasets show that the MEC/GI model has a much higher accuracy in haplotype reconstruction than the MEC model.
| 2 FORMULATION AND PROBLEM |
|---|
|
|
|---|
Suppose that there are m SNP fragments from a pair of chromosomes and the length of the corresponding haplotypes is n. Define an m x n SNP matrix M = (mij), whose every entry mij has value 0, 1 or (missing or skipped base, we call it gap). Each row corresponds to an SNP fragment and each column corresponds to an SNP site. Since the given SNP fragments may have different lengths, generally less than n, we assign valueto the uncovered elements in a row.
![]() | (1) |
. If HD(mi,mk) > 0, we say two fragments mi and mk are in conflict, otherwise we call them compatible. HD(mi,mk) > 0 indicates that either mi or mk is not from the same chromosome copy or there are errors in the data. The distance between a fragment and a haplotype is similarly defined. If there are no errors in the data, the rows of M can be divided into two disjoint sets M1, M2 of pairwise compatible fragments and we can infer two haplotypes by fragment overlap. At this time we say M is feasible, otherwise infeasible. So, the problem of haplotype reconstruction from SNP fragments is to find a minimum number of operations (e.g. remove fragments, SNPs or errors) on an SNP matrix such that it becomes feasible. Specifically:
MEC Given an SNP matrix M = (mij), correct a minimum number of elements (0 into 1 and vice versa) so that the resulting matrix is feasible, i.e. the corrected SNP fragments can be divided into two disjoint sets of pairwise compatible fragments, with each set determining a haplotype.
| 3 METHODS |
|---|
|
|
|---|
In this section, we first present a branch-and-bound algorithm to get exact optimal solutions to the MEC model, then a heuristic method based on GA is designed for large size instances. First, we present some pivotal components for the algorithms in this paper.
The concept of haplotype assembly from SNP fragments focuses on how aligned SNP fragments can be partitioned into two sets according to the SNP states, with each set determining a haplotype. Let (h1, h2) denote a pair of haplotypes from a pair of chromosomes and M represents an SNP matrix corresponding to a set of SNP fragments from this pair of chromosomes. A partition P = (C1,C2) of M divides the rows of M into two disjoint sets that represent a classification of the SNP fragments. For a partition P = (C1,C2), we need to infer two haplotypes by fragment overlap. Let F denote a subset of the rows of M and
,
denote the number of 0s, 1s, respectively, in column j when we focus on the rows in F. The haplotypes h1, h2 formed from P are determined by
![]() | (2) |
) or P = (
,C2) means that all fragments come from the same copy of chromosome and we can only infer one haplotype using (2). It is easy to see that for a fixed partition, generating haplotypes by this method requires fewer error corrections than by any other means.
For a partition P = (C1, C2) of M, we define an error function:
![]() | (3) |
E(P) for any partition P, then we say P* is an optimal solution for the MEC model. Given the error function, the goal of the MEC model is to find the best partition of M.
3.1 An exact algorithm
The aim of the branch-and-bound algorithm (Koontz et al., 1975) for the MEC model is to consider all possible classifications of SNP fragments, i.e. to search the tree of all possible solutions and find the best one. When the error value of a partial classification (i.e. only a fraction of SNP fragments are classified) is greater than current upper bound, then we need not consider all the complete classifications developed upon this partial classification. That is to say, we can abort any branch of the tree that will not lead to a better solution. Therefore, such a branch-and-bound algorithm can always find an exact optimal partition.
There are several ways to find a feasible solution acting as initial upper bound, e.g. we can randomly assign a class-membership to all SNP fragments. Although the running time of the algorithm is theoretically exponential in terms of the input size (the number of fragments), it can be used to illustrate the biological rationality of the MEC model, i.e. under certain conditions (error rate, gap rate, etc.), the haplotypes generated using a minimum number of error corrections are indeed the real haplotypes. This will be further discussed in Section 4. The details of the branch-and-bound algorithm are illustrated in Table 1.
|
Our algorithm is in fact to search a path in a binary tree, in which the node on the j-th level denotes the j-th fragment and the branch on the path connecting its child denotes its corresponding class-membership. Owing to the symmetry of complete binary trees, we only need to search half of the tree for all possible solutions. To speed up the algorithm, we can replace E(c1,c2,...,ck) with a tighter lower-bound similar to the one used by Koontz et al. (1975) when the search process arrives at a certain depth. It is easy to verify E(c1,c2,...,cm)
E(c1,c2,...,ck) + E(ck+1,ck+2,...,cm) and
, where
is the optimal partition of fragments k + 1,k + 2,...,m. Therefore,
could be a better lower bound for E(c1,c2,...,cm) than E(c1,c2,...,ck). When k is relatively large,
can be quickly computed.
3.2 A heuristic method
Given the NP-hardness of the MEC problem, we cannot depend on exact algorithms such as branch-and-bound algorithm to solve large size problems within an acceptable time period. This motivated us to design a heuristic method that is likely to perform well in practice. A GA (Goldberg, 1989), which has a random search mechanism and does not require to consider every feasible solution similar to the branch-and-bound algorithm, is designed here.
3.2.1 Construction of the hypothesis space
We use a binary string to express an individual code which represents a classification of SNP fragments (a feasible solution to the MEC model). The length of individuals in the hypothesis space is the number of SNP fragments. After labeling SNP fragments by 1,2,...,m, the value 0 or 1 on the i-th position of an individual characterizes the class-membership of the i-th SNP fragment. Thus, all of the binary strings having length of m constitute the hypothesis space: H = {(x1,x2,...,xm)|xi
{0,1}, i = 1,2,...,m}.
3.2.2 Designation of the fitness function
For every individual in a population, we need to assign an evaluation. According to the goal of the MEC model, the goodness or badness fit an individual is dependent on the number of error corrections needed for the corresponding classification. Hence, the following fitness function was designed:
![]() | (4) |
1 and the fitness of an individual is reversely proportional to the corresponding error correction number. It is easy to see that an SNP matrix is feasible if and only if there exists an individual {x1,x2,...,xm} in the hypothesis space H such that f(x1,x2,...,xm) = 1.
3.2.3 Genetic operators
There are several kinds of selection operators, e.g. tournament selection, rank selection and roulette wheel selection, etc. among which roulette wheel selection is very popular. However, in order to avoid the crowding phenomenon and to yield a more diverse population, we adopt both tournament selection and roulette wheel selection in different parts of our GA. A combination of single-point crossover and uniform crossover is adopted. In addition, according to the principle of the fittest survives, we use roulette wheel selection operator to select individuals to crossover and generate offsprings. Similarly, we adopted the combination of single-point mutation and swap mutation in our GA.
3.2.4 Designation of GA
The details of our implementation of GA are described as follows:
- Step 0: Give proper parameter settings, e.g. population size popsize, crossover rate pc, mutation rate pm and the maximum number of population generation gnumber.
- Step 1: Randomly generate popsize individuals as an initial population P0, k = 0.
- Step 2: Evaluate Pk, i.e. compute the fitness of every individual in Pk using the formula (4) and retain the individual with the highest fitness.
- Step 3: If k > gnumber, stop, return the individual with the highest fitness in the history, otherwise create a new generation Pk+1 using the following method:
- Select (1 pc)popsize members of Pk and add them to Pk+1 using tournament selection operator.
- Probabilistically select
pairs of individuals from Pk using roulette wheel selection operator. For each pair (s1,s2), produce two offspring by randomly applying single-point crossover operator and uniform crossover operator. Add all offspring to Pk+1.
- Select pm percentage of the individuals in Pk+1 with uniform probability. For each, invert the value at a randomly selected position, or swap the values at two randomly selected positions.
- Step 4: k = k + 1, go to Step 2.
- Step 1: Randomly generate popsize individuals as an initial population P0, k = 0.
| 4 EXPERIMENT RESULTS |
|---|
|
|
|---|
In this section, we will use both real data and simulation data to test the MEC model and two algorithms for haplotype reconstruction. The algorithms are implemented on a 2.26 GHz Pentium 4 PC using Microsoft Visual C++ compiler 6.
In our experiments, we use reconstruction rate (RR), the similarity degree between the original haplotypes and the reconstructed haplotypes, to measure performance of an algorithm or a model. Assuming h = (h1, h2) to be the original haplotypes, and
= (
1,
2) to be the reconstructed haplotypes. We define RR as follows:
![]() | (5) |
j), i = 1,2, j = 1,2. For a partition P = (C1, C2), the corresponding error correction number is defined by the formula (3).
4.1 Experiment on angiotensin converting enzyme (ACE)
ACE has a key function in the reninangiotensin system, hence several association studies have been performed with DCP1 (encode ACE). Rieder et al. (1999) completed the genomic sequencing of the DCP1 gene from 11 individuals and reported 78 SNP sites in 22 chromosomes. Out of the 78 varying sites, 52 were non-unique polymorphic sites.
Among these 11 individuals, there are two identical genotypes. We omit one of them. In addition, we omit the genotypes with no more than one heterozygous site whose haplotypes can be inferred immediately. Now each of the 8 pairs of haplotypes were used to generate 12 instances, in which SNP fragments are randomly generated according to different parameter settings: the number of SNP fragments m = 20; the gap rate of fragments g: 0.25, 0.5 and 0.75; the error rate of fragments e: from 0.1 to 0.4. The reconstruction rates of the MEC model averaged on eight individuals by the branch-and-bound algorithm are illustrated in Figure 1a. The branch-and-bound algorithm solves each of these instances in no more than one second.
|
From Figure 1a, when error rate and gap rate are relatively low, the reconstructed haplotypes by using the MEC model are almost the same as the real haplotypes. It shows that the MEC model is effective under these cases. When error rate and gap rate are high, the reconstructed haplotypes are not the same as the real haplotypes. Moreover, the greater the error rate, the greater the difference between them. This indicates that the MEC model cannot reconstruct haplotypes with high accuracy when error rate of SNP fragments is high, even if an exact algorithm is employed.
4.2 Experiment on simulation data
In this subsection, we use simulation data to evaluate the MEC model for haplotype reconstruction. First, 10 pairs of haplotypes with 50 SNP sites were randomly generated according to a parameter similarity rate s (denotes the similarity degree between two haplotypes in one pair). Then, each of the 10 pairs of haplotypes were used to generate 12 instances, in which SNP fragments are randomly generated according to: s = 0.5; m = 20; g: 0.25, 0.5 and 0.75; e: from 0.1 to 0.4. The reconstruction rates of the MEC model averaged over 10 pairs of haplotypes by the branch-and-bound algorithm are illustrated in Figure 1b, which shows that the results of the MEC model on simulation data under s = 0.5 are similar to those in the last subsection. Another 120 instances were generated according to the above parameter settings except that s = 0, i.e. every SNP site in the genotypes corresponding to 10 pairs of haplotypes is a heterozygous site (this is an extreme case). The results of the MEC model on simulation data under s = 0 are summarized in Figure 1c. From Figure 1, we can see that the MEC model has lower reconstruction rate in the extreme case with more heterozygous sites, when error rate of SNP fragments is high (e.g. e = 0.4). The branch-and-bound algorithm solves each of these instances in no more than 1 s.
4.3 Experiment on data from chromosome 5q31
We performed simulations using the data from public Daly set (Daly et al., 2001). They reported a high-resolution analysis of a haplotype structure across 500 kb on chromosome 5q31 using 103 SNPs in a European-derived population which consists of 129 trios. The haplotypes of 129 children from the trios can be inferred from the genotypes of their parents through pedigree information and the non-transmitted chromosomes as an extra 129 (pseudo) haplotype pairs. Markers for which both alleles could not be inferred are marked as missing. From the resulting set of 258 haplotype pairs, the ones with more than 20% missing alleles are removed, leaving 147 haplotype pairs. From among these pairs, 18 genotypes with no more than one heterozygous site were omitted, leaving 129 pairs of haplotypes as the test set.
The computational time complexity of the branch-and-bound algorithm for the MEC problem is exponential with respect to the number of SNP fragments and has less relevance to the number of SNP sites, because the size of searching tree is exponential in the number of SNP fragments. Of course, the number of SNP sites affects the number of SNP fragments for fixed coverage degree, i.e. the number of fragments that cover the same position of the haplotypes. In addition, the running time of the algorithm heavily depends on the error rate of fragments, which is obvious in Table 2. Generally, the branch-and-bound algorithm can quickly process instances within 30 SNP fragments and 50 SNP sites under various error rates. Beyond this range, the time taken by the algorithm increases rapidly, especially when the error rate of fragments is high. Therefore, in order to solve large size instances, it is better to adopt the designed genetic algorithm.
|
First, we will use some instances to illustrate that, though the designed GA is a heuristic algorithm, its performance is quite comparable with that of the exact algorithm for the MEC problem. For example, we use ID 27, a pair of haplotypes with a length of 95 after removing 8 missing sites, to generate 9 instances, in which SNP fragments are randomly generated according to: m = 40; g: 0.25, 0.5 and 0.75; e: from 0.1 to 0.3. For the cases in this section, we set popsize = 400, pc = 0.8, pm = 0.2 and gnumber = 150. The comparitive results of two algorithms for the MEC model are summarized in Table 2, where the results of the GA are averaged over 10 runs.
From Table 2 we can see that GA for the MEC problem has very good performance. In most cases, it can find optimal solutions. Where it cannot find optimal solutions, it still finds solutions that are very close to optimal solutions. In addition, GA has a very comparable reconstruction rate with the branch-and-bound algorithm. Although GA has a random mechanism, it is very robust for the MEC model. It can return identical results in 10 runs for most instances. We also see that even for the same number of SNP fragments, the running time of the branch-and-bound algorithm varies from several seconds to several hours when error rate of SNP fragments becomes higher from 0.1 to 0.3. However, the running time of the designed GA is only several seconds even for instances with high error rate.
In order to further illustrate the designed GA for the MEC model. Each of 129 pairs of haplotypes is used to generate 12 instances, in which SNP fragments are randomly generated according to: m = 40; g: 0.25, 0.5 and 0.75; e: from 0.1 to 0.4. The reconstruction rates of the MEC model averaged on 129 pairs of haplotypes are illustrated in Figure 2. Each data point is the average of 10 runs of the GA. Figure 2 again indicates that the MEC model is only effective in the case of low error rate of SNP fragments.
|
| 5 AN EXTENSION TO THE MEC MODEL |
|---|
|
|
|---|
For a genotype g = (g1,g2,...,gn), when the j-th SNP site is wild-type homozygous, gj = 0; when it is mutant type homozygous, gj = 1; when it is heterozygous, gj = 2. A pair of haplotypes h1 = (h11,h12,...,h1n) and h2 = (h21,h22,...,h2n) is said to be compatible with a genotype g if the following conditions hold: for each SNP site j where gj
2, h1j = h2j = gj; for each SNP site j where gj = 2, h1j = 0, h2j = 1 or h1j = 1, h2j = 0. As mentioned previously in the Introduction section, genotype data can be obtained more easily than haplotypes. In order to improve the MEC model for haplotype reconstruction, we propose a new MEC type computational model in this section, which employs GI of an individual in the process of SNP correction. MEC/GI Given a SNP matrix M = (mij) and a genotype g, correct minimum number of elements (0 into 1 and vice versa) so that the resulting matrix is feasible and g-compatible, i.e. the corrected SNP fragments can be divided into two disjoint sets of pairwise compatible fragments that determine a pair of haplotypes that is compatible with g.
We first provide the core of the algorithms for solving the MEC/GI model, i.e. how to generate a feasible solution from a partition P = (C1,C2) of M for the MEC/GI model. If gj
2, then let h1j = h2j = gj. If gj = 2, h1j, h2j are determined by the formula (2). If h1j
h2j, then h1, h2 are compatible with g at the j-th SNP site. Otherwise, h1, h2 are not compatible with g at the j-th SNP site and we must modify h1j or h2j. For h1j = h2j = 1, if
, then let h1j = 0, otherwise let h2j = 0. For h1j = h2j = 0, if
, then let h1j = 1, otherwise let h2j = 1. After modification, h1, h2 are compatible with g at every SNP site. It is also easy to see that for a fixed partition, a pair of haplotypes compatible with a given genotype generated by this method has fewer number of conflicts with the given fragments than haplotypes generated by any other means. After determining a pair of haplotypes from a partition by the above method; an error function is defined similarly as for formula (3). The goal of the MEC/GI model is again to find a best partition of M. By using the modified error function, the algorithms for solving the MEC model can be straightforwardly used to solve the MEC/GI model.
The comparative results of two models by the branch-and-bound algorithm on ACE are illustrated in Figure 3ac. From these figures, it is clear that the MEC/GI model, which employs genotype information, has a much higher reconstruction rate at various gap rates. Since genotype information can be obtained more easily than haplotype information, the MEC/GI model is more effective for haplotype reconstruction than the MEC model.
|
The comparative results of two models on simulation data under s = 0.5 are similar to the results on ACE (data not shown). The comparative results of two models averaged over 10 pairs of haplotypes on simulation data under s = 0 (every SNP site in the genotype is a heterozygous site) by the branch-and-bound algorithm are illustrated in Figure 3df. From these figures, it is clear that the MEC/GI model has a higher reconstruction rate than the MEC model even without employing homozygous site information. This indicates that the MEC/GI model is not trivial, i.e. the higher reconstruction rate does not only depend on the homozygous site information. Heterozygous site information contributes greatly towards ensuring the accuracy of the reconstructed haplotypes.
The comparative results of two models on Daly set by using GA are illustrated in Figure 3gi. The figures show that the MEC/GI model is quite effective and the designed GA also performs very well for the MEC/GI model.
| 6 CONCLUSIONS |
|---|
|
|
|---|
In this paper, we first present an exact algorithm based on branch-and-bound method for a kind of haplotype reconstruction from SNP fragmentsthe MEC problem. Then a heuristic method based on GA is designed and intended to solve large size problems. The designed GA has very good performance. It can always find optimal solutions to most instances. Furthermore, it requires much less time than the branch-and-bound algorithm. How well the MEC model solves the haplotype reconstruction problem is illustrated through experiments on real data and simulation data. To improve the MEC model, we also propose a new computational model as an extension to the MEC model, which employs genotype information of an individual in the process of SNP correction. Extensive computational results show that the new model has a much higher accuracy in haplotype reconstruction than the MEC model.
In addition, two algorithms in the paper can be generalized with minor modification to the weighted MEC problem, in which every element in the SNP matrix has a weight denoting the confidence level of the element value read (Greenberg et al., 2004). The algorithms can also be adapted to the k-MEC problem (k
2), which is suited for haplotype reconstruction from SNP fragments in multi-ploid (k-ploid) organisms.
| Acknowledgments |
|---|
The authors are grateful to the anonymous referees for comments and help in improving the presentation of the earlier version of the paper. This work is supported by the National Natural Science Foundation of China under grant # 10471141, the National Postdoctoral Foundation of China and the Center of Bioinformatics, Academy of Mathematics & Systems Science, CAS, China.
Received on November 23, 2004; revised on January 31, 2005; accepted on February 22, 2005
| REFERENCES |
|---|
|
|
|---|
Altshuler, D., et al. (2000) An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature, 407, 513519[CrossRef][Medline].
Bonizzoni, P., et al. (2003) The haplotyping problem: an overview of computational models and solutions. J. Comp. Sci. Technol., 18, 675688.
Chakravarti, A. (1998) It's raining, hallelujah? Nat. Genet., 19, 216217[CrossRef][ISI][Medline].
Clark, A.G. (1990) Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol., 7, 111122[Abstract].
Daly, M.J., et al. (2001) High-resolution haplotype structure in the human genome. Nat. Genet., 29, 229232[CrossRef][ISI][Medline].
Goldberg, D.E. Genetic Algorithms in Search, Optimization and Machine Learning, (1989) , Reading, MA Addison-Wesley.
Greenberg, H.J., et al. (2004) Opportunities for combinatorial optimization in computational biology. INFORMS J. Comput., 14, 211231.
Gusfield, D. (2002) Haplotyping as perfect phylogeny: conceptual framework and efficient solutions. Proceedings of the 6th Annual International Conference on Research in Computational Molecular Biology (RECOMB) , NY ACM Press, pp. 166175.
Halperin, E. and Eskin, E. (2004) Haplotype reconstruction from genotype data using imperfect phylogeny. Bioinformatics, 20, 18421849
Helmuth, L. (2001) Genome research: map of human genome 3.0. Science, 293, 583585.
Hoehe, M., et al. (2000) Sequence variability and candidate gene analysis in complex disease: association of µ opioid receptor gene variation with substance dependence. Hum. Mol. Gene., 9, 28952908
Koontz, W.L.G., et al. (1975) A branch and bound clustering algorithm. IEEE Trans. Comp., 24, 908915.
Lancia, G., et al. (2001) SNPs problems, complexity and algorithms. LNCS, 2161, 182193.
Li, L.M., et al. (2004) Haplotype reconstructon from SNP alignment. J. Comp. Biol., 11, 507518.
Lippert, R., et al. (2002) Algorithmic strategies for the SNPs haplotype assembly problem. Brief. Bioinformatics, 3, 2331
Rieder, M., et al. (1999) Sequence variation in the human angiotensim converting enzyme. Nat. Genet., 22, 5962[CrossRef][ISI][Medline].
Rizzi, R., et al. (2002) Practical algorithms and fixed-parameter tractability for the single individual SNP haplotyping problem. LNCS, 2452, 2943.
Stephens, J.C., et al. (2001) Haplotype variation and linkage disequilibrium in 313 human genes. Science, 293, 489493
Terwilliger, J. and Weiss, K. (1998) Linkage disquilibrium mapping of complex disease: fantasy and reality? Curr. Opin. Biotechnol., 9, 579594.
Wang, L.S. and Xu, Y. (2003) Haploptye inference by maximum parsimony. Bioinformatics, 19, 17731780
This article has been cited by other articles:
![]() |
V. Bansal and V. Bafna HapCUT: an efficient and accurate algorithm for the haplotype assembly problem Bioinformatics, August 15, 2008; 24(16): i153 - i159. [Abstract] [PDF] |
||||
![]() |
M. Xie, J. Wang, and J. Chen A model of higher accuracy for the individual haplotyping problem based on weighted SNP fragments and genotype with errors Bioinformatics, July 1, 2008; 24(13): i105 - i113. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Larranaga, B. Calvo, R. Santana, C. Bielza, J. Galdiano, I. Inza, J. A. Lozano, R. Armananzas, G. Santafe, A. Perez, et al. Machine learning in bioinformatics Brief Bioinform, March 1, 2006; 7(1): 86 - 112. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Li, W. Zhou, X.-S. Zhang, and L. Chen A parsimonious tree-grow method for haplotype inference Bioinformatics, September 1, 2005; 21(17): 3475 - 3481. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||









