Bioinformatics Advance Access originally published online on July 7, 2005
Bioinformatics 2005 21(17):3475-3481; doi:10.1093/bioinformatics/bti572
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A parsimonious tree-grow method for haplotype inference
1Beijing Materials Institute Beijing 101149, China
2Chinese Academy of Sciences Beijing 100080, China
3Osaka Sangyo University Osaka 574-8530, Japan
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Haplotype information has become increasingly important in analyzing fine-scale molecular genetics data, such as disease genes mapping and drug design. Parsimony haplotyping is one of haplotyping problems belonging to NP-hard class.
Results: In this paper, we aim to develop a novel algorithm for the haplotype inference problem with the parsimony criterion, based on a parsimonious tree-grow method (PTG). PTG is a heuristic algorithm that can find the minimum number of distinct haplotypes based on the criterion of keeping all genotypes resolved during tree-grow process. In addition, a block-partitioning method is also proposed to improve the computational efficiency. We show that the proposed approach is not only effective with a high accuracy, but also very efficient with the computational complexity in the order of O(m2n) time for n single nucleotide polymorphism sites in m individual genotypes.
Availability: The software is available upon request from the authors, or from http://zhangroup.aporc.org/bioinfo/ptg/
Contact: chen{at}elec.osaka-sandai.ac.jp
Supplementary information: Supporting materials is available from http://zhangroup.aporc.org/bioinfo/ptg/bti572supplementary.pdf
| 1 INTRODUCTION |
|---|
|
|
|---|
Single nucleotide polymorphisms (SNPs) characterize most of genomic variation in human populations. A haplotype is a SNP sequence from each of the two copies of a given chromosome in a diploid genome. In contrast, a genotype is a description of the mixture information of the two haplotypes in a given chromosome. Recently, haplotype information has become increasingly important in analyzing fine-scale molecular genetics data for a variety of purposes, such as disease genes mapping and drug design. However, current sequencing technology typically determines genotypes rather than haplotypes owing to the requirement of tedious and costly experiments. Such restriction makes in silico haplotyping attractive.
So far, many inference and statistical method have been proposed for haplotyping, such as Clark method (Clark, 1990), parsimony approaches (Gusfield, 2001; Lancia et al., 2001; Wang et al., 2005), maximum-likelihood methods (Excoffier and Slatkin, 1995;Hawley and Kidd, 1995), phylogeny-based approaches (Gusfield, 2002; Chung and Gusfield, 2003; Halperin and Eskin, 2004) and Bayesian methods (Stephens et al., 2001; Niu et al., 2002). In particular, the parsimony criterion that seeks the minimum number of haplotypes to explain a given set of genotypes has been widely investigated owing to its intuitive simplicity and biological implication. Recently both Wang and Xu (2003) and Brown and Harrower (2004) developed an exact algorithm to solve the haplotype inference problem based on the parsimony condition, by the branch-and-bound method and by integer programming method respectively. However, the pure parsimony haplotype inference problem is NP-hard (Gusfield, 2001). Any exact algorithm generally suffers from the curse of dimensionality, which impedes the application for analyzing large-scale genomic data.
In this paper, we aim to develop a novel algorithm for the haplotype inference problem with the parsimony criterion, based on a parsimonious tree-grow method (PTG). We show that the proposed approach is not only effective with a high accuracy but also very efficient with the computational complexity in the order of O(m2n) time for n SNP sites in m individual genotypes. The rest of this paper is organized as follows. Section 2 gives a formal definition of the haplotype inference problem. In Section 3, we explore PTG to develop a new algorithm for the haplotype inference problem and further analyze its computational complexity and optimality. Several numerical experiments are provided in Section 4 to demonstrate the proposed algorithm. In Section 5, we provide an algorithm to reduce the genotype matrix to a smaller block matrix so as to improve the efficiency of PTG algorithm. Finally, we give several general remarks to conclude the paper in Section 6.
| 2 NOTATION |
|---|
|
|
|---|
In this paper, we restrict ourselves to biallelic SNPs. Without loss of generality, assume that the values of the two involved alleles of each SNP are always 0 and 1, which represent the common allele and the rare alleles, respectively. Since the SNPs are located sequentially on a chromosome, a haplotype with length n is a vector over {0,1}n, where each position i is also called a site or a locus. However, a genotype vector, or simply a genotype, represents two haplotypes as a sequence of unordered pairs over the set {0,1}. Each pair represents the nucleotides in a given site. Since the pairs are unordered, we are not able to determine the two haplotypes from the genotype alone. For example, two haplotypes with length 3 are (0,1,1) and (1,0,1) that are combined into the genotype [(0,1),(0,1),(1,1)].
Whenever a pair is made of two identical values, the SNP site is homozygous, otherwise it is heterozygous. Clearly, by the assumption on the values of the alleles, the pair for a homozygous site is (0,0) or (1,1), whereas the pair for a heterozygous site is (0,1). In contrast to the unordered pairs, the genotype can also be represented by a compact form, i.e. a compact representation of the genotype consists of a vector over the alphabet {0,1,2}, where one of the first two symbols is used if the site is homozygous, and a 2 encodes a heterozygous site. For example, the compact representation of the genotype [(0,1),(0,1),(1,1)] is therefore (2,2,1). Next, we only use the compact form of genotype in this paper. If a genotype has no heterozygous site, then we call it homozygote; otherwise we call it heterozygote.
Given a genotype g = (g1,..., gn)
{0,1,2}n, then a resolution of g is a pair
h,k
of haplotypes, where h = (h1,..., hn) and k = (k1,..., kn) are defined in such a way that hi = ki = gi if gi
2; and hi, ki
{0,1} with hi
ki if gi = 2. When the above conditions hold, we also say that
h,k
resolves g, which is denoted by h
k = g. Next, we give several definitions as well as a basic result which is used in the proposed algorithm.
DEFINITION 1.
Letbe a set of m genotypes where gi = gi1
gin is the expression of the i-th genotype, and gij
{0,1,2} is the j-th SNP value in the i-th genotype. Then
is called a genotype matrix with m genotypes in n SNP sites.
(1)
![]() | (2) |
gkj, which is called a genotype fragment.
Pure parsimony haplotype inference problem for a given input set of m genotype vectors is to find a set of m pairs of haplotypes, one for each genotype vector, such that the number of distinct haplotypes is minimum. For convenience, let
(G) be a solution set of a haplotype inference problem for a given genotype matrix G, and
*(G) be the optimal haplotype solution set with the parsimony criterion. Then,
*(G) is the solution with the smallest number of distinct haplotypes that can resolve the genotypes G. Clearly |
*(G)|
|
(G)|, where | · | means the number of elements in the set.
PROPOSITION 1.
For any 1j
n 1, we have
| 3 PARSIMONIOUS TREE-GROWMETHOD (PTG) |
|---|
|
|
|---|
In this section, we propose a novel algorithm, called parsimonious tree-grow method (PTG), to solve the pure parsimony haplotype inference problem for a given genotype matrix G. It is a heuristic algorithm to find the minimal number of distinct haplotypes based on the criterion of keeping all genotypes (or genotype fragments) resolved during a tree-grow process.
3.1 Main idea of PTG
If a genotype matrix G has only one column, we can easily resolve all genotypes in G by no more than two distinct haplotypes of length 1. If a genotype matrix (or submatrix) has k columns and we have resolved the genotype submatrix G[1, k1] by a haplotype fragment set
(G[1, k1]) of length k 1, then we can resolve the genotype matrix (or submatrix) G[1,k] by a haplotype set
(G[1,k]) of length k with every haplotype obtained by adding one SNP value 0 or 1 to one of the haplotype fragments in
(G[1,k 1]). In other words, we can resolve the genotype matrix columns one by one. The resolving process is executed by a growing tree with minimal branching principle, which we called as PTG. In the growing tree, successive layers of the tree correspond to the successive columns, from the left to the right of G. We denote a submatrix of G as G[1,j], and call the rows gk[1, j], or simply gk(j) of G[1, j] as genotype fragments or row fragments.
Each column of G is resolved one by one in a consecutive way. Suppose that G[1, j] has been resolved and the tree has grown to the j-th layer. In the process of resolving G[1, j + 1], the tree grows or extends to a new layer, i.e. the (j + 1)-th layer, where every node in the (j + 1)-th layer corresponds to a distinct haplotype fragment of length j + 1 that can be used to resolve some row fragments in G[1, j + 1]. All nodes in the (j + 1)-th layer correspond to all distinct haplotype fragments that resolve all row fragments in G[1, j + 1]. When all the columns of G are resolved, each node in the final layer corresponds to a unique haplotype, and thereby we can obtain the parsimony haplotype solution set corresponding to the genotype matrix G.
Before we describe the algorithm of PTG in detail, we introduce several definitions. Let the genotype matrix G in Equation (1) have m rows and n columns, and v01 = {1, ..., m} be the index set of rows, which is also the index set of genotypes.
DEFINITION 2.
For a given genotype matrix G, if there is an l for 1l
j such that gil = 2, then i is called a divided index in the j-th layer. Otherwise, i is called an undivided index in the j-th layer. To distinguish them, we use a boolean variable f(i) to denote whether the index i is divided at each iteration of PTG program.
3.2 Algorithm of PTG
We first give the detail procedure of PTG, and then use an illustrative example to demonstrate the performance of the algorithm.
ALGORITHM 1.
PTG. Initialization: Input an m x n genotype matrix G. Set a root node denoted by v01, which represents the index set {1,...,m}, i.e. v01 = {1,...,m}. Set f(i) = false, for every i = 1, ..., m. Let j = 0, and go to step 1.
Step 1 Resolve submatrix G[1, j + 1].
Suppose that there are p nodes vj1, ..., vjk, ..., vjp in the j-th layer of the growing tree, which corresponds to p distinct haplotype fragments resolving G[1, j]. The nodes vj1, ..., vjp also represent their corresponding index sets. Do substeps 1.1 and 1.2 depicted below.
Substep 1.1 For each 1
k
p, and each i, (1
i
m), if i
vjk, resolve the i-th genotype fragment in G[1, j] when i satisfies either of the following two conditions:
- Condition 1: gi,j+1
2;
- Condition 2: gi,j+1 = 2, and f(i) = false.
- Condition 2: gi,j+1 = 2, and f(i) = false.
Otherwise, record the i in a set I(j); and record vjk in a node set Tij, where Tij is a set of the j-th layer nodes that include node i.
- If gi,j+1 = 0, then add a branch of type 0 to the node vjk when there is no branch of type 0 growing from node vjk; add i to the index set of the node v(j+1.), which is connected to the node vjk by the existing branch or the just added branch of type 0.
- If gi,j+1 = 1, then add a branch of type 1 to the node vjk when there is no branch of type 1 growing from node vjk; add i to the index set of the node v(j+1.), which is connected to the node vjk by the existing or just added branch of type 1.
- If gi,j+1 = 2 and f(i) = false, then add a branch of type 0 (a branch of type 1) or both branches (one of type 0 and the other of type 1) or nothing to the node vjk according to the following cases: only one type of branch exists, no branch exists or two types of branches exist. Add i into both index sets of the (j + 1)-th layer nodes, which are connected to node vjk, set f(i) = true.
Substep 1.2 For i
I(j), suppose Tij = {vjk1, vjk2}, i.e. i belongs to vjk1 and vjk2. Check whether there are two different types of branches growing separately from node vjk1 and vjk2.
- If there are no two such different types of branches, then add a proper type of branch to node vjk1 or vjk2, or add two different types of branches, one to node vjk1 and the other to node vjk2.
- Choose (or randomly choose) a pair of different type branches, one growing from node vjk1, the other growing from node vjk2. Add i into both node index sets of the (j + 1)-th layer which are connected to vjk1 or vjk2 by one of the chosen branches.
Step 2 If j + 1 < n, set j
j + 1 and return to Step 1. Otherwise, assemble haplotypes as follows.
Trace each path from node v01 to every node in the n-th layer of the growing tree. The sequence of branch type indices (0 or 1) of the path gives a haplotype, which can be used to resolve the genotypes whose indices belong to the corresponding node in the n-th layer. All the haplotypes corresponding to the n-th layer nodes consist of
(G).
3.3 An illustrative example
To demonstrate the algorithm, we resolve a genotype matrix G as follows:
![]() |
3.3.1 Resolve submatrix G[1,1] (or the first column of G)
We first resolve genotype fragment g1[1,1]. Since g11 = 2, we must use two distinct haplotype fragments (0 and 1) to resolve g1[1,1], which results in a branch of type 0 and a branch of type 1 growing from node v01. Denote the two new nodes in the first layer by v11 and v12, and set v11 = {1} and v12 = {1}. Because index 1 is now a divided index, set f(1) = true; then we resolve g1[1,1].
Next, resolve the second genotype fragment g2[1, 1]. Despite g21 = 2, since both the branch of type 0 and the branch of type 1 have grown from node v01, we add index 2 into both v11 and v12, i.e. v11 = {1,2}, v12 = {1,2}. Set f(2) = true. Then, we resolve g2[1,1].
Finally, resolve the third genotype fragment g3[1,1]. Since g31 = 0, and there is already a branch of type 0 growing from node v01, we add index 3 into v11. Therefore, v11 = {1,2,3} and v12 = {1,2}, which resolve g3[1,1]. The result is shown in the first layer of tree in Figure 1.
|
3.3.2 Resolve submatrix G[1,2] (or the second column of G)
First check all indices in v11. Since 1
v11 with f(1) = true and g12 = 2, we record the node 1 in a list I(1) to be treated later, and record v11 in T11, i.e. T11 = {v11}. Because 2
v11 with g22 = 0 and there is no branch of type 0 growing from node v11 as well, we add a branch of type 0 to node v11 and denote the new node by v21 and let v21 = {2}. Since 3
v11 with f(3) = false and g32 = 2 and there is no branch of type 1 growing from node v11, we add a branch of type 1 to node v11, denote the corresponding node attached to it by v22, and further add index 3 to v21 and v22, i.e. v21 = {2,3}, v22 = {3}. Set f(3) = true. Now g3[1,2] is resolved.
In the same manner, we can check all indices in v12. Since 1
v12 with f(1) = true and g12 = 2, we record v12 in T11, i.e. T11 = {v11,v12}. Since 2
v12 with g22 = 0 and there is no branch of type 0 growing from node v12 as well, we add a branch of type 0 to node v12 and denote the corresponding node by v23 and let v23 = {2}. Now, g2[1,2] is resolved.
According to substep 1.2 of the algorithm, now we consider the indices in I(1). Since 1
I(1) and T11 = {v11,v12}, there are three branches growing from nodes v11 and v12. Hence, we choose the branch of type 1 growing from v11 and the branch of type 0 growing from v12 to resolve g12 = 2, add index 1 in v22 and v23. Now g1[1,2] is resolved, and
![]() |
3.3.3 Resolve submatrix G[1,3] (or the third column of G)
In the same manner as the above two iterations, we obtain the index sets for the third layer nodes:
![]() |
The final tree is depicted in Figure 1, which has three nodes in the last layer corresponding to three distinct haplotypes. By tracing all paths, we obtain three haplotypes for resolving all genotypes in G,
![]() |
v31 and 1
v32, the haplotypes 010 (corresponding to v31) and 100 (corresponding to v32) resolve the first genotype g1 (220). Clearly, according to the index sets of (v31, v32, v33), each haplotype can be used to resolve two genotypes of G, e.g. the haplotype corresponding to v31 (i.e. 010) can be used to resolve genotypes g1 and g3 (since 1
v31 and 3
v31).
3.4 Computational complexity
There is a bound for the number of haplotypes by PTG. As proven in Proposition (4) in Appendix 1 of Supporting material, if the genotype matrix G has m rows and n columns, then the resolution set of haplotype inference problem obtained by Algorithm 1 must satisfy the following inequality
![]() |
| 4 EXPERIMENTAL RESULTS |
|---|
|
|
|---|
In this section, we use both real data and simulation data to demonstrate the performance of PTG. To improve the computational efficiency, input data are preprocessed according to Algorithm 2, which is described in detail in Section 5. CPU times in this section are the total amount for Algorithms 1 and 2. The program is implemented on a 1.8 GHz 512 MB RAM Pentium 4 Processor PC using Borland Delphi 5.0 by Pascal, and is available upon request or from the website. Throughout our experiment, to measure the performance of PTG, we use error rate, a commonly used criterion in haplotype inference problem (Stephens et al., 2001; Niu et al., 2002; Wang and Xu, 2003). The error rate is the proportion of genotypes whose original haplotype pairs are incorrectly inferred by the program.
4.1 Experiment on ß2AR gene data
ß2-Adrenergic Receptors (ß2AR) are G protein-coupled receptors that mediate the actions of catecholamines in multiple issues. There are 13 variable sites within a span of 1.6 kb in the human ß2AR gene. Among 121 individuals, there are 18 distinct genotypes, but only 10 haplotypes resolve all the genotypes. Those 10 haplotypes and 18 genotypes are illustrated in Table 1 (Wang and Xu, 2003).
|
We ran PTG on ß2AR gene data 100 times; within 80 times of the run, we found 10 distinct haplotypes to resolve all 18 genotypes, where 9 of the 10 haplotypes correctly resolve 17 genotypes. The average error rate in 100 runs was 0.056. In particular, in 10 of 100 runs, we found all 10 correct haplotypes to resolve all 18 genotypes. The average running time is
0.016 s, which is considered to be very efficient in contrast to HAPER (>1 min) and PHASE (>10 min). Detail computation process for PTG is described in Appendix 2 of Supporting material.
4.2 Angiotensin converting enzyme gene data
Angiotensin converting enzyme (ACE) is encoded by the gene DCP1. Complete data for the genomic sequencing of DCP1 from 11 individuals in 22 chromosomes are available (Rieder et al., 1999). There are 52 SNP sites and 11 genotypes, which are resolved by 13 distinct haplotypes (Rieder et al., 1999; Wang and Xu, 2003). We obtained 13 haplotypes with 9 correct haplotypes that resolve 9 of the 11 genotypes correctly with an error rate of 2/11 = 0.182. Such a performance is better than or at least equal to widely used existing programs, i.e. HAPAR with an error rate of 0.273, Haplotyper with an error rate of 0.182, HAPINFERX with an error rate of 0.273 and PHASE with an error rate of 0.273 (Wang and Xu, 2003). The relatively low accuracy is mainly because of the small sample size. In these experiments, the average CPU timeis 0.320 s.
4.3 Maize dataset
The maize dataset is used as one of the benchmarks to evaluate accuracy of haplotype programs (Wang and Xu, 2003). Acetyl-CoA C-acyltransferase which is an enzyme and catalyses the final step of fatty acid oxidation, has 18 SNP sites and 4 haplotypes with frequencies of 0.03, 0.47, 0.23 and 0.26 in the maize dataset, as shown in Table 2. We follow the same procedure as Wang and Xu (2003) to generate a sample of n genotypes by randomly picking two haplotypes according to their frequencies and conflating them. Table 3 is the simulation results (Wang and Xu, 2003) for five programs. The error rates are average values for 100 random samples. Clearly, PTG correctly resolves all genotypes for sample sizes from 4 to 10, and behaves best among five programs in terms of accuracy. We alsoconducted simulations for Adh1 in the maize dataset for different sample sizes, which has 6 haplotypes and 14 SNP sites with frequencies of 0.031, 0.031, 0.125, 0.25, 0.25 and 0.312. The simulation results are almost the same as those of Table 3, and PTG correctly resolves all genotypes.
|
|
4.4 Experiments on simulation data
The haplotype generator, ms, in Hudson (2002) is a well-known standard program based on the coalescent model of SNP sequence evolution. In this subsection, we use the software (ms) to generate 2m haplotypes, each with n SNP sites, and then randomly pair them to obtain m genotypes, which are used as input for the PTG program.
4.4.1 Coalescence-based simulations without recombination
In this section, the number of SNPs is fixed as 10, 50, 200, and 100 replications were made for each sample size. When generating haplotypes, we specify recombination parameter to be 0. The CPU times and error rates of PTG are illustrated in Tables 47, where m denotes the number of genotype matrix rows, and n is the number of genotype matrix columns.
|
|
Comparing with Figure 4 of Wang and Xu (2003), the computation in Tables 47 is fairly efficient in terms of both CPU time and accuracy (Halperin and Eskin, 2004), even for large size of genomic data, in contrast to PHASE (Stephens et al., 2001), HAPLOTYPER (Niu et al., 2002) and other methods. The results indicate the superiority of our algorithm over the conventional approaches. Both the numbers of genotypes and SNP sites affect the computational cost, which is also proved in Section 3.4 or Appendix 1 of Supporting material.
4.4.2 Coalescence-based simulations with recombination
In this section, we introduce recombination into the model when generating simulated haplotypes. We set recombination parameter
to be 100.0 when generating haplotypes by the software ms. The simulation results are illustrated in Table 8.
|
Comparing with figure 5 of Wang and Xu (2003), the error rate results in Table 8 are similar to those obtained by the existing haplotype softwares. However, in contrast to the cases without recombination shown in Table 4, the error rates are high. The reason resulting in relatively high error rate is that in the simulation data with recombination, the number of correct distinct haplotypes resolving all genotypes is often not the minimum one. For example, in our simulation of 30 genotypes with 10 SNP sites, the number of the correct distinct haplotypes resolving all 30 genotypes is 24. However, by PTG, we can find a solution of 19 distinct haplotypes resolving all 30 genotypes. Since PTG can almost always find the minimum number of haplotypes to resolve all genotypes, the error rate may not be low when recombination rate is high. This is also the reason why the error rate of other programs is also high. Such a fact implies that parsimony approach may not be suitable for the data with a high recombination rate, or needs to be modified to handle such problems by further considering the characteristics of recombination.
To study the bottleneck effect, we do simulation on large scale of data without recombination, as shown in Table 7. For a sample of 1000 individuals, PTG can currently handle 200 SNPs in no more than 10 min, which is better than HAPLOTYPER (handling 50 SNPs of 1000 individuals); For a sample of 300 individuals, PTG can handle 400 SNPs, which is also efficient in contrast to HAPLOTYPER (handling 256 SNPs of 100 individuals). PTG can even resolve problems with a much large size of data if there is sufficient capacity of computer RAM (>512 MB).
| 5 IMPROVING EFFICIENCY OF PTG |
|---|
|
|
|---|
Usually in genotype matrix derived from human haplotypes, many columns corresponding to SNP sites are identical. Indeed, as noted in Patil et al. (2001), the number of identical columns in real data is considerably large. It is common to keep only one column out of several identical columns since they are assumed not to carry any additional information (Patil et al., 2001). Thus we can improve the performance (in both CPU times and memory requirements) by reducing the number of columns of genotype matrix. This can be executed by dividing the genotype matrix into blocks, as a precomputation process of Algorithm 1.
DEFINITION 3.
Given a genotype gk = gk1gkn and a fragment gk[i,j] = gki
gkj, if gkt = 0 (or 1) for each i
t
j, then the fragment gk[i,j] is called an identical homozygous genotype fragment of type 0 (or 1). If gkt = 2 for each i
t
j, then gk[i,j] is called an identical heterozygous genotype fragment.
DEFINITION 4.
Given a genotype submatrix G[i,j], if every row of G[i,j] is either an identical homozygous genotype fragment or an identical heterozygous genotype fragment, i.e. every row of G[i,j] is one of the following three types:then G[i,j] is called a block. A block is called a homozygous block of type 0 (or type 1) if every row is an identical homozygous genotype fragment of type 0 (or type 1). Otherwise, it is called a heterozygous block.
Clearly, a block is a submatrix of the genotype matrix with all the columns identical.
DEFINITION 5.
Given a haplotype hk = hk1hkn where hkl
{0,1} for 1
l
n, a haplotype fragment hk[i,j] = hki
hkj is called an identical homozygous haplotype fragment of type 0 (or 1) if hkt = 0 (or hkt = 1) holds for any i
t
j.
With the preparation above, we have a basic proposition below to simplify the computation in PTG algorithm.
PROPOSITION 2.
All the genotype fragments in a homozygous block of type 0 (type 1) can be resolved by one (or two identical) identical homozygous haplotype fragment of type 0 (type 1). All the genotype fragments in a heterozygous block can be resolved by two different types of identical homozygous haplotype fragments in the spirit of parsimony.
For example, all genotype fragments in the homozygous block
![]() |
![]() |
ALGORITHM 2.
Dividing genotype matrix G into blocks.
- Initialization: input a genotype matrix G with m individual genotypes and n SNP sites, and let k
1, j
1, ik
j.
- Step 1. If
then j
j + 1, go to Step 2; otherwise go to Step 3.
- Step 2. If j = n, go to Step 3; otherwise go to Step 1.
- Step 3. Gk = G[ik,j], which is defined in Equation (2). If j = n, stop and output all blocks Gk of G; otherwise, let k
k + 1, j
j + 1, ik
j, and go to Step 1.
- Step 4. Combine all blocks into a block matrix B where each column represents a block.
It can be easily shown that Algorithm can divide the genotype matrix G into blocks in no more than O(mn) arithmetic operations. Since all columns in a block are identical, we can use one of them to represent the block. After doing this to all blocks of G, we obtain a matrix B, which is called a block matrix of G. Obviously, each block comprises consecutive identical columns of a genotype matrix. Clearly, the algorithm can be easily extended to find an extended block matrix with the minimum number of blocks, where each extended block is composed by all the identical columns in G.
PROPOSITION 3.
Given a genotype matrix G and its corresponding block matrix B, let*(G) and
*(B), respectively be the optimal haplotype solution sets of the haplotype inference problem with G and B. Then |
*(G)| = |
*(B)|, and the haplotypes in
*(G) can be obtained from those in
*(B) by adding the corresponding SNPs. However, the haplotypes in
*(B) can be obtained from those in
*(G) by deleting the SNPs according to the rule of dividing G into blocks.
EXAMPLE 1.
Given a genotype matrixthe corresponding block matrix is
according to Algorithm.
An optimal solution for the block matrix is
*(B) = {001,010,100} by PTG algorithm, which means
*(G) = {00011, 00100, 11000} by Proposition (3). Clearly, every haplotype in
*(B) is a compression of the haplotype in
*(G), and every haplotype in
*(G) is an extension of the haplotype in
*(B).
Therefore, instead of the original genotype matrix G, we can use the block matrix B = (b1
bt) that has less columns, to improve the computational efficiency in the initialization of Algorithm 1, in particular, for large-scale data. Generally, such a compression not only reduces the combination of possible haplotypes, but also loses no information of genotype data. Hence, given a genotype matrix G, we first use Algorithm to reduce the number of columns in G and obtain the block matrix B, then apply Algorithm 1 (PTG) to the block matrix B. After resolving the block matrix B, we recover haplotypes of full length to resolve the genotype matrix G. The program in this paper is coded by both Algorithms 1 and 2.
Using this modified PTG algorithm, we obtained an optimal solution of ß2AR gene data, the corresponding growing tree and the index sets are depicted in Appendix 2 of Supporting material.
| 6 FURTHER DISCUSSION OFPTG ALGORITHM |
|---|
|
|
|---|
In this paper, we proposed a novel graphic algorithmPTG, which not only is very efficient with polynomial arithmetic operations, but also has high accuracy for the haplotype inference problem. In particular, the computational cost is very low even for large scale genomic data as indicated in Table 7 and proven in Theorem (1) in Appendix of Supporting material. In contrast to Clark's method, there is no restriction for the given genotypes, i.e. PTG can resolve the case that every genotype has more than one heterozygous site, such as the illustrative example and in the simulation tests in Section 4.4.
Although PTG is very efficient, it is based on the parsimony criterion, which generally does not directly take the count information of genotypes into consideration, as indicated in Section 4. To alleviate such a disadvantage, instead of pure parsimony, a modified parsimony criterion may be required, such as by adding weighting parameters to approximately incorporate frequency information of genotypes in the model. In addition, PTG algorithm currently has no function to handle gaps in the genotype matrix. As a future topic, we will improve the PTG algorithm to incorporate missing data in optimization.
|
|
| Acknowledgments |
|---|
We are grateful to Wang and Xu for kindly giving us haplotype datasets and related information. This work is partially supported by the National Natural Science Foundation of China under grant No.10471141, and the Stress Foundation of Beijing Materials Institute, Beijing, China. X.Z.'s is partly supported by Informatics Research Center for Development of Knowledge Society Infrastructure, Graduate School of Informatics, Kyoto University, Japan.
Conflict of Interest: none declared.
Received on May 24, 2005; revised on June 30, 2005; accepted on July 4, 2005
| REFERENCES |
|---|
|
|
|---|
Bondy, J.A. and Murty, U.S.R. Graph Theory With Applications, (1976) , London Macmillan.
Brown, D.G. and Harrower, I.M. (2004) A new integer programming formulation for the pure parsimony problem in haplotype analysis. In Jonassen, I. and Kim, J. (Eds.). Algorithms in Bioinformatics, 4th International Workshop (WABI) Springer, pp. 254265.
Ching, A., et al. (2002) SNP frequency, haplotype structure and linkage disequilibrium in elite maize inbred lines. BMC Genet., 3, 19[CrossRef][Medline].
Chung, R.H. and Gusfield, D. (2003) Perfect phylogeny haplotyper: Haplotype inferral using a tree model. Bioinformatics, 19, 780781
Clark, A.G. (1990) Inference of haplotypes from PCR-amplified samples of diploid populations. Molecular Biology and Evolution, 7, 111122[Abstract].
Excoffier, L. and Slatkin, M. (1995) Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol., 12, 921927[Abstract].
Greenberg, H., Hart, W.E., Lancia, G. (2002) Opportunities for combinatorial optimization in computational biology. Technical report, University of Colorado at Denver, Mathematics Department, Denver, CO.
Gusfield, D. (2001) Inference of haplotypes from samples of diploid populations: Complexity and algorithms. J. Comput. Biol., 8, 305324[CrossRef][ISI][Medline].
Gusfield, D. (2002) Haplotyping as perfect phylogeny: Conceptual framework and efficient solutions. Proceedings of RECOMB 2002: The sixth Annual International Conference on Computational Biology , pp. 166175.
Halperin, E. and Eskin, E. (2004) Haplotype reconstruction from genotype data using imperfect phylogeny. Bioinformatics, 20, 18421849
Hawley, M. and Kidd, K. (1995) Haplo: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. J. Heredity, 86, 409411
Hudson, R. (2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics, 18, 337338
Lancia, G., Bafna, V., Istrail, S., Lippert, R., Schwartz, R. (2001) SNPs problems, complexity and algorithms. Proceedings of Annual European Symposium on Algorithms (ESA) Lecture Notes in Computer Science Springer 2161, , pp. 182193.
Lin, S., et al. (2002) Haplotype inference in random population samples. American Journal of Human Genetics, 71, 11291137[CrossRef][ISI][Medline].
Lincia, G. and Perlin, M. (1998) Genotyping of pooled microsatellite markers by combinatorial optimization techniques. Discrete Applied Mathematics, 88, 291314[CrossRef].
Niu, T., et al. (2002) Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. American Journal of Human Genetics, 70, 157169[CrossRef][ISI][Medline].
Patil, N., et al. (2001) Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science, 294, 171923.
Qian, D. and Beckmann, L. (2002) Minimum-recombinant haplotyping in pedigrees. American Journal of Human Genetics, 70, 14341445[CrossRef][ISI][Medline].
Rieder, M., et al. (1999) Sequence variation in the human angiotensin converting enzyme. Nat. Genet., 22, 5962[CrossRef][ISI][Medline].
Stephens, M., et al. (2001) A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics, 68, 978989[CrossRef][ISI][Medline].
Wang, L.S. and Xu, Y. (2003) Haplotype inference by maximum parsimony. Bioinformatics, 19, 17731780
Wang, R., et al. (2005) Haplotype Reconstruction from SNP Fragments by Minimum Error Correction. Bioinformatics, 21, 24562462
Xu, C.F., et al. (2002) Effectiveness of computational method in haplotype prediction. Human Genetics, 110, 148156[CrossRef][ISI][Medline].
Zhang, X., et al. (2006) Models and Algorithms for Haplotyping Problem. Current Bioinformatics, In press.
This article has been cited by other articles:
![]() |
K. Allen-Brady, L. A. Cannon-Albright, S. L. Neuhausen, and N. J. Camp A Role for XRCC4 in Age at Diagnosis and Breast Cancer Risk. Cancer Epidemiol. Biomarkers Prev., July 1, 2006; 15(7): 1306 - 1310. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
be a set of m genotypes where gi = gi1 














