Skip Navigation


Bioinformatics Advance Access originally published online on January 10, 2006
Bioinformatics 2006 22(6):685-691; doi:10.1093/bioinformatics/btk035
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
22/6/685    most recent
btk035v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Chang, C.-J.
Right arrow Articles by Chao, K.-M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chang, C.-J.
Right arrow Articles by Chao, K.-M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

A greedier approach for finding tag SNPs

Chia-Jung Chang 1, Yao-Ting Huang 1 and Kun-Mao Chao 1,2,*

1Department of Computer Science and Information Engineering, National Taiwan University Taipei, Taiwan
2Graduate Institute of Networking and Multimedia, National Taiwan University Taipei, Taiwan

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EXPERIMENTAL RESULTS
 4 DISCUSSION
 5 CONCLUSION
 REFERENCES
 

Motivation: Recent studies have shown that a small subset of Single Nucleotide Polymorphisms (SNPs) (called tag SNPs) is sufficient to capture the haplotype patterns in a high linkage disequilibrium region. To find the minimum set of tag SNPs, exact algorithms for finding the optimal solution could take exponential time. On the other hand, approximation algorithms are more efficient but may fail to find the optimal solution.

Results: We propose a hybrid method that combines the ideas of the branch-and-bound method and the greedy algorithm. This method explores larger solution space to obtain a better solution than a traditional greedy algorithm. It also allows the user to adjust the efficiency of the program and quality of solutions. This algorithm has been implemented and tested on a variety of simulated and biological data. The experimental results indicate that our program can find better solutions than previous methods. This approach is quite general since it can be used to adapt other greedy algorithms to solve their corresponding problems.

Availability: The program is available upon request.

Contact: kmchao{at}csie.ntu.edu.tw


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EXPERIMENTAL RESULTS
 4 DISCUSSION
 5 CONCLUSION
 REFERENCES
 
The genetic variations in DNA sequences have a major impact on genetic diseases and phenotypic differences. Among various genetic variations, the single nucleotide polymorphism (SNP) is the most frequent form which has fundamental importance for disease association and drug design. A SNP is a genetic variation when a single nucleotide (i.e. A, C, G, or T) in the DNA sequence is altered and kept through heredity thereafter. A set of linked SNPs on one chromosome is called a haplotype. Millions of SNPs have been identified and these data are now publicly available (Helmuth, 2001; Hinds et al., 2005; Altshuler et al., 2005).

In recent years, the patterns of linkage disequilibrium (LD) observed in the human population reveal a block-like structure (Bafna et al., 2003; Daly et al., 2001; Patil et al., 2001; Hinds et al., 2005; Zhang et al., 2004a). The entire chromosome can be partitioned into high LD regions interspersed by low LD regions. The high LD regions are usually called ‘haplotype blocks’ and the low LD ones are referred to as ‘recombination hotspots.’ Within a haplotype block, there is little or no recombination that occurs and the SNPs are highly correlated. Consequently, a small subset of SNPs (called tag SNPs or haplotype tagging SNPs) is sufficient to capture the haplotype pattern of the block. Using tag SNPs for association studies can greatly reduce the genotyping cost since it does not require genotyping all SNPs.

A number of methods have been proposed to find tag SNPs. These methods are mainly based on the following three models. The methods based on the first model try to identify a minimun set of LD bins such that SNPs within a bin are in high LD with each other (e.g. r2 ≥ 0.8) (Carlson et al., 2004). The second model assumes that the haplotype blocks have been delimited in advance, and these methods find a minimum set of SNPs which is able to distinguish each pair of haplotypes in a block (Patil et al., 2001; Zhang et al., 2002). The methods based on the third model usually assume that the number of tag SNPs is specified as an input parameter, and they identify tag SNPs which can reconstruct the haplotype of an unknown sample with high accuracy (Halperin et al., 2005; He et al., 2005).

We would like to note that methods based on the third model aim to find a set of SNPs which can predict the haplotype of an unknown sample with high accuracy. On the other hand, LD-based and block-based methods both focus on minimizing the number of tag SNPs for whole genome association studies (Crawfod and Nickerson, 2005). The LD-based methods identify tag SNPs that can represent other SNPs which are distant apart. The tag SNPs found by block-based methods are mainly used to represent SNPs in a continuous region (Hinds et al., 2005). However, the tag SNPs found by the LD-based methods may fail to distinguish all haplotypes in an LD bin1. But the tag SNPs found by block-based methods can distinguish all haplotypes in a block.

This paper studies the block-based model. In a large-scale study of human Chromosome 21, Patil et al. (2001) developed a greedy algorithm to partition the haplotypes into 4135 blocks with 4563 tag SNPs. Zhang et al. (2002, 2003, 2004a) used a dynamic programming approach to reduce the numbers of blocks and tag SNPs to 2575 and 3562, respectively. To avoid the influence of missing data, Huang et al. (2005) showed that there exists a set of SNPs, called robust tag SNPs, which is able to tolerate a fixed number of missing data. The problem of finding robust tag SNPs is the generalized version of the problem of finding tag SNPs. Both problems are known to be NP-hard (Garey and Johnson, 1979; Zhang et al., 2002; Huang et al., 2005).

The brute force method or the branch-and-bound method can find the optimal solution of these problems but may take infeasible time on large datasets (Huang et al., 2005; Zhao et al., 2005). On the other hand, approximation algorithms such as the greedy algorithm are fast on all datasets but may fail to find the optimal solution (Huang et al., 2005; Zhao et al., 2005; Zhang et al., 2004b). Here we briefly summarize the main ideas of the branch-and-bound and the greedy methods (Cormen et al., 2001).

  • The concept of the branch-and-bound algorithm is to divide the feasible region of a problem to create smaller subproblems, cut out impossible ones and recursively solve the subproblems to get an optimal solution.
  • The concept of the greedy method is to make a choice that appears to be the best at each step, and repeat this process until a feasible solution is found.

In this paper, we propose a hybrid method of the above two algorithms to solve the problem of finding robust tag SNPs. Our method adopts the idea of the branch-and-bound method to partition the original problem into a number of subproblems. Each subproblem is created by including or excluding some choices made by the greedy algorithm. The subproblems are then solved by a greedy method. By our method, we can explore larger solution space to obtain a better solution than a traditional greedy algorithm within a reasonable period of time. The trade-off between the efficiency and solutions of the algorithm can be explicitly adjusted by users.

We have implemented the algorithm and tested it on a variety of simulated and biological data. The experimental results indicate that this algorithm is quite efficient and the solutions are better than those of previous methods. Furthermore, our algorithm only spends few seconds to find optimal solutions in many cases. In other cases, our solutions can be very close to the optimal solutions without sacrificing the efficiency too much.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EXPERIMENTAL RESULTS
 4 DISCUSSION
 5 CONCLUSION
 REFERENCES
 
We are given a haplotype block containing n SNPs and p haplotype patterns. Each SNP can distinguish some haplotype patterns from the others [see Fig. 1a and b]. To tolerate m missing SNPs, the problem of finding robust tag SNPs asks for a minimum set of SNPs such that each pair of haplotype patterns is distinguished by at least m + 1 SNPs (Huang et al., 2005). The problem of finding minimum tag SNPs is a special case of finding robust tag SNPs when setting m = 0.


Figure 1
View larger version (57K):
[in this window]
[in a new window]
 
Fig. 1 (a) Four haplotype patterns with four SNPs. There are Formula pairs of haplotype patterns which are (P1, P2), (P1, P3), (P1, P4), (P2, P3) and (P2, P4), (P3, P4). (b) Each SNP can distinguish some haplotype patterns from the others. For example, the SNP S1 can distinguish between P1 and P2, P1 and P3, P2 and P4, and P3 and P4. (c) The relation between the SNPs and the haplotype patterns.

 
The relation between the SNPs and the haplotype patterns can be represented as an n-sized set of sets D and a Formula-sized set E (Fig. 1c). Each element of E represents a pair of haplotype patterns. Each Di isin D is a subset of E and stands for pairs of patterns distinguished by the i-th SNP. For example, D1 = {E1, E2, E5, E6} in Figure 1c. We say that Ej is covered by Di if Ej isin Di. The problem of finding robust tag SNPs is finding a minimum set C {subseteq} D where every element of E must be covered by C at least m + 1 times. Here is the formal definition.

Given a set of sets D, a set Formula Di and an integer m, choose a C {subseteq} D with minimum size that

Formula
where

Formula
One previous approach to solve this problem is a greedy algorithm (Huang et al., 2005). It always selects a SNP which distinguishes most pairs of haplotypes at each step. Formally speaking, the greedy algorithm adds Dk isin D to C at each step where |Dk| = maxDi isin D |Di|. In the following section, we introduce a novel concept of Greedy-Partition-Tree (GPT) for improving the previous greedy algorithm.

2.1 Greedy-Partition-Tree
A GPT is a complete binary tree whose height is specified by a parameter L. Figure 2 illustrates an example of a GPT with L = 3. The GPT is constructed using the greedy algorithm described above. In the GPT, each internal node except the root stands for the inclusion or exclusion of a SNP selected by this greedy algorithm. Each path of length l from the root to an internal node represents a set of inclusion and exclusion of l SNPs. By including or excluding some SNPs, the GPT partitions the original problem into small ones. In the following, we describe the detailed steps for constructing a GPT. Note that initially the GPT has only one root node, which stands for the original problem.

  • Step 1. For each leaf of the GPT, run the greedy algorithm to find a SNP Si according to the constraint based on the path to the root. For example, when the height of the GPT is 2 in Figure 2, for the gray S2 node we use the greedy algorithm to find a SNP (i.e. S4) while S1 and S2 must be selected as tag SNPs. For the white S2 node we find a SNP (i.e. S5) according to the constraint where S1 must and S2 must not be selected as tag SNPs.
  • Step 2. Branch two child nodes from each leaf. One child node represents the selection of Si and the other means the de-selection of it. For the above example, we find SNP S4 and branch two child nodes (the gray S4 node and the white S4 node) from the gray S2 node. Both the child nodes inherit D and E from their parent except Di is deleted. For the child node which represents the selection of Si, some Ejs are covered by Di and Ej is deleted if Ej is covered m + 1 times.
  • Step 3. If the height of the GPT is still less than L, go to Step 1.


Figure 2
View larger version (34K):
[in this window]
[in a new window]
 
Fig. 2 A GPT with L = 3. Each gray node represents inclusion of a SNP while its sibling (white) represents exclusion of the same SNP. The black triangles represent subproblems partitioned by the GPT. The solution of each subproblem can be used to infer a feasible solution of the original problem.

 
After constructing the GPT, each leaf stands for a subproblem partitioned by the GPT. For example, P1 in Figure 2 represents a subproblem where S1, S2 and S4 have already been selected as part of the tag SNPs. A solution of P1 plus S1, S2 and S4 is a feasible solution of the original problem. On the other hand, a solution of P2 plus S1 and S2 (without S4) is a feasible solution of the original problem. We run the greedy algorithm to solve each subproblem Pi and obtain a set of feasible solutions of the original problem. The best solution among them is chosen as the output. The following is the pseudo code of our algorithm. The inputs E, D, m are as defined in the beginning of this section and are parameters of the adopted greedy algorithm. The parameter L is used to define the height of the GPT.

The solution of the GPT is at least as good as that of a greedy algorithm because the solution of the lef most subproblem (i.e. P1 in Fig. 2) plus the SNPs along the path (i.e. S1, S2 and S4) to the root is exactly a greedy solution. The greedy algorithm is able to find a solution within a factor of ln((m + 1)|E|) of the optimal solution (Huang et al., 2005). Hence, the solution found by GPT is also guaranteed to have the same approximation ratio.

The GPT provides the flexibility for the user to balance the efficiency of the program and the quality of the solution by adjusting L. When L = 0, the algorithm is exactly the same as a greedy algorithm. By increasing L, the solution of GPT can be improved but the running time may also increase. When L = |S|, GPT becomes a branch-and-bound algorithm and the solution found is the optimal solution. From our empirical studies, the solution of GPT is significantly improved by increasing L and the optimal solution can even be found in many datasets with small L.

2.2 Improvement of efficiency
To improve the efficiency of our algorithm, two considerations are undertaken. One is to accelerate the greedy algorithm and the other is to prune unnecessary branches. The running time of the greedy algorithm can be reduced by adopting a heap structure (Cormen et al., 2001). The heap structure is used to speed up the step of selecting a SNP from Formula to Formula(log n). The greedy algorithm runs in time Formula(|E|n log n.) and the complexity of GPT is thus Formula(2L.|E|n log n).

Just as the branch-and-bound algorithm cuts out impossible subproblems, we can also prune unnecessary branches from the GPT to skip some impossible solution space. To prune unnecessary branches, we compute the lower bound of a subproblem by modifying the approach of Zhao et al. (2005). The lower bound is compared with the current best solution. The subproblem will not be solved further if its lower bound is not better than the current best solution.

2.3 Improvement of solutions
When building a GPT, there could be many redundant SNPs which distinguish the same pairs of haplotype patterns. If two nodes in distinct subtrees of the GPT include the same type of redundant SNPs, they may lead to the same subproblem. For example, if m = 0 and S1 and S3 in Figure 2 distinguish the same pairs of haplotype patterns, S2 is equal to S6 because we use the same greedy algorithm under the same constraint. Therefore, the solution found in P1 plus S4 must be equal to the solution found in P5. It takes place easily because redundant SNPs have the same priority to be selected by the greedy algorithm. Therefore, before constructing the GPT, we group SNPs that distinguish the same pairs of haplotypes. In each group of SNPs, the number of redundant SNPs selected as robust tag SNPs is at most m + 1. The extra SNPs could be deleted from each group without affecting the optimal solution. The deletion of these SNPs not only increases the efficiency, but also increases the opportunity to find a better solution.


    3 EXPERIMENTAL RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EXPERIMENTAL RESULTS
 4 DISCUSSION
 5 CONCLUSION
 REFERENCES
 
We have implemented the GPT in C and run the program on a PC with a 2.4 GHz CPU and 256 MB memory. The data we tested include a variety of simulated and biological data. We compared the results of our program with those of a greedy algorithm, an LP-relaxation algorithm and a brute force algorithm for searching the optimal solution proposed by Huang et al. (2005). These algorithms are referred to as ‘greedy,’ ‘LP’ and ‘OPT’ respectively in the following.

3.1 Experimental results on simulated data
The first set of simulated data is generated by randomly assigning the major or minor alleles to each SNP on a haplotype. This simulation considers the bottleneck situation where all SNPs reach complete linkage equilibrium. We generate 100 datasets and each of them contains 10 haplotypes with 40 SNPs. The results of ‘greedy,’ ‘LP’ and ‘OPT’ are compared with our GPT of L = 10. Figure 3 plots the average numbers of robust tag SNPs found by each algorithm for tolerating different number of missing SNPs.


Figure 3
View larger version (10K):
[in this window]
[in a new window]
 
Fig. 3 Experimental results on random data. The x-axis stands for the number of missing SNPs to be tolerated (i.e. m). The y-axis stands for the average numbers of robust tag SNPs found by different algorithms, ‘greedy,’ ‘LP,’ ‘GPT’ and ‘OPT’ over 100 datasets.

 
The optimal solutions for m > 1 cannot be found by OPT within a reasonable period of time and are not shown in this figure. In this experiment, the GPT only takes 6 seconds to find the optimal solutions when m = 1. As m increases, GPT significantly outperforms all other algorithms and the solutions can be found in seconds.

The second set of simulated data is generated by Hudson's program (Hudson, 2002) which can simulate a set of haplotypes under the assumption of neutral evolution and uniformly distributed recombination rate using the coalescent model. We use Hudson's program to generate 100 datasets and each of them contains 10 haplotypes with 40 SNPs. The results of this experiment are shown in Figure 4. The parameter L of the GPT is also set to 10 in this experiment.


Figure 4
View larger version (10K):
[in this window]
[in a new window]
 
Fig. 4 Experimental results on Hudson's data. The x-axis stands for the number of missing SNPs to be tolerated (i.e. m). The y-axis stands for the average numbers of robust tag SNPs found by different algorithms, ‘greedy,’ ‘LP,’ ‘GPT,’ and ‘OPT’ over 100 datasets.

 
The optimal solutions for m > 1 again cannot be found within a reasonable period of time. The experimental result indicates that GPT is the only algorithm which finds the optimal solution when m = 1. When m increases, the GPT still outperforms other algorithms. It only takes several seconds for GPT to output a solution. The numbers of robust tag SNPs found by each algorithm are more than those of randomly generated data. It is because the coalescent haplotypes generated by Hudson's program are similar to each other. A SNP can distinguish fewer haplotypes on average. Thus, we need more SNPs to construct a feasible solution. In this experiment, the numbers of robust tag SNPs found by all algorithms are not too much different. It is because that most SNPs in these datasets distinguish similar haplotypes. The space of improvement is relatively small since the amount of distinct SNPs that can be chosen by GPT is insufficient.

3.2 Experimental results on biological data
We first test these algorithms on public haplotype data from Patil et al. (2001). Patil's data include 20 haplotypes of 24 047 SNPs spanning over ~32.4 MB on human Chromosome 21, which are partitioned into 4135 haplotype blocks. Each block contains 2–7 haplotypes and each haplotype has 1–114 SNPs. The blocks with only two haplotypes are not included in our experiments because any SNP could be the tag SNP and the optimal solution can be easily found by any algorithm. Therefore, there are 612 haplotype blocks tested in our experiments. We apply the GPT of L=12 and other algorithms on these haplotype blocks. The numbers of total robust tag SNPs found by these algorithms are listed in Table 1.


View this table:
[in this window]
[in a new window]
 
Table 1 Comparison of the numbers of tag SNPs produced by four algorithms on Patil's data

 
The total number of robust tag SNPs decreases as m goes large because many blocks do not have enough SNPs for tolerating large missing data. The optimal solutions for m > 2 cannot be found within a reasonable period of time and are not listed in the table. In this experiment, the GPT is the only algorithm that finds the optimal solutions for m ≤ 2. The solutions of the GPT are still better than the other two algorithms as m goes large. However, the outperformance of the GPT is not obvious because of numerous small blocks in Patil's data. The optimal solutions of these short blocks can be easily found by all algorithms. As a result, the GPT fails to find better solutions in these short blocks and the improvement is not significant.

We then test these programs on the whole-genome haplotype data from Hinds et al. (2005). Hinds et al. (2005) genotyped 1 586 383 SNPs in 71 Americans of European, African and Asian ancestry and inferred haplotypes from these diploid genotype data. The inferred haplotypes were partitioned into blocks separately for each of the three population samples. Here we choose the sample of African-American as our experimental target. There are 22 chromosomes plus the X and Y chromosomes partitioned into 235 771 haplotype blocks and we apply the GPT of L = 3 on the haplotype blocks of each chromosome. We list the number of tag SNPs (i.e. m = 0) of all chromosome found by different algorithms in Table 2. We can see that the the solutions found by GPT are better than those of greedy and LP-relaxation algorithms. In fact, GPT finds optimal solutions of all chromosomes when L is only set to 3. In addition, the GPT takes only 15 s to find solutions of all chromosomes.


View this table:
[in this window]
[in a new window]
 
Table 2 Comparison of the numbers of tag SNPs produced by four algorithms on Hinds's data when m = 0

 
In all experiments, the GPT can nearly find the optimal solutions when m ≤ 2 and only takes a few seconds to finish execution. In comparison with the greedy algorithm, the GPT spends more time to explore larger solution space and thus can find better solutions than the traditional greedy algorithm.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EXPERIMENTAL RESULTS
 4 DISCUSSION
 5 CONCLUSION
 REFERENCES
 
4.1 Trade-off between efficiency and solutions of GPT
In this subsection, we discuss the trade-off between efficiency and solutions of GPT. By increasing L, the solutions of GPT can be improved but the efficiency may be sacrificed. The efficiency of GPT is measured by the elapsed time to run datasets with different L. The solution improved by different L is measured by improved ratio, which is defined as (I0Ik)/I0 * 100% where Ik is the number of robust tag SNPs found by GPT with L = k.

Figure 5a plots the improved ratio with respect to L for the experiment on 100 randomly generated datasets (i.e. the datasets used in Fig. 3). As L = 10, all the curves start to converge, because the solutions found by the GPT are quite close to the optimal solution. For example, the improved ratio of the curve with m = 1 stops growing at 15.68% because the optimal solution has been found. This phenomenon indicates that setting L = 10 can gain the most improvement in solutions for the GPT. The solution of GPT is not significantly improved as L > 10.


Figure 5
View larger version (15K):
[in this window]
[in a new window]
 
Fig. 5 Experimental results for different L values. (a) Improved ratio on random data. (b) Elapsed time on random data. (c) Improved ratio on Hudson's data. (d) Elapsed time on Hudson's data. (e) Improved ratio on Patil's data. (f) Elapsed time on Patil's data. In (a), (c) and (e), the x-axis stands for the parameter L. The y axis stands for the improved ratio of solutions of related L. Let the number of SNPs found by GPT with L = 0 be I0 and the number of SNPs found by GPT with L = k be Ik. The improved ratio is computed by (I0Ik)/I0 * 100%. In (b), (d) and (f), the x-axis stands for L and the y-axis stands for the elapsed time needed to run 100 datasets.

 
We then plot the efficiency deterioration of increasing L on Figure 5b. The x-axis stands for L and the y-axis stands for the total elapsed time of GPT running on these 100 datasets. It can be observed that when L increases by 1, the elapsed time is ~1.6–1.8 times of the original running time. Furthermore, the GPT only takes <50 s to run these datasets with different m when L ≤ 10. As a result, the parameter L of GPT is best set to around 10 to obtain the best improvement in solutions, which still keeps the running time within a reasonable period of time.

Figure 5c plots the improved ratio of GPT on the 100 datasets generated by Hudson's program, and Figure 5d plots the corresponding elapsed time of GPT with different L on these datasets. As L fixes, the improved ratio of GPT on Hudson's data is less than that on random data. In Figure 5c, most curves converges around L = 12. Although the GPT has already found the optimal solution, its improved ratio is only 12.5%, which is less than that of previous experiment. The reason is that the space for improvement is not much on Hudson's datasets (see Section 3.1). On the other hand, we find that the GPT is more efficient when running on Hudson's datasets than on random datasets. For example, it only takes <30 s to run the datasets with different m when L = 10. It is because the running time of GPT is sensitive to the number of haplotypes distinguished by each SNP (see Section 2.2). The SNPs generated by Hudson's program can distinguish only a few haplotypes. Consequently, the GPT is quite efficient in this experiment. This implies that we can set larger L for GPT to obtain better improvement of solutions on Hudson's datasets.

Figure 5e depicts the improved ratio of GPT on Patil's data and Figure 5f plots the corresponding elapsed time for L ranging from 0 to 14. In Figure 5e, all curves start converging around L = 8. The improved ratio for m = 1 and L = 14 is only 5.53%, which is much less than the above two experiments. As mentioned in Section 3.2, there are many short blocks in Patil's data and all algorithms nearly find the optimal solution in short blocks. The improvement of GPT is amortized among numerous short blocks and thus the improved ratio is smaller. However, in Figure 5f, we observe that the elapsed time of GPT for every curve is <1 min, even when L = 14. This owes to the fact that the common haplotypes in Patil's haplotype blocks are rare. Most blocks contain less than five common haplotypes. Since the time complexity of GPT is relative to the number of haplotypes in the block, it is quite fast when running on Patil's datasets. This phenomenon suggests that we can set larger L for GPT on Patil's data without sacrificing the efficiency too much.

In the experiment of Hinds's data, the GPT finds the optimal solutions for all chromosomes when L = 3. Table 3 lists the total number of tag SNPs found by GPT on all chromosomes for different L. The optimal solutions can be found with small L because most blocks contain <10 SNPs. However, owing to the huge amount of blocks across all chromosomes, the GPT spends most of the time on I/O instead of real computation. As a result, the elapsed time of GPT is completely dominated by the I/O time instead of computation time. Therefore, when L increases from 0 to 3, the elapsed time of GPT does not reveal the exponential growth as expected.


View this table:
[in this window]
[in a new window]
 
Table 3 Experimental results for different L values on Hinds's data

 
In all experiments, the improved ratio of the GPT decreases as m becomes large. It is because a larger amount of SNPs are required to tolerate more missing data. However, in each haplotype block, the number of SNPs is fixed. When m becomes large, any feasible solution has to use more SNPs in the block, which implies that there are fewer SNPs that remain. Usually, the remaining unselected SNPs can only distinguish fewer haplotypes than the selected ones. Thus, the GPT has less space for improvement over any feasible solution.

On the other hand, the elapsed time also increases as m becomes large. One intuitive reason is that the GPT has to find more SNPs to tolerate more missing data. In addition, the running time of GPT is also proportional to the frequency of the adjustment of the internal heap (see Section 2.2). As m becomes larger, the frequency of the adjustment of the heap also increases. This is also a common phenomenon that happened to each algorithm. For example, the OPT program fails to output the optimal solution for m > 2 in most experiments. A feasible set of robust tag SNPs for tolerating missing data is usually more difficult to be found as m increases.

Since the GPT provides the flexibility to adjust L to balance efficiency and solutions, choosing a proper L is another important issue. From our empirical study, the proper choice of parameter L for GPT heavily depends on the types of datasets. In the experiment of randomly generated data, if the running time of GPT is up to 1 min, the L can be best set to 10 and we can gain significant improvement of solutions. On the other hand, for the datasets on Hudson's and Patil's data, the GPT is quite efficient in these experiments. Therefore, we can set larger L (e.g. L = 18) for GPT to gain sufficient improvement of solutions. As for the haplotype blocks with only a few SNPs, the GPT is able to find the optimal solution with small L (e.g. L is only set to 3 in Hinds' datasets).

4.2 Comparison with MLR-tagging
This subsection compares GPT with a method based on the third model. The method we choose is called MLR-Tagging which selects a set of SNPs to predict the haplotype of an unknown sample with high accuracy. The MLR-Tagging program is based on multivariate least square prediction and is superior to its previous version which adopts the linear reduction method (He et al., 2005). The program was downloaded from http://alla.cs.gsu.edu/~software/tagging/tagging.html. The MLR-tagging program requires that the number of tag SNPs is specified as an input parameter and the output is a set of tag SNPs with the specified size. We set the parameter ‘1’ and run the program iteratively by increasing the parameter by one at a time until the selected tag SNPs can distinguish all haplotype patterns. The final parameter is treated as the number of tag SNPs produced by MLR-Tagging when m = 0.

We test GPT and MLR-Tagging on simulated datasets used in Section 3. In the experiments on the 100 random datasets, the total numbers of tag SNPs found by MLR-Tagging and GPT are 579 and 400, respectively. In the experiments on the 100 Hudson's datasets, the MLR-Tagging requires 762 tag SNPs and the GPT only needs 443 tag SNPs. However, it should be noted that the objectives of these two methods are quite different. MLR-Tagging aims to find tag SNPs for predicting unknown haplotypes and to minimize the prediction error. On the other hand, GPT focuses on finding a minimum set of tag SNPs which can distinguish all haplotype patterns. Therefore, this comparison is not completely fair. If the number of tag SNPs or the genotyping cost is the primary concern, GPT is a more suitable approach than MLR-Tagging.


    5 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EXPERIMENTAL RESULTS
 4 DISCUSSION
 5 CONCLUSION
 REFERENCES
 
In this paper, we propose a hybrid method called GPT to solve the problem of finding robust tag SNPs by combining the ideas of the branch-and-bound method and the greedy algorithm. The original problem is partitioned into a fixed number of subproblems, and each subproblem is then efficiently solved by a greedy algorithm. The GPT explores larger solution space than a traditional greedy algorithm and thus can obtain a better solution. In addition, the GPT offers the flexibility for the user to adjust the running time to approach the optimal solution as close as possible. The experimental results indicate that the GPT outperforms several existing methods and can find solutions quite close to the optimal solutions in feasible time. The GPT can also benefit from the parallel computation because the partitioned subproblems can be solved independently. In fact, the GPT is a general idea that can be used to adapt different greedy algorithms to solve their corresponding problems. The solutions found by GPT is guaranteed to be at least as good as the original greedy algorithms for the problem.


View this table:
[in this window]
[in a new window]
 
 


    Acknowledgments
 
We thank the reviewers for their helpful comments. C.-J.C., Y.-T.H. and K.-M.C. were supported in part by NSC grants 93-2213-E-002-029 and 94-2213-E-002-091 from the National Science Council, Taiwan.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

1In practice, the tag SNPs found by LD-based methods may fail to represent other SNPs since the LD between two SNPs are usually set to a relative high but not to a perfect threshold (e.g. r2 is 0.8 instead of 1.0). Back

Received on November 4, 2005; revised on December 10, 2005; accepted on December 30, 2005

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EXPERIMENTAL RESULTS
 4 DISCUSSION
 5 CONCLUSION
 REFERENCES
 

    Altshuler, D., et al. (2005) A haplotype map of the human genome. Nature, 437, 1299–1320[CrossRef][Medline].

    Bafna, V., Halldórsson, B.V., Schwartz, R., Clark, A.G., Istrail, S. (2003) Haplotypes and Informative SNP Selection Algorithms: Don't Block Out Information. Proceedings of the Research in Computational Molecular Biology (RECOMB)Berlin, Germany , pp. 19–27.

    Carlson, C.S., et al. (2004) Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J Hum. Genet, . 74, 106–120[CrossRef][Web of Science][Medline].

    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C. Introduction to Algorithms, (2001) , Cambridge, MA MIT Press.

    Crawfod, D.C. and Nickerson, D.A. (2005) Definition and clinical importance of haplotypes. Annu. Rev. Med, . 56, 303–320[CrossRef][Web of Science][Medline].

    Daly, M.J., et al. (2001) High-resolution haplotype structure in the human genome. Nat. Genet, . 29, 229–232[CrossRef][Web of Science][Medline].

    Garey, M.R. and Johnson, D.S. Computers and Intractability: A Guide to the Theory of NP-completeness, (1979) , New York, NY WH Freeman and Company.

    Halperin, E., et al. (2005) Tag SNP selection in genotype data for maximizing SNP prediction accuracy. Bioinformatics, . 21, Suppl. 1, i195–i203[Abstract].

    He, J., et al. (2005) Linear reduction methods for tag SNP selection. Int. J. Bioinformatics Res. Appl, . 1, 249–260.

    Helmuth, L. (2001) Genome research: map of the human genome 3.0. Science, 293, 583–585.

    Hinds, D.A., et al. (2005) Whole-genome patterns of common DNA variation in three human populations. Science, 307, 1072–1079[Abstract/Free Full Text].

    Huang, Y.T., et al. (2005) Selecting additional tag SNPs for tolerating missing data in genotyping. BMC Bioinformatics, 6, 263[Medline].

    Hudson, R.R. (2002) Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics, 18, 337–338[Abstract/Free Full Text].

    Patil, N., et al. (2001) Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science, 294, 1719–1723[Abstract/Free Full Text].

    Zhang, K., et al. (2002) A dynamic programming algorithm for haplotype block partitioning. Proc. Natl Acad. Sci. USA, 99, 7335–7339[Abstract/Free Full Text].

    Zhang, K., et al. (2003) Haplotype block partition with limited resources and applications to human chromosome 21 haplotype data. Am. J. Hum. Genet, . 73, 63–73[CrossRef][Web of Science][Medline].

    Zhang, K., et al. (2004a) Haplotype block partition and tag SNP selection using genotype data and their applications to association studies. Genome Res, . 14, 908–916[Abstract/Free Full Text].

    Zhang, P., et al. (2004b) A double classification tree search algorithm for index SNP selection. BMC Bioinformatics, 5, 89[CrossRef][Medline].

    Zhao, W., et al. (2005) Efficient RNAi-based gene family knockdown via set cover optimization. Artif. Intell. Med, . 35, 61–73[CrossRef][Web of Science][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
22/6/685    most recent
btk035v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Chang, C.-J.
Right arrow Articles by Chao, K.-M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chang, C.-J.
Right arrow Articles by Chao, K.-M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?