Bioinformatics Advance Access originally published online on November 15, 2007
Bioinformatics 2007 23(23):3178-3184; doi:10.1093/bioinformatics/btm496
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Genome-wide selection of tag SNPs using multiple-marker correlation
Algorithm and Data Analysis, Affymetrix, Inc., 3420 Central Expressway, Santa Clara, California, USA
To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivations: The tag SNP approach is a valuable tool in whole genome association studies, and a variety of algorithms have been proposed to identify the optimal tag SNP set. Currently, most tag SNP selection is based on two-marker (pairwise) linkage disequilibrium (LD). Recent literature has shown that multiple-marker LD also contains useful information that can further increase the genetic coverage of the tag SNP set. Thus, tag SNP selection methods that incorporate multiple-marker LD are expected to have advantages in terms of genetic coverage and statistical power.
Results: We propose a novel algorithm to select tag SNPs in an iterative procedure. In each iteration loop, the SNP that captures the most neighboring SNPs (through pair-wise and multiple-marker LD) is selected as a tag SNP. We optimize the algorithm and computer program to make our approach feasible on today's typical workstations. Benchmarked using HapMap release 21, our algorithm outperforms standard pair-wise LD approach in several aspects. (i) It improves genetic coverage (e.g. by 7.2% for 200 K tag SNPs in HapMap CEU) compared to its conventional pair-wise counterpart, when conditioning on a fixed tag SNP number. (ii) It saves genotyping costs substantially when conditioning on fixed genetic coverage (e.g. 34.1% saving in HapMap CEU at 90% coverage). (iii) Tag SNPs identified using multiple-marker LD have good portability across closely related ethnic groups and (iv) show higher statistical power in association tests than those selected using conventional methods.
Availability: A computer software suite, multiTag, has been developed based on this novel algorithm. The program is freely available by written request to the author at ke_hao{at}merck.com
Contact: ke_hao{at}163.com
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
It has been estimated that the human genome harbors >5 million common SNPs with minor allele frequency (MAF) of at least 10% (Carlson et al., 2003; Gonzalez-Neira et al., 2006; Kruglyak and Nickerson, 2001), and 7.5 million common SNPs with MAF of at least 5% (Barrett and Cardon, 2006). These polymorphisms explain a portion of the heritable risk for perhaps many diseases. There are two common strategies for constructing the contents of SNP genotyping panels, (1) SNPs chosen approximately randomly across the genome ignoring linkage disequilibrium (LD) patterns, and (2) LD-based tag SNPs chosen to maximize genetic coverage (Barrett and Cardon, 2006; Peer et al., 2006). Here, the genetic coverage is defined as the fraction of the set of all common (MAF
5%) SNPs exceeding some correlation threshold with at least one SNP typed by the array. The tag SNP approach takes advantage of our recent understanding in human genome's fine LD structure and reduces genotyping costs (Carlson et al., 2003, 2004; Gonzalez-Neira et al., 2006). Driven by such a large potential benefit, a variety of algorithms have been proposed to efficiently identify tag SNPs, which is essentially a feature selection problem from the machine-learning viewpoint. The SNP tagging strategy is tightly linked to the downstream testing methods for genetic association. If the selection starts from phased haplotype data and tag SNPs are picked to maximize the haplotypes they can distinguish, the downstream association studies might be more powerful when employing haplotype-based tests (Hao et al., 2005; Howie et al., 2006; Sebastiani et al., 2003). If the selection starts from diploid genotypes and tag SNP panels are developed to maximize genetic coverage through pair-wise LD (e.g. r2), single locus association testing could be more appropriate (de Bakker et al., 2005; Peer et al., 2006). Extending the pair-wise LD, r2 among multiple markers is proposed to further increase the genetic coverage of SNP panels (de Bakker et al., 2005; Hao et al., 2006; Peer et al., 2006). For example, using combinations of two genotyped SNPs, the additional coverage gain is more than 10% for Illumina HumanHap300 K panel in Caucasian (Peer et al., 2006). Software packages have also become available to quickly compute multiple-marker r2 and achieve good coverage at a genome-wide scale (Barrett et al., 2005; Hao et al., 2006). It is noteworthy that such additional coverage gain is achieved on tag SNP panels that are developed solely using pair-wise r2. How about selecting tag SNPs by incorporating multiple-marker LD information? In this article, we propose an extension of Carlson's greedy algorithm (Carlson et al., 2004). Our new method identifies tag SNPs by simultaneously considering their pair-wise and multiple-marker LD with nearby neighbors. Furthermore, we evaluate the (1) gain in genetic coverage, (2) saving in genotyping costs, (3) portability of the tag SNPs and (4) statistical power in association studies.
| 2 METHODS |
|---|
|
|
|---|
2.1 Data
The HapMap release 21 (2003) samples comprise 270 individuals from four populations: (1) 30 trios from the Yoruba, in Ibadan, Nigeria; (2) 30 trios from the CEPH collection (Utah residents with ancestry from Northern and Western Europe); (3) 45 unrelated Han Chinese individuals from Beijing, China and (4) 45 unrelated individuals from Tokyo, Japan. The Han Chinese and Japanese are often considered as a single East Asian sample (Gonzalez-Neira et al., 2006). The HapMap Project genotyped more than 4 million SNPs, among which about 2.2 million SNPs are common (MAF
5%), and this number varies depending on the ethnic group.
2.2 Multiple-marker correlation
2.2.1 A SNP's correlation with another marker (
)
Consider SNP2 and its neighbor (SNP1) within a specified distance (e.g. 100 kb), we term their pair-wise r2 as
because only a single SNP is used as predictor in tag SNP selection and downstream association test. Also, the notation of
is consistency with multiple-marker r2 notations.
2.2.2 A SNP's correlation with another two markers (
)
Herein, we implement a previously proposed method in computing multiple-SNP r2 (de Bakker et al., 2005; Hao et al., 2006; Peer et al., 2006). Let us consider SNP3 and its two neighbors (SNP1 and SNP2) within certain distance. Each SNP carries two possible alleles (SNP1 carries alleles A and a, SNP2 carries B and b, and SNP3 carries C and c). A multiple-marker r2 can be used to quantify the correlation between SNP3 and the combination of SNP1 and SNP2. This combination of SNP1 and SNP2 may form four possible haplotypes (AB, Ab, aB and ab). Therefore, this SNP combination can be treated as a multi-allelic marker, which carries four alleles, denoted as AB, Ab, aB and ab. Pooling {Ab, aB and ab}, we transform this multi-allelic marker to a bi-allelic marker, which carries alleles AB and non-AB. We compute the pair-wise r2 between this new bi-allelic marker and SNP3, and record the result as
. Similarly, we calculate
by pooling {AB, aB and ab}. The same for
and
. Finally, we define
between SNP3 and combination of SNP1 and SNP2 as max {
}.
2.2.3 A SNP's correlation with another three markers (
)
There are four SNPs (SNP1, SNP2, SNP3 and SNP4), and we are interested in
of SNP4 with its three neighbors (SNP1, SNP2 and SNP3). SNP1, SNP2 and SNP3 form 23 = 8 possible haplotypes. Again, we construct a novel bi-allelic marker by pooling seven haplotypes together, and obtain the
after eight iterations. Similarly,
or even high order of LD can be computed in the same framework.
2.3 Algorithm for selecting tag SNPs using multiple-marker LD (multiTag)
Notation:
- Scandidate: the set of candidate SNPs, from which tag SNPs will be selected. At the beginning of tag SNP selection, all SNPs belong to Scandidate.
- StagSNP: the set of tag SNPs. At the beginning of the selection, StagSNP is empty, and it increases by one during each selection loop.
(3) Scaptured: the set of SNPs already captured by StagSNP. At the beginning of the selection Scaptured is empty, and it increases during the tag SNP selection procedure.
Step 1, Initialization:
- We compute pair-wise r2 (
) between every two SNPs in Scandidate that are within a certain distance, L (L defines the sliding window size. In practice, we usually set L = 100 000 or 200 000 bp). In Scandidate, every SNP's ability to capture its neighboring SNPs (by single-marker LD,
) is quantified with the SNP Capture Score (SCS).
|
| (1) |
- (2) From Scandidate, we move the SNP with largest SCS to StagSNP, and denote it as tagSNP(1), since it is the first member of StagSNP.
- (3) From Scandidate, we move all captured SNPs [we define a SNP as captured if it has an
with tagSNP(1)] to Scaptured.
- (3) From Scandidate, we move all captured SNPs [we define a SNP as captured if it has an
Step 2. Iteration:
- We update the SCS for all remaining SNPs in Scandidate. For example, the jth SNP in Scandidate, denoted as candidatej, its SCS can be calculated as (the following formula is limited to
for illustration purposes, but the algorithm can readily accommodate
) :
|
| (2) |
, we count how many members in Scandiate (within candidatej's neighborhood including itself) are captured by candidatej through
, we count additional members in Scandidate that are not covered by candidatej through- From Scandidate, we move the SNP with largest SCS to StagSNP, and denote it as tagSNP(i), if it is the ith member of StagSNP. We record the combination of tagSNP(i) and other StagSNP members if this combination contributes to tagSNP(i)'s SCS. This is an important part of our algorithm because those recorded combinations are used in downstream portability evaluation and association tests.
- From Scandidate, we move all captured SNPs to Scaptured.
Step 3, Termination:
We continue the iteration until (1) Scandidate becomes empty, or (2) StagSNP reaches a prespecified size (e.g. 100 000 SNPs) or (3) the coverage value reaches a prespecified level. Herein, the coverage can be easily calculated using the size of Scandidate, StagSNP and Scaptured.
|
| (3) |
2.4 Evaluation of genetic coverage and portability for tag SNPs
The genetic coverage on a training sample itself can be easily computed using formula 3, as presented in Figure 1. Recently, the portability of tag SNPs has attracted great interest, especially for populations from the same ethnic categories (e.g. Caucasian). For example, how well does StagSNP developed in HapMap CEU subjects collected from Utah perform on a Caucasian cohort collected in Europe? In this article, we look at portability among HapMap CHB and JPT cohorts. We identify tag SNPs in JPT (based on
and
), and record the two-SNP combinations of StagSNP members that contribute to SCS. In CHB (an independent validation sample set), we calculate the fraction of SNPs that are captured by StagSNP (aka, StagSNP's genetic coverage in CHB) either by
or by recorded combinations of StagSNP members by
.
|
| 3 RESULTS |
|---|
|
|
|---|
To evaluate this novel multiple-marker SNP tagging approach, we apply it to HapMap release 21, which contains more than 4 million SNPs, a portion of which (e.g.
2.1 million in CEU) are common SNPs, defined as MAF
5%. In this study, we only focus on these common SNPs (Barrett and Cardon, 2006; de Bakker et al., 2006). Figure 1 illustrates the tag SNP selection procedure with the thresholds Tone = 0.8, Ttwo = 0.9 and Tthree = 0.95. For any StagSNP size, the tag SNPs selected using the multiple-marker approach (the solid line and dotted line) have higher coverage than
276 K in CEU, the single-marker curve becomes linear, indicating that we have captured all LD bins and have started genotyping singleton SNPs. The refraction point comes earlier for multiple-marker tag SNPs. Conditioning on a fixed SNP number, multiple-marker tags offer higher coverage. From another viewpoint, the multiple-marker approach reduces genotyping costs for a given genetic coverage (Table 1). If we target 90% coverage in CEU, the single-marker algorithm requires 356.4 K SNPs, where the two-marker algorithm requires only 234.7 K tag SNPs, which translates into a 34.1% savings. Again, we observe only minor additional savings when extending to
|
Tag SNPs optimized on a training dataset may not perform equally well on an independent study cohort, which was not used for tag SNP section. Such a phenomenon is often referred to as portability loss, describing the genetic coverage decrease when applying tag SNPs to an independent sample set. In this article, we examined the portability of multiple-marker tag SNPs on two closely related ethnic populations (HapMap CHB and JPT). For example, in Figure 2A, we identified StagSNP using single-marker and two-marker approaches in JPT, and then evaluated StagSNP's genetic coverage in JPT and CHB. It is noteworthy that, during two-marker tag SNP selection, we recorded the marker combinations that contribute to the tag SNP's SCS. In the coverage calculation, only tag SNPs themselves and the recorded combinations were evaluated. By these means, the number of hypothesis tests only moderately increases when applying two-marker tag SNPs in an association study. Our portability experiments draw a few interesting observations: (i) StagSNP shows lower genetic coverage in the validation samples (e.g. CHB in Fig. 2A) than in the training samples (e.g. JPT in Fig. 2A), and such coverage decrease is essentially the portability loss; (ii) more importantly, StagSNP identified using either approach (single-marker or two-marker) has similar portability; (iii) in both the training samples and the validation samples, two-marker StagSNP offers higher genetic coverage than its single-marker counterpart and (iv) furthermore, two-marker StagSNP has even higher genetic coverage in the validation sample than single-marker StagSNP in the training sample.
|
In the context of HapMap release 21 (2.2 million common SNPs) and ±100 kb window size, about 108 calculations of r2 are required for
100 neighbors within the window. For
100 and
300 h, respectively, to finish a large chromosome (e.g. Chromosome 2). Fortunately, tag SNP selection on each chromosome can be run in parallel on a Linux cluster. If terminated prematurely (e.g. a Linux Cluster node crashes with unknown reason), multiTag is able pick up partial results and resume the computation, which appear to be a valuable feature when running the program for a long period. Incorporating| 4 DISCUSSION |
|---|
|
|
|---|
Whole genome association study (WGAS) using tag SNPs is a powerful approach for elucidating genetic basis of common human diseases such as hypertension, type 2 diabetes mellitus and osteoporosis. A variety of techniques have been proposed in tag SNP selection (Barrett et al., 2005; Carlson et al., 2004; de Bakker et al., 2006; Halperin et al., 2005; Hao et al., 2005; Howie et al., 2006; Qin et al., 2006; Sebastiani et al., 2003; Stram et al., 2003), but many of them are only tested on relatively small chromosomal regions. Potentially, they can extend to genome-wide scale, although substantial modification is necessary to make them computational feasible in terms of memory usage and CPU run time. Because choosing tag SNP is literally a feature selection problem, several established feature selection algorithms were applied (Halperin et al., 2005; Horne and Camp, 2004; Lin and Altman, 2004; Phuong et al., 2005). However, these methods are still computationally complex, although not requiring exponential search time. As results, they can only be used on gene regions or small chromosomes. An alternative way is to focus on haplotype blocks, but the blocks are not always straightforward to define. Moreover, some feature selection methods (e.g. principal component analysis) derive mathematical abstractions, and mapping them to SNPs introduces one more level of complexity. Set theory has also been used (Sebastiani et al., 2003), but it only identify the perfect tag SNP sets (with 100% prediction power) and do not scale up to the entire genome. Currently, the block-free tag SNP selection strategy (Carlson et al., 2004) is employed by Ilumina in developing whole-genome SNP arrays (Barrett and Cardon, 2006; Peer et al., 2006). These arrays are designed to optimize genetic coverage based on pair-wise r2 (
Selecting a set of tag SNPs by exhaustive searching of all possible combinations is computationally intensive, and becomes impractical at the genome-wide scale even when limited to
(Carlson et al., 2004; Hao et al., 2005; Qin et al., 2006; Sebastiani et al., 2003). When extending to the orders of
and
, computation time and memory use become critical issues in algorithm development. Carlson's greedy approach greatly reduces the search space (Hao et al., 2005), and therefore, is fast and memory efficient. The identified tag SNP set is fairly close to the minimum size although without a mathematical guarantee (Carlson et al., 2004; Howie et al., 2006; Qin et al., 2006). More importantly, a tag SNP set containing a certain degree of redundancy offers better portability than the mathematically minimal set (data not shown). Based on the above rational, we extend Carlson's greedy method and elegantly incorporate higher order r2 (e.g.
and
). In each iteration, we only consider multiple-SNP r2 formed by one candidate SNP and its neighbors in StagSNP, by these means, the search space is further reduced and the algorithm becomes computationally feasible. Shown in Figure 1, at the early phase of tag SNP selection, our approach (solid line and dotted line) is similar to Carlson's method (dashed line), because StagSNP is small and
and
make little contributions to SCS. As StagSNP becomes larger, there are SNP combinations formed by a Scandidate member and its neighbors in StagSNP that give high
or
. Hence, the genetic coverage of the multiple-marker algorithm starts to exceed Carlson's method. It should be noted, when StagSNP gets larger, more SNP combinations need to be evaluated in terms of
and
, and the computational complexity grows quickly. Generally, there are two strategies in handling the large number of SNP combinations. (1) We could pre-compute all possible combinations
and
and using currently available software. These r2 values are stored in either memory or hard disk, and then used in the SCS calculation during tag SNP selection. The drawback is the large memory requirement (if r2 is stored in memory) or heavy file IO demand (if r2 is stored in hard disk). (2) Alternatively, we can compute a given SNP-combination's
and
on-the-fly. This strategy obviously has advantages in memory usage and/or file IO demand, however, more r2 computation is required (because a certain SNP-combination's r2 value maybe used in several SCS calculations). The current version of multiTag employs the latter strategy, and therefore, can run on a typical workstation with 512 MB memory. The computation of
and
needs 3- and 4-SNP haplotype data, respectively. In our study, we directly used haplotypes (HapMap release 21) as input, which are reconstructed using the program PHASE (Marchini et al., 2006; Stephens et al., 2001). Our algorithm can accommodate diploid data, and reconstruct 3- or 4-SNP haplotypes on-the-fly, however, this strategy could be time consuming and potentially less accurate (Hao et al., 2006). As a result, researchers are recommended to first apply PHASE or other methods (Marchini et al., 2006) to accurately generate haplotypes, and then select tag SNPs using multiTag (the current version of multiTag only accommodates haplotype input).
In this study, we applied Tone = 0.8, Ttwo = 0.9 and Tthree = 0.95 (Material and Methods section, Formulae 1 and 2). Certainly, we can choose different values for Ttwo and Tthree, e.g. a uniform Tmultiple (e.g. Ttwo = Tthree = 0.9), which will not bias against three-marker tag SNPs. In the multiTag algorithm and computer software, these three threshold values (Tone, Ttwo and Tthree) can be flexibly tuned to achieve (1) differently sized StagSNP (e.g. StagSNP tends to be larger when higher T values are applied); (2) various ratios between single-marker tag SNPs and multiple-marker SNPs and (3) various portability of resulting tag SNPs.
During SNP tagging, sometimes two or more candidate SNPs have equal SCS. In this situation, we randomly pick one of the best choices and continue the selection. Alternatively, we can modify formula1 to
|
|
Haploview has also implemented a multiple-marker tag SNP selection method (Barrett et al., 2005; de Bakker et al., 2006), but in a rather ad hoc manner. This algorithm works in two phases: (1) tag SNP selection based on pair-wise r2, which is equivalent to Carlson's greedy approach; (2) searching for specific multi-marker (haplotype) tests to improve tagging efficiency (de Bakker et al., 2006). The step (2) is done by iteratively dropping tag SNPs, one by one, and replacing them with a specific multi-marker predictor (using any of the remaining tag SNPs). That predictor is accepted only if it can capture the alleles originally captured by the discarded tag SNP; otherwise, that provisionally dropped tag is considered indispensable and kept (de Bakker et al., 2006). Obviously, this algorithm will miss some good two-marker predictors. For example, SNP1 is a single-marker tag for an LD bin and therefore recruited into StagSNP by Haploview in phase (1). SNP2 by itself is a singleton, but the combination of SNP1 and SNP2 predicts a few other SNPs. Unfortunately, the Haploview algorithm will miss such a combination. To date, Haploview's multiple-marker tag SNP selection mode handles only about 10 000 SNPs (or
10 Mb chromosome segment for HapMap release 21) in one run, and does not work at a chromosome-wide scale. Therefore, we did not conduct a head-to-head comparison between Haploview and multiTag.
Multiple testing remains as the primary challenge in WGAS. Many correction approaches have been proposed. (Bender and Lange, 2001; Chen et al., 2006; Hao et al., 2004; Herbert et al., 2006; Peer et al., 2006; Rosenberg et al., 2006; Wen et al., 2006) There are two strategies in dealing with multiple testing. (1) The statistical significance level should be adjusted by correction methods, and which method to apply depends on the nature of the SNPs being genotyped. For example, if the genotyped SNPs have weak LD among each other, Bonferroni correction would be adequate. (2) The number of hypotheses testing in WGAS should be carefully controlled. If we test all two or three marker combinations for genetic association with the study trait, the multiple comparison penalties may quick diminish statistical power. In this study, we record the marker combinations that contribute to genetic coverage (SCS) during tag SNP selection, and only these recorded combinations are tested for association in WGAS. By these means, we keep the number of testing in check. For example, in CEU, the multiple testing burden increases
60% for 300 K two-marker tag SNPs comparing to 300 K single-marker tag SNPs (Fig. 3).
|
In term of statistical power, we investigate whether tag SNPs selected using multiTag (e.g.
2), where
2 = 1 and µ is different among genotypes: µAA = –3, µAa = 0 and µaa = 3. These parameters are chosen to make the power and FDR in a range convenient to compare. (3) Kruskal–Wallis test is conducted between the trait and each tag SNPs (as well as the recorded tag SNP combinations). (4) We permute the trait value and repeat step (3) in order to derive FDR. Total 10 000 simulation loops are run, and we compare the relative power at FDR = 5 and 10% level (Fig. 4). Clearly, 200 K tag SNPs are more powerful than 100 K tag SNPs. More importantly, at fixed tag SNP number (or fixed genotyping cost), multiTag approach offers extra power even after adjusting for multiple testing. For example, at 10% FDR, 200 K tag SNPs derived in
|
Taken together, the tag SNP strategy is based on our recent understanding of the fine LD structure in the human genome. However, at the current stage, only the pair-wise LD information (e.g.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The author wants to thank Dr. Joshua Millstein for insightful discussion and comments on the manuscript. The author also feels grateful to the reviewers for their valuable suggestions, which strengthen the paper greatly.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Present address: Rosetta Inpharmatics, a wholly owned subsidiary of Merck and Co. Inc., 401 Terry Ave. N., Seattle, WA, USA.
Received on May 21, 2007; revised on September 8, 2007; accepted on September 28, 2007
| REFERENCES |
|---|
|
|
|---|
The International HapMap Consortium. The International HapMap Project. Nature (2003) 426:789–796.[CrossRef][Medline]
Barrett JC, Cardon LR. Evaluating coverage of genome-wide association studies. Nat. Genet (2006) 38:659–662.[CrossRef][Web of Science][Medline]
Barrett JC, et al. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics (2005) 21:263–265.
Bender R, Lange S. Adjusting for multiple testing–when and how? J. Clin. Epidemiol (2001) 54:343–349.[CrossRef][Web of Science][Medline]
Carlson CS, et al. Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nat. Genet (2003) 33:518–521.[CrossRef][Web of Science][Medline]
Carlson CS, et al. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet (2004) 74:106–120.[CrossRef][Web of Science][Medline]
Chen BE, et al. Resampling-based multiple hypothesis testing procedures for genetic case-control association studies. Genet. Epidemiol (2006) 30:495–507.[CrossRef][Web of Science][Medline]
de Bakker PI, et al. Efficiency and power in genetic association studies. Nat. Genet (2005) 37:1217–1223.[CrossRef][Web of Science][Medline]
de Bakker PI, et al. Transferability of tag SNPs to capture common genetic variation in DNA repair genes across multiple populations. Pac. Symp. Biocomput (2006) 11:478–486.
Gonzalez-Neira A, et al. The portability of tagSNPs across populations: a worldwide survey. Genome Res (2006) 16:323–330.
Halperin E, et al. Tag SNP selection in genotype data for maximizing SNP prediction accuracy. Bioinformatics (2005) 21(Suppl. 1):i195–i203.[Abstract]
Hao K, et al. Power estimation of multiple SNP association test of case-control study and application. Genet. Epidemiol (2004) 26:22–30.[CrossRef][Web of Science][Medline]
Hao K, et al. A sparse marker extension tree algorithm for selecting the best set of haplotype tagging single nucleotide polymorphisms. Genet. Epidemiol (2005) 29:336–352.[CrossRef][Web of Science][Medline]
Hao K, et al. LdCompare: rapid computation of single- and multiple-marker r2 and genetic coverage. Bioinformatics (2006) 23:252–254.[Web of Science][Medline]
Herbert A, et al. A common genetic variant is associated with adult and childhood obesity. Science (2006) 312:279–283.
Horne BD, Camp NJ. Principal component analysis for selection of optimal SNP-sets that capture intragenic genetic variation. Genet. Epidemiol (2004) 26:11–21.[CrossRef][Web of Science][Medline]
Howie BN, et al. Efficient selection of tagging single-nucleotide polymorphisms in multiple populations. Hum. Genet (2006) 120:58–68.[CrossRef][Web of Science][Medline]
Kruglyak L, Nickerson DA. Variation is the spice of life. Nat. Genet (2001) 27:234–236.[CrossRef][Web of Science][Medline]
Lin Z, Altman RB. Finding haplotype tagging SNPs by use of principal components analysis. Am. J. Hum. Genet (2004) 75:850–861.[CrossRef][Web of Science][Medline]
Marchini J, et al. A comparison of phasing algorithms for trios and unrelated individuals. Am. J. Hum. Genet (2006) 78:437–450.[CrossRef][Web of Science][Medline]
Peer I, et al. Evaluating and improving power in whole-genome association studies using fixed marker sets. Nat. Genet (2006) 38:663–667.[CrossRef][Web of Science][Medline]
Phuong MZ, et al. Choosing SNPs using feature selection. (2005) Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference. 301–309. doi: 10.1109/CSB.2005.22.
Qin ZS, et al. An efficient comprehensive search algorithm for tagSNP selection using linkage disequilibrium criteria. Bioinformatics (2006) 22:220–225.
Rosenberg PS, et al. Multiple hypothesis testing strategies for genetic case-control association studies. Stat. Med (2006) 25:3134–3149.[CrossRef][Web of Science][Medline]
Sebastiani P, et al. Minimal haplotype tagging. Proc. Natl Acad. Sci. USA (2003) 100:9900–9905.
Stephens M, et al. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet (2001) 68:978–989.[CrossRef][Web of Science][Medline]
Stram DO, et al. Choosing haplotype-tagging SNPS based on unphased genotype data using a preliminary sample of unrelated subjects with an example from the Multiethnic Cohort Study. Hum. Hered (2003) 55:27–36.[CrossRef][Web of Science][Medline]
Wen SH, et al. A two-stage design for multiple testing in large-scale association studies. J. Hum. Genet (2006) 51:523–532.[CrossRef][Web of Science][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



