Skip Navigation


Bioinformatics Advance Access originally published online on November 15, 2007
Bioinformatics 2007 23(23):3178-3184; doi:10.1093/bioinformatics/btm496
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
23/23/3178    most recent
btm496v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Hao, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hao, K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Genome-wide selection of tag SNPs using multiple-marker correlation

K. Hao

Algorithm and Data Analysis, Affymetrix, Inc., 3420 Central Expressway, Santa Clara, California, USA

To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivations: The tag SNP approach is a valuable tool in whole genome association studies, and a variety of algorithms have been proposed to identify the optimal tag SNP set. Currently, most tag SNP selection is based on two-marker (pairwise) linkage disequilibrium (LD). Recent literature has shown that multiple-marker LD also contains useful information that can further increase the genetic coverage of the tag SNP set. Thus, tag SNP selection methods that incorporate multiple-marker LD are expected to have advantages in terms of genetic coverage and statistical power.

Results: We propose a novel algorithm to select tag SNPs in an iterative procedure. In each iteration loop, the SNP that captures the most neighboring SNPs (through pair-wise and multiple-marker LD) is selected as a tag SNP. We optimize the algorithm and computer program to make our approach feasible on today's typical workstations. Benchmarked using HapMap release 21, our algorithm outperforms standard pair-wise LD approach in several aspects. (i) It improves genetic coverage (e.g. by 7.2% for 200 K tag SNPs in HapMap CEU) compared to its conventional pair-wise counterpart, when conditioning on a fixed tag SNP number. (ii) It saves genotyping costs substantially when conditioning on fixed genetic coverage (e.g. 34.1% saving in HapMap CEU at 90% coverage). (iii) Tag SNPs identified using multiple-marker LD have good portability across closely related ethnic groups and (iv) show higher statistical power in association tests than those selected using conventional methods.

Availability: A computer software suite, multiTag, has been developed based on this novel algorithm. The program is freely available by written request to the author at ke_hao{at}merck.com

Contact: ke_hao{at}163.com

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
It has been estimated that the human genome harbors >5 million common SNPs with minor allele frequency (MAF) of at least 10% (Carlson et al., 2003; Gonzalez-Neira et al., 2006; Kruglyak and Nickerson, 2001), and 7.5 million common SNPs with MAF of at least 5% (Barrett and Cardon, 2006). These polymorphisms explain a portion of the heritable risk for perhaps many diseases. There are two common strategies for constructing the contents of SNP genotyping panels, (1) SNPs chosen approximately randomly across the genome ignoring linkage disequilibrium (LD) patterns, and (2) LD-based tag SNPs chosen to maximize genetic coverage (Barrett and Cardon, 2006; Pe’er et al., 2006). Here, the genetic coverage is defined as the fraction of the set of all common (MAF ≥ 5%) SNPs exceeding some correlation threshold with at least one SNP typed by the array. The tag SNP approach takes advantage of our recent understanding in human genome's fine LD structure and reduces genotyping costs (Carlson et al., 2003, 2004; Gonzalez-Neira et al., 2006). Driven by such a large potential benefit, a variety of algorithms have been proposed to efficiently identify tag SNPs, which is essentially a feature selection problem from the machine-learning viewpoint.

The SNP tagging strategy is tightly linked to the downstream testing methods for genetic association. If the selection starts from phased haplotype data and tag SNPs are picked to maximize the haplotypes they can distinguish, the downstream association studies might be more powerful when employing haplotype-based tests (Hao et al., 2005; Howie et al., 2006; Sebastiani et al., 2003). If the selection starts from diploid genotypes and tag SNP panels are developed to maximize genetic coverage through pair-wise LD (e.g. r2), single locus association testing could be more appropriate (de Bakker et al., 2005; Pe’er et al., 2006). Extending the pair-wise LD, r2 among multiple markers is proposed to further increase the genetic coverage of SNP panels (de Bakker et al., 2005; Hao et al., 2006; Pe’er et al., 2006). For example, using combinations of two genotyped SNPs, the additional coverage gain is more than 10% for Illumina HumanHap300 K panel in Caucasian (Pe’er et al., 2006). Software packages have also become available to quickly compute multiple-marker r2 and achieve good coverage at a genome-wide scale (Barrett et al., 2005; Hao et al., 2006). It is noteworthy that such additional coverage gain is achieved on tag SNP panels that are developed solely using pair-wise r2. How about selecting tag SNPs by incorporating multiple-marker LD information? In this article, we propose an extension of Carlson's greedy algorithm (Carlson et al., 2004). Our new method identifies tag SNPs by simultaneously considering their pair-wise and multiple-marker LD with nearby neighbors. Furthermore, we evaluate the (1) gain in genetic coverage, (2) saving in genotyping costs, (3) portability of the tag SNPs and (4) statistical power in association studies.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Data
The HapMap release 21 (2003) samples comprise 270 individuals from four populations: (1) 30 trios from the Yoruba, in Ibadan, Nigeria; (2) 30 trios from the CEPH collection (Utah residents with ancestry from Northern and Western Europe); (3) 45 unrelated Han Chinese individuals from Beijing, China and (4) 45 unrelated individuals from Tokyo, Japan. The Han Chinese and Japanese are often considered as a single East Asian sample (Gonzalez-Neira et al., 2006). The HapMap Project genotyped more than 4 million SNPs, among which about 2.2 million SNPs are common (MAF ≥ 5%), and this number varies depending on the ethnic group.

2.2 Multiple-marker correlation
2.2.1 A SNP's correlation with another marker (Formula )
Consider SNP2 and its neighbor (SNP1) within a specified distance (e.g. 100 kb), we term their pair-wise r2 asFormula because only a single SNP is used as predictor in tag SNP selection and downstream association test. Also, the notation ofFormula is consistency with multiple-marker r2 notations.

2.2.2 A SNP's correlation with another two markers (Formula )
Herein, we implement a previously proposed method in computing multiple-SNP r2 (de Bakker et al., 2005; Hao et al., 2006; Pe’er et al., 2006). Let us consider SNP3 and its two neighbors (SNP1 and SNP2) within certain distance. Each SNP carries two possible alleles (SNP1 carries alleles A and a, SNP2 carries B and b, and SNP3 carries C and c). A multiple-marker r2 can be used to quantify the correlation between SNP3 and the combination of SNP1 and SNP2. This combination of SNP1 and SNP2 may form four possible haplotypes (AB, Ab, aB and ab). Therefore, this SNP combination can be treated as a multi-allelic marker, which carries four alleles, denoted as AB, Ab, aB and ab. Pooling {Ab, aB and ab}, we transform this multi-allelic marker to a bi-allelic marker, which carries alleles AB and non-AB. We compute the pair-wise r2 between this new bi-allelic marker and SNP3, and record the result asFormula . Similarly, we calculateFormula by pooling {AB, aB and ab}. The same forFormula andFormula . Finally, we defineFormula between SNP3 and combination of SNP1 and SNP2 as max {Formula }.

2.2.3 A SNP's correlation with another three markers (Formula )
There are four SNPs (SNP1, SNP2, SNP3 and SNP4), and we are interested inFormula of SNP4 with its three neighbors (SNP1, SNP2 and SNP3). SNP1, SNP2 and SNP3 form 23 = 8 possible haplotypes. Again, we construct a novel bi-allelic marker by pooling seven haplotypes together, and obtain theFormula after eight iterations. Similarly,Formula or even high order of LD can be computed in the same framework.

2.3 Algorithm for selecting tag SNPs using multiple-marker LD (multiTag)
Notation:

  1. Scandidate: the set of candidate SNPs, from which tag SNPs will be selected. At the beginning of tag SNP selection, all SNPs belong to Scandidate.
  2. StagSNP: the set of tag SNPs. At the beginning of the selection, StagSNP is empty, and it increases by one during each selection loop.

(3) Scaptured: the set of SNPs already captured by StagSNP. At the beginning of the selection Scaptured is empty, and it increases during the tag SNP selection procedure.

Step 1, Initialization:

  1. We compute pair-wise r2 (Formula ) between every two SNPs in Scandidate that are within a certain distance, L (L defines the sliding window size. In practice, we usually set L = 100 000 or 200 000 bp). In Scandidate, every SNP's ability to capture its neighboring SNPs (by single-marker LD,Formula ) is quantified with the SNP Capture Score (SCS).


Formula 1

(1)
whereFormula is the single-marker r2 and Tone denotes the single-marker LD threshold (e.g. Tone = 0.8).Formula is an indicator variable. It takes the value of 1 whenFormula is true, otherwise 0. All SNPs inside the sliding window (including itself) is considered. For example, a singleton SNP (a SNP that has no neighboring markers in strong LD) will have SCS = 1.

(2) From Scandidate, we move the SNP with largest SCS to StagSNP, and denote it as tagSNP(1), since it is the first member of StagSNP.
(3) From Scandidate, we move all captured SNPs [we define a SNP as captured if it has anFormula with tagSNP(1)] to Scaptured.

Step 2. Iteration:

  1. We update the SCS for all remaining SNPs in Scandidate. For example, the jth SNP in Scandidate, denoted as candidatej, its SCS can be calculated as (the following formula is limited toFormula for illustration purposes, but the algorithm can readily accommodateFormula ) :


Formula 2

(2)
In the first {Sigma}, we count how many members in Scandiate (within candidatej's neighborhood including itself) are captured by candidatej throughFormula . This computation is similar to that in Step 1, although with a smaller Scandidate. In the second {Sigma}, we count additional members in Scandidate that are not covered by candidatej throughFormula but are captured by combining candidatej and members of StagSNP throughFormula . Here, Tmultiple is the threshold for ‘useful’ multiple-marker LD (e.g. Tmultiple = 0.9).

  1. From Scandidate, we move the SNP with largest SCS to StagSNP, and denote it as tagSNP(i), if it is the ith member of StagSNP. We record the combination of tagSNP(i) and other StagSNP members if this combination contributes to tagSNP(i)'s SCS. This is an important part of our algorithm because those recorded combinations are used in downstream portability evaluation and association tests.
  2. From Scandidate, we move all captured SNPs to Scaptured.

Step 3, Termination:

We continue the iteration until (1) Scandidate becomes empty, or (2) StagSNP reaches a prespecified size (e.g. 100 000 SNPs) or (3) the coverage value reaches a prespecified level. Herein, the coverage can be easily calculated using the size of Scandidate, StagSNP and Scaptured.


Formula 3

(3)
For example, when Scandidate becomes empty the coverage is 100%.

2.4 Evaluation of genetic coverage and portability for tag SNPs
The genetic coverage on a training sample itself can be easily computed using formula 3, as presented in Figure 1. Recently, the portability of tag SNPs has attracted great interest, especially for populations from the same ethnic categories (e.g. Caucasian). For example, how well does StagSNP developed in HapMap CEU subjects collected from Utah perform on a Caucasian cohort collected in Europe? In this article, we look at portability among HapMap CHB and JPT cohorts. We identify tag SNPs in JPT (based onFormula andFormula ), and record the two-SNP combinations of StagSNP members that contribute to SCS. In CHB (an independent validation sample set), we calculate the fraction of SNPs that are captured by StagSNP (aka, StagSNP's genetic coverage in CHB) either byFormula or by recorded combinations of StagSNP members byFormula .


Figure 1
View larger version (29K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Genetic coverage of tag SNPs selected using (1)Figure 1 only (dashed line), (2) incorporatingFigure 1 (solid line) and (3) further incorporatingFigure 1 (dotted line). The tag SNP selection and coverage calculation was conducted on HapMap release 21 data using a sliding window width of 100 kb.

 

    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
To evaluate this novel multiple-marker SNP tagging approach, we apply it to HapMap release 21, which contains more than 4 million SNPs, a portion of which (e.g. ~2.1 million in CEU) are common SNPs, defined as MAF ≥ 5%. In this study, we only focus on these common SNPs (Barrett and Cardon, 2006; de Bakker et al., 2006). Figure 1 illustrates the tag SNP selection procedure with the thresholds Tone = 0.8, Ttwo = 0.9 and Tthree = 0.95. For any StagSNP size, the tag SNPs selected using the multiple-marker approach (the solid line and dotted line) have higher coverage thanFormula counterparts. The coverage improvement fromFormula toFormula is sizable (e.g. 7.2% for 200 K tag SNPs in CEU). However, we observe limited additional gain when extending toFormula , suggesting combinations of three SNPs are less likely to be good surrogates of neighboring markers, at least for the MAF and Tthree we considered. At the early phase of selection (when StagSNP is small), the slope of all three curves are substantial because tag SNPs are capturing large LD bins (Fig. 1). As the tag SNP increases, e.g. StagSNP in the interval (1 x 105, 2 x 105), we observe gradually decreasing slope because smaller LD bins are being tagged. Furthermore, when StagSNP reaches ~276 K in CEU, the single-marker curve becomes linear, indicating that we have captured all LD bins and have started genotyping singleton SNPs. The refraction point comes earlier for multiple-marker tag SNPs. Conditioning on a fixed SNP number, multiple-marker tags offer higher coverage. From another viewpoint, the multiple-marker approach reduces genotyping costs for a given genetic coverage (Table 1). If we target 90% coverage in CEU, the single-marker algorithm requires 356.4 K SNPs, where the two-marker algorithm requires only 234.7 K tag SNPs, which translates into a 34.1% savings. Again, we observe only minor additional savings when extending toFormula .


View this table:
[in this window]
[in a new window]

 
Table 1. Number of tag SNPs (x103) needed to achieve genetic coverage thresholds

 
Tag SNPs optimized on a training dataset may not perform equally well on an independent study cohort, which was not used for tag SNP section. Such a phenomenon is often referred to as portability loss, describing the genetic coverage decrease when applying tag SNPs to an independent sample set. In this article, we examined the portability of multiple-marker tag SNPs on two closely related ethnic populations (HapMap CHB and JPT). For example, in Figure 2A, we identified StagSNP using single-marker and two-marker approaches in JPT, and then evaluated StagSNP's genetic coverage in JPT and CHB. It is noteworthy that, during two-marker tag SNP selection, we recorded the marker combinations that contribute to the tag SNP's SCS. In the coverage calculation, only tag SNPs themselves and the recorded combinations were evaluated. By these means, the number of hypothesis tests only moderately increases when applying two-marker tag SNPs in an association study. Our portability experiments draw a few interesting observations: (i) StagSNP shows lower genetic coverage in the validation samples (e.g. CHB in Fig. 2A) than in the training samples (e.g. JPT in Fig. 2A), and such coverage decrease is essentially the portability loss; (ii) more importantly, StagSNP identified using either approach (single-marker or two-marker) has similar portability; (iii) in both the training samples and the validation samples, two-marker StagSNP offers higher genetic coverage than its single-marker counterpart and (iv) furthermore, two-marker StagSNP has even higher genetic coverage in the validation sample than single-marker StagSNP in the training sample.


Figure 2
View larger version (58K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Summary of genetic coverage in portability experiments at the r2 ≥ 0.8 cutoff threshold. In the upper panel, we identify various number of tag SNPs in HapMap JPT data (i.e. the training dataset) using either single-marker approach or two-marker approach. These tag SNPs’ genetic coverages are then evaluated in the training dataset itself (subject to over-fitting problem) and independent validation dataset (HapMap CHB). In the lower panel, we switch the training and validation datasets, and observe consistent results.

 
In the context of HapMap release 21 (2.2 million common SNPs) and ±100 kb window size, about 108 calculations of r2 are required forFormula mode, because a given SNP has n {approx} 100 neighbors within the window. ForFormula mode, because we examine all pairs of neighboring SNPs, the complexity goes fromFormula toFormula (which translates to 100 times more calculations). The complexity goes even higher toFormula forFormula mode (Hao et al., 2006). Fortunately, number of calculations increases less dramatically in reality since many SNPs are captured atFormula orFormula mode and therefore no need to extend to higher order. Expertise in software development and programming is critical in its implementation. A GNU-licensed program suite, multiTag, is published along with this article and freely available. Written in ANSI C++, multiTag strongly emphasizes speed and scalability, and has been successfully tested on Windows XP, Linux and Sun Solaris platforms at a chromosome-wide scale. It is able to run inFormula (equivalent to Carlson's greedy method),Formula orFormula modes. Using HapMap release 21 CEU data and typical workstations (Intel Xeon 2.80 GHz CPU and 512 MB memory) as the test-bed, theFormula mode finishes the entire genome within 1 h, in contrast, theFormula andFormula modes require ~100 and ~300 h, respectively, to finish a large chromosome (e.g. Chromosome 2). Fortunately, tag SNP selection on each chromosome can be run in parallel on a Linux cluster. If terminated prematurely (e.g. a Linux Cluster node crashes with unknown reason), multiTag is able pick up partial results and resume the computation, which appear to be a valuable feature when running the program for a long period. IncorporatingFormula is very computationally intensive, but may only yield small gain in genetic coverage (Fig. 1) or savings in genotyping (Table 1), suggesting tag SNP selection inFormula order is more cost-effective.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Whole genome association study (WGAS) using tag SNPs is a powerful approach for elucidating genetic basis of common human diseases such as hypertension, type 2 diabetes mellitus and osteoporosis. A variety of techniques have been proposed in tag SNP selection (Barrett et al., 2005; Carlson et al., 2004; de Bakker et al., 2006; Halperin et al., 2005; Hao et al., 2005; Howie et al., 2006; Qin et al., 2006; Sebastiani et al., 2003; Stram et al., 2003), but many of them are only tested on relatively small chromosomal regions. Potentially, they can extend to genome-wide scale, although substantial modification is necessary to make them computational feasible in terms of memory usage and CPU run time. Because choosing tag SNP is literally a feature selection problem, several established feature selection algorithms were applied (Halperin et al., 2005; Horne and Camp, 2004; Lin and Altman, 2004; Phuong et al., 2005). However, these methods are still computationally complex, although not requiring exponential search time. As results, they can only be used on gene regions or small chromosomes. An alternative way is to focus on haplotype blocks, but the blocks are not always straightforward to define. Moreover, some feature selection methods (e.g. principal component analysis) derive mathematical abstractions, and mapping them to SNPs introduces one more level of complexity. Set theory has also been used (Sebastiani et al., 2003), but it only identify the perfect tag SNP sets (with 100% prediction power) and do not scale up to the entire genome. Currently, the block-free tag SNP selection strategy (Carlson et al., 2004) is employed by Ilumina in developing whole-genome SNP arrays (Barrett and Cardon, 2006; Pe’er et al., 2006). These arrays are designed to optimize genetic coverage based on pair-wise r2 (Formula ). Interestingly, Pe’er and colleagues showed that by incorporating multiple-marker r2 into the calculation, these SNPs offer more than 10% higher genetic coverage, which may also boost the statistical power of an association study (Hao et al., 2006; Pe’er et al., 2006). It is a natural question to ask, ‘could further gains in genetic coverage be achieved by incorporating multiple-marker r2 as early as in the tag SNP selection step?’ In this article, we propose a novel algorithm for tag SNP selection using multiple-marker r2, and we systematically benchmark its performance. Our algorithm outperforms the standard Carlson's approach in several aspects. (i) It improves genetic coverage (e.g. by 7.2% for 200 K tag SNPs in CEU) compared to the single-marker method, when conditioning on a fixed tag SNP number. (ii) It reduces genotyping costs when conditioning on a fixed genetic coverage (e.g. 34.1% savings in CEU at 90% coverage). (iii) Tag SNPs identified using our approach have a similar portability, as Carlson's approach, across closely related ethnic groups (e.g. HapMap CHB and JPT collected in East Asia), and we believe this result can be generalized to cohorts of European ancestry.

Selecting a set of tag SNPs by exhaustive searching of all possible combinations is computationally intensive, and becomes impractical at the genome-wide scale even when limited toFormula (Carlson et al., 2004; Hao et al., 2005; Qin et al., 2006; Sebastiani et al., 2003). When extending to the orders ofFormula andFormula , computation time and memory use become critical issues in algorithm development. Carlson's greedy approach greatly reduces the search space (Hao et al., 2005), and therefore, is fast and memory efficient. The identified tag SNP set is fairly close to the minimum size although without a mathematical guarantee (Carlson et al., 2004; Howie et al., 2006; Qin et al., 2006). More importantly, a tag SNP set containing a certain degree of redundancy offers better portability than the mathematically minimal set (data not shown). Based on the above rational, we extend Carlson's greedy method and elegantly incorporate higher order r2 (e.g.Formula andFormula ). In each iteration, we only consider multiple-SNP r2 formed by one candidate SNP and its neighbors in StagSNP, by these means, the search space is further reduced and the algorithm becomes computationally feasible. Shown in Figure 1, at the early phase of tag SNP selection, our approach (solid line and dotted line) is similar to Carlson's method (dashed line), because StagSNP is small andFormula andFormula make little contributions to SCS. As StagSNP becomes larger, there are SNP combinations formed by a Scandidate member and its neighbors in StagSNP that give highFormula orFormula . Hence, the genetic coverage of the multiple-marker algorithm starts to exceed Carlson's method. It should be noted, when StagSNP gets larger, more SNP combinations need to be evaluated in terms ofFormula andFormula , and the computational complexity grows quickly. Generally, there are two strategies in handling the large number of SNP combinations. (1) We could pre-compute all possible combinations’Formula andFormula and using currently available software. These r2 values are stored in either memory or hard disk, and then used in the SCS calculation during tag SNP selection. The drawback is the large memory requirement (if r2 is stored in memory) or heavy file IO demand (if r2 is stored in hard disk). (2) Alternatively, we can compute a given SNP-combination'sFormula andFormula on-the-fly. This strategy obviously has advantages in memory usage and/or file IO demand, however, more r2 computation is required (because a certain SNP-combination's r2 value maybe used in several SCS calculations). The current version of multiTag employs the latter strategy, and therefore, can run on a typical workstation with 512 MB memory. The computation ofFormula andFormula needs 3- and 4-SNP haplotype data, respectively. In our study, we directly used haplotypes (HapMap release 21) as input, which are reconstructed using the program PHASE (Marchini et al., 2006; Stephens et al., 2001). Our algorithm can accommodate diploid data, and reconstruct 3- or 4-SNP haplotypes on-the-fly, however, this strategy could be time consuming and potentially less accurate (Hao et al., 2006). As a result, researchers are recommended to first apply PHASE or other methods (Marchini et al., 2006) to accurately generate haplotypes, and then select tag SNPs using multiTag (the current version of multiTag only accommodates haplotype input).

In this study, we applied Tone = 0.8, Ttwo = 0.9 and Tthree = 0.95 (Material and Methods section, Formulae 1 and 2). Certainly, we can choose different values for Ttwo and Tthree, e.g. a uniform Tmultiple (e.g. Ttwo = Tthree = 0.9), which will not bias against three-marker tag SNPs. In the multiTag algorithm and computer software, these three threshold values (Tone, Ttwo and Tthree) can be flexibly tuned to achieve (1) differently sized StagSNP (e.g. StagSNP tends to be larger when higher T values are applied); (2) various ratios between single-marker tag SNPs and multiple-marker SNPs and (3) various portability of resulting tag SNPs.

During SNP tagging, sometimes two or more candidate SNPs have equal SCS. In this situation, we randomly pick one of the best choices and continue the selection. Alternatively, we can modify formula1 to


Formula

whereFormula is a certain function ofFormula , e.g.Formula orFormula . BecauseFormula is a real number between 0 and 1, it is unlikely to observe a SCS tie when the training data has a reasonably large sample size. More importantly, such a modification biases towards SNPs in tighter LD with neighbors, hence it will further improve the tag SNPs’ average r2 and portability.

Haploview has also implemented a multiple-marker tag SNP selection method (Barrett et al., 2005; de Bakker et al., 2006), but in a rather ad hoc manner. This algorithm works in two phases: (1) tag SNP selection based on pair-wise r2, which is equivalent to Carlson's greedy approach; (2) searching for specific multi-marker (haplotype) tests to improve tagging efficiency (de Bakker et al., 2006). The step (2) is done by iteratively dropping tag SNPs, one by one, and replacing them with a specific multi-marker predictor (using any of the remaining tag SNPs). That predictor is accepted only if it can capture the alleles originally captured by the discarded tag SNP; otherwise, that provisionally dropped tag is considered indispensable and kept (de Bakker et al., 2006). Obviously, this algorithm will miss some good two-marker predictors. For example, SNP1 is a single-marker tag for an LD bin and therefore recruited into StagSNP by Haploview in phase (1). SNP2 by itself is a singleton, but the combination of SNP1 and SNP2 predicts a few other SNPs. Unfortunately, the Haploview algorithm will miss such a combination. To date, Haploview's multiple-marker tag SNP selection mode handles only about 10 000 SNPs (or ~10 Mb chromosome segment for HapMap release 21) in one run, and does not work at a chromosome-wide scale. Therefore, we did not conduct a head-to-head comparison between Haploview and multiTag.

Multiple testing remains as the primary challenge in WGAS. Many correction approaches have been proposed. (Bender and Lange, 2001; Chen et al., 2006; Hao et al., 2004; Herbert et al., 2006; Pe’er et al., 2006; Rosenberg et al., 2006; Wen et al., 2006) There are two strategies in dealing with multiple testing. (1) The statistical significance level should be adjusted by correction methods, and which method to apply depends on the nature of the SNPs being genotyped. For example, if the genotyped SNPs have weak LD among each other, Bonferroni correction would be adequate. (2) The number of hypotheses testing in WGAS should be carefully controlled. If we test all two or three marker combinations for genetic association with the study trait, the multiple comparison penalties may quick diminish statistical power. In this study, we record the marker combinations that contribute to genetic coverage (SCS) during tag SNP selection, and only these recorded combinations are tested for association in WGAS. By these means, we keep the number of testing in check. For example, in CEU, the multiple testing burden increases ~60% for 300 K two-marker tag SNPs comparing to 300 K single-marker tag SNPs (Fig. 3).


Figure 3
View larger version (32K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. During tag SNP selection, we recorded the SNP combinations that contribute to tag SNP's SCS, and such combinations are tested for association with phenotypes in downstream studies. For single-marker tag SNPs, the number of association tests is essentially the number of tag SNPs, therefore, we observe a 45 degree line (solid line) in this scenario. For the two-marker tag SNPs scenario, the number of testing = (number of tag SNPs) + (number of recorded two-SNP combinations). For three-marker tag SNPs scenario, the number of testing = (number of tag SNPs) + (number of recorded two-SNP combinations) + (number of recorded three-SNP combinations).

 
In term of statistical power, we investigate whether tag SNPs selected using multiTag (e.g.Formula mode) outperform those selected using traditional approach (e.g.Formula mode). In WGAS, it is difficult to calculate the absolute power directly. Fortunately, conditioning on false discovery rate (FDR), the number of discoveries reflects the relative power. Therefore, we can compare the power of tag SNPs by looking at number of discoveries at fixed FDR (5% and 10% in this article). The 90 HapMap Asian individuals are employed our power analysis, where we use 45 subjects (training sample) for tag SNPs selection and the remaining 45 subjects (testing sample) for association testing. The multiTag program is applied to HapMap release 21 and identifies 100 and 200 K tag SNPs, usingFormula andFormula mode (Fig. 4). ForFormula mode, the informative tag SNP combinations are recorded. Afterwards, we compare the relative power through simulation. In each simulation loop, (1) one SNP (e.g. carrying genotype AA, Aa and aa) is randomly picked to be causative, and then (2) we simulate the trait value for the testing sample. In detail, we assume the quantitative trait follows a normal distribution N(µ, {sigma}2), where {sigma}2 = 1 and µ is different among genotypes: µAA = –3, µAa = 0 and µaa = 3. These parameters are chosen to make the power and FDR in a range convenient to compare. (3) Kruskal–Wallis test is conducted between the trait and each tag SNPs (as well as the recorded tag SNP combinations). (4) We permute the trait value and repeat step (3) in order to derive FDR. Total 10 000 simulation loops are run, and we compare the relative power at FDR = 5 and 10% level (Fig. 4). Clearly, 200 K tag SNPs are more powerful than 100 K tag SNPs. More importantly, at fixed tag SNP number (or fixed genotyping cost), multiTag approach offers extra power even after adjusting for multiple testing. For example, at 10% FDR, 200 K tag SNPs derived inFormula mode show 6.5% higher power than the counterpart ofFormula mode.


Figure 4
View larger version (47K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. At fixed FDR, the number of discoveries reflects the relatively statistical power. Using this strategy, we illustrate tag SNPs selected inFigure 4 mode is uniformly more powerful than those selected inFigure 4 mode.

 
Taken together, the tag SNP strategy is based on our recent understanding of the fine LD structure in the human genome. However, at the current stage, only the pair-wise LD information (e.g.Formula ) is extracted. Herein, we present a novel approach, which utilizes the multiple-SNP LD in tag SNP selection. This approach efficiently reduces searching space, therefore, becomes computational feasible. Applied on HapMap release 21, multiTag uniformly outperforms traditional approaches in terms of both genetic coverage and statistical power, and we believe it will facilitate future genetic association studies.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The author wants to thank Dr. Joshua Millstein for insightful discussion and comments on the manuscript. The author also feels grateful to the reviewers for their valuable suggestions, which strengthen the paper greatly.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Present address: Rosetta Inpharmatics, a wholly owned subsidiary of Merck and Co. Inc., 401 Terry Ave. N., Seattle, WA, USA.

Received on May 21, 2007; revised on September 8, 2007; accepted on September 28, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    The International HapMap Consortium. The International HapMap Project. Nature (2003) 426:789–796.[CrossRef][Medline]

    Barrett JC, Cardon LR. Evaluating coverage of genome-wide association studies. Nat. Genet (2006) 38:659–662.[CrossRef][Web of Science][Medline]

    Barrett JC, et al. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics (2005) 21:263–265.[Abstract/Free Full Text]

    Bender R, Lange S. Adjusting for multiple testing–when and how? J. Clin. Epidemiol (2001) 54:343–349.[CrossRef][Web of Science][Medline]

    Carlson CS, et al. Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nat. Genet (2003) 33:518–521.[CrossRef][Web of Science][Medline]

    Carlson CS, et al. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet (2004) 74:106–120.[CrossRef][Web of Science][Medline]

    Chen BE, et al. Resampling-based multiple hypothesis testing procedures for genetic case-control association studies. Genet. Epidemiol (2006) 30:495–507.[CrossRef][Web of Science][Medline]

    de Bakker PI, et al. Efficiency and power in genetic association studies. Nat. Genet (2005) 37:1217–1223.[CrossRef][Web of Science][Medline]

    de Bakker PI, et al. Transferability of tag SNPs to capture common genetic variation in DNA repair genes across multiple populations. Pac. Symp. Biocomput (2006) 11:478–486.

    Gonzalez-Neira A, et al. The portability of tagSNPs across populations: a worldwide survey. Genome Res (2006) 16:323–330.[Abstract/Free Full Text]

    Halperin E, et al. Tag SNP selection in genotype data for maximizing SNP prediction accuracy. Bioinformatics (2005) 21(Suppl. 1):i195–i203.[Abstract]

    Hao K, et al. Power estimation of multiple SNP association test of case-control study and application. Genet. Epidemiol (2004) 26:22–30.[CrossRef][Web of Science][Medline]

    Hao K, et al. A sparse marker extension tree algorithm for selecting the best set of haplotype tagging single nucleotide polymorphisms. Genet. Epidemiol (2005) 29:336–352.[CrossRef][Web of Science][Medline]

    Hao K, et al. LdCompare: rapid computation of single- and multiple-marker r2 and genetic coverage. Bioinformatics (2006) 23:252–254.[Web of Science][Medline]

    Herbert A, et al. A common genetic variant is associated with adult and childhood obesity. Science (2006) 312:279–283.[Abstract/Free Full Text]

    Horne BD, Camp NJ. Principal component analysis for selection of optimal SNP-sets that capture intragenic genetic variation. Genet. Epidemiol (2004) 26:11–21.[CrossRef][Web of Science][Medline]

    Howie BN, et al. Efficient selection of tagging single-nucleotide polymorphisms in multiple populations. Hum. Genet (2006) 120:58–68.[CrossRef][Web of Science][Medline]

    Kruglyak L, Nickerson DA. Variation is the spice of life. Nat. Genet (2001) 27:234–236.[CrossRef][Web of Science][Medline]

    Lin Z, Altman RB. Finding haplotype tagging SNPs by use of principal components analysis. Am. J. Hum. Genet (2004) 75:850–861.[CrossRef][Web of Science][Medline]

    Marchini J, et al. A comparison of phasing algorithms for trios and unrelated individuals. Am. J. Hum. Genet (2006) 78:437–450.[CrossRef][Web of Science][Medline]

    Pe’er I, et al. Evaluating and improving power in whole-genome association studies using fixed marker sets. Nat. Genet (2006) 38:663–667.[CrossRef][Web of Science][Medline]

    Phuong MZ, et al. Choosing SNPs using feature selection. (2005) Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference. 301–309. doi: 10.1109/CSB.2005.22.

    Qin ZS, et al. An efficient comprehensive search algorithm for tagSNP selection using linkage disequilibrium criteria. Bioinformatics (2006) 22:220–225.[Abstract/Free Full Text]

    Rosenberg PS, et al. Multiple hypothesis testing strategies for genetic case-control association studies. Stat. Med (2006) 25:3134–3149.[CrossRef][Web of Science][Medline]

    Sebastiani P, et al. Minimal haplotype tagging. Proc. Natl Acad. Sci. USA (2003) 100:9900–9905.[Abstract/Free Full Text]

    Stephens M, et al. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet (2001) 68:978–989.[CrossRef][Web of Science][Medline]

    Stram DO, et al. Choosing haplotype-tagging SNPS based on unphased genotype data using a preliminary sample of unrelated subjects with an example from the Multiethnic Cohort Study. Hum. Hered (2003) 55:27–36.[CrossRef][Web of Science][Medline]

    Wen SH, et al. A two-stage design for multiple testing in large-scale association studies. J. Hum. Genet (2006) 51:523–532.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
23/23/3178    most recent
btm496v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Hao, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hao, K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?