Skip Navigation


Bioinformatics Advance Access originally published online on February 23, 2008
Bioinformatics 2008 24(7):972-978; doi:10.1093/bioinformatics/btn071
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/7/972    most recent
btn071v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Su, S.-Y.
Right arrow Articles by Coin, L. J.M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Su, S.-Y.
Right arrow Articles by Coin, L. J.M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Disease association tests by inferring ancestral haplotypes using a hidden markov model

Shu-Yi Su , David J. Balding and Lachlan J.M. Coin *

Department of Epidemiology and Public Health, Imperial College, London W2 1PG, UK

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 SIMULATION STUDY
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Most genome-wide association studies rely on single nucleotide polymorphism (SNP) analyses to identify causal loci. The increased stringency required for genome-wide analyses (with per-SNP significance threshold typically {approx} 10–7) means that many real signals will be missed. Thus it is still highly relevant to develop methods with improved power at low type I error. Haplotype-based methods provide a promising approach; however, they suffer from statistical problems such as abundance of rare haplotypes and ambiguity in defining haplotype block boundaries.

Results: We have developed an ancestral haplotype clustering (AncesHC) association method which addresses many of these problems. It can be applied to biallelic or multiallelic markers typed in haploid, diploid or multiploid organisms, and also handles missing genotypes. Our model is free from the assumption of a rigid block structure but recognizes a block-like structure if it exists in the data. We employ a Hidden Markov Model (HMM) to cluster the haplotypes into groups of predicted common ancestral origin. We then test each cluster for association with disease by comparing the numbers of cases and controls with 0, 1 and 2 chromosomes in the cluster. We demonstrate the power of this approach by simulation of case-control status under a range of disease models for 1500 outcrossed mice originating from eight inbred lines. Our results suggest that AncesHC has substantially more power than single-SNP analyses to detect disease association, and is also more powerful than the cladistic haplotype clustering method CLADHC.

Availability: The software can be downloaded from http://www.imperial.ac.uk/medicine/people/l.coin

Contact: I.coin{at}imperial.ac.uk

Supplementary Information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 SIMULATION STUDY
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Genome-wide association studies using high-density genotyping in unrelated cases and controls have been shown to be a successful approach for identifying common causal variants of moderate effect (Sladek et al., 2007; WTCCC, 2007). Single-SNP association tests, such as the Fisher exact or Armitage trend tests, are the most convenient and popular way to analyse such datasets. One downside of single-SNP tests is that they do not exploit correlations between neighbouring SNPs in order to better capture the effects of untyped causal SNPs. In cases where the causal SNP is in weak Linkage Disequilibrium (LD) with surrounding typed SNPs, this limits the power of single-SNP analyses.

Haplotypes, the combination of ordered and closely-linked SNP alleles on a chromosome, can capture much of the correlation structure among the SNPs and can also encapsulate multiple tightly-linked causal variants (Schaid, 2004). Haplotype diversity within a small interval typically arises from mutations rather than from recombination events. Thus, using clusters of haplotypes in association studies is useful because cases tend to have similar haplotypes in the region surrounding a causal SNP.

One potential problem with haplotype-based analysis is that multiple rare haplotypes can lead to loss of power due to the many degrees of freedom. As a result, single-SNP tests sometimes can be more powerful than haplotype-based analysis (Clayton et al., 2004). One solution is to group haplotypes based on similarity and perform statistical tests on these clusters (Durrant et al., 2004; Li and Jiang, 2005; Molitor et al., 2003; Seltman et al., 2003; Tachmazidou et al., 2007; Tzeng et al., 2006; Waldron et al., 2006). These approaches are motivated by the expectation that haplotypes within a cluster are derived from the same ancestral haplotype and hence carry similar risks. They employ various scores to assess haplotype similarity. More recently, Liu et al. (2007) have proposed a weighted haplotype cladistic analysis which weights the contribution of each SNP to the measure of haplotype similarity based on CLADHC (Durrant et al., 2004) by the single-locus p-value of that SNP. Haplotype clustering methods typically use pre-defined windows of SNPs in which to define haplotypes. Ideally, the windows would correspond to haplotype blocks with strong LD between SNPs, and so their size should vary with the background pattern of LD among SNPs. In practice, to apply clustering on a genome-wide scale the windows are often of fixed size (number of SNPs) and ‘slide’ along the genome.

Clustering approaches are motivated by the idea that haplotype clusters reflect aspects of the evolutionary history of case and control chromosomes. Coalescent-based approaches more explicitly model this evolutionary history. Some methods based on the coalescent have been proposed for fine mapping, and essentially ignore recombination (Morris et al., 2002; Rannala and Reeve, 2001; Zollner and Pritchard, 2005). Recently, Minichiello and Durbin (2006) have developed an approach that approximates the coalescent-with-recombination model and which can be applied to larger regions. However, applying these coalescent-based methods to a genome-wide association study is extremely computationally demanding.

Here, we propose a haplotype-clustering approach based on a hidden Markov model (HMM) that is more computationally efficient than coalescent-based approaches but without resorting to a sliding-window scheme or fixed haplotype blocks. We also propose a procedure for performing association tests on the haplotype clusters, using a permutation strategy to obtain significance levels under the null distribution. In this way, each haplotype cluster is represented by an inferred ancestral haplotype, from which all haplotypes in the cluster are assumed to have descended. We aim to identify the ancestral haplotype on which a risk-enhancing mutant arose, so that cases are over-represented among individuals with one or two chromosomes in the cluster. Figure 1 illustrates the motivation behind our approach, for simplicity considering the haploid case. The genealogical tree in Figure 1A is not directly observed, but if we can identify and test Cluster 1 (haplotypes H5–H8), consisting of haplotypes that share the causal mutant allele, then we will have much greater power to detect the association than would be possible from single-SNP tests. Specifically, based on the hypothetical haplotype counts in Figure 1B, the smallest Fisher exact test P-value from any of the typed SNPs is 0.002, whereas testing cluster 1 against all other haplotypes gives P{approx}10–11.


Figure 1
View larger version (24K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. A genealogical tree (A) and eight haplotype sequences from the leaves of part of this genealogical tree indicated by solid lines. Mutations are marked by black dots with a number specifying its marker position. The untyped causal allele is marked by ‘C’. We have delineated five ancestral clusters. The rows of #Cases and #Control specify the number of cases and controls among these clusters. The count of each haplotype by cases and controls is given in the table (B). The goal of our method is to cluster haplotype H5, H6, H7, H8 into the same cluster, and subsequently test association of this cluster with case/control status. The Fisher exact test P-value of each SNP based on a single-SNP analysis are 0.004 (based on SNP1–8), 0.209 (SNP9) and 0.002 (SNP10); whereas the P-value for the haplotype cluster 1 is 1.4 x 10–11.

 
One challenge with our strategy for diploid species is that the haplotypes in the population are not observed. We overcome this problem by treating pairs of ancestral haplotypes as the hidden states in the HMM, using an algorithm similar to that used in HINT (Kimmel and Shamir, 2005) and fastPhase (Scheet and Stephens, 2006). Our algorithm first learns the haplotype structure from the data, and then infers phased pairs of ancestral haplotypes for each individual.

The transition probabilities in our HMM strongly favour staying in the same ancestral haplotype from one marker position to the next, but also allow switching with a small probability which is learned from the data. The inferred ancestral haplotype for each individual is able to switch at different positions, and thus the block boundaries are flexible. However, if a strong block-like structure exists in a dataset, our model will (after the expectation-maximization learning procedure) learn the block structure and preferentially place transitions at block boundaries.

A final challenge for our algorithm is that we do not know the appropriate number of ancestral haplotypes to include in our model in order to capture the principal causal variant in a single ancestral haplotype. To allow for uncertainty in the correct number of ancestral haplotypes, we consider association statistics for 2, 4, 6 and 8 ancestral haplotypes, and retain the maximum statistic over all 20 possibilities for the single ancestral haplotype that best captures the principal causal variant. For computational reasons we do not consider more than eight ancestral haplotypes. A larger number of ancestral haplotypes will be appropriate if the causal variant is rare, having arisen on a lower branch of the genealogical tree.

We apply our ancestral haplotype cluster (AncesHC) approach to a mouse dataset consisting of 1904 outbred mice descended from eight genotyped inbred founder strains (Valdar et al., 2006), having been maintained under random mating for more than 50 generations, thus generating mosaic-like chromosomes. This dataset is particularly amenable to ancestral haplotype modelling, as it is in reality derived from eight founder strains. In addition, because the founder strains are also genotyped, the dataset allows us to accurately assess the success of AncesHC in inferring the ancestral haplotypes.

We simulate disease status for this mouse dataset on the basis of a range of disease models and a range of causal loci (which are subsequently removed for inference). The results suggest that AncesHC has substantially more power than the single-SNP Armitage trend test to detect association with disease, and more power than haplotype clustering analysis using CLADHC for almost all scenarios considered.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 SIMULATION STUDY
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Haplotype clustering based on HMM
For each individual we observe genotypes g = (g1, g2, ... , gM) at M SNPs along a chromosome. For generality, let N denote the number of chromosome copies per individual (so that N = 2 for diploid organisms). The genotype of an individual is thought of as an unordered list of alleles, gm = {gm1, ... , gmN}, and each allele is assumed to have been derived from an ancestral haplotype, drawn from a fixed pool of z distinct ancestral haplotypes. The unordered list of ancestral haplotypes is denoted by sm = {sm1, ... , smN}. We will also denote by [sm1, ... , smN] the ordered list of ancestral haplotypes; by {pi}(sm) = [sm{pi}(1), ... , sm{pi}(N)] a permutation of this ordered list, and by {Pi}(sm) the set of all such permutations. Thus, for example, if sm = [1, 2] there are two permutations, namely [2, 1] and [1, 2], whereas if sm = [1, 1] there is only one permutation. The sequence s = (s1, s2, ... , sM) forms a Markov chain on the space of unordered lists of ancestral haplotypes.

Recombination events are captured by transitions in the HMM, at which an individual's haplotype can switch the ancestral haplotype from which it is considered to have descended. Transitions are allowed to occur continuously along the sequence, and occur between markers m–1 and m with probability given by


Formula 1

(1)
in which


Formula 2

(2)
where Jm is the probability of a jump occurring at marker m–1. The probability that this jump results in haplotype l{pi}(n) is {alpha}ml{pi}(n), irrespective of the current haplotype kn which means that ‘transitions’ can occur that do not change the ancestral haplotype. Recall that kn and l{pi}(n) are indices of ancestral haplotypes, and hence {alpha}m is a probability vector with z elements.

Also note in Equation (1) that we sum over permutations of the to ancestral haplotype. To understand this intuitively, consider first transitions between ordered lists of ancestral haplotypes:


Formula 3

(3)
We can consider an unordered list of ancestral haplotypes as the collection of all ordered lists of ancestral haplotypes which are equivalent to each other under permutation (in other words using the permutation operator to define equivalence classes on the set of ordered lists of ancestral haplotypes). It is clear that the transition probability from an ordered list of ancestral haplotypes [k1, ... , kN] to the unordered list of ancestral haplotypes {l1, ... , lN} should be equal to the sum of the transition probabilities from the ordered list of ancestral haplotypes to all of the ordered list of ancestral haplotypes comprising this equivalence class, which is just the sum under all permutations as given in Equation (1). Finally, we can see that this ensures equal transition probabilities from each ordered list of ancestral haplotypes [k{pi}(1), ... , k{pi}(N)] to the unordered list of ancestral haplotypes {l1, ... , lN}, and hence we are able to use this as the transition probability from the unordered list of ancestral haplotypes {k1, ... , kn}.

As a consequence, in the case where N = 2 and l1 = l2 there is only one permutation of sm and hence only one term in the sum. On the other hand, whenever N = 2 and l1 != l2 there are two terms in the sum, in contrast with the model proposed in Scheet and Stephens (2006) which has only one term when k1 = k2.

The relation between hidden ancestral haplotypes and observed genotype data is modelled by emission probabilities. For generality, we assume multiallelic markers with alleles h isin {0, ... , H}; the special case H = 1 applies to biallelic SNPs. The emission probability of a genotype at marker m given an unordered list of ancestral haplotypes sm is as


Formula 4

(4)
where we denote by {theta}hml the emission probability of allele h at marker m from ancestral haplotype l. If gm is missing for an individual, we treat the emission probability as a constant (since we maximize the likelihood, the value of the constant is arbitrary).

The haplotype of an observed individual is not an exact copy of the ancestral haplotype from which it has descended, because of evolutionary processes and imperfect inferences. Thus the {theta}hml are generally different from zero and one, but they should typically be close to one of these values.

We use Dirichlet priors on all of our parameters. Namely, for scalars u{theta} > 0 and u{alpha} > 0, we let Formula where Formula is the uniform vector with each element Formula and Formula where Formula is the uniform vector with each element Formula . We also let Formula where uJ > 0, and Formula is the distance between markers Formula and Formula . We take Formula , reflecting the background probability of recombination between consecutive bases. The Formula parameters measure the strength of the prior information, so that large Formula implies sampling more tightly around Formula . To initialize our model we use Formula ; subsequently we set Formula for model fitting.

Although our HMM has many parameters, Jm at each marker and {alpha}ml and {theta}hml at each ancestral haplotype and each marker, the Baum–Welch algorithm provides a convenient way to find maximum-likelihood estimates of these model parameters. Because of the problem of finding a suboptimal local maximum of the likelihood, we run the Baum–Welch training algorithm 10 times and choose the model with the largest log likelihood. For the hidden sequence inference, we use the Viterbi algorithm to find the most probable ancestral haplotype sequence given the parameterized model.

One of the features of the mice dataset is that the founders have been genotyped and so we can accurately infer the founder haplotypes: we will refer to these as ‘known’. The performance of tests using these known ancestral haplotypes provides a good benchmark for AncesHC, which must usually infer the ancestral haplotypes without information about the founder strains.

2.2 Association tests
In this section we will assume a diploid organism (N = 2). The previous section described how we transform a genotype (unordered pair of alleles) for each individual at each marker into an unordered pair of ancestral haplotypes. Since multiple similar haplotypes underlying the observed genotypes can descend from the same ancestral haplotype, this transformation corresponds to data reduction via haplotype clustering. In this section we describe how we use this transformed dataset to conduct association tests. Note that the ancestral haplotypes may remain unchanged over a set of consecutive SNPs for all individuals in the dataset. In this case the test statistic will be invariant over the set of SNPs, and hence we can perform a single test. We will call a set of SNPs a ‘haplotype block’ if there is no change of ancestral haplotype for any individual, whether using 2, 4, 6 or 8 distinct ancestral haplotypes (supplementary Fig. 1).

Since there are z ancestral haplotypes, we compute {tau}, the association test statistic, z times in each block, treating each ancestral haplotype in turn as the high-risk haplotype. We repeat this process for z = 2, 4, 6 and 8, and retain the maximum test statistic over all 20 possibilities for the high-risk ancestral haplotype. Here, we use the Armitage trend test as our test statistic ({tau}), but we could equally use any other test statistic based on the 2 x 3 table representing the numbers of cases and controls with 0, 1 and 2 predicted copies of the putative high-risk ancestral haplotype. We explain the full procedure underlying AncesHC step by step in Figure 2. Type 1 error is assessed by repeating the (computationally fast) steps (D) to (F) under random permutations of case-control labels: since the computationally demanding steps (A) through (C) do not involve phenotype labels, they do not vary over such permutations making our permutation procedure computationally efficient. Similarly for the calculations using known ancestral haplotypes, the result of step (C) is regarded as known.


Figure 2
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. This graph explains the process of our method in the case of z = 4 ancestral haplotypes. (A) Genotype data are coded as 0,1 and 2. (B) Illustration of possible jumps from ancestral haplotype 1 at SNP 1 to the other four ancestral haplotypes at SNP 2. (C) The inferred ancestral haplotypes where we can observe the block-like structure of these sequences. The target region in our simulation study is defined as the causal block and its two flanking blocks. (D) The contingency table for one (putative high risk) ancestral haplotype in one block. (E) The formula for the Armitage trend test statistic. (F) The maximum test statistic over the four ancestral haplotypes.

 

    3 SIMULATION STUDY
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 SIMULATION STUDY
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We compare the performance of AncesHC with single-SNP tests and the cladistic haplotype clustering method CLADHC (Durrant et al., 2004). The reasons for choosing CLADHC among the many haplotype-based methods available are that it is computationally feasible for a large scale dataset, and that Bardel et al. (2005) find that it performs generally well against other similar methods.

CLADHC uses SNP haplotypes, which we obtained for our mouse dataset using fastPhase (Scheet and Stephens, 2006). One CLADHC input parameter is the number of SNPs in a haplotype window: we considered windows of size 4, 6, 8 and 10, as used by Durrant et al. (2004), and found that the maximum window size of 10 gave the best results, which we report.

We simulate disease status on the mouse data using 1500 mice with low levels of missing data and all 946 SNPs on chromosome 1 after removing monomorphic SNPs. We used two disease models, each assuming a disease prevalence of 12% and a single high-risk minor allele, with genotype relative risks (GRR) in the ratios 1:2:4 and 1:3:3, respectively. Thus, 180 cases are expected from the 1500 mice, and we randomly subsample to obtain 150 cases and 150 controls. For each disease model we randomly chose 24 causal SNPs, eight in each of three disease allele frequency (DAF) bands: 3–6%; 7–10% and 15–17%. We analysed 100 datasets generated under each of these 48 scenarios, and averaged the results across SNPs within each disease model and DAF band. We removed the 24 causal loci from all simulated datasets, as well as 49 SNPs in strong LD with any of them (r2 ≥ 0.8), when performing ancestral haplotype inferences, phasing for CLADHC by fastPhase, and all association tests. We do not remove any SNPs when inferring transition probabilities with known ancestral haplotypes, in order to have more precise estimation, but we remove these 73 SNPs when performing association tests using the known ancestral haplotypes.

For each of our 48 scenarios, we generated 10 000 permutation datasets by randomly choosing 150 cases and 150 controls from all 1500 mice. From each permutation dataset we calculate the statistics for each block from AncesHC, and each SNP from single-SNP Armitage trend tests and CLADHC. This generates an empirical null distribution of each statistic, which we use to compute approximate p-values for the statistics calculated in each of the 100 datasets for that scenario.

To assess the power of each method, we regard a significant value of the test statistic to be a true positive if it arises in the block that includes the causal SNP, or in either of its two flanking blocks. For each method, we calculate the minimum p-value (minP) over these three blocks for each of the 100 datasets and 48 scenarios.

We also attempted to compare AncesHC with Margarita (Minichiello and Durbin, 2006), but found it to be too computationally demanding for our simulation study. We found that Margarita required 10–11 h for a single dataset consisting of 873 SNPs and 300 mice (using 10 000 permutations to assess significance and 100 ARGs). AncesHC is much more efficient for the simulation study because the computationally intensive steps, (A) to (C) in Figure 2, do not depend on the case-control labels and hence need be employed only once for the entire simulation study. For the ancestral haplotype inferring step, the computing time of AncesHC with 873 SNPs and 1500 mice is approximately 3 h, 2 h, 30 min and 10 min for 8, 6, 4 and 2 ancestral haplotypes, respectively, for each run (we perform 10 runs in this study).


    4 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 SIMULATION STUDY
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We first compared the LD structure for our mouse chromosome 1 data to CEU ENCODE data on Chromosome 2 (a 500kb segment). For both datasets, we exclude the SNPs which have minor allele frequency <10% when calculating r2. We observe that randomly selected pairs of SNPs in the mouse dataset at a distance of ~3 Mb have an average r2 of 0.2, corresponding to a distance of 50kb in the ENCODE dataset. Thus, the average intermarker spacing of 190 kb in our mouse dataset roughly translates to a 3 kb spacing in a human dataset.

Figure 3 presents the power of each test in our simulation study as a function of significance level (type I error) for the six combinations of disease model and DAF band. The power of AncesHC with inferred ancestral haplotypes is comparable to that obtained with known ancestral haplotypes, except for rare variants (DAF 3–6%). This suggests that our HMM clustering algorithm in general performs well. AncesHC outperforms the single-SNP analysis for both disease models in all three DAF bands. It also has greater power than CLADHC for almost all scenarios. Supplementary Table 1 shows the empirical power at significance level of 10–3 in Figure 3. Underlying each power estimate are 800 simulated datasets (100 replicates for each of eight causative SNPs), so the SD of the estimate is about 1.7% when power {approx}50%


Figure 3
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Empirical power estimate via a permutation strategy of AncesHC using known ancestral haplotypes, AncesHC using inferred ancestral haplotypes, single-SNP tests and CLADHC (Durrant et al., 2004) for two disease models and three DAF bands, after the removal of all 24 causal SNPs and 49 SNPs in strong LD with causal loci.

 
In Figure 4, we report the corresponding results when the 49 SNPs in high LD with one of the 24 causal SNPs are also included in the marker set. For both disease models, the power of AncesHC is similar to that obtained without the high-LD SNPs (Fig. 3), whereas the performance of the single-SNP analysis is greatly improved when these SNPs are included, especially for rare and intermediate variants (DAF 3–9%). AncesHC remains more powerful overall, but this analysis illustrates that its principle advantage over single-SNP tests arises when the causal SNP is not well-tagged by the marker SNPs. Supplementary Table 2 shows the empirical power at significance level of 10–3 in Figure 4. Including the 49 high-LD SNPs improves the power of CLADHC, only slightly when the window size is 4 (results not shown), but greatly when the window size is 10 and disease variants are common (Fig. 4).


Figure 4
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. As for Figure 3, but including 49 SNPs in strong LD with causal loci (still excluding 24 causal SNPs).

 
In supplementary Figure 2, we show the power of AncesHC when restricted to a fixed number (z = 2, 4, 6 or 8) of distinct ancestral haplotypes, as well as when taking the maximum over these cases. For z fixed, the case z = 8 is usually close to optimal, but taking the maximum over z in most scenarios gives a further noticeable improvement. Since computational time is proportional to the square of the number of haplotypes, considering smaller values of z is computationally cheap relative to the cost of implementing z = 8.


    5 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 SIMULATION STUDY
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Single-SNP analyses are widely used in large scale genome-wide association studies (WTCCC, 2007; Sladek et al., 2007). However, this approach is unable to capture the effects of LD between neighbouring marker SNPs. Multi-SNP analyses, including haplotype-based approaches, can improve power with low type I error by exploiting local SNP correlations to better capture untyped variants. We have presented AncesHC, based on haplotype clustering using a HMM to predict shared ancestral haplotypes. This approach overcomes the problems of unphased genotypes and occasional missing genotypes, haplotype diversity and block boundary definition that can limit the performance of other haplotype analyses. AncesHC also does not assume a diploid organism, and works with multiallelic states.

Our clustering model does not require pre-defined blocks or a sliding window scheme to define haplotype boundaries. Kimmel and Shamir (2005) and Scheet and Stephens (2006) exploited similar models to infer missing genotypes and haplotypic phase; Huang et al. (2007) applied a similar HMM to find haplotype blocks and to investigate association between haplotypes and quantitative traits. Here, we use the ancestral haplotypes inferred by our HMM to investigate associations with disease status.

We have applied AncesHC to a mouse dataset and developed a simulation scheme to generate disease status for this dataset under various scenarios. AncesHC greatly outperforms single-SNP analysis for both disease models in all three DAF bands, and has greater power than CLADHC in almost all scenarios. We employed the Armitage trend test applied to the ancestral haplotypes instead of genotypes, but it is possible to apply other statistics. It is also possible to include the ancestral haplotypes in linear regression models, and to calculate Bayes factors from these models, which is an avenue we intend to explore in future.

In our clustering model, we maximize over four values (2, 4, 6 and 8) for the number of distinct ancestral haplotypes. We found that this gave improved power over using a fixed number of ancestral haplotypes. We assumed a single causal SNP for our simulation study, but AncesHC could be extended to allow for multiple causal SNPs with different frequencies of disease alleles.

The mice underlying the data used for our simulation study are descended from eight inbred strains crossed over 50 generations. An advantage of this dataset is that it provided the possibility of comparing results from the model with inferred ancestral haplotypes to one in which ancestral haplotypes are known, because the founders were genotyped. Further, there is much recent interest in this type of data as a means to identify quantitative trait loci (Shifman et al., 2006; Valdar et al., 2006). Many datasets derived from outcrossed individuals descended from inbred ancestors are becoming available, for example pure-bred dogs which have interbred to produce cross-breds and mongrels (Lindblad-Toh et al., 2005), and a recombinant-inbred lines dataset from an eight-ways cross by sibling mating (Broman, 2005; Peters et al., 2007). AncesHC is particularly well suited to such datasets, as we have illustrated using the mouse data. However, we expect our model should also provide increased power for the analysis of data from outcrossed populations, such as humans, and this is an avenue we are currently exploring. Application of our method to a genome-wide scan of 500 000 SNPs in 2000 individuals would currently require ~2000 h per run for eight ancestral haplotypes. Although computationally demanding, use of AncesHC for such analysis is feasible with a multiprocessor computing cluster.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 SIMULATION STUDY
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We would like to thank Michael Buckley for advice, particularly with regard to transition probabilities for paired HMMs, and Richard Mott for providing us with the mice genotype dataset.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on January 9, 2008; revised on February 5, 2008; accepted on February 11, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 SIMULATION STUDY
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Bardel C, et al. On the use of haplotype phylogeny to detect disease susceptibility loci. BMC Genetics (2005) 6:24.[CrossRef][Medline]

    Broman KW. The genomes of recombinant inbred lines. Genetics (2005) 169:1133–1146.[Abstract/Free Full Text]

    Clayton D, et al. Use of unphased multilocus genotype data in indirect association studies. Genet. Epidemiol (2004) 27:415–428.[CrossRef][Web of Science][Medline]

    Durrant C, et al. Linkage disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes. Am. J. Hum. Genet (2004) 75:35–43.[CrossRef][Web of Science][Medline]

    Huang JC, et al. Bayesian association of haplotypes and non-genetic factors to regulatory and phenotypic variation in human populations. Bioinformatics (2007) 23:212–221.

    Kimmel G, Shamir R. A block-free hidden markov model for genotypes and its application to disease association. J. Comput. Biol (2005) 12:1243–1260.[CrossRef][Web of Science][Medline]

    Lindblad-Toh K, et al. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature (2005) 438:803–819.[CrossRef][Medline]

    Liu J, et al. Incorporating single-locus tests into haplotype cladistic analysis in case-control studies. PLoS Genetics (2007) 3:0421–0430.

    Li J, Jiang T. Haplotype-based linkage disequilibrium mapping via direct data mining. Bioinformatics (2005) 21:4384–4393.[Abstract/Free Full Text]

    Minichiello M, Durbin R. Mapping trait loci by use of inferred ancestral recombination graphs. Am. J. Hum. Genet (2006) 79:910–922.[CrossRef][Web of Science][Medline]

    Molitor J, et al. Fine-scale mapping of disease genes with multiple mutations via spatial clustering techniques. Am. J. Hum. Genet (2003) 73:1368–1384.[CrossRef][Web of Science][Medline]

    Morris AP, et al. Fine-scale mapping of disease loci via shattered coalescent modeling of genealogies. Am. J. Hum. Genet (2002) 70:686–707.[CrossRef][Web of Science][Medline]

    Peters LL, et al. The mouse as a model for human biology: a resource guide for complex trait analysis. Nat. Rev. Genet (2007) 8:58–69.[CrossRef][Web of Science][Medline]

    Rannala B, Reeve J. High resolution multipoint linkage disequilibrium mapping in the context of a human genome sequence. Am. J. Hum. Genet (2001) 69:159–178.[CrossRef][Web of Science][Medline]

    Schaid DJ. Evaluating associations of haplotypes with traits. Genet. Epidemiol (2004) 27:348–364.[CrossRef][Web of Science][Medline]

    Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet (2006) 78:629–644.[CrossRef][Web of Science][Medline]

    Seltman H, et al. Evolutionary-based association analysis using haplotype data. Genet. Epidemiol (2003) 25:48–58.[CrossRef][Web of Science][Medline]

    Shifman S, et al. A high-resolution single nucleotide polymorphism genetic map of the mouse genome. PLoS Biology (2006) 4:2227–2237.[Web of Science]

    Sladek R, et al. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature (2007) 445:881–885.[CrossRef][Medline]

    Tachmazidou I, et al. Genetic association mapping via evolution-based clustering of haplotypes. PLoS Genetics (2007) 3:1163–1177.[Web of Science]

    Tzeng J-Y, et al. Regression-based association analysis with clustered haplotypes through use of genotypes. Am. J. Hum. Genet (2006) 78:231–242.[CrossRef][Web of Science][Medline]

    Valdar W, et al. Genome-wide genetic association of complex traits in heterogeneous stock mice. Nat. Genet (2006) 38:879–887.[CrossRef][Web of Science][Medline]

    Waldron ERB, et al. Fine mapping of disease genes via haplotype clustering. Genet. Epidemiol (2006) 30:170–179.[CrossRef][Web of Science][Medline]

    WTCCC. Genome-wide association study of 14 000 cases of seven common diseases and 3000 shared controls. Nature (2007) 447:661–678.[CrossRef][Medline]

    Zollner S, Pritchard JK. Coalescent-based association mapping and fine mapping of complex trait loci. Genetics (2005) 169:1071–1092.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
M. Li, K. Wang, S. F. A. Grant, H. Hakonarson, and C. Li
ATOM: a powerful gene-based association test by combining optimally weighted markers
Bioinformatics, February 15, 2009; 25(4): 497 - 503.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/7/972    most recent
btn071v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Su, S.-Y.
Right arrow Articles by Coin, L. J.M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Su, S.-Y.
Right arrow Articles by Coin, L. J.M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?