Bioinformatics Advance Access originally published online on October 25, 2005
Bioinformatics 2005 21(24):4384-4393; doi:10.1093/bioinformatics/bti732
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Haplotype-based linkage disequilibrium mapping via direct data mining
1Electrical Engineering and Computer Science Department, Case Western Reserve University Cleveland, OH 44106, USA
2Department of Computer Science and Engineering, University of California Riverside, CA 92521, USA
3Center for Advanced Study, Tsinghua University Beijing, China
4Shanghai Center for Bioinformatics Technology Shanghai, China
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: With the availability of large-scale, high-density single-nucleotide polymorphism markers and information on haplotype structures and frequencies, a great challenge is how to take advantage of haplotype information in the association mapping of complex diseases in casecontrol studies.
Results: We present a novel approach for association mapping based on directly mining haplotypes (i.e. phased genotype pairs) produced from casecontrol data or caseparent data via a density-based clustering algorithm, which can be applied to whole-genome screens as well as candidate-gene studies in small genomic regions. The method directly explores the sharing of haplotype segments in affected individuals that are rarely present in normal individuals. The measure of sharing between two haplotypes is defined by a new similarity metric that combines the length of the shared segments and the number of common alleles around any marker position of the haplotypes, which is robust against recent mutations/genotype errors and recombination events. The effectiveness of the approach is demonstrated by using both simulated datasets and real datasets. The results show that the algorithm is accurate for different population models and for different disease models, even for genes with small effects, and it outperforms some recently developed methods.
Availability: The software, HapMiner, and Supplementary materials are available on the authors' website at http://vorlon.case.edu/~jxl175/HapMiner.html
Contact: jingli{at}eecs.case.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Disease gene mapping refers to the localization of heritable mutations that contribute to the risk of diseases and has been the primary focus in genetic epidemiology for decades. Many statistical methods have been developed for gene mapping with the aid of molecular markers and have been successfully applied to the identification of a substantial number of Mendelian diseases. However, most common diseases are complex diseases and the power of many existing methods (e.g. linkage analysis) is low since each gene may only have a small effect. Risch and Merikangas (1996) proposed that for genes with moderate or small effect, the association studies may provide higher power than linkage analysis. With the advance of technology, and dramatically decreasing genotyping cost, large-scale whole-genome association studies using single-nucleotide polymorphisms (SNPs) are now feasible. However, simple association analysis using
2 on each SNP for casecontrol data might not be able to give a reliable result. Recent experimental studies (Daly et al., 2001; Gabriel et al., 2002) over large genomic regions have shown that the human genome contains long segments with high linkage disequilibrium and limited haplotype diversity, suggesting the use of haplotype information for association studies. As a matter of fact, some new statistical methods, e.g. McPeek and Strahs (1999); Tzeng et al. (2003) among others, have already been proposed to take advantage of the haplotype information directly. But these model-based methods were mainly for candidate gene studies. Intensive computational demands prohibit them from whole-genome association analyses. Furthermore, for complex disease, the disease mutations only increase the risk of being affected, but not every individual carrying the disease mutations will be affected (low or moderate penetrances). Moreover, not every affected individual carries the disease susceptibility (DS) genes/alleles (known as phenocopies). For a casecontrol study, neither the degree of penetrance nor the rate of phenocopy is known in advance. While most methods assume incomplete penetrance, very few existing methods could deal with data of high phenocopies. We address the problem of gene association mapping of complex diseases and develop a novel algorithmic approach using haplotypes (i.e. phased genotype pair of each individual). The key assumption underlying haplotype mapping is the non-random association of alleles in disease haplotypes around the disease genes. The haplotypes from cases are expected to be more similar than haplotypes from controls in regions near the disease genes. Several recent papers have proposed to use clustering techniques for haplotype mapping. Liu et al. (2001) assigned haplotypes into clusters representing allele heterogeneity (i.e. multiple functional alleles that might be from different ancestral haplotypes) and employed the Markov chain Monte Carlo method (McMC) for parameter estimations within a Bayesian framework, but their method could not incorporate locus heterogeneity. Molitor et al. (2003) modeled haplotype risks using clusters and employed a probit model, but their method does not take phenocopies into consideration. Both methods were developed mainly for haplotype fine mapping and could not scale up for whole-genome screens very well. Durrant et al. (2004) adopted a logistic-regression model applicable to whole-genome screens using sliding windows, but they had to assume HardyWeinberg equilibrium and a multiplicative disease model for convincing the likelihood calculation. The effects of violations of these assumptions are unpredictable in general. Inspired by data mining techniques, Toivonen et al. (2000) proposed a non-parametric method for haplotype mapping called HPM (haplotype pattern mining). The authors examined the haplotype patterns in cases and in controls and utilized the pattern frequencies as the prediction of disease gene locations. As a model-free method, HPM has the appealing properties that it does not require any assumption on the inheritance patterns and has good localization power, even when the number of phenocopies is large. However, methods based on HPM also have some limitations. First, by allowing don't care symbols in a haplotype pattern, many haplotypes have been counted multiple times. The effect of this duplicate counting is unknown. Second, the frequency of identified haplotype patterns is closely related to the sample size, and the statistical significance of the predicted gene location using such frequency information cannot be assessed. Finally, in the experimental results, Toivonen et al. (2000) showed that the prediction accuracy may deteriorate with dense (e.g. SNP) markers, which is undesirable and greatly limits the utility of the method. We reason that disease susceptibility gene-embedded haplotypes, especially mutants of recent origin, tend to be close to each other due to linkage disequilibrium, while other haplotypes can be regarded as random noise sampled from the haplotype space. As illustrated in Figure 1, around the DS gene region, it is expected that haplotypes from affected individuals should share segments from the common ancestral haplotype where the mutation occurred in the past. Based on this logic, a new algorithmic approach for haplotype mapping is proposed in this paper that utilizes a clustering algorithm. The effectiveness of the approach depends on the similarity measure of haplotype fragments used in the clustering algorithm. We propose a new haplotype similarity measure that is a generalization of several haplotype similarity measures currently used in the literature. It captures the sharing of haplotype segments due to historical recombination events as demonstrated by Figure 1 and also incorporates the recent mutations/genotype errors. Extensive experimental results on real data as well as on simulated data show the algorithm are robust with respect to disease models and population history and outperforms some recently developed methods. The rest of this paper is organized as follows. The details of the algorithm are presented in the next section, followed by the test results. We conclude the paper with some discussion of possible future directions.
|
| 2 METHODS |
|---|
|
|
|---|
The proposed approach works as follows. The input to the algorithm consists of phased genotypes (haplotype pairs) for each individual. Such information can be inferred computationally from genotypes based on information of family members (Kruglyak and Lander, 1998; Li and Jiang, 2004) for caseparent data, or some population models (Stephens et al., 2001; Niu et al., 2002) for casecontrol data. For casecontrol data, we use the disease status of each individual to label both of its haplotypes. For caseparent data, transmitted haplotypes can be labelled as case haplotypes and untransmitted haplotypes as controls. The algorithm scans each marker one by one. For each marker position, a haplotype segment with certain length centered at the position will be considered. The segment length is an input parameter defined by the user based on marker interval distances, and should not be determined before hand. Clusters are identified based on some similarity/distance measure via a density-based clustering algorithm. The Pearson
2-statistic or Z-score, which are equivalent as shown in Fienberg (1977), based on a contingency table derived from the numbers of case haplotypes and control haplotypes in a cluster can be used as an indicator of the degree of association between the cluster and disease. Both measures can also be used as association/independence test statistics, properly adjusted (e.g. using Bonferroni correction) for multiple tests. A statistical significance threshold can be chosen independent of the sample size and all findings that exceed the threshold will be reported. The algorithm is summarized in Figure 2 with the details to follow shortly.
|
2.1 A general haplotype (dis)similarity measure
The effectiveness of the algorithm depends on the similarity measure of haplotype fragments used in the clustering algorithm. We propose a new haplotype similarity measure that generalizes several haplotype similarity measures in the literature. The similarity of two haplotype segments (consisting of markers that are not necessarily SNPs) is defined with respect to a particular marker locus. Suppose that we focus on a marker at locus 0, with loci 1, 2, ..., r on one side and 1, 2, ..., l on the other side. Assume that the genetic/physical distance from any locus to locus 0 is known and denoted as xk, where l
k
r. A haplotype h spanning this region is just an (l + 1 + r)-dimensional vector and the k-th dimension of h, denoted as h(k), is the allele at locus k. For a pair of haplotypes hi, hj, we define the similarity score of hi, hj with respect to locus 0 as:
![]() | (1) |
The first summation in Equation (1) is a weighted measure of the number of alleles in common between haplotypes hi and hj in the region, which can be thought of as Hamming similarity. The remaining summations form a weighted measure of the longest continuous interval of matching alleles around locus 0, which has some resemblance to the notion of a longest common substring (Gusfield, 1997). Our definition is quite flexible and generalizes several similarity measures used in the literature (Tzeng et al., 2003; Molitor et al., 2003). For instance, by setting w1 = 1 and w2 = 0, the measure becomes the counting measure described in Tzeng et al. (2003). The length measure in the same article can be achieved by setting w1 = 0 and w2 = 1. This definition of haplotype similarity is more powerful than the above two specialized measures and can be used for different types of markers by choosing appropriate weighting functions. It has the strengths of both specialized measures mentioned above. That is, it is robust against recent marker mutations and genotyping/haplotyping errors, and it also apprehends partial sharing from a common ancestral haplotype due to historical recombination events.
The requirement for both weighting functions w1 and w2 is that they must be non-increasing functions. It can be exponentially, quadratically, or linearly decreasing, or constant. It can also be a discrete function with its values defined only at marker positions. The user has the freedom of choosing the weighting function depending on the marker density of the input data. Noticing that si,i = sj,j, a distance metric between haplotypes hi and hj at marker locus 0 can be defined as follows:
![]() | (2) |
|
The distance definition can be further extended. For instance, missing alleles can be handled directly in the calculation of the similarity measure by taking all the missing alleles as a new distinct allele. To consider genegene interactions, a distance for two loci can be defined as the average of pairwise distances at each locus. This way the proposed algorithm could automatically consider multiple DS loci simultaneously.
2.2 A density-based clustering algorithm
Clustering is a powerful tool for mining massive data. Traditional clustering algorithms fall into two categories: partitioning clustering or hierarchical clustering methods (Han and Kamber, 2000). For large datasets with high level noise, there is an increasing interest in a third type of clustering algorithms called density-based algorithms. The density-based algorithms are based on the notion of local density. High density areas form clusters and low density areas may be due to random noise. In the haplotype association mapping setup, we are interested in identifying haplotype clusters that are strongly associated with the disease under study. The goal is not to partition all the haplotypes into certain clusters. Neither do we try to build a cladogram because of the difficulty of reconstructing the evolutionary relationship for all haplotypes. Instead, we believe that haplotypes from affected individuals are expected to be more similar at the disease gene location than those from controls which are assumed to be random samples. We do not expect control haplotypes to form any clusters except by chance. A difficulty lies in the fact that, due to the existence of allele and/or locus heterogeneity and phenocopies, some haplotypes from affected individuals do not necessarily form a cluster. This is also a main reason why a gene mapping method using casecontrol data would likely fail in reality if it assumes, explicitly or implicitly, that all or at least most affected individuals do have the same disease mutations. We take the problem of finding strongly disease associated haplotype clusters as the problem of finding clusters from data with noisy background. It has been shown that density-based clustering algorithms are capable of and effective in identifying meaning clusters from large datasets with high level of noise in many domains (Ester et al., 1996; Hinneburg and Keim, 1998; Ankerst et al., 1999). So we use the concept of density-based clusters and adopt an algorithm called DBSCAN (Ester et al., 1996) with minor modifications. The idea of assigning haplotypes to clusters for gene mapping is promising and has been explored by many researchers (Liu et al., 2001; Molitor et al., 2003; Durrant et al., 2004). Both Liu et al. (2001) and Molitor et al. (2003) partitioned haplotypes into clusters. Each cluster corresponds to a founder haplotype or is associated with a particular risk. Statistical models were then built to infer the parameters. Durrant et al. (2004) built a cladogram for all haplotypes using a hierarchical clustering algorithm and employed a logistic-regression model to find the most significant partition. Our algorithm is much flexible and much efficient, and it is capable of dealing with large datasets with high level of phenocopies. Comprehensive comparisons with the method in Durrant et al. (2004) on all diallelic datasets, as well as some other methods (Toivonen et al., 2000; Liu et al., 2001; Molitor et al., 2003), will be presented in Section 3.
In order to keep the paper self-contained, we briefly introduce the DBSCAN algorithm in the context of haplotype mapping. There are two input parameters for DBSCAN. One is the radius of the interested neighborhood
and the other is a density threshold MinPts. A haplotype is called a core haplotype if there are more than MinPts haplotypes in its
neighborhood. The haplotypes in the
neighborhood are directly reachable from the core haplotype and a haplotype is reachable from a core haplotype if there is a chain of core haplotypes between these two haplotypes where each is directly reachable from the preceding one. Two haplotypes are density-connected if there is a core haplotype such that both haplotypes are reachable from it. A density-based cluster of haplotypes is a set of density-connected haplotypes with maximal density-reachability. All the above definitions are with respect to the two parameters
and MinPts. DBSCAN examines every haplotype and starts to construct a cluster once a core haplotype is found. It then iteratively collects directly reachable haplotypes from a core haplotype, merging clusters when necessary. The process terminates when all haplotypes have been examined. The clusters are then output and the haplotypes that do not belong to any cluster are regarded as noise. More details can be found in Ester et al. (1996).
2.3 Score of the degree of association
For each marker position, our algorithm will take the haplotype segments around the marker and calculate the pairwise haplotype distances according to the distance measure. The DBSCAN algorithm is then applied on the distance matrix to identify clusters. A score for each marker will be calculated as follows. We measure the degree of association between a haplotype cluster and the disease of interest using Z-score (or
2-value). Suppose that we are given m case haplotypes and n control haplotypes. Let m' and n' denote the numbers of case and control haplotypes in a cluster, respectively. A 2 x 2 contingency table can be constructed and the Z-score of the cluster is defined as follows:
![]() | (3) |
2 based on the table to indicate the degree of association. At each marker position, there may be multiple clusters, possibly due to allele heterogeneity. In general, the cluster with the highest score is taken as the prediction for each marker. The algorithm can be naturally modeled by taking multiple clusters at each position if their scores are significant. The score is regarded as the point estimation of each marker locus and a consensus haplotype pattern (by taking the majority allele at each position) or a haplotype profile (the distribution of alleles at each position) based on the cluster can be used as disease-associated pattern centered at the locus.
2.4 Permutation tests
The significant level of the prediction can be measured by the P-value of the
2 or Z-score, properly adjusted using Bonferroni correction for multiple tests. As a model-free method, it is more appropriate to obtain an empirical P-value using a permutation test. A permutation test can be easily performed by shuffling the phenotypes among all the haplotypes. By randomly shuffling the disease labels, it is expected that associations between haplotypes and the trait are broken. The association mapping analysis will be performed on each shuffled dataset and the values of the interested statistic will be recorded. Then, the process will be iterated for a sufficiently large number of times to mimic the distribution of the original data. The proportion of the datasets whose statistic values are equal to or better than the statistic of the original dataset is regarded as the experimental P-value. The proposed method using permutation test is so computationally efficient that it can be done for whole-genome screens.
2.5 False positives due to population structure
It is well known that for casecontrol data, association can be due to some factors other than linkage such as population substructures. Special care must be taken when recruiting samples for such casecontrol studies. However, our algorithm can also be applied to data with family members, such as the caseparent design, for which the population structure is not a problem. When population structure is problematic, one can also combine the Genomic Control method by Devlin and Roeder (1999) with the proposed algorithm and use the variance inflation factor
to adjust the statistic score. See Devlin and Roeder (1999) for details.
2.6 HapMiner, the computer program
The overall time complexity of the algorithm is O(MN2), where M is the total number of marker loci and N is the sample size which is at most thousands in most real datasets. We have implemented the algorithm in a computer program called HapMiner. The executable code on Windows and Linux can be downloaded from the authors' website. Extensive tests have been performed using HapMiner on simulated datasets as well as real datasets and will be discussed below.
| 3 RESULTS |
|---|
|
|
|---|
3.1 The test datasets
Two different sets of simulated data were tested. The first dataset (dataset I) was generated in a previous paper by Toivonen et al. (2000) in their studies of the HPM method, and the second dataset (dataset II) was generated by us using an approach similar to that in a recent paper by Zollner and Pritchard (2005). The two simulated datasets differ in many ways. Dataset I mimicked an isolated population with an exponential growth rate while dataset II had a constant population size. The scope of interested regions was different. One was at the whole-genome level and the other was a candidate gene study. The disease models and the sampling strategies were different. Dataset I simulated a dominant disease but with high phenocopy rates and dataset II simulated a disease with incomplete penetrance. Due to the page limitation, results on dataset II are provided as supplementary materials.
More specifically, dataset I corresponds to a recently founded, relatively isolated founder subpopulation that grew from the initial size of 300 to
100 000 individuals in 500 years. The region considered was at the chromosome level with genetic length of 100 cM. Both microsatellite markers and SNP markers were simulated. Markers were evenly spaced along the chromosome with interval lengths of 1 and 1/3 cM for microsatellite markers and SNP markers, respectively. A dominant disease was modeled, with a large number of phenocopies. The proportion of mutation-carrying chromosomes from all the case chromosomes, denoted by A, is 2.5, 5.0, 7.5 or 10.0% corresponding to overall relative risks (of first-degree relatives)
= 1.2, 1.7, 2.7, 4.1, respectively. Mutations were not modeled directly but compensated by introducing missing alleles randomly. A detailed description of the simulation procedure can be found in the paper by Toivonen et al. (2000).
In addition, we tested HapMiner on three real datasets concerning different types of diseases. The first real dataset was originally reported by Kerem et al. (1989) in the study of the fine-mapping of Cystic Fibrosis (CF) gene and has been used by many investigators in testing their methods. The second real dataset concerning the localization of Friedreich Ataxia (FA) gene was reported in Liu et al. (2001) and reanalyzed by Molitor et al. (2003). The third dataset consisting of affected sib-pair families with type 1 diabetes (T1D) is from Herr et al. (2000). Detailed information and analytical results on the three real datasets will be discussed shortly.
3.2 Comparisons with other algorithms
We have made comprehensive comparisons with the program CLADHC, a most recently developed method by Durrant et al. (2004) that also uses a clustering algorithm, and the
2-test using single marker. Due to the limitation of CLADHC, only datasets with diallelic markers were used for the comparison. Independent datasets (dataset I) from Toivonen et al. (2000) were taken to minimize the bias in evaluating different methods. Unfortunately, the program HPM from the same paper is not available to us. We only compared our results with those by HPM in their original paper. Prediction results on real datasets were also compared with those by different methods (Liu et al., 2001; Molitor et al., 2003; Kerem et al., 1989).
3.3 Results on dataset I: whole genome screen
3.3.1 HapMiner parameters
There are five parameters that need be specified by the user, namely, the haplotype segment length and two weighting functions used in the calculation of pairwise haplotype segment similarities, and the radius
and density threshold MinPts required by the DBSCAN algorithm. Figure 3a and b show a typical Z-score distribution map for a dataset with two different haplotype segment lengths, i.e. 5 and 7 markers, respectively. The x-coordinate represents the marker positions and the y-coordinate represents the corresponding score for each marker. The vertical line indicates the location of the functional allele, while in this case it is halfway between markers 5 and 6. The predicted gene location is at marker 5 for both length parameters, with Z-scores of 4.63 and 3.86, respectively. As expected, with the increase of the segment length, the score profile tends to be smoother. But the scores near the signal region were rather strong no matter which value we took and only noise was averaged out. The numbers of the haplotypes in the identified clusters are 24 and 18, respectively, for the two parameter values, which are close to the number of true case haplotypes (i.e. haplotype with mutated alleles) since there are 200 haplotypes that were labeled as case and the fraction of mutation-carrying chromosomes, denoted as A, is 10%. Such information on phenocopies was not known to HapMiner in advance. Most of the haplotypes in the clusters are core haplotypes, which means that the haplotypes in the clusters are very similar to each other. The consensus patterns are the same in the overlapped region for the two different values of haplotype segment length, which also implies the robustness of HapMiner with respect to this parameter. For the remaining tests on dataset I, we took the lengths of haplotype segments the same as those in Toivonen et al. (2000), which were 7 and 21 markers for microsatellite markers and SNP markers, respectively. We took simple linear functions with flat tails for both weights since the markers were evenly spaced. (The functions are depicted in Supplementary Figure 1.) There are two ways to set the radius
in HapMiner. The first method is to specify its value directly. Since the pairwise distances are within the range [0,1], one can specify the radius to be any value from this range. The other way to set
is to choose a percentile according to the distribution of all pairwise distances. We took the first method in this study. To set the value of MinPts, we first calculated the number of neighbors for every haplotype based on
and chose MinPts based on the user-specified percentile parameter. Experiments on three different
values (i.e. 0.1, 0.2 and 0.3) and three different MinPts values (i.e. 15, 25 and 35%) indicated that HapMiner performed consistently well around the DS locus across different parameters (data not shown). We thus fixed
to be 0.2 and the percentile for MinPts to be 25% for the remaining tests.
|
3.3.2 Prediction accuracy
Figure 3c shows the predicted locations (y-axis) and true locations (x-axis) on 100 datasets with 200 case haplotypes and 200 control haplotypes for each dataset. All the parameters were using their default values as specified in the previous subsection. The accuracy was high for most datasets even though
90% of case haplotypes did not contain the mutation allele. The success was mainly due to the concept of density-based clustering algorithms, which allow noisy inputs. Traditional partitioning algorithms like k-means could not correctly identify the cluster associated with the disease given such a noisy dataset (data not shown). And it is almost impossible for any method to correctly reconstruct the genealogy of the samples (which is the goal of hierarchical algorithms), given the complexity of the evolutionary history.
We further investigated the power of HapMiner under different phenocopy rates (1 A), different sample sizes, and increasing marker density, and with missing values. We compared our results with those reported by Toivonen et al. (2000) using their HPM program. The results are illustrated in Figure 4ad. In the figure, the x-coordinate represents the distance from the true gene position and the y-coordinate represents the average fraction (power) of the predictions that were within the distance on 100 datasets. As expected, the prediction accuracy increased with the increasing of A and the increasing of sample sizes. For a sample size of 200 cases and 200 controls (Fig. 4a), the prediction errors were small for A = 10%, 7.5% (i.e. relative risk
= 4.1, 2.7. But the errors increased rapidly when A = 5% (
= 1.7) and neither methods (HapMiner and HPM) could successfully predict gene locations when A dropped to 2.5% (
= 1.2). With a sample size of 400 cases and 400 controls (Fig. 4b), the accuracy was greatly improved for all the values of A. For instance, with A = 10%, all the prediction errors were <4.5 cM. Even with A = 5%, >80% of the prediction errors were within 4 cM. The results were better than those by HPM. Only
85% of the HPM results achieve the same accuracy for A = 10%, as shown in Figure 2b of Toivonen et al. (2000), and the performance of HPM did not necessarily improve when the value of A increased, as illustrated in Figures 2 and 4 of Toivonen et al. (2000). HapMiner demonstrated great advantage in dealing with phenocopies.
|
With the advance of genotyping technology, more SNP markers will be available for whole-genome association studies of common diseases using casecontrol data in the near future. For any gene mapping method, it is desirable to see the performance of the method improve with denser markers. Indeed, HapMiner performed better on SNP markers than it on microsatellite markers. For instance, with A = 10%, 98% of the predicted errors were <5 cM and 81% of the predicted errors were >2 cM for SNP markers (Fig. 4c) and the results for microsatellite markers (Fig. 4a) were 94 and 73%, respectively. Another factor that affects the accuracy is the number of missing alleles of the input data. In reality, most datasets contain a substantial number of missing alleles. There are also ambiguities while inferring haplotypes from genotypes using computational approaches. To test how HapMiner performs under such a realistic situation, we examined HapMiner on the SNP datasets by randomly removing 12.5% alleles that counted for missing and phase-unknown positions. The missing alleles were imputed simply based on allele frequencies before running HapMiner. The results (Fig. 4d) were quite satisfactory considering the small sample size (200) and high number of phenocopies. For example, with A = 10%, >80% of the predictions had errors <5 cM.
3.3.3 Significance of the predictions
To assess the significance of a prediction, a permutation test was performed for 1000 iterations. Figure 4e and f illustrates the permutation test results on every marker position using the same two datasets as in Figure 3 of Toivonen et al. (2000). The solid black line represents the predicted Z-scores for all the markers and the dashed black line underneath shows the empirical P-values of the predictions. The predicted gene location for the first dataset was within 0.2 cM of the true gene location represented by the vertical line in Figure 4e and the empirical P-value <0.001. For the second dataset, where the signal was much weaker, the predicted error was 1 cM and the empirical P-value was 0.019. While it is a common approach to use the permutation test as a way to assess the significance of the prediction, it seems inappropriate to take the position with the minimum empirical P-value as the predicted gene location itself, as in Toivonen et al. (2000). Unlike the normal P-value of a statistic, the position with the minimum empirical P-value might not have the highest statistic (Z-score, or
2-value in our case). In such a case, it is not clear why one should take the position with the minimum P-value as the prediction.
3.3.4 HapMiner, CLADHC and
2-test
We further compared HapMiner, CLADHC and the simple
2 on the SNP dataset with different levels of phenocopies (A = 10%, 7.5%, 5%,
= 1.7, 2.7,4.1 in terms of relative risks. The case A = 2.5% was dropped since no methods were significantly better than random guesses). Both haplotype-based approaches achieved much higher power (defined as the proportion of the predicted locations are within one half of the segment length from the true locations, used by CLADHC) and returned more accurate results than the simple
2-test (Fig. 4gi, Table 2). HapMiner is more robust against noisy data than CLADHC. For A = 10% (Fig. 4g), HapMiner and CLADHC reported similar results for two different values of haplotype segment lengths tested, and HapMiner was slightly better than CLADHC in terms of the root mean squared error rate (the square root of the average squared errors across 100 runs, Table 2). With the increase of the phenocopies, HapMiner achieved much higher power than CLADHC using the same segment lengths (Fig. 4h for A = 7.5%, Fig. 4i for A = 5% and Table 2). The two haplotype segment lengths were taken since CLADHC could not deal with segment lengths longer than 10 markers. All other parameters for both programs took their default values.
|
3.4 Results on real datasets
3.4.1 The CF dataset
We applied the algorithm to a widely studied real dataset originally reported by Kerem et al. (1989) in the study of the fine-mapping of the CF gene. The dataset contains 94 affected haplotypes and 92 normal haplotypes with 23 RFLP markers each. It is known that a certain founder mutation
F508 between markers 17 and 18,
0.88 cM away from the first marker, accounts for 67% of the disease chromosomes. The result of our prediction is illustrated in Figure 5a. For comparisons, all three methods use the adjusted significant levels. The x-coordinate represents the marker positions and the y-coordinate represents the significant levels. The overall significant level P for the whole region was 0.05. The value (yi,j) at marker i for method j was defined as yi,j = (log(pi,j)) (log(p)/nj), where pi,j is the significant value of method j at position i (both HapMiner and the single SNP method use the
2-test with 1 df; CLADHC uses the likelihood ratio test) and nj is the number of total multiple tests over the region for method j (CLADHC has two levels of multiple tests, so the number is different from the number of SNPs). So in Figure 5a, a marker with a positive value means it is significant at the 0.05 level. Our predicted disease location is at marker 18 (0.89 cM away from the first marker) with much higher significant level than it by the simple
2-test. CLADHC with the same segment length 7 output two markers with very similar significant levels while the distance between the two markers is 0.125 cM. In terms of point estimation, our prediction is better than the point estimation (0.8698 cM away from the first marker) in Molitor et al. (2003) which took the mode of posterior distribution as the disease gene location. No point estimation was given by Liu et al. (2001) and they only reported the 95% confidence interval was around [0.82, 0.93]. The cluster identified by HapMiner consists of 63 haplotypes and 60 of them were from the 94 disease chromosomes, which is very close to the total number of disease chromosomes that had the DS mutation. The majority of the two sets overlapped. The haplotype segment length parameter was set to be 7 markers in Figure 5a, and the exactly same set of chromosomes and similar profile were obtained by HapMiner when using segment length of 5 (data not shown). For the analysis on the three real datasets, all other parameters took the default values as those in the dataset II, namely, w1 = w2 = e10x,
= 0.2, and the percentile for MinPts is 0.25. CLADHC could not handle multiallelic data so we only output the results by HapMiner for the remaining two datasets using their Z-score profiles.
|
3.4.2 The FA dataset
We further applied the algorithm to the second real dataset concerning the localization of Friedreich Ataxia (FA) gene reported in Liu et al. (2001) and reanalyzed by Molitor et al. (2003). Our data contains 54 disease haplotypes and 69 control haplotypes with 12 microsatellite markers spanning a region of 15 cM. The gene is located between the fifth and sixth markers. More details about the data can be found in Liu et al. (2001). HapMiner predicted the gene position on the fifth marker as shown in Figure 5b, with a Z-score of 6.03. The haplotype segment length parameter was set to be 7 and a similar result was obtained for segment length 5. The most informative cluster identified consists of 25 disease haplotypes where the biggest cluster identified in Liu et al. (2001) contained 33 haplotypes. We obtained three other small clusters as found in Liu et al. (2001) that may be due to allele heterogeneity. The sizes of our clusters were smaller than those of the clusters in Liu et al. (2001) mainly because our parameters were chosen in such a way that the algorithm could detect phenocopies more effectively. No point estimation or confident interval were given in Liu et al. (2001). Again, the point estimation was much better than the results in Molitor et al. (2003) where the prediction was 2 markers (0.25 cM) away from the true location.
3.4.3 The T1D dataset
We have also tested HapMiner on the third real dataset, consisting of affected sib-pair families with type 1 diabetes (T1D) obtained from Herr et al. (2000). The T1D dataset consists of 385 affected sib-pair families each with 2 parents and 2 affected children. There are a total of 25 microsatellite markers spanning a 14 Mb region on chromosome 6 including the entire HLA complex, with known type 1 diabetes-susceptibility locus. The haplotypes were inferred from the genotype data using the integer linear programming (ILP) algorithm of the PedPhase program by Li and Jiang (2004). Only 89 families were taken from all 385 families since the other families missed the genotypes of all members in at least one locus. For each family, a haplotype from the four parental haplotypes was assigned as a case haplotype if it appears in any of the two affected children. Otherwise it was selected as a control haplotype. There were totally 213 case haplotypes and 143 control haplotypes. The length of a haplotype segment was set to be 5. The results (Fig. 5c) show that HapMiner could find the DS gene location at marker D6S2444 with a Z-score of 3.72. The location is the same as those identified by TDT (Transmission Disequilibrium Test) type of tests in Herr et al. (2000), while HapMiner only used a much smaller subset of the total data. The associated cluster has 32 haplotypes and only 3 are from control haplotypes. The number of core haplotypes is 27 and the consensus haplotype pattern is 61429.
4 DISCUSSION
We have described a model-free haplotype association mapping method and proposed a new haplotype similarity measure. The program, HapMiner, is well suited for gene fine mapping and efficient for whole-genome screens. Results on two simulated datasets and three real datasets have illustrated that HapMiner could predict DS gene locations with high accuracy under various situations with realistic sample sizes, and it has a better performance than some recently developed approaches. Simulations based on the dataset from the literature show that it is effective even for data containing a high rate of phenocopies (corresponding to small relative risks). We have tested HapMiner under two evolutionary models and it performed consistently well regardless of the population history. The simulations and the real datasets consisted of dominant, recessive and complex diseases, and HapMiner was able to successfully identify the DS gene locations for all the cases. It requires no prior information about the evolutionary history (genealogy of haplotypes) or inheritance patterns of the diseases. Extensive tests have also demonstrated the robustness of HapMiner on the selection of different parameters.
The framework can easily handle diseases with multiple founder mutations per locus since HapMiner could report all clusters that are significant at each marker locus as we did on the FA dataset. It can also handle diseases with multiple genes and genegene interactions by modifying the similarity measure to account for different haplotype segments. The significance level of the prediction is evaluated by carrying out permutation tests. The properties of the proposed statistics (Z-score or
2) under different assumptions will be investigated. It is also possible to incorporate statistical techniques for studying false discovery rates (Storey and Tibshirani, 2003) into our genome-wide association mapping studies. For false positive due to population structure, one can also incorporate the genomic control method to the proposed framework.
The method presented here assumes that the haplotype pair of each individual is available, which in general can be inferred by computational methods based on genotype data. A possible extension is to take into consideration the ambiguity of the inferred haplotypes as well as the dependence of the two haplotypes from the same individual. An alternative to the use of inferred haplotypes is to calculate similarity/distance based on genotype vectors directly. For instance, similarity of two genotype vectors can be measured based on the number of identical alleles at each marker. But our preliminary results on genotype vectors have shown it cannot provide accurate predictions in most cases. We will systematically investigate how the predictions will be affected while using inferred haplotypes from various sources by different algorithms.
The notion of density-based clusters is crucial to the prediction accuracy when data contain high level noise such as phenocopies and incomplete penetrances. It also alleviates the mislabeling problem of haplotypes, i.e. for casecontrol data, it is possible that only one of the two case haplotypes from an affected individual contains the disease mutation, while we label both of them as case haplotypes. A limitation of the DBSCAN algorithm is that it could not automatically determine the density of the input data. In the current implementation, we have to rely on the user to supply the parameters. Although we have shown that the algorithm is robust against a broad range of parameters, it is still difficult to argue what is the optimal value for each parameter. The two parameters (
and MinPts) depend on the level and the patten of LD near the disease locus, which is generated by many forces such as recombination rate and distribution, population structure, the age of the disease mutation, genetic drift, etc. Detail characterization of the relationship between the parameters and those factors is difficult. Nevertheless, one possible extension to the current framework is to automatically estimate the density of the input data. For example, the density around a data point can be evaluated by looking at the distance distribution from this data point to all other data points. Another possible direction is to incorporate model-based clustering algorithms, which assumes that different clusters correspond to different distributions (Fraley and Raftery, 2002). Then the problem of finding appropriate threshold values is just the standard model selection and parameter estimation problem.
In addition to complex diseases, many continuously distributed quantitative traits are of primary clinical and health significance. Examples of such quantitative traits are blood pressure, cholesterol level, obesity, and bone mineral density, etc. In many cases, the disease status of an individual is actually defined based on some threshold value of a particular quantitative trait. Quantitative values can actually provide much more detailed information than the disease status only. The approach proposed here can be extended to quantitative trait association mapping by defining a new score. The idea is to identify clusters first according to haplotype similarities and to evaluate the mean differences of the trait values of those in a cluster and those not in the cluster. Additional work will be done to investigate the feasibility of quantitative trait mapping using the framework.
In summary, results on datasets from various sources demonstrate the high accuracy and great flexibility of the proposed method. HapMiner will be a useful tool for LD mapping and complement the existing model-based statistical methods.
| Acknowledgments |
|---|
We are grateful to H. T. Toivonen for sharing the dataset I, J. S. Liu for sharing the CF and FA datasets and M. Herr for sharing the T1D data with us. We thank C. Durrant for providing her program CLADHC. We thank the four anonymous reviewers for their many constructive suggestions and valuable comments. J. L.'s research is supported in part by NSF grants CCR-0311548 and ITR-0085910, and the start-up fund from Case Western Research University. T. J.'s research is supported in part by NSF Grants ITR-0085910 and CCR-0309902, Chinese National Key Project for Basic Research (973 #2002CB512801), and a fellowship from Center for Advanced Study, Tsinghua University, Beijing, China.
Conflict of Interest: none declared.
Received on August 17, 2005; revised on October 4, 2005; accepted on October 19, 2005
| REFERENCES |
|---|
|
|
|---|
Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J. (1999) OPTICS: ordering points to identify the clustering structure. Proc SIGMOD'99Philadelphia, PA , pp. 4960.
Daly, M.J., et al. (2001) High-resolution haplotype structure in the human genome. Nat. Genet, . 29, 229232[CrossRef][Web of Science][Medline].
Devlin, B. and Roeder, K. (1999) Genomic control for association stdies. Biometrics, 55, 9971004[CrossRef][Web of Science][Medline].
Durrant, C., et al. (2004) Linkage disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes. Am. J. Hum. Genet, 75, 3543[CrossRef][Web of Science][Medline].
Ester, M., Kriegel, H.-P., Sander, J., Xu, X. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Proc KDD'96Portland, OR , pp. 226231.
Fienberg, S.E. The Analysis of Cross-Classified Categorical Data, (1977) , Cambridge, MA MIT Press.
Fraley, C. and Raftery, A.E. (2002) Model-based clustering, discriminant analysis, and density estimation. J. Amer. Stat. Assoc, . 97, 611631[CrossRef].
Gabriel, S.B., et al. (2002) The structure of haplotype blocks in the human genome. Science, 296, 22252229
Gusfield, D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, (1997) , Cambridge, UK Cambridge University Press, pp. 125.
Han, J. and Kamber, M. Data Mining: Concepts and Techniques, (2000) , San Francisco, CA Morgan Kaufmann Publishers.
Herr, M., et al. (2000) Evaluation of fine mapping strategies for a multifactorial disease locus: systematic linkage and association analysis of IDDM1 in the HLA region on chromosome 6p21. Hum. Mol. Genet, . 9, 12911301
Hinneburg, A. and Keim, D.A. (1998) An efficient approach to clustering in large multimedia databases with noise. Proc KDD'98New York, NY , pp. 224228.
Kerem, B.S., et al. (1989) DNA marker haplotype association with pancreatic sufficiency in cystic fibrosis. Am. J. Hum. Genet, . 44, 827834[Web of Science][Medline].
Kruglyak, L. and Lander, E.S. (1998) Faster multipoint linkage analysis using Fourier transforms. J. Comput. Biol, . 5, 17[Web of Science][Medline].
Li, J. and Jiang, T. (2004) An exact solution for finding minimum recombinant haplotype configurations on pedigrees with missing data by integer linear programming. Proc RECOMB'04San Diego, CA , pp. 2029.
Liu, J.S., et al. (2001) Bayesian analysis of haplotypes for linkage disequilibrium mapping. Genome Res, . 11, 17161724
McPeek, M.S. and Strahs, A. (1999) Assessment of linkage disequilibrium by the decay of haplotype sharing, with application to fine-scale genetic mapping. Am. J. Hum. Genet, . 65, 858875[CrossRef][Web of Science][Medline].
Molitor, J., et al. (2003) Fine-scale mapping of disease genes with multiple mutations via spatial clustering techniques. Am. J. Hum. Genet, . 73, 13681384[CrossRef][Web of Science][Medline].
Niu, T., et al. (2002) Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am. J. Hum. Genet, . 70, 157169[CrossRef][Web of Science][Medline].
Risch, N. and Merikangas, K. (1996) The future of genetic studies of complex human diseases. Science, 273, 15161517
Storey, J. and Tibshirani, R. (2003) Statistical significance for genomewide studies. Proc. Natl Acad. Sci. USA, 100, 94409445
Stephens, M., et al. (2001) A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet, . 68, 978989[CrossRef][Web of Science][Medline].
Toivonen, H.T., et al. (2000) Data mining applied to linkage disequilibrium mapping. Am. J. Hum. Genet, . 67, 133145[CrossRef][Web of Science][Medline].
Tzeng, J.Y., et al. (2003) On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. Am. J. Hum. Genet, . 72, 891902[CrossRef][Web of Science][Medline].
Zollner, S. and Pritchard, J.K. (2005) Coalescent-based association mapping and fine mapping of complex trait Loci. Genetics, 169, 10711092
This article has been cited by other articles:
![]() |
M. Li, K. Wang, S. F. A. Grant, H. Hakonarson, and C. Li ATOM: a powerful gene-based association test by combining optimally weighted markers Bioinformatics, February 15, 2009; 25(4): 497 - 503. [Abstract] [Full Text] [PDF] |
||||
![]() |
S.-Y. Su, D. J. Balding, and L. J.M. Coin Disease association tests by inferring ancestral haplotypes using a hidden markov model Bioinformatics, April 1, 2008; 24(7): 972 - 978. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Montana Statistical methods in genetics. Brief Bioinform, September 1, 2006; 7(3): 297 - 308. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








