Bioinformatics Advance Access originally published online on July 21, 2007
Bioinformatics 2007 23(18):2399-2406; doi:10.1093/bioinformatics/btm371
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Haplotype inference for present–absent genotype data using previously identified haplotypes and haplotype patterns
1Department of Biostatistics, 2Department of Medicine and 3Department of Epidemiology, University of Alabama at Birmingham, Birmingham, AL 35294, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Killer immunoglobulin-like receptor (KIR) genes vary considerably in their presence or absence on a specific regional haplotype. Because presence or absence of these genes is largely detected using locus-specific genotyping technology, the distinction between homozygosity and hemizygosity is often ambiguous. The performance of methods for haplotype inference (e.g. PL-EM, PHASE) for KIR genes may be compromised due to the large portion of ambiguous data. At the same time, many haplotypes or partial haplotype patterns have been previously identified and can be incorporated to facilitate haplotype inference for unphased genotype data. To accommodate the increased ambiguity of present–absent genotyping of KIR genes, we developed a hybrid approach combining a greedy algorithm with the Expectation-Maximization (EM) method for haplotype inference based on previously identified haplotypes and haplotype patterns.
Results: We implemented this algorithm in a software package named HAPLO-IHP (Haplotype inference using identified haplotype patterns) and compared its performance with that of HAPLORE and PHASE on simulated KIR genotypes. We compared five measures in order to evaluate the reliability of haplotype assignments and the accuracy in estimating haplotype frequency. Our method outperformed the two existing techniques by all five measures when either 60 % or 25 % of previously identified haplotypes were incorporated into the analyses.
Availability: The HAPLO-IHP is available at http://www.soph.uab.edu/Statgenetics/People/KZhang/HAPLO-IHP/index.html
Contact: KZhang{at}ms.soph.uab.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Population-based association studies for both genome-wide mapping and fine mapping of complex disease genes have become increasingly popular as they offer potentially more cost-effective and powerful approaches than linkage analysis (Ardlie et al., 2002; Botstein and Risch, 2003). There are two general types of methods for analyzing the unphased genotype data available from association studies: single marker-based methods and haplotype-based methods. The later can provide additional power in defining effects associated with multiple disease-related alleles within a single gene (Morris and Kaplan, 2002) or when a single marker test fails to capture local complexity of linkage disequilibrium (LD) between multiple markers (Akey et al., 2001). Haplotypes could also be better units of analysis for understanding LD patterns in the human genome since they reflect preservation or change in the combined alleles at multiple loci throughout the evolutionary process (Clark, 2004).
Practically, however, assembly of haplotypes within individuals of a study population is not always straightforward. Current laboratory techniques (e.g. Douglas et al., 2001; Yan et al., 2000) can be used to determine local haplotypes experimentally, but these approaches are often too expensive and too cumbersome to be used effectively for large-scale studies. Therefore, most association studies have relied on unphased genotyping data, on which statistical and computational methods have been used to infer haplotypes through estimation of haplotype frequencies and resolution of haplotype pairs within individuals. Accordingly, effective and accurate methods for haplotype inference in various situations are highly valuable. Here, haplotype inference refers to reconstruction of a haplotype pair in an individual and estimation of haplotype frequencies in a population.
The methods for haplotype inference from unrelated individuals can be divided into two major categories. Some are based on likelihood computations (e.g. Excoffier and Slatkin, 1995; Niu et al., 2002; Qin et al., 2002; Stephens et al., 2001), whereas others rely on optimization algorithms to obtain a solution (e.g. Clark, 1990; Gusfield, 2001; Huang et al., 2005). The use of advanced techniques, such as the partition–ligation and the Markov chain Monte Carlo technique and the incorporation of LD pattern and population models, have made the haplotype inference for unrelated individuals more accurate with up to several hundred SNP markers (Fallin and Schork, 2000; Stephens and Donnelly, 2003). However, none of them have been specifically tailored for analyses of the family of killer immunoglobulin-like receptor (KIR) genes as detected with locus-specific technology.
The human KIR gene family comprises a cluster of genes located at 19q13.4, which is a highly polymorphic region of the human genome (Hsu et al., 2002; Marsh et al., 2003; Martin et al., 2004; Middleton et al., 2005). KIR genes are known to regulate cytolytic activity of human natural killer cells, but the exact mechanisms are not fully understood (Middleton et al., 2005). To date, 17 genes and pseudogenes have been assigned to the KIR gene family (Marsh et al., 2003). Variation in the number and type of KIR genes across individuals is well documented (Hsu et al., 2002; Middleton et al., 2005; Uhrberg et al., 2002; Whang et al., 2005), whereas discovery of allelic variants within each KIR gene is still at an early stage (Hsu et al., 2002).
Current KIR genotyping is largely restricted to locus-specific PCR, which detects the presence or absence of target genes. Ambiguity arises from inability to resolve the copy number (one or two) of specific KIR genes (Hsu et al., 2002; Middleton et al., 2005). This ambiguity leaves a large portion of data potentially missing. For example, we analyzed KIR genotype data reported for a population of unrelated Caucasians after excluding four anchor genes that are invariably present and converting the KIR gene designations to reflect a more recent categorization scheme (Hsu et al., 2002). By our computation, 66.5 % were typed as present across all KIR genes; i.e. 66.5 % of the information on homozygosity or hemizygosity or 33.3 % of diploid genotype data were missing. Furthermore, the observed systematic (non-random) loss of data might be expected to compromise the performance of the conventional EM algorithm based haplotype inference method for resolving missing data.
For such present–absent genotype data of KIR genes, haplotypes defining gene content could be inferred using some simple logic rules if there is enough familial information for each individual (Hsu, 2002; Middleton, 2005). Studies so far have revealed 42 different haplotypes involving different combinations of 17 KIR genes (Khakoo and Carrington, 2006), including the four anchor genes (3DL3, 3DP1, 2DL4 and 3DL2) present on all KIR haplotypes. In addition, certain KIR genes (e.g. 2DS2 and 2DL2) always occur together, suggesting strong or complete LD between them. These known haplotypes and various haplotype patterns can be incorporated into inference routines to eliminate unlikely haplotypes and emphasize more likely candidate haplotypes as possible resolutions for individual genotypes.
We have developed a hybrid method for haplotype inference. This hybrid consists of a staged application of the greedy algorithm (Black, 2005; Weiss, 1997) followed by the EM algorithm (Excoffier and Slatkin, 1995; Qin et al., 2002), which can incorporate information on previously identified haplotypes and haplotype patterns. To evaluate the performance of our method, we conducted simulation studies based on KIR gene data from two Caucasian populations (Hsu et al., 2002; Middleton et al., 2004). Our method of inferring haplotypes formed from present–absent genotype data compared favorably with results using other available methods (Stephens et al., 2001; Zhang et al., 2005).
| 2 METHODS |
|---|
|
|
|---|
2.1 Notations
We introduce the following notations in our computation. Let L = {l1, ..., lN} be a set of N loci; for simplicity they are indexed from 1 to N. Let X = {G1, ..., GM} denote the genotype of M unrelated individuals, where
Furthermore, let H = {h1, ..., hK}denote a set of K haplotypes across N loci, where hi is an N-dimensional vector and its element can take only status 1 (present) or 0 (absent). The frequencies of haplotypes in H are denoted as
= {
1, ...,
K}, where
K is the frequency of the haplotype hk (k = 1, ..., K) in the population. A pair of haplotypes hi and hj is said to be compatible with a genotype Gm if two alleles at each locus from the haplotype pair are compatible with the genotype at each locus. Note that we only allow missing data in the genotype Gm but not in the haplotype pair of hi and hj. The missing data are compatible with any specific allele. For example, if we consider two haplotypes across three loci such as h1 = (0,1,1) and h2 = (1,1,1) then h1 and h2 are compatible with the genotype G1 = {(1,0), (1,1), (1,1)} and G2 = {(1,?), (1,?), (1,1)} but not with G3 = {(1,?), (1,0), (1,1)} because the two alleles at the second locus on h1 and h2 are both 1 but two alleles for G3 are 1 and 0.
We denote H0 as the initial set of identified haplotypes and S as the set of haplotype patterns. For the KIR genes, several haplotype patterns have been observed from the empirical data: some genes always appear on every haplotype, some pairs of genes always appear to be both absent or both present (completely positive LD), while some pairs of genes never appear together (completely negative LD). We also emphasize that each haplotype in H0 is compatible with the haplotype patterns in S.
2.2 The hybrid algorithm
Our hybrid algorithm first uses the greedy algorithm, drawing upon an initial set of haplotypes and haplotype patterns, to construct a minimum set of haplotypes for resolving all genotypes. The EM algorithm is then employed to estimate haplotype frequencies and reconstructs haplotype pairs for each individual according to their posterior probabilities.
2.3 The greedy algorithm
A greedy algorithm is the one that can find the optimal solution by choosing a local optimum at each stage and advancing to the next stage until the selection process is exhausted (Black, 2005; Weiss, 1997). The global optimum may not be found by a greedy algorithm, but at least a good approximation can be expected. Generally, greedy algorithms are faster than exhaustive algorithms that try to find the globally optimal solution in complicated situations. We developed a greedy algorithm to identify a minimum set of gene haplotypes that can resolve the maximum number of individuals. We consider an individual resolved for a set of haplotypes if the genotype is compatible with at least one pair of haplotypes in the set of haplotypes. It has been shown that this objective to find the minimum set of haplotypes to resolve the maximum number of individuals in a sample is an NP-hard problem, meaning that there is no polynomial time algorithm that can guarantee such minimum set of haplotypes (Gusfield, 2001; Huang et al., 2005). Among the various algorithms for this purpose (e.g. Gusfield, 2001; Huang et al., 2005; Wang and Xu, 2003), none have relied on previously identified haplotypes to resolve haplotypic relationships.
Starting from a set of previously identified haplotypes (H0) and a set of haplotype patterns (S), first a subset of individuals in the sample may be resolved. Then, a set of extended haplotypes (E0) is constructed to serve as candidate haplotypes for the unresolved individuals based on H0 and S. From this extended haplotype set, the greedy algorithm finds additional haplotype(s) at each iteration to resolve more individuals until it reaches the greatest possible resolution of the maximum number individuals. Some individuals may not be resolved if all compatible paired haplotypes are contradictory to given haplotype patterns in S. The maximum number of individuals to be resolved is denoted as Mmax and it could be less than the number of total individuals used in the study. The execution of the algorithm proceeds in the following steps:
- Identify those individuals who can be resolved by the initial set of haplotypes (H0). Denote the size of these individuals as M0.
- For individuals unresolved by H0, construct first extended set Ea as a set of all haplotypes that can resolve more individuals as a pair with a haplotype in H0, i.e. (h1,h2) for h1
H0 and h2
Ea and do not violate the haplotype patterns in S. If there remain individuals unresolved by H0
Ea, construct second extended set Eb as a set of all haplotypes that can resolve more individuals as a pair within this second extended set, i.e. (h1,h2) for h1,h2
Eb and do not violate the haplotype patterns in S. The final extended set E0 is obtained by E0 = Ea
Eb. Note that the number of individuals resolved by H0
E0, is Mmax.
- For each h
E0, obtain the number of individuals who can be resolved by H0
{h}; that number is denoted as M0,h. Choose h0 such that
is the biggest among all h
E0. Construct the new set of haplotypes H1 as H1 = H0
{h0}and E1 by E1 = E0\H1. If multiple haplotypes yield a tie with the highest resolution
at this step, then include all of them in H1 and exclude all of them from E1.
- For t = 1,2, ..., construct Ht+1 and Et+1 based on Ht and Et as the same way in (3) by obtaining Mt,h as the size of individuals resolved by Ht
{h} for all h
Et. Repeat this iteration until no additional individual can be resolved.
- Suppose the iteration stops at Tth iteration resulting in the haplotype sets HT and ET. If the number of resolved individuals by HT is Mmax, stop the algorithm and denote the final haplotype set as H. Otherwise, for each pair of (h1, h2) where (h1,h2)
Ek, h1
h2, obtain the number of individuals that can be resolved by HK
{h1, h2}, which is denoted as
. Choose
such that
is the biggest for all (h1, h2)
ET, h1
h2. Construct the new set of haplotypes,
and ET+1 = ET/HT+1. Then go to step (4) and repeat the iteration until the number of individuals that can be resolved by the final haplotype set H is Mmax.
Although the final haplotype set obtained as H from this greedy algorithm may not be a minimum set of haplotypes that resolves the maximum number of individuals, our greedy algorithm guarantees that H can resolve maximum number of individuals.
2.4 The EM algorithm
After a final set of haplotypes, H = {h1, ..., hK} has been identified, the EM algorithm estimates haplotype frequencies and assigns compatible haplotype pairs with their posterior probabilities for each individual. The EM algorithm method has been widely used for haplotype inference, and a detailed description is available elsewhere (Excoffier and Slatkin, 1995; Qin et al., 2002). The only difference between previous applications and ours is that we use only the final set of haplotypes obtained from the greedy algorithm. The final output of our algorithm consists of two lists: one is the list of haplotypes with estimated frequencies greater than the threshold (e.g. 1 x 10–5), and the other is the list of compatible haplotype pairs assigned for each individual with their posterior probabilities. We have implemented our algorithms in a program named HAPLO-IHP, which is available at:http://www.soph.uab.edu/Statgenetics/People/KZhang/HAPLO-IHP/index.html.
2.5 Data simulations
We compared the performance of HAPLO-IHP with other two programs for haplotype inference, PHASE (Stephens et al., 2001) and HAPLORE (Zhang, 2005) based on simulations. The 17 KIR haplotypes and their frequencies reported by Middleton et al. (2005) and Hsu et al. (2002) for Caucasian population were used to generate individual genotypes. The original haplotypes consist of combinations of 17 KIR genes, but only 14 of them were used to simulate present–absent KIR data. Two pseudogenes (2DP1 and 3DP1) were excluded due to their lesser importance in functionality, and 2DS4 (allele-specific genotyping available) was excluded due to irrelevance for the purpose of testing the performance for present–absent format genotype data. Since the frequency distribution of 17 KIR haplotypes obtained using two studies (model 1) is unusual in that one major haplotype has the frequency of more than 50 %, we constructed another haplotype frequency distribution model (model 2) so that the highest haplotype frequency is set at 30 % and others are modified proportionally to their relative frequencies in the original frequency distribution. The 17 haplotypes and their frequencies in models 1 and 2 simulations are presented in Table 1.
|
To generate the data under the assumption of Hardy–Weinberg Equilibrium (HWE), two haplotypes were randomly selected according to their frequencies and paired to form the genotype of each individual. Then the genotype was converted to the present–absent format. Original haplotype configuration for each individual was stored separately to assess the performance of methods for haplotype inference. The data were generated with different sample sizes (100, 200, 400) and two haplotype frequency models. In each situation, 100 data replications were simulated.
Similarly, the data were also generated assuming a departure from HWE by modifying the proportions of heterozygous haplotype pairs and homozygous haplotype pairs in the following way. Let x be the homozygosity parameter and y be the heterozygosity parameter. Shom is the sum of frequencies for all homozygous haplotype pairs and Shet is the sum of frequencies for all heterozygous haplotype pairs under HWE. We can obtain
that satisfies
(xShom + yShet) = 1 for given x and y. With the HWE assumption, x and y are set to be equal to 1. In addition, we can set x > y to represent the excessive homozygosity and set x < y to represent the excessive heterozygosity. The new frequency of each haplotype pair with a departure from HWE is obtained by multiplying
x for homozygous haplotype pairs or
y for heterozygous haplotype pairs. We generated data sets assuming x = 1 and y = 2 (excessive heterozygosity) and x = 2 and y = 1 (excessive homozygosity).
2.6 Comparison of methods of haplotype inference
For each simulated data set, we estimated haplotype frequencies and assigned the haplotype pair with the maximum posterior probability to each individual as its true haplotype pair using HAPLO-IHP, HAPLORE and PHASE. For HAPLO-IHP, we obtained the results with different initial sets of identified haplotypes, which account for 100, 60 and 25 % of 17 original KIR haplotypes that were used in our simulations. Haplotypes 1, 5, 6, 7, 8, 9, 11, 12, 16 and 17 are selected for the 60 % initial set and haplotypes 1, 5, 8 and 9 are selected for the 25 % initial set. For frequency model 1, the 60 and 25 % sets of initial haplotypes include the 10 and 4 most frequent haplotypes, accounting for 96 and 83 % of haplotype frequency, respectively. For frequency model 2, the 60 and 25 % sets of haplotypes account for 93 and 73 % of haplotype frequency, respectively.
In our simulations, we used three types of haplotype patterns derived from the observations in KIR haplotype studies (Hsu et al., 2002; Martin et al., 2004; Middleton et al., 2005): (1) three anchor genes, 2DL4, 3DL2 and 3DL3 always present in all haplotypes; (2) two genes, 2DS2 and 2DL2, always either present or absent together in all haplotypes and (3) two pairs of genes, (3DS1, 3DL1) and (2DL2, 2DL3), in complete negative LD, i.e. when one gene in each pair is present in a haplotype, the other gene in the same pair is absent. To evaluate the performance of the HAPLO-IHP when only the haplotype patterns were used, we inferred the haplotypes without using any previously identified haplotypes. Also, we modified HAPLORE to incorporate these three haplotype patterns in the haplotype inference and compared the results with those obtained from HAPLO-IHP. The results of HAPLO-IHP with the initial set of identified haplotypes but no haplotype pattern were also obtained.
To quantify the performance of different methods for haplotype inference, we calculated the following five measures for each replicate and the average of these measures over 100 replicates: IH (Excoffier and Slatkin, 1995), sum of absolute differences between estimated and true frequencies (SAD; Falling and Schork, 2000), sum of squared errors between estimated and true frequencies (SSE; Falling and Schork, 2000), individual error rate (IE; Niu et al., 2002) and similarity error rate (SE; Stephens and Donnelly, 2003). These five measures have been selected because they have been extensively used by researchers to evaluate the performance of methods for haplotype inference and each of them provides different aspects of such evaluation. IH, SAD and SSE are computed from the estimated haplotypes and their frequencies, while IE and SE are computed from haplotype pairs assigned to each individual with their true haplotypes. Specifically, IH is defined as:
|
| (1) |
|
| (2) |
|
| (3) |
| 3. Results |
|---|
|
|
|---|
3.1 Comparisons of performance of HAPLO-IHP with PHASE and HAPLORE
Inferences performed by HAPLO-IHP, PHASE and HAPLORE on haplotypes from simulated data were evaluated by all five measures (IH, SAD, SSE, IE and SE). Results of those evaluations include the average of five measures and their 95 % confidence intervals (CIs) over 100 replicates for frequency models 1 and 2 with sample size of 200 under HWE from: (1) the results of HAPLO-IHP using the 100, 60 and 25 % initial sets of true haplotypes along with the haplotype patterns; and (2) the results of HAPLORE and PHASE, computed for each replicate (Table 2 and Supplementary Table S1). The average IH values for HAPLO-IHP were greater than 0.95 for the 100 % initial set and at least greater than 0.75 for the 25 % initial set for both frequency models. The average IH values for HAPLORE were between 0.50 and 0.56 and for PHASE around 0.16, indicating these methods identified many more haplotypes that are not present in the sample due to the high portion of missing data. Furthermore, the number of haplotypes identified from HAPLO-IHP is much closer to the true number of haplotypes than those obtained from HAPLORE and PHASE.
|
HAPLO-IHP showed consistently smaller SAD, SSE, IE and SE values than HAPLORE and PHASE, also indicating enhanced performance by HAPLO-IHP. PHASE performed least well compared with HAPLO-IHP applied to different initial sets and compared with HAPLORE. Under frequency models 1 and 2,
65 and 81 % of individuals in simulated data were assigned incorrect haplotype pairs by PHASE, whereas only 17 and 25 % of individuals with simulated data were assigned incorrect haplotype pairs by HAPLO-IHP with the 25 % initial sets of true haplotypes. Again under frequency models 1 and 2, the results of SAD, the sum of differences between true and estimated haplotype frequencies by HAPLO-IHP, ranged from 0.08 to 0.17 for two simulated data sets. These values were much lower than those obtained by HAPLORE and PHASE.
3.2 The effect of sample size
We assessed the performance of HAPLO-IHP, HAPLORE and PHASE with different sample sizes. Figure 1 shows the average values for IH and SAD measures with sample sizes of 100, 200 and 400 (please refer to Supplementary Table S2 for the average values and their 95 % CIs of five measures). HAPLO-IHP outperformed HAPLORE and PHASE for all sample sizes. Using the different initial sets of haplotypes, HAPLO-IHP and HAPLORE consistently showed improved performance with increased sample sizes. For PHASE, only SAD and SSE decreased with increased sample sizes. For HAPLO-IHP, the improvement in accuracy with increasing sample size from 100 to 400 was smaller. For HAPLORE, the contribution of increased sample size was greater. With a sample size of 400, HAPLORE performed worse compared to HAPLO-IHP with the 25 % initial set and a sample size of 100.
|
3.3 The effect of Hardy–Weinberg equilibrium assumption
We investigated the effect of departure from HWE on the performance of the three algorithms on data simulated to display HWE, excessive heterozygosity and excessive homozygosity (Table 3 and Supplementary Table S3). For HAPLO-IHP with different initial sets, the average IH values increased for excessive homozygosity model (HOM) and decreased for excessive heterozygosity model (HET) compared with the results of the data generated with the HWE assumption. SAD measures were slightly higher (worse performance) for excessive homozygosity model compared with the HWE model for HAPLO-IHP, and they were even bigger for the excessive heterozygosity model. For SSE, IE and SE measures (lower values mean better performance), the excessive homozygosity model performed better than the HWE model for all inference methods except the HAPLO-IHP with the 25 % initial set (exceptions occurred on SSE and IE measures), and the excessive heterozygosity model performed worse than the HWE model.
|
For HAPLORE, the average IH value decreased slightly for the excessive homozygosity model and more for the excessive heterozygosity model. The other four measures (SAD, SSE, IE and SE) performed better for the excessive homozygosity model (smaller values) and worse for the excessive heterozygosity model (bigger values) relative to the HWE model. For PHASE, all five measures performed better for the excessive homozygosity model and worse for the excessive heterozygosity model relative to the HWE model.
This generally better performance for the excessive homozygosity model and worse performance for the excessive heterozygosity model was expected due to the reduced haplotype ambiguity in data with excessive homozygosity. In all of these evaluations, HAPLO-IHP consistently performed better than HAPLORE and PHASE.
3.4 The effect of the use of identified haplotypes and haplotype patterns
We also investigated the effect of using only previously identified haplotypes and the effect of using only the haplotype patterns. We performed the haplotype inference with (1) HAPLO-IHP using only the haplotype patterns, (2) modified HAPLORE incorporating such haplotype patterns and (3) HAPLO-IHP using only the initial sets of identified haplotypes that account for 60 and 25 % original haplotypes (Table 4 and Supplementary Table S4). HAPLO-IHP and modified HAPLORE using only the haplotype patterns performed better than the original HAPLORE and PHASE (see Table 2 for the result of original HAPLORE and PHASE). Interestingly, most of the measures showed slightly better results for HAPLO-IHP or modified HAPLORE using only the haplotype patterns compared with those obtained for HAPLO-IHP using both the initial set of 25 % original haplotypes and the haplotype patterns. Using only the initial set of 60 and 25 % haplotypes, HAPLO-IHP performed better than the original HAPLORE and PHASE for all five measures. HAPLO-IHP showed slightly better results when using only the initial set of 60 % original haplotypes than when using both the initial set of 25 % original haplotypes and the haplotype patterns. In other words, combining the haplotype pattern with the initial set of 25 % original haplotypes did not improve the performance appreciably over using just the initial set of 60 % original haplotypes.
|
| 4 DISCUSSION |
|---|
|
|
|---|
Methods for inferring haplotypes from unrelated individuals have been valuable and widely used in genetic studies. Although the existing methods can be directly applied to the analyses of present–absent genotype results, the large portion of data missing from such genotyping effort could compromise the inference. In addition, none of these methods have considered the incorporation of known haplotypes and/or haplotype patterns. Our new algorithm aims to incorporate available information such as previously reported haplotypes and patterns observed in haplotypes, in order to ensure accuracy for haplotype inference. Through various simulations of data in the present–absent format, our new program (HAPLO-IHP) showed consistent superiority in performance when compared with two publicly available programs, HAPLORE and PHASE. Specifically, PHASE assigns haplotype pairs with
60–80 % individual error rates (IE), and its predicted haplotype frequency differed from true frequencies (SAD) that was more than four times the SAD error in the HAPLO-IHP frequency estimation (25 % initial set). This poor performance of PHASE may be due to the fact that the assumptions and methods used in PHASE such as the coalescent model and the similarity between haplotypes are not appropriate for our simulated data based on KIR genes. We also assessed the isolated effects of previously reported haplotypes and of haplotype patterns to initiate the greedy algorithm by using only one or the other of them in HAPLO-IHP. When only one of them is used, the overall performance of HAPLO-IHP was still better than HAPLORE and PHASE. The performance of HAPLO-IHP was better using haplotype patterns with the 25 % initial set than the performance using only haplotype patterns. The improved overall performance of HAPLO-IHP using both information sets makes it preferable for the inference of KIR gene haplotypes. However, if one of the information sets we incorporated in these algorithms is not obtainable, using only one can still improve the accuracy of haplotype inference.
A number of methods have been developed to find a minimum set of haplotypes that can resolve all individuals under study (Clark, 1990; Gusfield, 2001; Huang et al., 2005; Wang and Xu, 2003). The original Clark algorithm begins by obtaining resolved haplotypes from individuals who were heterozygous at only one locus or no locus (unambiguous genotypes) and tries to resolve other individuals who were heterozygous at more than one locus, one by one, by adding new haplotypes that can consist of a compatible pair of haplotypes for unambiguous individuals. The algorithm is repeated until no further resolution occurred for additional individuals. The resulting set of haplotypes depends on the order of the search of individual genotypes and only provides a local optimal solution. Subsequent studies recast the problem and highlighted the need for more powerful algorithms to find a global optimum. HAPLO-IHP is not only an attempt to solve the problem to find a optimum set of haplotypes but also uses the EM algorithm and available information from previous studies to improve the accuracy for haplotype inference.
Our program incorporated two kinds of information to improve accuracy in haplotype inference. It included data on KIR haplotypes and their patterns as identified in general populations. As typing for these genes increases, the certainty and frequency of those observed haplotypes will be better established, and novel haplotypes will be detected. The incorporation of newly discovered haplotypes into our program will inevitably further improve its capacity for haplotype inference. Their inclusion may prove especially advantageous for deciphering present–absent genotyping results since many individuals may have numerous compatible haplotype pairs due to the large portion of missing/ambiguous data. In addition, confirmation of the haplotype patterns already observed in populations can help eliminate less likely haplotypes and enhance the efficiency and accuracy of the algorithms.
Although we illustrated our method in the analysis of KIR genes, HAPLO-IHP is applicable to any bi-allelic markers, including single nucleotide polymorphisms (SNP). However, the portions of missing SNP data are generally much smaller, and it is sometimes difficult to start with known information (identified haplotypes and/or haplotypes patterns) for SNPs under investigation. In this situation, some commonly used programs for haplotype inference (e.g. PHASE) may be used without a priori assumptions. Nonetheless, we performed additional simulations to illustrate the general usefulness of our program. We simulated the data sets based on known haplotypes from 12 SNP markers in a study of the β2-adrenergic receptor (ADRB2) gene (Drysdale et al., 2000) with two missing data scenarios: (1) 5 % of the data missing overall, (2) 50 % of SNP 5, 6 and 7 data missing. For the second scenario, haplotype patterns of SNP 5, 6 and 7 identified in the original haplotypes were used as a known haplotype pattern set for HAPLO-IHP. We found that all three methods had similar performance for the first missing data scenario but HAPLO-IHP showed slightly better performance for the second missing data scenario (details are shown in Supplementary Tables S5 and S6).
Another factor that may affect the performance of HAPLO-IHP is the number of markers (genes) involved in the analysis. To assess how the number of the genes will affect the performance of HAPLO-IHP, we conducted additional simulations based on 7 and 10 KIR genes and found HAPLO-IHP generally performed slightly better for the lower number of genes (Supplementary Table S7). This is expected because an individual with more genes would generally have more compatible haplotype pairs, making the haplotype inference more difficult, especially for data sets with a large portion of missing data. However, we did not observe more individuals with unresolved haplotypes in the greedy algorithm when more genes were involved because haplotypes for an individual remain unresolved if and only if none of the compatible haplotype pairs are consistent with the given haplotype pattern.
One drawback of applying these constraints of haplotype patterns is that unusual haplotype pairs that are incompatible with these constraints may be falsely rejected. Therefore, flexibility in the selection of those pattern constraints may be warranted in order to resolve haplotypes in as many individuals as possible. Finally, accuracy of haplotype inference may be further improved by adaptation of more effective algorithms in finding optimal (minimal) set of haplotypes required for reliable haplotype inference. In this regard, several existing techniques (e.g. Huang et al., 2005; Wang and Xu, 2003) might be helpful after modifications based on previously identified haplotypes and specific patterns of linkage disequilibrium.
We also recognized that the EM algorithm often has computational problems with longer haplotypes. The optimal number of genetic markers for the EM algorithm ranges from 20 to 30 and computation time for the greedy algorithm implemented in HAPLO-IHP can increase substantially when more than 30 markers with a large portion of missing data are tested. The implementation of the popular partition–ligation technique (Qin et al., 2002) proved valuable to HAPLO-IHP. To assess the computational efficiency of HAPLO-IHP using the partition–ligation technique, we conducted simulations with 28, 42 and 56 genes, respectively. The average computational time for 14, 28, 42 and 56 genes were 4, 22, 53 and 107 s. These results clearly show that the use of the partition-ligation strategy reduces the computational time with the comparable performance (Supplementary Table S8).
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This work was supported in part by several R01 grants (AI041951, AI041530 and AI064060) and a contract (N01-AI40068) from the National Institute of Allergy and Infectious Diseases (NIAID). Additional funding came from R01-GM074913 (National Institute of General Medical Sciences).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Keith Crandall
Received on February 28, 2007; revised on June 22, 2007; accepted on July 11, 2007
| REFERENCES |
|---|
|
|
|---|
Akey J, et al. Haplotypes vs. single marker linkage disequilibrium test: what do we gain? Eur. J. Hum. Genet (2001) 9:291–300.[CrossRef][Web of Science][Medline]
Ardlie K, et al. Patterns of linkage disequilibrium in the human genome. Nat. Rev. Genet (2002) 3:299–309.[CrossRef][Web of Science][Medline]
Black PE. Dictionary of Algorithms and Data Structures [online]. Black PE, ed. (2005) NIST.
Botstein D, Risch N. Discovering genotypes underlying human phenotypes: past successes for Mendelian disease, future approaches for complex disease. Nat. Genet (2003) 33:228–237.[CrossRef][Web of Science][Medline]
Clark AG. Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol (1990) 7:111–122.[Abstract]
Clark AG. The role of haplotypes in candidate gene studies. Genet. Epidemiol (2004) 27:321–333.[CrossRef][Web of Science][Medline]
Drysdale CM, et al. Complex promoter and coding region β (2)-adrenergic receptor haplotypes alter receptor expression and predict in vivoresponsiveness. Proc. Natl Acad. Sci. USA (2000) 97:10483–10488.
Douglas JA, et al. Experimentally derived haplotypes substantially increase the efficiency of linkage disequilibrium studies. Nat. Genet (2001) 28:361–364.[CrossRef][Web of Science][Medline]
Du FX, et al. Haplotype construction of sires with progeny genotypes based on an exact likelihood. J. Dairy Sci (1998) 81:1462–1468.[Abstract]
Excoffier L, Slatkin M. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol (1995) 12:921–927.[Abstract]
Falling D, Schork NJ. Accuracy of haplotype frequency estimation for biallelic loci, via expectation-maximization algorithm for unphased diploid genotype data. Am. J. Hum. Genet (2002) 67:947–959.
Gusfield D. Inference of haplotypes from samples of diploid populations: complexity and algorithm. J. Comput. Biol (2001) 8:305–323.[CrossRef][Web of Science][Medline]
Hsu KC, et al. The killer cell immunoglobin-like receptor (KIR) genomic region:gene-order, haplotypes and allelic polymorphism. Immunol. Rev (2002) 190:40–52.[CrossRef][Web of Science][Medline]
Huang YT, et al. An approximation algorithm for haplotype inference by maximum parsimony. J. Comput. Biol (2005) 12:1261–1274.[CrossRef][Web of Science][Medline]
Khakoo SI, Carrington M. KIR and disease: a model system or system of models? Immunol. Rev (2006) 214:186–201.[CrossRef][Web of Science][Medline]
Marsh SGE, et al. Killer-cell immunoglobulin-like receptor (KIR) nomenclature report, 2002. Hum. Immunol (2003) 64:648–654.[Web of Science][Medline]
Martin AM, et al. Comparative genomic analysis, diversity and evolution of two KIR haplotypes A and B. Gene (2004) 335:121–131.[CrossRef][Web of Science][Medline]
Middleton D, et al. KIR genes. Transpl. Immunol (2005) 14:135–142.[CrossRef][Web of Science][Medline]
Morris RW, Kaplan NL. On the advantage of haplotype analysis in the presence of multiple disease susceptibility alleles. Genet. Epidemiol (2002) 23:221–233.[CrossRef][Web of Science][Medline]
Niu T, et al. Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am. J. Hum. Genet (2002) 70:157–159.[CrossRef][Web of Science][Medline]
Qin ZS, et al. Partial-ligation-expectation-maximization for haplotype inference with single nucleotide polymorphisms. Am. J. Hum. Genet (2002) 71:1242–1247.[CrossRef][Web of Science][Medline]
Stephens M, Donnelly P. A comparison of Bayesian methods for haplotype reconstruction from population data. Am. J. Hum. Genet (2003) 68:978–989.[CrossRef]
Stephens M, et al. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet (2001) 68:978–989.[CrossRef][Web of Science][Medline]
Uhrberg M, et al. Definition of gene content for nine common group B haplotypes of the Caucasoid population: KIR haplotypes contain between seven and eleven KIR genes. Immunogenetics (2002) 54:221–229.[CrossRef][Web of Science][Medline]
Wang L, Xu Y. Haplotype inference by maximum parsimony. Bioinformatics (2003) 29:1773–1780.
Weiss MA. Data Structures and Algorithm Analysis in C. (1997) 2nd. Reading, MA: Addison-Wesley.
Whang DH, et al. Haplotype analysis of killer cell immunoglobulin-like receptor genes in 77 Korean families. Hum. Immunol (2005) 66:146–154.[CrossRef][Web of Science][Medline]
Yan H, et al. Conversion of diploidy to haploidy. Nature (2002) 403:723–724.
Zhang K, et al. HAPLORE: a program for haplotype reconstruction in general pedigrees without recombination. Bioinformatics (2005) 21:90–103.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
