Skip Navigation


Bioinformatics Advance Access originally published online on June 23, 2008
Bioinformatics 2008 24(17):1942-1948; doi:10.1093/bioinformatics/btn324
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/17/1942    most recent
btn324v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zhang, H.
Right arrow Articles by Yang, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhang, H.
Right arrow Articles by Yang, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

PoooL: an efficient method for estimating haplotype frequencies from large DNA pools

Han Zhang 1,*, Hsin-Chou Yang 2 and Yaning Yang 1,*

1Department of Statistics and Finance, University of Science and Technology of China, Anhui 230026 and 2Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 APPENDIX A
 APPENDIX B
 ACKNOWLEDGEMENTS
 References
 

Motivation: Pooling DNA is a cost-effective alternative to individual genotyping method. It is often used for initial screening in genome-wide association analysis. In some studies, large pools with sizes up to several hundreds were applied in order to significantly reduce genotyping cost. However, method for estimating haplotype frequencies from large DNA pools has not been available due to computational complexity involved.

Methods: We propose a novel constrained EM algorithm, PoooL, to estimate frequencies of single-nucleotide polymorphism (SNP) haplotypes from DNA pools. A quantity called importance factor is introduced to measure the contribution of a haplotype to the likelihood. Under the assumption of asymptotic normality of the estimated allele frequencies and a system of linear constraints on haplotype frequencies the importance factor remains a constant in the iterative maximization process. The maximization problem in the EM algorithm is then formulated into a constrained maximum entropy model and solved by the improved iterative scaling method.

Results: Simulation study shows that our algorithm can efficiently estimate haplotype frequencies from DNA pools with arbitrarily large sizes. The algorithm works equally well for large pools with sizes up to hundreds or thousands and for pools with sizes as small as one or two individuals. The computational complexity of the PoooL algorithm is independent of pool sizes, and the computational efficiency for large pools is thus substantially improved over existing estimating methods. Simulation results also show that the proposed method is robust to genotype errors and population admixture.

Availability: http://staff.ustc.edu.cn/~ynyang/poool

Contact: zhanghan{at}mail.ustc.edu.cn; ynyang{at}ustc.edu.cn


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 APPENDIX A
 APPENDIX B
 ACKNOWLEDGEMENTS
 References
 
Pooling DNA is a cost-effective design for gene mapping (Norton et al., 2004; Risch and Teng, 1998; Sham et al., 2002). It is often used as a tool of initial genomic screening for informative alleles (Barcellos et al., 1997; Pearson et al., 2007; Zuo et al., 2006). Since haplotype is regarded as an information-rich carrier of linkage disequilibrium (LD) information across different single-nucleotide polymorphism (SNP) loci, estimating haplotype frequencies from pooled DNA has been investigated by many researchers (Ito et al., 2003; Kirkpatrick et al., 2007; Wang et al., 2003; Yang et al., 2003). A complete review of methods for haplotype inference for pooled and individual DNA data can be found in Niu, (2004).

To save genotyping cost, pooling DNA from large numbers of individuals is now a common strategy in multi-stage experiments (Kirkpatrick et al., 2007; Pearson et al., 2007). The main purpose of pooling large numbers of individuals is to screen alleles by allelotyping. But estimating haplotype frequencies or LD information is also of general interest with the pooled data. The results will provide information for multilocus association analysis, which was only investigated by a limited number of studies for pooled DNA (Pearson et al., 2007; Yang et al., 2006).

Almost all haplotype inference algorithms need to enumerate haplotypes that are compatible to the observed pooled genotypes. For a large pool size (say, pools with tens or hundreds of individuals), enumerating haplotypes, if not impossible, is computationally intensive, and is not viable on most current computer facilities. To overcome this difficulty, we propose a constrained EM algorithm, PoooL, to improve computational efficiency of the original EM algorithm. We introduce a system of linear constraints on allele frequencies and pairwise LD coefficients (or pairwise haplotype frequencies) by considering the mean and variance–covariance matrix of the observed genotypes in order to simplify the computation of the conditional expectation in the EM algorithm. Using the linear constraints, the EM algorithm can be transformed into a constrained maximum entropy model and solved by the improved iterative scaling method (IIS). The PoooL algorithm converges much faster than the original EM algorithm without losing much accuracy in estimating haplotype frequencies. The most prominent feature of the proposed method is that the computational load does not depend on pool sizes. Consequently, haplotype analysis of DNA pools with arbitrarily large sizes becomes possible with this approach.

While developing our method, we observed that, if the number of loci is small compared with the pool sizes, the haplotype information can be partially recovered from pairwise LD or pairwise haplotype information, which can be efficiently estimated from the variance–covariance matrix of pooled genotypes. We therefore compute the conditional expectation of the number of haplotypes subject to linear constraints on allele frequencies and pairwise haplotype frequencies (or equivalently, LD coefficients) by the mean and variance–covariance matrix of the pooled genotypes. In fact, for q SNP loci, there are q+q(q–1)/2=q(q+1)/2 such linear constraints for the 2q possible haplotypes. The means and variance–covariances capture the major part of the haplotype frequencies. Another characteristic of the PoooL algorithm is the use of the ‘importance factor’, which is introduced to measure the contribution of a specific haplotype to the likelihood. This factor, under the normality assumption for the estimated allele frequencies from pooled DNA, is shown to be determined by the linear constraints and remains a constant throughout the iterative updating procedure. Therefore, with the linear constraints, the importance factor does not need to be evaluated repeatedly. This property allows us to transform the maximization problem into a constrained maximum entropy model and to solve the optimization by an improved iterative scaling method.

A common problem for pooled DNA data is its higher percentage of missing values and various sources of measurement errors (Barratt et al., 2002). In our method, missing data are imputed by conditional probabilities of missing data given the observed data. Simulation results show that the imputation method works well for missing rate up to 20%. Results also show that measurement error rate up to 10% does not have significant influence on haplotype frequency estimation. Since population stratification is a common issue in genetic studies, we also investigate the influence of population admixture on the PoooL algorithm. The rest of this article is organized as follows. We first introduce our method in the next section. Then we evaluate the proposed method by simulation studies. The last section provides some discussion of the proposed ethod.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 APPENDIX A
 APPENDIX B
 ACKNOWLEDGEMENTS
 References
 
In this section, we will first introduce some notations and the normality assumption, then we briefly review the EM algorithm and describe our method. We will show that the conditional expectation of the number of a specific haplotype in the EM algorithm can be expressed as its frequency multiplied by a factor, which is defined as the importance factor of the haplotype. Under the normality assumption for the pooled genotypes and some linear constraints on the allele frequencies and pairwise haplotype frequencies, the importance factor is shown to be a constant and we then reformulate the maximization part of the EM algorithm into a constrained maximum entropy model and solve it by an improved iterative scaling method.

For q SNP loci, the total number of possible haplotypes is r =2q. Denote the two alleles at each locus by 1 and 0. Denote all the possible haplotypes by vectors h1, h2, ..., hr, where hj =(h1j, ..., hqj)' is the binary sequence expansion of number j–1. Let the frequency of haplotype hj be pj and denote p =(p1, ..., pr)'.

In this study, we use the counts of allele 1 as the measurement on pooled DNA, which can be converted from the estimated allele frequencies. Hereafter, we will refer to the counts of allele 1 as pooled genotype. Suppose that we have T such pools each with size N. For the i-th pool, let Ai=(Ai1,...,Aiq)' be the pooled genotypes on q loci (q≥2). Denote all the genotype observations of the T pools by A=(A1, ..., AT)'. Although we assume the pool sizes to be the same in the following derivations, our method can be applied to the situations that different pools have different sizes.

Let {omega} = ({omega}1,..., {omega}q)' be the vector of allele frequencies for allele 1's and {Sigma}0 = ({sigma}st)1≤s,t,≤q be the matrix of LD coefficients (or variance–covariance matrix) for the q loci. An appropriate estimate of {omega} is the sample mean Formula and the estimate of {Sigma}0 is the sample variance–covariance matrix Formula Our objective is to estimate the frequencies p of all r haplotypes from pooled genotypes A. In what follows, we will approximate the distribution of the pooled genotypes by a multivariate normal distribution


Formula 1

(1)
where µp = 2N{omega}, {Sigma}p = 2N{Sigma}0. If the pools are large in sizes, this normality approximation is guaranteed by the Central Limit Theorem. But, as we will see subsequently, this also provides a good approximation of the calculations even if pool sizes are small (see also Supplementary Material for some illustrations).

2.1 The EM algorithm
Before stating our method, we first review the classical EM algorithm for haplotype estimation. Let mij be the number of the j-th haplotype in pool i and let M={(m11,...,m1r), ...,(mT1,...,mTr)} be the complete dataset. Note that the complete data are not observable. The EM algorithm works through updating iteratively their conditional expectation on the pooled genotype data A. The complete log-likelihood function of M under the assumption of Hardy–Weinberg equilibrium is given by


Formula

By definition, {sum}Formulamij=2N. The EM algorithm works by iteratively updating haplotype frequency estimates. Denote the estimate in step t by Formula . In E-step, the expected log-likelihood given observations and current parameter estimate Formula is


Formula 2

(2)
In M-step, the maximizer of the above expected log-likelihood is


Formula 3

(3)
In computing the conditional expectation in (3), the EM algorithm enumerates the haplotype configurations that are compatible with the observed pooled genotypes A. This process is very computationally intensive for large pools and EM algorithm can only work when pool sizes are not large (e.g. N<10) due to current computer running-time limitations. Thus, an efficient algorithm for calculating the conditional expectation is needed. Surprisingly, when pool size N is large, there is a simple representation for the conditional expectation in (3), as shown below.

2.2 Importance factor
Calculation of the expected number of haplotypes that are compatible with the pooled genotypes, is the most time-consuming part of the EM algorithm. To calculate the conditional expectation more efficiently, we introduce a quantity called importance factor and factorize the conditional expectation into the product of haplotype frequency and the importance factor.

Let {delta}ijk=1 if the k-th haplotype in pool i is haplotype j, and {delta}ijk=0 otherwise. Define the importance factor for the j-th haplotype by


Formula 4

(4)
Under the normality assumption in (1), we show in Appendix A that the importance factor Rh depends on p only through {omega} and {Sigma}0. This property will be shown to be critical to our algorithm. We shall denote Rh=Rh({omega}, {Sigma}0).

Then the conditional expectation in the EM algorithm (3) can be rewritten as


Formula 5

(5)
where Formula and Formula are the sample mean and variance–covariance matrix based on haplotype frequencies Formula .

From the definition, the importance factor Rhj may be regarded as a measure for the contribution of the j-th haplotype hj to the observed data. In Appendix B, we show that the importance factor, Rh=Rh({omega}, {Sigma}0), has a simple approximation


Formula

when N is large and q is relatively small compared with N. Therefore, 1–Rh is the weighted distance of haplotype h from centroid {omega}.

2.3 The PoooL algorithm
Recognizing the computational complexity in assessing the conditional expectation in (2) and (3) of the EM algorithm, we propose to maximize (2) under some linear constraints on allele frequencies and pairwise haplotype frequencies by using the mean and variance–covariance matrix of pooled genotypes. Addition of the linear constraints is based on the following observations. First, the frequency of an allele can be obtained by summing up frequencies of all the haplotypes that contain the allele, and consequently there are q such linear equations for q SNP loci (Sham et al., 2002). Second, the joint probability of every pair of alleles is the sum of frequencies of all haplotypes that contain the two alleles. Since the allele frequencies can be estimated by marginal means for each locus and the pairwise joint allele frequencies can be estimated from pairwise covariances, we can naturally add constraints on the haplotype frequencies by the allele frequencies and pairwise covariances. Namely, the frequency of allele 1 at locus l is given by {omega}l = {sum}Formulahljpj, and the probability of the two-locus haplotype formed by alleles ‘1’ at locus l1 and l2 is Formula . Using a matrix representation, these facts are


Formula

where {Lambda}p=p and H=(h1, ... hr)'. The matrix of pairwise covariances or LD coefficients can be represented as {Sigma}0={eta}-{omega}{omega}'= H({Lambda}p-pp')H'.

These observations lead to the following constraints on p in maximizing (2)


Formula 6

(6)
where vector 1 is the vector with all components being 1 and Formula . For q loci, we have q linear constraints on the haplotype frequencies by the allele frequencies {omega} and q(q–1)/2 linear constraints by the pairwise joint probabilities {eta}.

Since the importance factor Rh depends on p only through {omega} and {Sigma}0, or equivalently, {omega} and {eta}, it is a constant under the constraints in (2), and the computational complexity can thus be significantly reduced. We denote its value by Formula under the constraints and the expected log-likelihood in (2) can be rewritten as


Formula 7

(7)
Notice that the converged solution, Formula , to (7) under the constraints (6) is also the solution to the following convex optimization problem


Formula 8

(8)
Let Formula , {varphi}=({varphi}1, ..., {varphi}r)', Formula . From (5) and the fact that {sum}i,jmij=2NT, we can see that {sum}RFormulajpj=1 and therefore Formula since Formula as N is large enough. Therefore {varphi} can be treated as a probability distribution. Denote the new version of {Omega} under {varphi}scale by {Omega}{varphi}. Then (8) is equivalent to the following maximization problem


Formula 9

(9)
which is thus a typical constrained maximum entropy (ME) model (Jaynes, 1957), the well-known convex optimization problem in natural language processing (NLP) and other relevant fields. The ME model can be solved by the generalized iterative scaling algorithm (GIS, Csiszár, 1975; Csiszár, 1989; Darroch and Ratcliff, 1972) or the improved iterative scaling algorithm (Berger et al., 1996; Della Pietra et al., 1997). In this article, we use the IIS algorithm to find Formula , thereby getting a solution of model (8) by setting Formula .

The IIS algorithm is an efficient algorithm in NLP and it can maximize an entropy function with numbers of constraints up to hundreds of thousands [e.g. Equation (6) for a moderate q] in a endurable time (Della Pietra et al., 1997). Furthermore, as a convex optimization problem, the IIS solution always converges to the unique solution in the constrained space (Della Pietra et al., 1997). Thus, our PoooL algorithm would not be trapped in local maxima and can be used for moderate numbers of loci (e.g. q~ 10).


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 APPENDIX A
 APPENDIX B
 ACKNOWLEDGEMENTS
 References
 
We demonstrate our method by simulating haplotypes using haplotype frequencies from three published datasets. We generate 2N haplotypes in each pool independently and combine pairs of haplotypes randomly to make genotypes for N individuals. Then, T pools are constructed and haplotype frequencies are estimated from the pooled data. We first show results for pooled data without missing data or genotype error, then missing data mechanism and genotype errors are introduced in simulations. In what follows, unless otherwise stated, the number of simulation replicates is 2000.

We first illustrate the advantage of our method by analyzing large pools formed from the FUSION data (the Finland–United States Investigation of NIDDM Genetics Study), namely the genotype data of chromosome 22 from the FUSION case-control study (Lin and Zeng, 2006; Valle et al., 1998; Zhang et al., 2007). There are five SNPs and 32 haplotypes for this data. Frequencies of non-rare haplotypes are estimated, respectively, within the case and control groups by the standard EM algorithm (the two columns entitled p in Table 1). In simulations, we randomly generate haplotypes from this distribution and form T =30 pools each with a size N =500 and estimate the haplotype frequencies using the PoooL program. Results are shown in Table 1. Note that in Table 1 and subsequent tables, haplotypes with true frequencies less than 0.01 are not listed.

Second, to examine our method for the case of large number of SNPs, we generate haplotypes according to the 10-locus haplotype frequencies (Table 2) of AGT gene considered in Yang et al., (2003). Similarly, we use these frequencies to generate T pools of size N. Haplotype frequency estimates from the PoooL program are shown in Table 2. Both Tables 1 and 2 show that our method can produce satisfactory estimates of the haplotype frequencis when the pool size is large, but there are slight biases (up to 0.04) in haplotype frequency estimates. As show below, when N or T increases or the number of loci decreases, the biases tend to become smaller.


View this table:
[in this window]
[in a new window]

 
Table 1. Haplotype frequency estimation for large DNA pools from the FUSION data (SD: standard error)

 
Third, we compare our method and the EM algorithm by using the 3-locus haplotype frequencies obtained from another dataset (Jain et al., 2002). This dataset contains 135 unrelated individuals genotyped at three SNPs. The haplotype frequencies are estimated by using an EM algorithm based on the individual genotyping data, and the estimates are regarded as the true haplotype frequencies. We generate 30 large pools using these frequencies with different large pool sizes N=10, 50, 100, 500 and small pool sizes N=2,3,4,5. The simulation procedure is the same as those in the previous examples. Table 3 illustrates the results. It can be found that the haplotype estimates by PoooL are reasonably close to the true values. As pools size N increases, the SDs of the estimates decrease. Nevertheless, if N ≥ 50, the SDs appear to be rather stable. Table 3 also lists the estimates from the EM algorithm for small pools N≤5. Even for the case of N=500, SDs of the estimates from the PoooL algorithm are greater than those of the EM algorithm. We think this as the price that one must pay in order to estimate haplotype frequencies from large pools. A surprising result is that PoooL can work equally well even when a pool size is as small as 2, for which the normal approximation may not be accurate enough.


View this table:
[in this window]
[in a new window]

 
Table 2. Haplotype estimates of 10-locus haplotype frequencies

 


View this table:
[in this window]
[in a new window]

 
Table 3. Haplotype frequency estimates from the PoooL algorithm and EM program for Jain's data (SDs are in parentheses)

 
Figure 1a displays estimation precision of the PoooL algorithm for a variety of pool sizes (10 to 10 000) under different numbers of DNA pools, in which the estimation precision is defined as the averaged Euclidean distance between the estimated haplotype frequencies and their true values, i.e. the precision is characterized by Formula . Given a fixed number of DNA pools, the estimation accuracy increases with the pool sizes because the normal approximation is more accurate as the pool size increases. However, when the pool size exceeds some threshold, the increment of precision becomes negligible. The Figure 1b compares the PoooL method and the EM method by using the Jain's data. Since the EM algorithm can only handle the situation of small pools, we make the comparison using T=30 pools with sizes N=1,2,...,6. From this figure, we can find that estimation precision of the EM algorithm slightly decreases with pool size N. The EM algorithm has better estimation accuracy than the PoooL algorithm but the discrepancies are small when N ≥ 2.


Figure 1
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Estimation precision of the PoooL algorithm (a), x-axis is on a logarithmic scale with base 10) and comparison of the PoooL algorithm and the EM algorithm in terms of estimation precision (b) based on Jain's data.

 
We have shown the performance of our method by simulating DNA pools without missing data and measurement errors. However, in allelotyping pooled DNA, allele frequencies may not be estimated properly in some practical situations and the data are consequently missing or have high measurement errors. In our algorithm, missing data are imputed based on the observed pooled genotypes. Table 4 shows the results under two different settings of missing rates and measurement error rates. The missing rate, {alpha}, is the proportion of genotype data that are missing at random. The error rate is defined as the proportion of variance inflation due to measurement error, i.e. β=({sigma}Formula{sigma}Formula)/{sigma}Formula, where {sigma}Formula is the variance of allele frequency estimates when genotype errors are present and {sigma}Formula={omega}(1–{omega}) is the variance of allele frequency estimates when there is no error. The first set of parameters is {alpha}=10% and β=5%. The second set is {alpha}=20% and β=10%, which is taken to be larger than the typical values in real applications. Even for the latter case, simulation results show that these two factors do not affect the estimation procedure significantly (Table 4). For examples, the estimates for the cases of N=30 and N=50 are very similar.

Another common issue in genetic studies is population stratification. As Pe'er and Beckmann, (2003) demonstrated, different populations with different haplotype distributions may have the same allele frequencies. Therefore, individuals with similar allele frequencies may be from different populations. We examine this issue by mixing individuals from the main population A (Jain's data in Table 3) and a minor population B (Column 3 in Table 5). These two populations have the same allele frequencies (minor allele frequencies are 0.47, 0.11, 0.08) but different haplotpye frequencies (Columns 2–3 in Table 5). The total number of pools is taken to be T=30. Each pool has N=5 individuals. We take the proportion, {tau}B, of population B to be 0.0, 0.1 and 0.2. Table 5 illustrates that the biases of haplotype frequency estimates are generally small when the proportion of population B is small. The maximum biases are, respectively, 0.022, 0.029, 0.037 for {tau}B=0.0, 0.1 and 0.2 in Table 5. More results can be found in Supplementary Material.


View this table:
[in this window]
[in a new window]

 
Table 4. Haplotype frequency estimates by using PoooL under different missing rates and error rates based on Jain's data (SDs of the estimates are in parentheses)

 


View this table:
[in this window]
[in a new window]

 
Table 5. Haplotype frequency estimates when a proportion, {tau}B, of individuals from population B is admixed in population A

 
The advantage of our method is its computational efficiency in estimating haplotype frequencies from large pools. For example, for the FUSION data with 5 loci, 30 pools of sizes 500 and 2000 replicates of simulations, the PoooL program takes about 1 h on an Intel(R) 2.2 GHz notebook PC with 2 GB RAM. The pool size is irrelevant in our algorithm and it is order N! more efficient than the classical EM algorithm for large pools. The computing time of PoooL algorithm is not affected by pool size N and is linearly dependent on the number of pools T.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 APPENDIX A
 APPENDIX B
 ACKNOWLEDGEMENTS
 References
 
Pooling a large number of individual DNA samples is now a frequently used design in genome-wide association studies, principally used as an initial allelic screening method. With the pooled DNA data available, many researchers may also want to estimate haplotype frequencies and to perform LD analysis. However, estimation of haplotype frequencies from such data was not possible for pools with more than 10 individuals. It is thus challenging to develop methods appropriate for haplotype inference from multi-locus large pooled data. In this study, we propose an efficient algorithm, PoooL, to solve the problem of estimating SNP haplotype frequencies from pooled DNA data. The method allows for missing data and is shown to be quite robust to measurement error.

The basic assumption of our approach is the normality assumption of the observed pooled genotypes. This is guaranteed by the Central Limit Theorem in statistical theory. Simulation studies show that this assumption provides a satisfactory description of the observed data (Supplementary Material). We also introduce a quantity, importance factor, to characterize the importance of a specific haplotype in pooled DNA data, which turns out to be the key to our method. Although we propose the algorithm for large pools, it is shown that it also works well for small pools. For small pools, the haplotype frequency estimates from our method are comparable to those estimated from the standard EM algorithm but with slightly larger biases and variances. This is because the normality approximation of pooled genotypes is not accurate enough when pool sizes are small.

By constraining the allele frequencies and pairwise haplotype frequencies, the importance factor does not rely on the unknown parameters of haplotype frequencies. We are hence able to translate the constrained EM model into the framework of the standard constrained maximum entropy model, which can be solved efficiently by the IIS algorithm. As a classical convex optimization method for an ME model, the IIS algorithm and thus our PoooL algorithm produces the global optimal solution in the constrained space.

The outstanding feature of the PoooL algorithm is its computational efficiency for large pools. Pool size does not affect the computing load of our method and therefore one can estimate haplotype frequencies from pools with arbitrarily large sizes. In practice, however, pooling too many individual DNAs (e.g. more than 1000) is not feasible because of the limitation of the instrument sensitivity and operational difficulties. But, pooling of tens to hundreds of individuals is a common practice in biomedical studies. Simulation results show that our method can produce satisfactory estimates for pools with relatively large sizes and reasonable error rates. Due to its computing efficiency, our method can be readily applied to genome-wide association analysis with large DNA pools.

Similar to most of other haplotype estimation methods, a major limitation of the PoooL algorithm is that it cannot work properly when the number of loci is large. However, the sliding window method (Yang et al., 2006) can be easily incorporated into the PoooL algorithm when the number of loci is large. Another possible method is the partition–ligation method (Niu et al., 2002), which can be used to weld together pieces of short haplotypes. In this study, we consider haplotype frequency estimation from pooled genotype data. In addition to loss of individual genotype information , another disadvantage of DNA pooling is its relatively high measurement error (Kirkpatrick et al., 2007). In practice, pooled genotypes are obtained from intensity measurements and which are known to be subject to preferential amplification/hybridization effect and various sources of variation such as error in PCR (Polymerase Chain Reaction), pool formation and allele frequency estimation (Barratt et al., 2002). We do not differentiate these different sources of error and assume an error rate that is proportional to pool sizes in our simulation studies. However, different types of error may have different effects on haplotype frequency estimation. We will explore these issues in future studies.


    APPENDIX A
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 APPENDIX A
 APPENDIX B
 ACKNOWLEDGEMENTS
 References
 
CALCULATION OF THE IMPORTANCE FACTORS UNDER NORMAL APPROXIMATION
Notice that P(Ai|{delta}ij1=1, p) is the conditional probability of observation Ai given the first haplotype in pool i is haplotype j, which is equivalent to the non-conditional probability of artificial data AFormula=Ai-hj. Thus, as N{infty}, we have


Formula

where Formula and Formula Since {omega}p/2N and {Sigma}0={Sigma}p/2N, then Rh can be expressed as


Formula

Therefore Rh depends on p only through {omega} and {Sigma}0.


    APPENDIX B
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 APPENDIX A
 APPENDIX B
 ACKNOWLEDGEMENTS
 References
 
CHARACTERIZATION OF THE IMPORTANCE FACTORS UNDER NORMAL APPROXIMATION
Applying Taylor expansion, we have


Formula 10

(B1)
as N->{infty}. When N is large enough, from largesample theory, we know Formula and Formula so that the third term on the right side of (B1) disappears and the last term is approximately equal to Formula when N is large compared with q. Hence Rh can be approximated by


Formula

when pool sizes are relatively large. Furthermore, we can see that


Formula

The difference measures which of hi and hj is closer to the centroid {omega}. The following proposition shows that, if the loci are in linkage equilibrium and have identical allele frequencies, the difference of importance factors for two haplotypes is proportional to the difference of their frequencies. This implies that, for the special situation, the importance factor measures the importance of a haplotype in term of its frequency.

Proposition.
If all the qloci are independent and have an identical allele frequency {omega}0, then


Formula

where {Delta}ij is the number of different loci of haplotypes hi and hj.

Proof.
It is easy to see that if the conditions hold, {Sigma}0={sigma}FormulaI, where {sigma}Formula={omega}0(1–{omega}0). Then


Formula

where


Formula

and


Formula

Denote {Gamma}Formula={t1, ..., tl}, {Gamma}Formula={s1,..., sk}, then


Formula

Using the following identity


Formula

it can be shown that


Formula



Formula

These two equations imply that


Formula

which completes the proof.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 APPENDIX A
 APPENDIX B
 ACKNOWLEDGEMENTS
 References
 
We thank Dr Anthony Y.C. Kuk for his helpful discussion and suggestions and Jianming Wu for his technical supports in programming. We are indebted to two anonymous referees for valuable comments which have substantially improved the quality of this work.

Funding: H.Z. and Y.Y. are supported by China NSF Grant and Chinese Academy of Science Grant. H.C.Y. is partially supported by a National Science Council grant of Taiwan.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Limsoon Wong

Received on January 19, 2008; revised on June 20, 2008; accepted on June 20, 2008

    References
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 APPENDIX A
 APPENDIX B
 ACKNOWLEDGEMENTS
 References
 

    Barcellos LF, et al. g of disease loci, by use of a pooled DNA genomic screen. Am. J. Hum. Genet. (1997) 61:734–747.[Web of Science][Medline]

    Barratt BJ, et al. Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design. Ann. Hum. Genet. (2002) 66:393–405.[CrossRef][Web of Science][Medline]

    Berger A, et al. A maximum entropy approach to natural language processing. Comput. Lingui. (1996) 22:39–71.[CrossRef]

    Csisaár I. I-divergence geometry of probability distributions and minimization problems. Ann. Prob. (1975) 3:146–158.[CrossRef]

    Csiszár I. A geometric interpretation of Darroch and Ratcliff's generalized iterative scaling. Ann. Stat. (1989) 17:1409–1413.[CrossRef]

    Darroch JN, Ratcliff D. Generalized iterative scaling for log-linear models. Ann. Math. Statist. (1972) 43:1470–1480.[CrossRef]

    Della Pietra S, et al. Inducing features of random fields. IEEE Trans. Pattern Anal. Mach. Intell. (1997) 19:1–13.

    Ito T, et al. Estimation of haplotype frequencies, linkage-disequilibrium measures, and combination of haplotype copies in each pool by use of pooled DNA data. Am. J. Hum. Genet. (2003) 72:384–398.[CrossRef][Web of Science][Medline]

    Jain S, et al. Angiotensinogen gene polymorphism at -217 affects basal promoter activity and is associated with hypertension in African–Americans. J. Biol. Chem. (2002) 277:36889–36896.[Abstract/Free Full Text]

    Jaynes ET. Information theory and statistical mechanics. Phys. Rev. (1957) 106:620–630.[CrossRef][Web of Science]

    Kirkpatrick B. HaploPool: improving haplotype frequency estimation through DNA pools and phylogenetic modeling. Bioinformatics (2007) 23:3048–3055.[Abstract/Free Full Text]

    Lin D, Zeng D. Likelihood-based inference on haplotype effects in genetic association studies. J. Am. Stat. Assoc. (2006) 101:89–104.[CrossRef][Web of Science]

    Niu T, et al. Bayesian haplotype inference for multiple linked single–nucleotide polymorphisms. Am. J. Hum. Genet. (2002) 70:157–169.[CrossRef][Web of Science][Medline]

    Niu T. Algorithms for inferring haplotypes. Genet. Epidemiol. (2004) 27:334–347.[CrossRef][Web of Science][Medline]

    Norton N, et al. DNA pooling as a tool for large-scale association studies in complex traits. Ann. Med. (2004) 36:146–152.[CrossRef][Web of Science][Medline]

    Pearson JV, et al. Identification of the genetic basis for complex disorders by use of pooling-based genomewide single-nucleotide-polymorphism association studies. Am. J. Hum. Genet. (2007) 80:126–139.[CrossRef][Web of Science][Medline]

    Pe'er I, Beckmann JS. Resolution of haplotypes and haplotype frequencies from SNP genotypes of pooled samples. In: Proceedings of the Seventh Annual International Conference on Research in Computational Molecular Biology (RECOMB2003) (2003) Berlin: ACM Press. 237–246.

    Risch N, Teng J. The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases I. DNA pooling. Genome Res. (1998) 8:1273–1288.[Abstract/Free Full Text]

    Sham P, et al. DNA pooling: a tool for large-scale association studies. Nat. Rev. Genet. (2002) 3:862–871.[CrossRef][Web of Science][Medline]

    Valle T, et al. Mapping genes for NIDDM: design of the Finland-United States Investigation of NIDDM Genetics (FUSION) study. Diabetes Care (1998) 21:949–958.[Abstract]

    Wang S, et al. On the use of DNA pooling to estimate haplotype frequencies. Genet. Epidemiol. (2003) 24:74–82.[CrossRef][Web of Science][Medline]

    Yang HC, et al. PDA: pooled DNA analyzer. BMC Bioinformatics (2006) 7:233.[CrossRef][Medline]

    Yang Y, et al. Efficiency of SNP haplotype estimation from pooled DNA. Proc. Natl. Acad. Sci. USA (2003) 100:7225–7230.[Abstract/Free Full Text]

    Zhang H, et al. Statistical methods for haplotype-based matched case-control association studies. Genet. Epidemiol. (2007) 31:316–326.[CrossRef][Web of Science][Medline]

    Zuo Y, et al. Two-stage designs in case-control association analysis. Genetics. (2006) 173:1747–1760.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Genome ResHome page
J. E. Craig, A. W. Hewitt, A. E. McMellon, A. K. Henders, L. Ma, L. Wallace, S. Sharma, K. P. Burdon, P. M. Visscher, G. W. Montgomery, et al.
Rapid inexpensive genome-wide association using pooled whole blood
Genome Res., November 1, 2009; 19(11): 2075 - 2080.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/17/1942    most recent
btn324v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zhang, H.
Right arrow Articles by Yang, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhang, H.
Right arrow Articles by Yang, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?