Skip Navigation


Bioinformatics Advance Access originally published online on January 12, 2006
Bioinformatics 2006 22(6):771-772; doi:10.1093/bioinformatics/btk049
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/6/771    most recent
btk049v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Thomas, A.
Right arrow Articles by Camp, N. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Thomas, A.
Right arrow Articles by Camp, N. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Maximum likelihood estimates of allele frequencies and error rates from samples of related individuals by gene counting

Alun Thomas * and Nicola J. Camp

Department of Medical Informatics, University of Utah 391 Chipeta Way Suite D, Salt Lake City, UT 84108, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 REFERENCES
 

Summary: Graphical modeling is used to extend the gene counting method to compute maximum likelihood estimates of allele frequencies for samples of individuals related in extended pedigrees. Genotypes may be missing or partially observed, and error rates can be simultaneously estimated.

Availability: The Java classes and Javadocs pages for Formulacan be obtained from bioinformatics.med.utah.edu/~alun, which also has more information on its use and file formats.

Contact: alun{at}genepi.med.utah.edu

Few ideas from statistical genetics have been as widely accepted and adapted as C. A. B. Smith's gene counting method (Ceppellini et al., 1955; Smith, 1957). It was originally devised as an iterative procedure for estimating multinomial frequencies when only combined class counts are observable, the canonical example being the estimation of the frequencies of the alleles of the AB0 blood group locus from observed counts of blood types. Now more often known as the expectation and maximization, or EM, algorithm (Dempster et al., 1977), it has been applied to such genetic problems as estimating evolutionary trees (Thompson, 1975) and reconstructing haplotypes from diploid data (Thomas, 2003). More broadly it has been used in all manner of statistical problems in which the observed data are, at least conceptually, incomplete. The application described here is a very direct extension: estimating allele frequencies based on samples of genotypes from related individuals.

Although the naive approach of ignoring relationships and using raw sample frequencies gives unbiased estimates, for individuals related in a small number of extended pedigrees results can be unreliable with large variance because of correlations between genotypes due to common ancestry. This is particularly an issue when performing linkage or association analysis in pedigrees. These are sensitive to frequency errors: a rare allele shared by a group of relatives also sharing a genetic trait is evidence of association of the marker and trait loci. Overestimating the frequency of the allele will obscure this association. Conversely, an under estimate of an allele's frequency will cause false positive results. Reliable allele frequency estimates from the pedigrees under study are necessary. McPeek et al. (2004) showed that maximum likelihood estimation is best for this problem when it is tractable, and the method of Boehnke (1991) can be used to do this. The approach outlined here, however, has modeling and computational advantages as shown below.

We begin with initial estimates of the allele frequencies that may be the naive unbiased estimates. Given these frequencies, the pedigree structure and observed genotypes for some of the individuals, we can compute the marginal distributions of the genotypes of the founders of the pedigree. This is the E step of the EM algorithm and is done by constructing a graphical model and performing the usual forward–backward algorithm to obtain the marginal distributions of genotypes in the maximal cliques of the graph. From these the marginals for the individual founder genotypes are easily calculated. A clique is a subgraph in which all the vertices representing variables are joined to each other, and a decomposable graphical model has the property that the joint distribution on all variables is determined by the clique marginals (Lauritzen and Spiegelhalter, 1988). The M step is to find the maximum likelihood estimates of the allele frequencies by summing the weighted counts for the founder alleles. There is an implicit assumption that the founder alleles are randomly sampled from a wider population. The E and M steps are then repeated until convergence, which, in this implementation, is determined when the largest absolute between iteration difference for an allele frequency is less than a specified value, which is by default 0.000001.

This approach is, therefore, entirely analogous to the the original gene counting method but with the simple partition of allele counts for a phenotype according to the currently implied genotype frequencies being replaced by the more involved graphical modeling computation. For data on unrelated individuals, the method reduces to simple gene counting. The functions used in the implementing program, called Formula, are shared in large part with the error checking program Formula (Thomas, 2005), and like that application, the computational time and storage needed to process a locus with a observed alleles in a pedigree of n individuals with no loops is of order o(na4). Formula will also handle looped pedigrees, but the resources required will grow as at least o(a6) and faster for highly looped complex pedigrees. Thus, the program should be able to obtain estimates for loci with small numbers of alleles, and in particular single nucleotide polymorphisms, or SNPs, even on moderately looped pedigrees. For multi allelic loci in inbred pedigrees, the best linear unbiased estimator of McPeek et al. (2004) should provide a good alternative.

The ability to model errors is an advantage of the graphical modeling framework, and this is again particularly useful for SNPs as the large number of loci available and the comparatively low information at any single locus make detecting and correcting errors difficult. We can follow the method used by Formula and introduce explicit variables indicating the presence or absence of errors for each genotype call and incorporate a prior probability for error. As posterior marginals for founder genotypes allow re-estimation of population allele frequencies, so posterior marginals for error states can give updated estimates of the error rate. Thus, tolerance to errors and maximum likelihood estimation of error rates can very naturally be built in to the procedure. For a large number of alleles, however, there is a considerable computational cost for this because we are no longer able constrain the state space of possible genotypes to match the observed data. We have, therefore, arbitrarily limited error modeling to loci with 5 or fewer alleles. For other loci, any errors in the data that leave no possible consistent set of inheritances will cause the program to skip the locus, so checking with Formula or a similar program should be done as a preliminary step. The prior probability of error is set at 1%. It is assumed that genotyping errors do not depend on the value of the genotype or on occurrences of other errors. While this may not be truly realistic, experience with Formula shows that this is a sufficiently powerful and robust assumption.

One criticism of EM methods is that standard errors of estimates are not always available. However, as noted by C. A. B. Smith in the discussion of the paper by Dempster et al. (Smith, 1977), both the standard errors of the estimates and the rate of convergence of the algorithm are determined by the curvature of the likelihood surface. Thus, if Formula is the estimate of the multinomial probability p at the ith iteration, and Formula is the final estimate, we can track

Formula
the ratio of successive deviations from the eventual maximum, which the general theory of the EM method implies converges fairly quickly to some value {lambda}. From this we obtain an inflation factor for the usual standard error to give

Formula
where n is the number of founder alleles, or twice the number of founders of the pedigree.

Formula, is written in Java and used as follows:

Formula
where Formula and Formula are LINKAGE parameter and pedigree data files which together specify the locus parameters, pedigree structure and observed genotypes. Details of these file formats are given on the Internet at http://linkage.rockefeller.edu

The output file Formula has exactly the same format and information as Formula except that the allele frequencies are replaced by their new estimates. Thus, Formula can directly replace Formula in further analysis. Initial estimates, final estimates and standard errors are output to the screen as the program analyzes each locus.

The results given by this method match those given by the method of Boehnke (1991) as both maximize the same likelihood. The program Formula that implements that method uses the Elston and Stewart (1971) algorithm which is a special case of the forward part of the forward–backward algorithm used by Formula Formula. Formula, therefore, uses generic numerical techniques for optimization and evaluation of curvature. The combination of the forward–backward method, EM algorithm, and curvature from convergence rate make Formula more computationally efficient.

To give an indication of the performance of the program, it was run on a dataset of 59 individuals connected in a single extended pedigree with no loops and 20 founders. A microsatellite marker loci with 19 alleles, 14 of them occurring in the pedigree, was analyzed in 13.5 s on the author's laptop computer which has a 2.8 GHz CPU. The number of EM iterations was 13. In contrast Formula made 22 function evaluations in 215 s. The results are compared in Table 1. The estimates made by the two programs corresponded to 3 decimal places. There were, however, some discrepancies in the standard errors probably because of the sensitivity of the problem of computing a second derivative.


View this table:
[in this window]
[in a new window]
 
Table 1 Allele frequency estimates from Formula and Formula for a pedigree of 59 individuals and a locus with 19 alleles

 


    Acknowledgments
 
This work was supported by NIH NIGMS grant R21 GM070710 to A.T. and NIH NCI grants R03 CA099844 and K07 CA098364 to N.C.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on November 7, 2005; revised on January 6, 2006; accepted on January 9, 2006

    REFERENCES
 TOP
 ABSTRACT
 REFERENCES
 

    Boehnke, M. (1991) Allele frequency estimation from data on relatives. Am. J. Hum. Genet, . 48, 22–25[ISI][Medline].

    Ceppellini, R., et al. (1955) The estimation of gene frequencies in a random-mating population. Ann. Hum. Genet, . 20, 97–115[Medline].

    Dempster, A.P., et al. (1977) Maximum likelihood estimation from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B, 39, 1–38.

    Elston, R.C. and Stewart, J. (1971) A general model for the genetic analysis of pedigree data. Hum. Hered, . 21, 523–542[CrossRef][ISI][Medline].

    Lauritzen, S.L. and Spiegelhalter, D.J. (1988) Local computations with probabilities on graphical structures and their applications to expert systems. J. Roy. Stat. Soc., B, 50, 157–224.

    McPeek, M.S., et al. (2004) Best linear unbiased allele-frequency estimation in complex pedigrees. Biometrics, 60, 359–367[CrossRef][Medline].

    Smith, C.A. (1957) Counting methods in genetical statistics. Ann. Hum. Genet, . 21, 254–276[ISI][Medline].

    Smith, C.A. (1977) Contribution to the discussion of the paper by professor Dempster et al. J. Roy. Stat. Soc. B, 39, 24–25.

    Thomas, A. (2003) GCHap: fast MLEs for haplotype frequencies by gene counting. Bioinformatics, 19, 2002–2003[Abstract/Free Full Text].

    Thomas, A. (2005) GMCheck: Bayesian error checking for pedigree genotypes and phenotypes. Bioinformatics, 21, 3187–3188[Abstract/Free Full Text].

    Thompson, E.A. Human Evolutionary Trees, (1975) , Cambridge Cambridge University Press.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/6/771    most recent
btk049v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Thomas, A.
Right arrow Articles by Camp, N. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Thomas, A.
Right arrow Articles by Camp, N. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?