Bioinformatics Advance Access originally published online on January 12, 2006
Bioinformatics 2006 22(6):771-772; doi:10.1093/bioinformatics/btk049
Maximum likelihood estimates of allele frequencies and error rates from samples of related individuals by gene counting
Department of Medical Informatics, University of Utah 391 Chipeta Way Suite D, Salt Lake City, UT 84108, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Graphical modeling is used to extend the gene counting method to compute maximum likelihood estimates of allele frequencies for samples of individuals related in extended pedigrees. Genotypes may be missing or partially observed, and error rates can be simultaneously estimated.
Availability: The Java classes and Javadocs pages for
can be obtained from bioinformatics.med.utah.edu/~alun, which also has more information on its use and file formats.
Contact: alun{at}genepi.med.utah.edu
Few ideas from statistical genetics have been as widely accepted and adapted as C. A. B. Smith's gene counting method (Ceppellini et al., 1955; Smith, 1957). It was originally devised as an iterative procedure for estimating multinomial frequencies when only combined class counts are observable, the canonical example being the estimation of the frequencies of the alleles of the AB0 blood group locus from observed counts of blood types. Now more often known as the expectation and maximization, or EM, algorithm (Dempster et al., 1977), it has been applied to such genetic problems as estimating evolutionary trees (Thompson, 1975) and reconstructing haplotypes from diploid data (Thomas, 2003). More broadly it has been used in all manner of statistical problems in which the observed data are, at least conceptually, incomplete. The application described here is a very direct extension: estimating allele frequencies based on samples of genotypes from related individuals.
Although the naive approach of ignoring relationships and using raw sample frequencies gives unbiased estimates, for individuals related in a small number of extended pedigrees results can be unreliable with large variance because of correlations between genotypes due to common ancestry. This is particularly an issue when performing linkage or association analysis in pedigrees. These are sensitive to frequency errors: a rare allele shared by a group of relatives also sharing a genetic trait is evidence of association of the marker and trait loci. Overestimating the frequency of the allele will obscure this association. Conversely, an under estimate of an allele's frequency will cause false positive results. Reliable allele frequency estimates from the pedigrees under study are necessary. McPeek et al. (2004) showed that maximum likelihood estimation is best for this problem when it is tractable, and the method of Boehnke (1991) can be used to do this. The approach outlined here, however, has modeling and computational advantages as shown below.
We begin with initial estimates of the allele frequencies that may be the naive unbiased estimates. Given these frequencies, the pedigree structure and observed genotypes for some of the individuals, we can compute the marginal distributions of the genotypes of the founders of the pedigree. This is the E step of the EM algorithm and is done by constructing a graphical model and performing the usual forwardbackward algorithm to obtain the marginal distributions of genotypes in the maximal cliques of the graph. From these the marginals for the individual founder genotypes are easily calculated. A clique is a subgraph in which all the vertices representing variables are joined to each other, and a decomposable graphical model has the property that the joint distribution on all variables is determined by the clique marginals (Lauritzen and Spiegelhalter, 1988). The M step is to find the maximum likelihood estimates of the allele frequencies by summing the weighted counts for the founder alleles. There is an implicit assumption that the founder alleles are randomly sampled from a wider population. The E and M steps are then repeated until convergence, which, in this implementation, is determined when the largest absolute between iteration difference for an allele frequency is less than a specified value, which is by default 0.000001.
This approach is, therefore, entirely analogous to the the original gene counting method but with the simple partition of allele counts for a phenotype according to the currently implied genotype frequencies being replaced by the more involved graphical modeling computation. For data on unrelated individuals, the method reduces to simple gene counting. The functions used in the implementing program, called
, are shared in large part with the error checking program
(Thomas, 2005), and like that application, the computational time and storage needed to process a locus with a observed alleles in a pedigree of n individuals with no loops is of order o(na4).
will also handle looped pedigrees, but the resources required will grow as at least o(a6) and faster for highly looped complex pedigrees. Thus, the program should be able to obtain estimates for loci with small numbers of alleles, and in particular single nucleotide polymorphisms, or SNPs, even on moderately looped pedigrees. For multi allelic loci in inbred pedigrees, the best linear unbiased estimator of McPeek et al. (2004) should provide a good alternative.
The ability to model errors is an advantage of the graphical modeling framework, and this is again particularly useful for SNPs as the large number of loci available and the comparatively low information at any single locus make detecting and correcting errors difficult. We can follow the method used by
and introduce explicit variables indicating the presence or absence of errors for each genotype call and incorporate a prior probability for error. As posterior marginals for founder genotypes allow re-estimation of population allele frequencies, so posterior marginals for error states can give updated estimates of the error rate. Thus, tolerance to errors and maximum likelihood estimation of error rates can very naturally be built in to the procedure. For a large number of alleles, however, there is a considerable computational cost for this because we are no longer able constrain the state space of possible genotypes to match the observed data. We have, therefore, arbitrarily limited error modeling to loci with 5 or fewer alleles. For other loci, any errors in the data that leave no possible consistent set of inheritances will cause the program to skip the locus, so checking with
or a similar program should be done as a preliminary step. The prior probability of error is set at 1%. It is assumed that genotyping errors do not depend on the value of the genotype or on occurrences of other errors. While this may not be truly realistic, experience with
shows that this is a sufficiently powerful and robust assumption.
One criticism of EM methods is that standard errors of estimates are not always available. However, as noted by C. A. B. Smith in the discussion of the paper by Dempster et al. (Smith, 1977), both the standard errors of the estimates and the rate of convergence of the algorithm are determined by the curvature of the likelihood surface. Thus, if
is the estimate of the multinomial probability p at the ith iteration, and
is the final estimate, we can track
![]() |
. From this we obtain an inflation factor for the usual standard error to give
![]() |
, is written in Java and used as follows:
![]() |
and
are LINKAGE parameter and pedigree data files which together specify the locus parameters, pedigree structure and observed genotypes. Details of these file formats are given on the Internet at http://linkage.rockefeller.edu
The output file
has exactly the same format and information as
except that the allele frequencies are replaced by their new estimates. Thus,
can directly replace
in further analysis. Initial estimates, final estimates and standard errors are output to the screen as the program analyzes each locus.
The results given by this method match those given by the method of Boehnke (1991) as both maximize the same likelihood. The program
that implements that method uses the Elston and Stewart (1971) algorithm which is a special case of the forward part of the forwardbackward algorithm used by
.
, therefore, uses generic numerical techniques for optimization and evaluation of curvature. The combination of the forwardbackward method, EM algorithm, and curvature from convergence rate make
more computationally efficient.
To give an indication of the performance of the program, it was run on a dataset of 59 individuals connected in a single extended pedigree with no loops and 20 founders. A microsatellite marker loci with 19 alleles, 14 of them occurring in the pedigree, was analyzed in 13.5 s on the author's laptop computer which has a 2.8 GHz CPU. The number of EM iterations was 13. In contrast
made 22 function evaluations in 215 s. The results are compared in Table 1. The estimates made by the two programs corresponded to 3 decimal places. There were, however, some discrepancies in the standard errors probably because of the sensitivity of the problem of computing a second derivative.
|
| Acknowledgments |
|---|
This work was supported by NIH NIGMS grant R21 GM070710 to A.T. and NIH NCI grants R03 CA099844 and K07 CA098364 to N.C.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on November 7, 2005; revised on January 6, 2006; accepted on January 9, 2006
| REFERENCES |
|---|
|
|
|---|
Boehnke, M. (1991) Allele frequency estimation from data on relatives. Am. J. Hum. Genet, . 48, 2225[Web of Science][Medline].
Ceppellini, R., et al. (1955) The estimation of gene frequencies in a random-mating population. Ann. Hum. Genet, . 20, 97115[Medline].
Dempster, A.P., et al. (1977) Maximum likelihood estimation from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B, 39, 138.
Elston, R.C. and Stewart, J. (1971) A general model for the genetic analysis of pedigree data. Hum. Hered, . 21, 523542[CrossRef][Web of Science][Medline].
Lauritzen, S.L. and Spiegelhalter, D.J. (1988) Local computations with probabilities on graphical structures and their applications to expert systems. J. Roy. Stat. Soc., B, 50, 157224.
McPeek, M.S., et al. (2004) Best linear unbiased allele-frequency estimation in complex pedigrees. Biometrics, 60, 359367[CrossRef][Medline].
Smith, C.A. (1957) Counting methods in genetical statistics. Ann. Hum. Genet, . 21, 254276[Web of Science][Medline].
Smith, C.A. (1977) Contribution to the discussion of the paper by professor Dempster et al. J. Roy. Stat. Soc. B, 39, 2425.
Thomas, A. (2003) GCHap: fast MLEs for haplotype frequencies by gene counting. Bioinformatics, 19, 20022003
Thomas, A. (2005) GMCheck: Bayesian error checking for pedigree genotypes and phenotypes. Bioinformatics, 21, 31873188
Thompson, E.A. Human Evolutionary Trees, (1975) , Cambridge Cambridge University Press.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


