Skip Navigation


Bioinformatics Advance Access originally published online on May 6, 2005
Bioinformatics 2005 21(14):3187-3188; doi:10.1093/bioinformatics/bti485
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/14/3187    most recent
bti485v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Thomas, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Thomas, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

GMCheck: Bayesian error checking for pedigreegenotypes and phenotypes

Alun Thomas

Department of Medical Informatics and Center for High Performance Computing, University of Utah 391 Chipeta Way Suite D, Salt Lake City, UT 84108, USA


    Abstract
 TOP
 Abstract
 REFERENCES
 

Summary: GMCheck uses graphical modeling to find the posterior probabilities of data errors given genotypes or phenotypes in a specified pedigree structure.

Availability: The Java classes and Javadocs pages for GMCheck can be obtained from bioinformatics.med.utah.edu/~alun, which also has information on use, parameter settings and file formats.

Contact: alun{at}genepi.med.utah.edu

Lange et al. (1988) and O'Connell and Weeks (1988) have detailed the importance of precise error checking for genetic marker data in pedigrees used for linkage analysis, and produced programs, MENDEL and PedCheck, for finding errors by detecting implied allele segregations that violate Mendelian rules of inheritance. Unlike previous programs which use ad hoc heuristics and only part of the data available at a locus, both these programs calculate the posterior probabilities for genotypes given the entire pedigree structure and observed data. Each, however, has implementational drawbacks: MENDEL cannot compute posterior probabilities for loci with large numbers of alleles and approximates these with conditional posteriors; PedCheck is unable to deal with looped pedigrees efficiently (O'Connell and Weeks, 1988). Loki (Heath, 1998) is an efficient program that handles arbitrarily structured pedigrees but exits at the discovery of the first error. Merlin (Abecasis et al., 2001) uses exact computation to find the posterior probabilities of genotypes given multilocus data, but only for small pedigrees. SimWalk2 (Sobel et al., 2002) addresses the same problem using Markov chain Monte Carlo integration which takes considerable time.

GMCheck is a new program that frames detecting and reporting errors from single locus genetic data as a Bayesian network and uses the methods of Lauritzen and Spiegelhalter (1988) for efficient exact calculation. The novel aspect is the introduction of an explicit indicator variable to represent the occurrence or the non-occurrence of an error for each datum.

The probability of a genotype configuration is represented as a product of simple factors which defines a Markov random field and corresponding Markov graph. This graph is triangulated using the heuristic greedy algorithm, and a sequence of maximal cliques that have the running intersection property is found using lexicographic search (Tarjan, 1985). This gives us an efficient order for calculating the sum of the probabilities of all possible genotype configurations, referred to as the collect evidence phase in expert systems literature, but which is recognizable in genetics as the peeling method of Elston and Stewart (1971) and Cannings et al. (1978). We can also reverse the peeling order and perform a distribute evidence step which computes the posterior marginal distributions for each clique and hence for each variable. Replacing the summation operation by maximization gives a dynamic program that will find a state of maximal posterior probability (Dawid, 1992).

Let xi be the true unobserved genotype of the i-th individual, let yi be the corresponding observation. For each observation we have a binary variable ei indicating the presence or absence of an error. Let x = {xi}, e = {ei}. The factors of the Markov random field are

  • {pi}(xi): The population frequency for genotype xi.
  • {tau}(xi|xfixmi): The probability that a child inherits genotype xi from parents with genotypes xfi and xmi defined by the usual Mendelian rules.
  • {rho}(yi | xi): The probability that an individual with genotype xi has observation yi.
  • {rho}*(yi|xi ei): The observational error model which is {rho}(yi|xi) if ei = 0, but constant if ei = 1. Thus, if ei = 0 there is no error and we use the usual penetrance function, whereas if ei = 1 an error has occurred and the i-th observation is deemed uninformative.
  • {varepsilon}(ei): The prior probability of an observational error.
Our graphical model is then defined by the product

where F is the set of founders of the pedigree. We first compute {sum}x {sum}e f(x,e) and {sum}x f(x,e = {0,0,...}), whose ratio gives the probability that the data are error free. If this is less than a threshold value, we then find and such that to obtain a configuration of variables of maximal posterior probability. The values of indicate the combination of genotypes most probably in error, and computing gives us the posterior probability of this. Finally, we compute posterior marginals and report genotypes with high posterior error probability and the most probable correct states.

The time and storage needed for these computations are determined by the product of the number of states of variables in each clique of the Markov graph. In the case of zero loop pedigrees these cliques correspond to parent–offspring triplets. If there are a alleles at a locus, each genotype has [a(a + 1)]/2 states. A generic graphical modeling program would, therefore, need time and storage of order o(a6). However, for each parental-genotype pair there are at most four possible offspring genotypes. We take advantage of this to reduce the computational requirements to order o(a4). GMCheck will also compute probabilities for looped pedigrees but will behave as a generic graphical modeling program when dealing with cliques that are not parent–offspring triplets.

To further speed up computations we consider only the alleles observed in the pedigree being checked, even though other alleles may occur in other pedigrees. Although this parsimonious approach can alter the posterior probabilities slightly, it greatly increases tractability.


{bti485i1}

As an illustration, the figure shows a fictitious pedigree of seven linked nuclear families that is problematic for heuristic methods as all seven need to be considered jointly in order to detect the error. It is also problematic for programs that cannot handle loops. GMCheck gave the following output in 4.05 s on the author's laptop computer.

Pedigree (#1) locus (#1)

P(at least one error) = 1.0

Most probable individual(s) in error:

(#28) 1 28, with probability 0.8286

Individuals with high error probability:

Individual (#14) 1 14

Observation = 3 2

P(Error) = 0.077

Probable genotype:

P(2,3) = 0.922

Individual (#28) 1 28

Observation = 2 4

P(Error) = 0.842

Probable genotype:

P(1,2) = 0.364

P(1,1) = 0.258

P(2,4) = 0.157

P(2,2) = 0.105

This indicates that there must be at least one error and that it is most probably for individual 28. Although it could be explained by an error for individual 14 this is far less probable. The posterior probabilities of likely genotypes are also given and since for 28 none of these is by far larger than the others, this is a case where the observation might be deleted rather than corrected. The program has also been used on larger datasets, for example, a set of 8839 individuals in 132 pedigrees of up to 326 individuals, including a looped pedigree of 61 individuals, typed at 25 biallelic loci, was checked in 185 s.

The current default parameter values for GMCheck are set to report cases where the overall probability of error-free data is <0.75, and where the marginal probability of a particular genotype error is >0.05. The prior probability of an error is set at 0.01. All these values can be changed using command line arguments. These defaults were chosen informally based on experience with a small but diverse collection of datasets. For a more complete discussion of this issue see Douglas et al. (2002).

This approach provides exact computation of posterior error probabilities using all available genotyping data in pedigrees of arbitrary size and complexity. It is efficient, and for zero looped pedigrees its requirements grow linearly with the size of the pedigree. It also handles looped pedigrees although the resources required to do so can grow quickly if there are several intersecting loops. The output from the program is informative and should enable straightforward correction or deletion of unreliable data. Thus, heuristics, partial analyses or simulation methods for checking single locus data should be needed for only the most complex pedigrees.

If the genotypes of a particular individual over a broad range of loci are indicated to be in error this may point to problems in the pedigree data or a possible sample mix up. A full multilocus analysis would be a more reliable way of detecting this, however, and also of addressing the problem of individually uninformative loci, such as single nucleotide polymorphisms in linkage disequilibrium. This is still an open question for large pedigrees.


    Acknowledgments
 
This work was supported by NIH NIGMS grant R21 GM070710to A.T.

Received on November 19, 2004; revised on April 29, 2005; accepted on May 3, 2005

    REFERENCES
 TOP
 Abstract
 REFERENCES
 

    Abecasis, G.R., et al. (2001) Merlin—rapid analysis of dense genetic maps using sparse gene flow trees. Nat. Genet., 30, 97–101.

    Cannings, C., et al. (1978) Probability functions on complex pedigrees. Ann. Appl. Probab., 10, 26–61.

    Dawid, A.P. (1992) Applications of a general propogation algorithm for probabilistic expert systems. Stat. Comput., 2, 25–36.

    Douglas, J.A., et al. (2002) A multipoint method for detecting genotyping errors and mutations in sibling-pair linkage data. Am. J. Hum. Genet., 66, 1287–1297.

    Elston, R.C. and Stewart, J. (1971) A general model for the genetic analysis of pedigree data. Hum. Hered., 21, 523–542[CrossRef][Web of Science][Medline].

    Heath, S.C. (1998) Generating consistent genotypic configurations for multi-allelic loci and large complex pedigrees. Hum. Hered., 48, 1–11[CrossRef][Medline].

    Lange, K., et al. (1988) Programs for pedigree analysis: MENDEL, FISHER anddGENE. Genet. Epidemiol., 5, 471–472[CrossRef][Web of Science][Medline].

    Lauritzen, S.L. and Spiegelhalter, D.J. (1988) Local computations with probabilities on graphical structures and their applications to expert systems. J. R. Stat. Soc. Ser. B, 50, 157–224.

    O'Connell, J.R. and Weeks, D.E. (1998) PedCheck: A program for identification of genotype incompatabilities in linkage analysis. Am. J. Hum. Genet., 63, 259–266[CrossRef][Web of Science][Medline].

    Sobel, E., et al. (2002) Detection and integration of genotyping errors in statistical genetics. Am. J. Hum. Genet., 70, 496–508[CrossRef][Web of Science][Medline].

    Tarjan, R.E. (1985) Decomposition by clique separators. Discrete Math., 55, 221–232[CrossRef].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
D. M. Toleno, P. L. Morrell, and M. T. Clegg
Error detection in SNP data by considering the likelihood of recombinational history implied by three-site combinations
Bioinformatics, July 15, 2007; 23(14): 1807 - 1814.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Thomas and N. J. Camp
Maximum likelihood estimates of allele frequencies and error rates from samples of related individuals by gene counting
Bioinformatics, March 15, 2006; 22(6): 771 - 772.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/14/3187    most recent
bti485v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Thomas, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Thomas, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?