Bioinformatics Advance Access originally published online on September 16, 2004
Bioinformatics 2005 21(4):464-470; doi:10.1093/bioinformatics/bti027
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 21 issue 4 © Oxford University Press 2005; all rights reserved.
Improving genome annotations using phylogenetic profile anomaly detection
The Eli & Edythe L. Broad Institute, Massachusetts Institute of Technology and Harvard University 320 Charles Street, Cambridge, MA 02141, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: A promising strategy for refining genome annotations is to detect features that conflict with known functional or evolutionary relationships between groups of genes. Previous work in this area has been focused on investigating the absence of housekeeping genes or components of well-studied pathways. We have sought to develop a method for improving new annotations that can automatically synthesize and use the information available in a database of other annotated genomes.
Results: We show that a probabilistic model of phylogenetic profiles, trained from a database of curated genome annotations, can be used to reliably detect errors in new annotations. We use our method to identify 22 genes that were missed in previously published annotations of prokaryotic genomes.
Availability: The method was evaluated using MATLAB and open source software referenced in this work. Scripts and datasets are available from the authors upon request.
Contact: tarjei{at}broad.mit.edu
| INTRODUCTION |
|---|
|
|
|---|
The first step in understanding a newly sequenced genome is to identify its genes and their putative functions. Unfortunately, current homology-based annotations of protein-coding genes remain far from perfect, even for relatively simple bacterial genomes (Brenner, 1999; Devos and Valencia, 2001; Bocs et al., 2002; Iliopoulos et al., 2003). Broadly speaking, their shortcomings can be divided into two categories. First, as many as 40% of the predicted genes in most bacterial genomes remain annotated as hypothetical or unknown proteins. Second, even for genes that do have significant database matches to known genes, systematic errors in the annotation methods themselves, such as software defects, unrealistic statistical models and failure to distinguish functional genes disrupted by sequencing errors from pseudogenes, can lead to annotation inaccuracies. Here, we describe a statistical method designed to detect the second type of errors.
A promising strategy for refining genome annotations is to take into account functional or evolutionary relationships between groups of genes. Genomes continually evolve through gene transfer, duplication and loss. However, the presence or absence of specific genes in a genome is not arbitrary, but reflects functional and structural dependences between the proteins they encode. For example, metabolic enzymes are the building blocks of biochemical pathways that would be inactive, and perhaps even fatal to the organism, if incomplete. Other genes encode connected and interdependent subcomponents of large complexes, such as ribosomes and flagella. Since incomplete, non-functional metabolic pathways and cellular complexes confer little evolutionary benefit to an organism, groups of functionally or structurally linked genes tend to be either present in their entirety or completely absent (Pellegrini et al., 1999). As a consequence, unexpected co-occurrence patterns (phylogenetic profiles after Pellegrini et al.) may point toward systematic errors in an annotation. If only one gene from a group of co-occurring genes is missing in an annotation, then there is reason to suspect that it is a false negative, which has somehow been overlooked by the annotation system. Similarly, if only one gene from a co-occurring group is present, then it may be a false positive.
Several groups have demonstrated the utility of gene dependences in improving genome annotations, using varying degrees of phylogenetic and functional information. Methods for inferring missing genes from well-studied gene families were developed long before the first complete genomes were sequenced (Goodman et al., 1979, Page, 1994). These methods use character-based approaches to find genes whose presence or absence are inconsistent with a known phylogeny. More recently, Natale et al., (2000) reported finding novel genes in previously annotated bacterial genomes by examining specific unexpected phylogenetic patterns, such as a gene being present in all but one of the genomes, in the Clusters of Orthologous Groups (COG) database. Karp, (2001) has developed a symbolic framework for inferring missing genes in known metabolic pathways, relying primarily on functional dependences rather than phylogenetic information.
In this work, we present Phylogenetic Profile Anomaly Detection (PPAD), a computational framework for annotation refinement. We first generate a probabilistic model of expected phylogenetic profiles from the growing databases of existing genome annotations. The model is encoded as a Bayesian network (Pearl, 1988) which has been shown to be a principled and powerful tool for integrating heterogeneous information and for modeling uncertainty and noise in biological systems (Delcher et al., 1993; Friedman et al., 2000; Hartemink et al., 2001; Stuart et al., 2003; Troyanskaya et al., 2003). The model essentially encodes a set of probabilistic rules that predict the presence or absence of a gene, given the presence or absence of a small number of other genes. We then use this dependency model to identify statistical anomalies in new annotations that signal the presence of unexpected components in their phylogenetic profiles or, potentially, systematic errors.
Our PPAD method extends and generalizes the above-mentioned methods in the following ways:
- The inputs and outputs are probabilistic, allowing explicit modeling of uncertainties.
- The probabilistic dependency model is better suited for modeling noisy biological correlations than logical rules or ad hoc pattern matching.
- All or part of the dependency model can be efficiently learned directly from existing annotations and other relevant datasets.
We show through simulations that our method can reliably improve on the accuracy of genome annotations to which artificial errors have been added. We also identify 22 previously missed genes by applying the PPAD method to bacterial genome annotations from GenBank.
| METHODS |
|---|
|
|
|---|
Problem domain and definitions
We define the genome annotation task as the problem of assigning descriptive labels to every predicted gene in a genome. We denote the set of available labels G = {G 1, G 2,...,G N ,NA}. The NA label is used to designate no assignment for genes for which no other label in G is applicable. In the context of homology-based annotations, the labels are the names of known genes or gene families that have been identified or characterized previously.
In this work, we use the COG database as our reference source of descriptive labels (Tatusov et al., 2003) although the method applies equally well to any other protein classification scheme. A COG is defined as a set of predicted orthologous genes identified in three or more species, where at least one species is a prokaryote. The database contains COGs covering 25 major functional categories, as well as curated annotations of 50 bacterial genomes. We focus on the annotation of genes with information processing and storage functions, such as transcription, translation and replication. There are 731 such COGs, and thus in our notation N = 731 and G i corresponds to the i-th COG in functional categories A, B, J, K and L, as defined in the COG database (http://www.ncbi.nlm.nih.gov/COG/new/).
For a genome we wish to annotate, we define a binary vector of length N, where the i-th entry is 1 if at least one gene in the genome is best described using label G i , and 0 otherwise. This vector, denoted
, is equivalent to a phylogenetic profile, as first introduced by Gaasterland and Ragan, (1998) and Pellegrini et al., (1999). A genome annotation, which assigns labels to each gene in the genome, implicitly defines a phylogenetic profile indicating which of the available labels in G have been used in that annotation. This phylogenetic profile can be evaluated against a probabilistic model of dependences between genes to identify potential errors and inconsistencies.
A Bayesian network model of gene dependences
We utilize a Bayesian network (Pearl, 1988; see reviews in Charniak, 1991; Cowell, 1998) to capture and represent dependences between gene families as we observe them in a set of trusted phylogenetic profiles, D. The nodes in the Bayesian network are a set of binary random variables
, corresponding directly to the entries of the phylogenetic profiles introduced above (Fig. 1a). Edges between nodes describe a directed acyclic graph and define a set of conditional independence relations between the binary variables. The probability distribution associated with each random variable depends only on the parent nodes in the graph (Fig. 1b). The complete Bayesian network B D therefore represents a probability distribution over all possible phylogenetic profiles, given D:
![]() |
|
The structure and parameters of a Bayesian network that captures the dependences in D can be estimated using a variety of learning algorithms. Here, we use a greedy hill-climbing approach, as implemented by the LibB software package (N.Friedman and G.Elidan, http://www.cs.huji.ac.il/labs/compbio/LibB/). Starting from an unconnected network, the learning algorithm computes which single structural modification (edge addition, removal or inversion) will optimize the Bayesian Information Criterion (BIC) score. The BIC score of a network is the log-likelihood of the training data, given the network, minus a complexity penalty to prevent overfitting. The penalty depends on d, the number of parameters in the model:
![]() |
The learning algorithm iteratively applies the best modification to the Bayesian network until the BIC score converges. Although this greedy approach is not guaranteed to find a globally optimal model, it tends to perform well in practice (Heckerman et al., 1995). The end result of this learning procedure is a Bayesian network that assigns low probabilities to phylogenetic profiles are inconsistent with the gene dependencies observed in the training set. Thus, if the resulting phylogenetic profile for a newly annotated genome is deemed highly unlikely by our trained Bayesian network, then either the new genome is very different from anything we have seen before or the genome annotation is flawed.
Note that we explicitly assume that each phylogenetic profile in the training set is equally informative, and that the training set is a random sampling from some underlying probability distribution of correct annotations. This is a naive assumption that simplifies the learning procedure, but which may result in uninformative predictions if the species included in the training set are not sufficiently diverse. More complex variants of PPAD can be developed to take full advantage of known phylogenetic relationships between the species included in the training set.
Detecting potential errors in a phylogenetic profile
To detect potential errors in a genome annotation, we consider the phylogenetic profile
implied by the annotation. We then make explicit our confidence in the entries of this initial phylogenetic profile in terms of probabilities. In general, each assignment in
can have its own degree of confidence, but in this work we use a simple two-parameter model, where
is the probability that a label which is called present in the initial annotation is actually absent (the assumed false positive rate) and ß is the probability that a label which is called absent is actually present (the assumed false negative rate):
![]() |
By incorporating our confidence in the initial phylogenetic profile as virtual evidence (Pearl, 1988) we can compute the most probable (MPE) phylogenetic profile
using this information and B D as the Bayesian prior probability distribution over possible phylogenetic profiles:
![]() |
We find an exact solution to this optimization problem using the junction tree inference algorithm (Dawid, 1992) as implemented in the Bayesian Network Toolbox for Matlab (Murphy, 2001) and then compare each entry in
to the corresponding entry in
.
Discordances between the most probable and the initial phylogenetic profiles represent features in the initial profile that are statistically anomalous according to our Bayesian network model of gene dependencies, and may therefore correspond to annotation errors. For example, the presence of a label in the MPE profile, but not in the initial profile, signals that the annotation pipeline may have missed that label. However, it is important to note that it is not possible to distinguish between annotation errors and unexpected evolutionary events a priori. The statistically anomalous presence or absence of that particular label may also be due to a recent gene loss event in the species or strain we are annotating, or it may simply be a result of the species being significantly different from the set of species from which our Bayesian network was learned. Any discordance is therefore a hypothesis that can be tested by re-evaluating the relevant evidence and, if necessary, follow-up experiments.
| RESULTS |
|---|
|
|
|---|
Improvement over simulated annotations
We evaluated the ability of the PPAD method to improve upon inaccurate initial annotations by performing a series of simulated leave-one-out tests. We induced a Bayesian network from the curated annotations in the COG database, leaving one species out of the training set. If the database contained annotations of more than one strain of the same species, we removed all. We then simulated inaccurate annotations by randomly adding and removing entries from the phylogenetic profiles of the species that were left out. Finally, treating the original COG annotations as correct, we calculated the pairwise number of differences (the Hamming distance) between them, and the simulated profiles and the computed MPE profiles. We repeated this process 20 times for each species and for each set of parameters.
The PPAD method improved upon the simulated annotations in 582 of 600 trials (97%), where either the false positive or the false negative rate of the initial profile was >0.01. For each of the species listed in Table 1, the first row of column A shows the mean increase in the number of correct COG calls in the MPE profiles relative to the simulated initial profiles. With false positive and false negative rates of 0.1 in the initial profiles, we saw approximately 40 corrected COG calls for all five species, which is 5.5% of the 731 COGs that were considered. All positive differences are statistically significant (P < 103, MannWhitney U test). We obtained similar results for the species that are not shown here.
|
We note that the PPAD method is dependent on accurate error rate estimates. The column B in Table 1 shows the results of setting
and ß to values that differed significantly from the actual error rates. In these cases we saw little or no improvement in the MPE profiles. In practice,
and ß can be estimated from the significance values returned from the sequence alignment tools used to compute the initial annotation. The assumed error rates can also be increased or decreased to obtain more or less conservative MPE profiles.
Improvement over nearest-relative profiles
We also compared the MPE profiles from our leave-one-out tests to the most closely related species present in the corresponding training sets. Given the amount of genomic variation across the bacterial kingdom, it is not immediately evident that comparing an initial annotation to the MPE profile is better than simply comparing it to the closest sequenced relative, which is a common practice. If MPE profiles tend to make predictions consistent with averaged genomes, they might not fit any real species particularly well. We found that this was not the case.
The MPE profiles were more similar to the unmodified correct COG annotations than to the most closely related species in the training set in every one of 600 simulations where either the false positive or the false negative rate of the initial profile was <0.1. For all five species shown in Table 1 the difference between the MPE profile and the corresponding nearest-relative profile increased with decreasing error rates (column A). This was true even with inaccurate error rate estimates (column B).
Identification of 22 new genes
We searched for annotation errors in the COG database by repeating our leave-one-out tests, but without adding simulated noise to the initial annotations. Since the COGs are themselves defined as a part of the annotation process (Tatusov et al., 1997) we expected very few or no false positive assignments. However, the input peptide sets might not have been complete. To search for the evidence of missed genes, we therefore assumed a very low false positive rate (
= 0.001) and a moderately high false negative rate (ß = 0.1).
The PPAD method predicted the presence of a total of 244 additional genes in the 50 curated annotations, a predicted increase of 2% for the number of genes in these species. We examined these predictions by using TBLASTN to align known members of the missing COGs to the complete genomic sequence of the genomes from which they were predicted to be missing. If a significant alignment was found, we determined whether it correspond to a complete open reading frame, and compared its location to the GenBank annotation of the genome to see if it overlaps a known feature.
In the end, we were able to confirm 22 annotation errors across 10 bacterial genomes using TBLASTN sequence alignments alone. This set, as shown in Table 2, was dominated by ribosomal proteins. In particular, we found nine missing ribosomal genes in Agrobacterium tumefaciens, suggesting a systematic underdetection of ribosomal genes in the original annotation of this genome (Goodner et al., 2001). A similar pattern was found by Natale et al., (2000) in their analysis of an earlier version of the COG database, suggesting that underdetection of ribosomal genes may be a systematic error in several annotation systems, perhaps due to their short length.
|
We also found that a number of the remaining predicted genes were present in the examined genomes as pseudogenes and consequently left out of the COG database, which only catalogs functional genes. For Mycobacterium leprae, a species known to have undergone extensive gene loss (Cole et al., 2001) we found that six out of eight genes predicted to be present were pseudogenes annotated elsewhere. Since detectable pseudogenes are likely to have been deactivated in the recent evolutionary history, they may be as statistically anomalous as annotation errors, but may also be actual genes that have been obscured by sequencing errors or strain polymorphisms. Additional laboratory work will be required to distinguish these alternatives. Finally, we note that since our Bayesian network only modeled a subset of the COGs, we expect that there may be additional missing genes in other functional categories.
The set of 244 PPAD predicted genes was significantly enriched for confirmable annotation errors. By repeating the TBLASTN analysis for 5 randomly picked sets of 244 genes absent in the initial COG annotations, we could confirm only a total of 7 missing genes (P < 1012 by Fisher's exact test), of which 5 were also found using PPAD. The two genes not found using PPAD were COG2094 and COG2167 in Corynebacterium glutamicum.
| DISCUSSION |
|---|
|
|
|---|
We have described a methodology for improving whole-genome annotations using a probabilistic model of gene co-occurrence patterns. Unlike earlier ad hoc analyses of unexpected phylogenetic profiles, the PPAD method is formulated within a principled probabilistic framework. PPAD can be incorporated into existing annotation pipelines to spot potential systematic errors and assist manual annotators. The fact that the annotations we examined and improved upon were published, best effort annotations shows the value of such independent quality controls.
We note that there are many potential avenues for follow-up and improvement on the results presented here. For example, it would be desirable to develop a better quantitative understanding of both the potential and the limitations of phylogenetic information in this context. Recent evolutionary innovations and species-specific adaptations cannot be distinguished from true errors by a method that relies on historic correlations. However, it is currently unclear whether the numbers of false predictions encountered in this and related methods (Natale et al., 2000; Paley and Karp, 2001; Zheng et al., 2002) can be lowered by using better training sets and explicit models of phylogenetic relationships between the species in the training set, or whether they are absolute limits set by the phylogenetic and functional plasticity of gene families (Mirkin et al., 2003; Peregrin-Alvarez et al., 2003).
A pragmatic strategy for improvement might be to assess the statistical significance of each component of the probabilistic model and avoid making predictions based on uncertain parameters. Several such strategies have been developed in other contexts, such as gene expression analysis (Pe'er et al., 2001). More sophisticated learning algorithms, such as simulated annealing, might also be helpful in improving the predictive power of the models.
Finally, it appears worthwhile to develop models with more problem-specific semantics or constraints. One promising possibility is a hierarchical or noisy-or model (Pearl, 1988) where the observable variables, which indicate the presence and absence of genes, only depend on a smaller set of hidden variables that indicate the presence or absence of particular functional pathways, cellular complexes or other higher-level biological features. It would also be possible to use other annotation labels such as enzymatic activities, instead of protein families, analogous to the pathway inference systems developed by (Karp, 2001).
| Acknowledgments |
|---|
We thank Nir Friedman and Gal Elidan for sharing their LibB software; Simon Kasif, Nick Patterson and Douglas Lauffenburger for their helpful discussions and support; and referees for their helpful comments.
Received on February 12, 2004; revised on July 9, 2004; accepted on September 1, 2004
| REFERENCES |
|---|
|
|
|---|
Bocs, S., Danchin, A., Médigue, C. (2002) Re-annotation of genome microbial coding-sequences: finding new genes and inaccurately annotated genes. BMC Bioinformatics, 3, 5[CrossRef][Medline].
Brenner, S.E. (1999) Errors in genome annotation. Trends Genet., 15, 132133[CrossRef][ISI][Medline].
Charniak, E. (1991) Bayesian networks without tears. AI Mag., 12, 5063.
Cole, S.T., Eiglmeier, K., Parkhill, J., James, K.D., Thomson, N.R., Wheeler, P.R., Honore, N., Garnier, T., Churcher, C., Harris, D., et al. (2001) Massive gene decay in the leprosy bacillus. Nature, 409, 10071011[CrossRef][Medline].
Cowell, R. (1998) Introduction to inference for Bayesian networks. In Jordan, M.I. (Ed.). Learning in Graphical Models, , Dordrecht, The Netherlands Kluwer Academic Publishers, pp. 926.
Dawid, A.P. (1992) Applications of a general propagation algorithm for probabilistic expert systems. Stat. Comput., 2, 2526.
Delcher, A.S., Kasif, S., Goldberg, H., Xsu, W. (1993) Protein secondary-strucutre modeling with probabilistic networks. Proc. Int. Conf. Intell. Syst. Mol. Biol., 109117.
Devos, D. and Valencia, A. (2001) Intrinsic errors in genome annotation. Trends Genet., 17, 429431[CrossRef][ISI][Medline].
Friedman, N., Linial, M., Nachman, I., Pe'er, D. (2000) Using Bayesian networks to analyze expression data. J. Comput. Biol., 7, 601620[CrossRef][ISI][Medline].
Gaasterland, T. and Ragan, M.A. (1998) Microbial genescapes: phyletic and functional patterns of ORF distribution among prokaryotes. Microb. Comp. Genomics, 3, 199217[Medline].
Goodman, M., Czelusniak, J., Moore, G.W., Matsuda, G. (1979) Fitting the gene lineage into its species lineage: a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Zool., 28, 132163[CrossRef].
Goodner, B., Hinkle, G., Gattung, S., Miller, N., Blanchard, M., Qurollo, B., Goldman, B.S., Cao, Y., Askenazi, M., Halling, C., et al. (2001) Genome sequence of the plant pathogen and biotechnology agent Agrobacterium tumefaciens C58. Science, 294, 23232328
Hao, W. and Golding, G.B. (2004) Patterns of bacterial gene movement. Mol. Biol. Evol., 21, 12941307
Hartemink, A.J., Gifford, D.K., Jaakkola, T.S., Young, R.A. (2001) Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pac. Symp. Biocomput., 6, 422433.
Heckerman, D., Geiger, D., Chickering, D. (1995) Learning Bayesian networks: the combination of knowledge and statistical data. Machine Learning, 20, 197243[CrossRef].
Iliopoulos, I., Tsoka, S., Andrade, M.A., Enright, A.J., Carroll, M., Poullet, P., Promponas, V., Liakopoulos, T., Palaios, G., Pasquier, C., et al. (2003) Evaluation of annotation strategies using an entire genome sequence. Bioinformatics, 19, 717726
Karp, P.D. (2001) Pathway databases: a case study in computational symbolic theories. Science, 293, 20402044
Mirkin, B.G., Fenner, T.I., Galperin, M.Y., Koonin, E.V. (2003) Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol. Biol., 3, 2[CrossRef][Medline].
Murphy, K.P. (2001) The Bayes Net Toolbox for Matlab. Comput. Sci. Stat., 33, 120.
Natale, D.A., Galperin, M.Y., Tatusov, R.L., Koonin, E.V. (2000) Using the COG database to improve gene recognition in complete genomes. Genetica, 108, 917[CrossRef][ISI][Medline].
Page, R.D.M. (1994) Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Syst. Biol., 43, 5877[CrossRef].
Paley, S.M. and Karp, P.D. (2001) Evaluation of computational metabolic-pathway predictions for Heliobacter pylori . Bioinformatics, 18, 715724.
Pearl, J. Probabilistic Inference in Intelligent Systems, (1988) , San Mateo, CA Morgan Kaufmann.
Pe'er, D., Regev, A., Elidan, G., Friedman, N. (2001) Inferring subnetworks from perturbed expression profiles. Bioinformatics, 17, (Suppl. 1),, pp. S215S224[Abstract].
Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D., Yeates, T.O. (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl Acad. Sci., USA, 96, 42854288
Peregrin-Alvarez, J.P., Tsoka, S., Ouzounis, C.A. (2003) The phylogenetic extent of metabolic enzymes and pathways. Genome Res., 13, 422427
Stuart, J.M., Segal, E., Koller, D., Kim, S.K. (2003) A gene-coexpression network for global discovery of conserved genetic modules. Science, 302, 249255
Tatusov, R.L., Koonin, E.V., Lipman, D.J. (1997) A genomic perspective on protein families. Science, 278, 631637
Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., Koonin, E.V., Krylov, D.M., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., et al. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 11, 41.
Troyanskaya, O.G., Dolinski, K., Owen, A.B., Altman, R.B., Botstein, D. (2003) A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl Acad. Sci., USA, 100, 83488353
Zheng, Y., Roberts, R.J., Kasif, S. (2002) Genomic functional annotation using co-evolution profiles of gene clusters. Genome Biol., 3, RESEARCH0060[Medline].
This article has been cited by other articles:
![]() |
P. Ternes, P. Sperling, S. Albrecht, S. Franke, J. M. Cregg, D. Warnecke, and E. Heinz Identification of Fungal Sphingolipid C9-methyltransferases by Phylogenetic Profiling J. Biol. Chem., March 3, 2006; 281(9): 5582 - 5592. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





