Skip Navigation


Bioinformatics Advance Access originally published online on January 10, 2008
Bioinformatics 2008 24(5):597-605; doi:10.1093/bioinformatics/btn004
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/5/597    most recent
btn004v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Liu, Q.
Right arrow Articles by Pereira, F. C. N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Liu, Q.
Right arrow Articles by Pereira, F. C. N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Evigan: a hidden variable model for integrating gene evidence for eukaryotic gene prediction

Qian Liu 1,*, Aaron J. Mackey 2,3, David S. Roos 2,3 and Fernando C. N. Pereira 1

1Department of Computer and Information Science, 2Department of Biology and 3Penn Genomics Institute, University of Pennsylvania, Philadelphia PA 19104, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 EXPERIMENTS AND RESULTS
 3 DISCUSSION
 4 METHODS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: The increasing diversity and variable quality of evidence relevant to gene annotation argues for a probabilistic framework that automatically integrates such evidence to yield candidate gene models.

Results: Evigan is an automated gene annotation program for eukaryotic genomes, employing probabilistic inference to integrate multiple sources of gene evidence. The probabilistic model is a dynamic Bayes network whose parameters are adjusted to maximize the probability of observed evidence. Consensus gene predictions are then derived by maximum likelihood decoding, yielding n-best models (with probabilities for each). Evigan is capable of accommodating a variety of evidence types, including (but not limited to) gene models computed by diverse gene finders, BLAST hits, EST matches, and splice site predictions; learned parameters encode the relative quality of evidence sources. Since separate training data are not required (apart from the training sets used by individual gene finders), Evigan is particularly attractive for newly sequenced genomes where little or no reliable manually curated annotation is available. The ability to produce a ranked list of alternative gene models may facilitate identification of alternatively spliced transcripts. Experimental application to ENCODE regions of the human genome, and the genomes of Plasmodium vivax and Arabidopsis thaliana show that Evigan achieves better performance than any of the individual data sources used as evidence.

Availability: The source code is available at http://www.seas.upenn.edu/~strctlrn/evigan/evigan.html

Contact: qianliu{at}seas.upenn.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 EXPERIMENTS AND RESULTS
 3 DISCUSSION
 4 METHODS
 ACKNOWLEDGEMENTS
 REFERENCES
 
As genome sequencing becomes more efficient, manual gene annotation has increasingly become a bottleneck in genome projects. Curated gene models are extremely valuable, but suffer from several inherent limitations. First, manual curation is slow and expensive, and thus lags increasingly behind as the acquisition of genomic sequence data accelerates. Second, successful manual curation depends heavily on biological evidence, which is often unavailable in the relatively understudied organisms that are becoming the target of sequencing efforts. Third, consistency and reproducibility are difficult to maintain, because individual curators use different annotation methods and evidence, as may even be the case for the same curator at different times. Finally, it can be difficult to track the sources of evidence that inform the manual curation process, complicating model revision as new data is acquired.

During manual curation, human annotators typically examine multiple sources of evidence, in order to produce a consensus view of the structure of a newly sequenced genome. We have developed an automated method that mimics this process of evidence integration and consensus building. Evigan (EVidence Integration for Genome Annotation using a Network) employs a dynamic Bayes net (DBN), a type of probabilistic graphical model that can accommodate multiple (possibly incomplete) gene predictions and other lines of evidence, yielding consensus gene models that maximize the probability of the evidence provided. The DBN model supports a wide variety of evidence types, including computational gene predictions, sequence homology search results, EST alignments and splice site predictions and is easily extensible to incorporate other evidence types, such as proteomics hits, predicted domain architecture, SAGE tags, or Affymetrix tiling array data. Evigan's annotation process simulates an idealized human curator: different evidence sources are compared, those that tend to agree in particular contexts are assigned higher confidence and a consensus model is then created that reflects those confidence estimates. Evigan can produce a single consensus gene model or an ordered list of the n-best gene models, along with associated posterior probabilities for each. The resulting consensus may be edited by a human annotator to reflect biological knowledge and experience not available to the algorithm.

Previous algorithms for creating consensus gene models from diverse evidence can be classified into two main approaches: (i) voting and graph-based methods that assemble a consensus prediction using predefined rules to combine predictions from different sources (Murakami and Takagi, 1998; Rogic et al., 2002; Schiex et al., 2001; Howe et al., 2002; Brejova et al., 2005; Coghlan and Durbin, 2007), and (ii) machine-learning methods that learn rules for combining multiple sources by maximizing the prediction accuracy on a training set of manually annotated genes (Pavlovic et al., 2002; Allen et al., 2004, 2005). Evigan brings together the best features of those two approaches. Like approach (ii), Evigan adapts its combination rule to the actual data being integrated, and like approach (i), it does not require a manually annotated training set. Evigan achieves this by learning a dynamic Bayes network that maximizes the likelihood of the observed sources of evidence. The consensus prediction is then the most probable state sequence for the learned network (see below for further discussion).

In this report, we have applied Evigan to three large-scale data sets: the ENCODE regions of the human genome (ENCODE project consortium, 2004), and the genomes of Plasmodium vivax (TIGR, 2007) and Arabidopsis thaliana (Allen et al., 2004). These experiments demonstrate that for all three species, Evigan achieves better performance than any individual data source used as evidence.


    2 EXPERIMENTS AND RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 EXPERIMENTS AND RESULTS
 3 DISCUSSION
 4 METHODS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Evigan (Section 4 for algorithmic details) has been applied to ENCODE regions of the human genome, the genome of the protozoan parasite P.vivax (one of the causative agents of malaria), and the genome of the flowering plant A.thaliana. We assessed Evigan's performance against well-curated gene models assumed to be the gold standard. We also compared Evigan with the following baseline gene models:

  • Models used as individual evidence sources for Evigan;
  • Majority voting among the evidence sources.
  • For A.thaliana, the Combiner model (Allen et al., 2004).
  • GLEAN (Elsik et al., 2007).

2.1 ENCODE
For this experiment, we run Evigan on 31 of the 44 fragments of the human genome selected in The Encyclopedia of DNA Elements (ENCODE) project (ENCODE project consortium, 2004), which was also used as the testing set in the ENCODE Genome Annotation Assessment Project (EGASP) (Guigo and Reese, 2005; Guigo et al., 2006) gene prediction evaluation. EGASP evaluated the performance of more than 20 gene finders on those 31 regions by comparing their results with gene models curated by the HAVANA group (Guigo et al., 2006). Participating gene finders were categorized into four groups based on the information that they were allowed to exploit in making predictions: the human genomic sequence only (ab initio gene prediction; EGASP category 2); genomic sequences from other species (EGASP category 4); EST, mRNA and protein sequence alignments (EGASP category 3); or any available information (EGASP category 1). The general trend found in EGASP was that gene finders in categories 1 and 3 exhibit better accuracy than those in categories 2 and 4 because they have access to richer features.

Source gene prediction sets used in our experiment include: (I) EGASP entries: AUGUSTUS-any (Stanke and Waack, 2003; Stanke et al., 2006), FgenesH++ (Solovyev et al., 2006), Jigsaw (Allen et al., 2005, 2006), Pairagon-any (Arumugam et al., 2006), GeneMark (Lukashin and Borodovsky, 1998), GeneZilla (Majoros et al., 2004), AceView (NCBI, http://www.ncbi.nlm,nih.gov/IEB/Research), EnsEmbl (Curwen et al., 2004), Exogean (Djebali et al., 2006), ExonHunter (Brejova et al., 2005), DogFish (Carter and Durbin, 2006), Mars (Flicek and Brent, 2006), and Saga (Chatterji and Pachter, 2005), (II) UCSC tracks: GeneID (Parra et al., 2000), Genscan (Burge and Karlin, 1997) and Twinscan (Korf et al., 2001; Flicek et al., 2001); (III) predictor published after EGASP: CRAIG (Bernal et al., 2007).

At present, Evigan can only use evidence about a single candidate transcript from each source (Section 3). While the gene finders in categories 2 and 4 predict single-transcript genes only, some gene finders in categories 1 and 3 predict multiple transcripts at a locus, based on evidence of alternatively spliced, overlapping EST or cDNA sequences. Where multiple transcripts are provided, Evigan takes just the longest available transcript, although all transcripts are retained in the annotation set used for performance evaluation. This filtering was applied to the Ensembl, Aceview, Exogean, FgenesH++ and Mars predictions. We also left out non-coding predictions, integrating and evaluating only predicted CDS between translation start and stop sites. Following the EGASP guidelines, an exon (or CDS) is counted correct if its boundaries and strand are exactly matched. A (coding) transcript is counted correct if all its CDS are exactly matched. A gene is counted as correct if one of its transcripts is predicted correctly. For incomplete transcripts in the annotation set, the EGASP analysis counts a predicted transcript as correct if its components are consistent with those of an annotated transcript; we have adopted a slightly more stringent measure wherein a predicted transcript is considered correct only if its components coincide precisely with those of an annotated transcript.

Source gene prediction sets in the ENCODE dataset are of variable quality. To estimate the sources' quality in the absence of reference gene annotations, we computed a quality score for each source based on the Evigan model parameters associated with the source, which represent an average level of agreement between the inferred consensus and the source, as detailed in Section 4.4. Evigan was first run with all 17 sources, producing the quality scores shown as blue diamonds in Figure 1. The results of this analysis show that gene finders from categories 1 and 3 tend to have better scores than the ones from categories 2 and 4, matching closely the EGASP results on reference annotation (see Supplementary Material), suggesting that Evigan quality scores are good indicators of the true quality of sources.


Figure 1
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Quality of ENCODE data sources, as ranked by Evigan. Blue diamonds represent quality scores for 17 source gene prediction sets from EGASP on ENCODE regions, derived from the parameters inferred by Evigan using all the 17 sources, as described in Section 4. Green dots represent quality scores inferred from integrating random subsets of eight sources.

 
Although the ENCODE project provides a large collection of gene models for the human genome, far fewer data sources are likely to be available for other organisms. We therefore sought to determine how many and which data sources are sufficient for best Evigan performance. We ran Evigan on each subset of the top k sources from the ranked list shown in Figure 1, for k = 1,...,17, yielding the F-score curves for the exon, gene and transcript levels shown in Figure 2. As more sources are added, performance improves until k = 8 for all three levels. For k > 8, performance starts decreasing as lower-quality sources are integrated. It is interesting to note that the average quality score for the top k sources, computed from the parameters inferred by Evigan for those sources, tracks closely with the accuracy of the Evigan-produced consensus for those sources with respect to the EGASP reference annotation. This suggests that the Evigan individual source and average quality scores may be useful proxies for the accuracy of individual sources and of the Evigan consensus when no gold standard is available.


Figure 2
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Evigan Average Quality Score and True Performance as a function of top-K evidence sources. Average quality score of the top k sources (filled symbols) and F-scores at the exon, transcript and gene levels for the Evigan combination of the top k sources. For the top k evidence sources as ranked by Evigan (Figure 1), average quality score of those sources is computed from the inferred Evigan parameters; true performance of the corresponding k-source Evigan consensus is measured against reference annotation. F-score is the harmonic mean between sensitivity and specificity.

 
Following the above heuristic, we selected the eight sources with the highest quality scores for integration with Evigan. Figure 3 displays the sensitivity and specificity of the Evigan consensus of those eight sources, labeled Evigan-8g, in comparison with individual sources and with the Evigan combination of all sources, Evigan-17g (see Supplementary Material for detailed evaluation results). Overall, Evigan-8g outperforms any of the constituent data sources on which its analysis was based at all levels (exon, gene and transcript). Although the exclusion of multiple overlapping transcripts (such as those from differentially spliced cDNAs) from the source datasets supplied to Evigan decreases their accuracy, particularly with respect to sensitivity, a comparison of the performance statistics with the EGASP results indicates that this effect is small (see Supplementary Material). Using a larger collection of less accurate data sources (Evigan-17g) significantly reduces performance.


Figure 3
View larger version (36K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Performance of Evigan (filled symbols) and each of the individual evidence sources on the human ENCODE regions. ‘Evigan-8g’ integrates eight gene finders as data sources: Augustus-any, Jigsaw, Pairagon-any, Ensembl, Aceview, CRAIG, Exogean and FgenesH++; ‘Evigan-17g’ integrates all the available sources.

 
Evigan's quality score estimate for a source depends on the set of sources being integrated. The quality score for a source inferred by running Evigan on all the 17 sources might be different than the one inferred by running Evigan on a subset of the sources. To evaluate the robustness of Evigan quality score estimates when different subsets of sources are integrated, 20 random subsets of eight sources were sampled and integrated by Evigan. Quality score estimates for each source from the experiments were collected and are shown as green dots in Figure 1. Overall quality scores of the sources inferred from the random subsets are robust and close to the ones inferred from all the 17 sources, although Genscan and Saga have outliers. Furthermore, Evigan's performance using the top eight sources (Evigan-8g) is better than its performance using any of the random eight-source subsets, suggesting that Evigan-8g has near-optimal performance among any eight-source subsets.

2.2 Arabidopsis thaliana
A more limited set of gene predictions is available for A.thaliana, but annotation of this plant genome has benefitted from Combiner (Allen et al., 2004), which uses a decision tree trained on a separate curated training set to adjudicate among the predictions of multiple gene finders. The complete set of predictions from all gene finders used by Combiner, gene predictions from Combiner itself, and a list of 2163 genes annotated based on full-length cDNA alignments and manual curation (Haas et al., 2002) were kindly provided by the authors. Evidence sources included predictions from Genemark HMM (Lukashin and Borodovsky, 1998), Genscan (Burge and Karlin, 1997), GlimmerA (Pertea and Salzberg, 2002), GlimmerM (Pertea and Salzberg, 2002) and Twinscan (Korf et al., 2001), splice site predictions (Pertea et al., 2001) and protein alignments. All gene finders were trained before the annotated gene set became available, and they only produce non-overlapping gene models. Protein alignments between (the translations) of candidate coding DNA sequences and elements of a non-redundant protein database were created with the dps and nap programs (Huang et al., 1997). Alignment data was filtered to remove proteins from A.thaliana, which might have biased Evigan's predictions. Only HSPs with >50% amino acid sequence identity were used, and only the longest HSP was selected in cases of HSP overlap.

Following the original Combiner publication (Allen et al., 2004), the data set was split into a small training set (380 annotated genes) and a larger test set (1783 annotated genes), and Combiner parameters were estimated from the annotated training set. Evigan's parameters were estimated on the whole data set (training and test) and GLEAN's on the test set from which reference gene annotations had been removed. The performance of each of the individual gene finders, GLEAN, Evigan and Combiner was assessed by comparison with the 1783 gene test set. Slight differences from the results previously reported for Combiner are attributable to the use of a more recent version of Combiner in this study.

Two sets of experiments were carried out on the Arabidopsis genome, with the results summarized in Figure 4 (see the Supplementary Material for detailed performance tables). In the first set, analogous to our ENCODE study, five gene finders were used as data sources for both Evigan and Combiner (Combiner-G, GLEAN-G, Evigan-G). In the second set, these five gene finders were integrated with splice site and protein sequence alignment evidence (Combiner-A, Evigan-A) (GLEAN currently can not incorporate protein alignment data). In both of these experiments, all of the data integration strategies (Combiner, GLEAN and Evigan) achieve higher sensitivity and specificity at both the gene- and exon levels than any of the gene finders used as data sources. In particular, both Combiner and Evigan improve upon Twinscan, which in turn far exceeds the results from GenemarkHMM, Genscan, GlimmerA, GlimmerM. That is, information from four gene finders that perform relatively poorly is actually helpful in correcting some of the mistakes of a much better gene finder. Integrating additional information (splice sites, protein sequence alignments) results in further improvements (compare -A with -G). The performance of Evigan and GLEAN are almost the same on the exon level and gene level. The performance of Combiner slightly exceeds that of Evigan and GLEAN, but Combiner depends on carefully curated training set, while Evigan and GLEAN infer their parameters solely from the data sources provided, none of which includes the training annotations used by Combiner.


Figure 4
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Performance of Evigan, Combiner and each of the individual gene finders used as evidence sources on the A.thaliana genome. Evigan-G, GLEAN-G and Combiner-G integrate information from gene finders only, while Evigan-A and Combiner-A also include splice site predictions and protein sequence alignment data. Evigan and Combiner perform similarly at both the exon and gene level, and significantly better than any of the component data sources.

 
2.3 Plasmodium vivax
Plasmodium vivax is one of four parasite species that cause human malaria. Although most deaths from malaria are attributable to P.falciparum, P.vivax infections may be even more prevalent (Mendis et al., 2001). In order to expedite research on this parasite, which cannot be cultivated in vitro, a reference P.vivax genome has recently been sequenced (TIGR, http://www.tigr.org/tdb/e2k1/pva1/pva1.shtml). The first-pass annotation of this genome exploited gene predictions from GeneZilla (Majoros et al., 2004) (6090 predicted genes), GlimmerHMM (Pertea and Salzberg, 2002) (12047), and Phat (Cawley, 2001) (4563), as well as ESTs and BLASTX hits. Manual curation efforts yielded 5354 annotated genes. All of these datasets were downloaded from plasmodb.org (PlasmoDB, http://www.plasmodb.org/plasmo/home.jsp). All three gene finders predict non-overlapping gene models; in cases of overlapping ESTs, only the longest was used as input for Evigan; in cases where BLASTX hits overlapped, the one with the lowest E value was used.

As with A.thaliana, two sets of experiments were carried out on the P.vivax genome, integrating three gene finder models, with or without the inclusion of EST and BLASTX data, as shown in Figure 5 (see Supplementary Material for detailed evaluation results). Sensitivity and specificity were evaluated at the exon- and gene-level on a manually curated dataset. Once again, the performance of Evigan exceeds any individual gene finder. Evigan also outperforms majority voting (designated by ‘MV’ in Figure 5), because this analysis treats all evidence sources equally. Ascribing equal weight to the limited EST dataset (low sensitivity) and very noisy BLASTX evidence (low specificity) causes MV-A to perform worse than MV-G. Since Evigan learns to weigh individual data sources according to their patterns of agreement and disagreement, however, Evigan-A extracts useful information from the EST and BLASTX datasets, leading to better performance than Evigan-G.


Figure 5
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Performance of Evigan compared with three gene finders and majority vote consensus on the P.vivax genome GeneZilla, GlimmerHMM and Phat predictions were used as input for Evigan-G; Evigan-A also incorporated EST and BLASTX results. MV designates results from majority voting by each of the data sources.

 

    3 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 EXPERIMENTS AND RESULTS
 3 DISCUSSION
 4 METHODS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Several groups have previously sought to improve gene annotation by integrating diverse evidence sources using predefined rules. Murakami and Takagi (1998) used logical operators to combine coding regions predicted by several gene finders, reporting that disjunctions improve sensitivity and conjunctions specificity, as might be expected. Further studies aimed to improve both sensitivity and specificity by combining two gene finders with reliable exon scores (Rogic et al., 2002). EuGene (Schiex et al., 2001) described a more global integration method in which gene evidence is represented as a weighted directed acyclic graph, and the shortest path represents the proposed consensus gene model. GAZE (Howe et al., 2002) provides a generic framework that allows users to specify rules and scoring functions for assembling gene elements (such as signals and segments predicted by sensors), and then uses dynamic programming to find the highest-scoring gene parse. ExonHunter (Brejova et al., 2005) combines two distributions to model gene structures: a distribution over annotations given a genomic sequence, and a distribution over annotations given various sources of evidence (such as genome-genome comparisons, EST matches and protein alignments). All of these methods depend on manually specified combination rules and parameters to propose consensus gene models. A more recently published consensus finder, Genomix (Coghlan and Durbin, 2007), uses conservation features to score and select exons predicted by source gene finders.

Learning-based methods have also been developed for gene model prediction by evidence integration. Combiner (Allen et al., 2007) and its successor Jigsaw (Allen et al., 2005), which have been used successfully to annotate several organisms and in the EGASP competition (Allen et al., 2006), integrate multiple types of evidence using a machine-learned classifier. Pavlovic et al. (2002) use naïve Bayes models and input–output HMMs (based on complete gene models only) to model per-nucleotide consensus.

Evigan represents evidence from diverse sources (such as gene models, BLAST hits and EST matches) similarly to Combiner and Jigsaw, and uses a probabilistic graphical model (Jordan, 1999) like Pavlovic et al. do to represent the evidence integration task. Unlike those previous programs; however, Evigan does not rely on annotated training data. Instead, Evigan uses a carefully initialized expectation-maximization (EM) method to estimate model parameters directly from the observed evidence, making it suitable for systems where little or no annotated data is available.

The sources integrated by Evigan may have been developed with the help of annotated training sets, for instance ab initio gene finders. Evigan itself does not depend on what training data was used for the sources or on how they were trained. No training data is needed for Evigan to estimate its model parameters. In addition, our experiments with Evigan-derived source quality scores for EGASP sources suggest that scores based on Evigan's comparison of sources are useful proxies for the expected accuracy of those sources, even when the sources are quite similar, and can thus be used to select sources likely to enhance the accuracy of the Evigan consensus. When a random subset of the sources is selected and quality scores inferred for the sources in the subset, the quality scores are consistent with the ones inferred from running Evigan on all the sources together.

Evigan is a refinement and generalization of the GLEAN program (Elsik et al., 2007). GLEAN represents sources of evidence through their support or non-support for proposed signals (coding starts and stops, splice donors and acceptors) while Evigan represents them as gene components at consecutive segments between signals. GLEAN learns a scoring function for each signal type and constructs the optimal gene parse with dynamic programming, whereas Evigan formulates evidence and gene parse in a single sequential probabilistic model so that learning and inference can be integrated. Like Evigan, GLEAN does not require a training set for parameter estimation. GLEAN's signal-based evidence representation and scoring function, however, makes it difficult to incorporate segment-based evidence types, such as BLASTX (protein to genome sequence) alignments, SAGE tags, or tiling array data.

We made several important assumptions in choosing to learn the relative quality of evidence sources without resorting to a training set. First, we assume that there is a fair amount of agreement among sources because we actually learn their quality by checking how often they agree with each other via a proposed consensus. Second, we assume that majority voting is significantly better than random, because we rely on majority voting to initialize the EM algorithm with a reasonable initial consensus. It is easy to see that a malicious or poorly selected set of sources in which the majority vote consensus is frequently wrong would mislead Evigan. Third, we assume that the evidence sources are approximately conditionally independent given the consensus gene model. This is a much weaker condition than asking for the sources to be unconditionally independent, since it only requires that there should not be strong correlations between source departures from the truth. Still, this is a matter of some concern, as many gene finders are trained on related datasets, use similar features, or are based on similar algorithms. Evigan could then fail if the errors of different sources were strongly correlated. For example, if one source provides good quality evidence whereas all the others are copies of a poor source, Evigan would wrongly learn to prefer the replicated poor sources.

Another limitation of Evigan is that the evidence provided by each source has to be extendible to a single gene model. In particular, gene models with alternative transcripts cannot be used, due to which we had to select just the longest transcript for each gene predictor that produces multiple transcripts. The reason for this limitation is that the probabilistic representation of each source (Section 4) is a single chain of random variables. We do not have yet a practical way of lifting this limitation, but this is an interesting area for further work.

In practice, however, Evigan performed quite well on several different datasets, as shown in Section 2. For all three genomes examined, Evigan outperformed all of the data sources used as evidence. Although these data sources are not always identical to the gene prediction sets generated by published algorithms, as some gene predictions were removed to provide the current version of Evigan with a non-overlapping set of gene models, the important point is that Evigan yields better gene predictions than any of the input data sources and also better even than the majority voting consensus (Fig. 5). Overall, the performance of Evigan is comparable to that of Combiner on A.thaliana, without Combiner's need for a curated training set.

One further feature of Evigan is the ability to generate the n-best gene models, with associated posterior probabilities. Although the highest probability model is selected in the absence of further information, the larger list of candidates is potentially valuable, suggesting possible alternatively spliced transcripts, and as hypotheses to be integrated with additional computational and biological information. Automated annotation cannot completely replace manual annotation because human experts will use additional knowledge not present in the individual sources of evidence. However, a program like Evigan can serve as a fast first-pass annotation tool. Human curators can then quickly look through Evigan's predictions to decide whether they want to make further changes. For example, if the posterior probability of the best consensus gene model is high, the curator may decide to keep it. If instead the posterior probability of the best model is below some threshold, the curator may examine the n-best gene models or go back to the individual sources of evidence to resolve the uncertainty. Curation effort will thus be focused on just those cases in which the evidence is so conflicting that the automated method cannot come up with a high-probability consensus, rather than having to be applied uniformly everywhere. Althought Evigan's n-best gene models have the potential to suggest possible alternatively spliced transcripts, there is no guarantee that all the models proposed by Evigan are legitimate transcripts. To illustrate this, Figure 6 shows alternative models for a gene from the ENCODE set (EntrezGene ID: LAIR1, leukocyte-associated Ig-like receptor 1, on reverse strand). The curated transcripts and Evigan's five best models are visualized on the top and bottom of the figure, respectively. The posterior probabilities of Evigan's five best models are all similar. Evigan's third and fourth best models Evigan_3 and Evigan_4 match exactly with the curated alternative transcripts AC0087461.1-001 and AC0087461.1-003, respectively. However, the other Evigan models differ from the curated transcripts. How to automatically recognize true alternative transcripts in Evigan's n-best gene models is an open research question that we plan to investigate.


Figure 6
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Curated transcripts and Evigan's five best models for an ENCODE gene (EntrezGene ID: LAIR1, on reverse strand). Curated transcripts are displayed on the top, and Evigan's models on the bottom. Evigan's third and fourth best models Evigan_3 and Evigan_4 match exactly the curated alternative transcripts AC0087461.1-001 and AC0087461.1-003, respectively. However, the other Evigan models differ from the curated transcripts. Generated with the gene visualization program GFF2PS (Abril and Guig, 2000).

 
As shown in the experiments on A.thaliana and P.vivax, adding splice site prediction, protein and EST evidence improves Evigan's performance over integrating gene prediction sets only. Aggregation of different types of evidence works well in our experiments, given they are complementary to each other and properly weighted according to their relative strength. It may seem contradictory that the best results for A.thaliana use all of the sources, whereas using all of the sources leads to less good results in the ENCODE experiment. This apparent conflict can be explained by the differences between the sources in the two experiments. For A.thaliana, all of the sources are ab initio or de novo gene finders that exploit genomic sequence only, use similar features, have similar performance and often agree with each other. Therefore, each source seems mostly right from the point of view of the other sources, and Evigan's task is the relatively simple one of adjudicating a relatively small number of disagreements. In contrast, the 17 sources available for the ENCODE experiment differ from each other drastically in terms of input features and performance. Their higher level of disagreement has the effect of diluting a minority of good sources with a majority of relatively bad sources, decreasing overall performance unless sources with low quality scores are excluded.

In addition to the experiments described in this study, we have also entered Evigan in the Nematode Genome Annotation Assessment Project (NGASP) evaluation for C.elegans gene prediction. Preliminary results of that evaluation show that Evigan performs competitively among combiner gene finders (NGASP, http://www.wormbase.org/wiki/index.php/NGASP). Furthermore, Evigan has recently been used to create gene models for a P.falciparum re-annotation workshop whose results will be soon available on PlasmoDB (PlasmoDB, http://www.plasmodb.org/plasmo/home.jsp).

As we already noted, the current version of Evigan cannot use evidence for alternative models from a single source, and is thus restricted to selecting a single hypothesis for each source. This might cause Evigan to underestimate the value of such sources. In addition, alternative model evidence could be very useful in helping Evigan select a set of n-best models that is more likely to include true alternative transcripts. This would involve representing multiple alternative states for each segment and associating them with appropriate state transitions at segment boundaries. These questions clearly deserve further research. Another area for further study is how to use partial sources of evidence to improve the consensus for specific gene model components. For example, EST and protein sequence alignments may be very informative for splice donor and acceptor prediction, even if they are less accurate than gene finders in predicting complete gene models. Preliminary results indicate that other noisy data types can also be very valuable. For example, SAGE tag results can improve the recognition of terminal coding exons; affymetrix tiling array data can help UTR identification. Finally, it would be useful to allow the user to specify prior beliefs about evidence sources. For example, a researcher may prefer EST evidence from their lab to computational evidence based on poorly matched training data, and thus they want different initial weights to sources. Ideally, the researcher would specify a prior distribution over source weights and Evigan would use a Bayesian inference procedure with that prior to produce posterior probabilities over candidate models.


    4 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 EXPERIMENTS AND RESULTS
 3 DISCUSSION
 4 METHODS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We formulate the gene evidence combination task as a sequence labeling problem, and then represent the sequence labeling problem with a dynamic Bayes network. This representation allows us to model different types of evidence in a uniform way. Standard algorithms are used to estimate the DBN parameters and to compute the consensus gene model. (See Supplementary Material for a flow chart of the system.)

4.1 Sequence segmentation
Figure 7 provides a schematic view of how evidence and consensus are represented in Evigan. In this toy example, track 1 displays a gene model, produced by gene finder 1, consisting of two exons and one intron; tracks 2 and 3 display slightly different gene models, generated by gene finders 2 and 3. The splice predictor showed in track 4 proposes possible donor and acceptor sites, while tracks 5 and 6 display HSPs representing transcript and protein alignments.


Figure 7
View larger version (32K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7. Given multiple sources of gene evidence, the genomic sequence is segmented at positions proposed by any of these sources. Track 0 contains the consensus prediction (not yet established in this example), and other tracks represent sources of evidence. Curly brackets at right indicate the state alphabet for each track.

 
The cumulative set of potential signals proposed by any of these sources is then used to divide the genomic sequence into segments, with signal sites as boundaries, as shown on the top track in the figure (16 segments in this case). Alternatively, one could segment the sequence based on each occurrence of a conserved motif for each signal type, but this would produce a significantly larger number of segments, and therefore increased computational cost.

Next, sources of gene evidence are mapped onto these segments as possible states (exon, intron or intergenic). For example, the gene model predicted by gene finder 1 (track 1) maps to the 16 sequence segments to yield the following series: one intergenic state, five frame 1 exon states, two intron states, seven frame 1 exon states and one intergenic state. Track 0 represents the emerging consensus gene prediction, where the consecutive segments will be labeled with exon, intron, or intergenic states. (See Supplementary Material for details on the various state alphabets.)

4.2 Dynamic Bayes networks, state alphabets and probability constraints
Dynamic Bayes networks (DBNs) (Murphy, 2002) are probabilistic graphical models that encode dependencies among a set of random variables evolving along a single dimension, conventionally called "time". DBNs generalize Hidden Markov Models (HMMs (Rabiner, 1989)) by allowing complex probabilistic dependencies among the variables at a given time, and between variables at consecutive times. In our DBN model, observed sequences of variables represent the sources of evidence, and another, unobserved sequence represents the hidden consensus that we wish to compute. We use the EM algorithm (Dempster et al., 1977) to estimate the optimal model parameters, and the Viterbi algorithm (Rabiner, 1989) to find the single most probable consensus.

In the DBN model used in Evigan, the random variables represent consensus and source evidence for each segment, and they take values from appropriate state alphabets. The first track is always the sequence of consensus variables and the other tracks are the sequences of evidence variables. Model edges encode statistical dependencies among variables. Two types of dependencies are represented: (1) dependency from consensus variable to evidence variables on the same segment; (2) dependency between consensus variables or evidence variables on successive segments. (See Supplementary Material for a detailed model definition.)

Each track, including the consensus sequence and each evidence sequence, is associated with a state alphabet that defines permissible states for that track. For example, track 1 in Figure 7 has one intergenic state, one intron state, and exon states with different phases (see the Supplementary Material for more details). In addition, we constrain the probabilistic relationships between segment states in the various tracks, so as to encode constraints in gene structure and evidence. Specifically we constrain the conditional probability tables (CPTs) of the DBN to accept only valid gene structures by setting certain probabilities to zero. The consensus state transitions are constrained to enforce correct transitions between intergenic, exonic and intronic states. Furthermore, each segment boundary is labeled with the signals associated with that boundary, and the latent state transitions across the boundary must obey the labeling. For example, a boundary associated with a donor signal requires the latent state to either stay the same (ignore the signal) or to make a transition from exonic to intronic. Other evidence types, such as SAGE tags, can be easily incorporated into the model by defining an appropriate alphabet and corresponding CPT constraints.

4.3 Parameter estimation and inference
We use the expectation maximization (EM) algorithm (Dempster et al., 1977) to estimate the parameters of our DBN given the observed data. The EM algorithm needs a probabilistic inference algorithm as a subroutine to compute the posterior probabilities of the consensus variables given the observed data. For general DBNs, this inference problem is computationally hard because of the very large number of possible variable configurations, and approximate inference is typically used instead (Murphy, 1999). However, in the present case the state space is relatively small and state transition is relatively sparse due to the gene structure constraints on the consensus path, so we can reduce the inference problem to a straightforward variant of the forward–backward algorithm for HMMs (Murphy, 2002). Given the posterior probabilities for the consensus variables, new model parameters are computed as appropriate ratios of expectations. The EM algorithm converges to a local maximum of the data likelihood, and it may be converge slowly if the starting model parameters are poorly chosen. To help the algorithm converge quickly to a good local maximum, the first expectation step is calculated with the consensus variables set to the majority votes of the corresponding evidence variables, with ties broken randomly.

After estimating the model parameters with the EM algorithm outlined above, we can query the model to answer inference questions. The most important such question is to find the hidden state sequence—the consensus gene structure—with highest probability given the evidence. This can be calculated using a straightforward adaptation of the Viterbi algorithm for HMMs (Durbin et al., 1998). In addition to finding out the best consensus gene structure, we also implement an n-best decoding algorithm (Schwartz and Chow, 1990) to find out the n-best consensus gene structures. One may be also interested in computing the posterior probability of a (possibly partial) consensus gene structure. This again can be computed in time linear to the number of segments by dynamic programming using a straightforward variant of the forward–backward algorithm for HMMs.

The inference algorithms implemented in Evigan rule out impermissible consensus sequences. For example, it is required that the consensus state for each segment be supported by at least one evidence source for the segment or that it extends from the state for the previous segment; it is also required that only the state transitions consistent with a signal type be allowed: for example, only exon-to-intron transition is allowed at a proposed donor position.

4.4 Quality scores and evaluation
Evigan's source quality score is calculated from the conditional probability table associated with the source, averaging the entries where the source variable and the consensus variable have consistent values. (See Supplemetary Material for more details.) The quality score is a surrogate for source's performance that does not need a curated training set.

Gene prediction accuracy was evaluated using the Eval software package (Keibler and Brent, 2003) by sensitivity and specificity on exon, transcript and gene levels. (See Supplemetary Material for more details.)


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 EXPERIMENTS AND RESULTS
 3 DISCUSSION
 4 METHODS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We wish to thank Axel Bernal for providing CRAIG predictions on the ENCODE regions, and Jonathan Allen for evidence sources and Combiner predictions on A.thaliana. We would also like to thank Anat Caspi and Zhongqiang Chen for helpful suggestions for improving the manuscript. This work was supported in part through the Health Research Formula Fund from the Commonwealth of Pennsylvania, NSF ITR grant EIA 0205456 and grants from the NIH. Implementation was supported under the auspices of a Bioinformatics Resource Center contract from NIAID. DSR is an Ellison Medical Foundation Senior Scholar in Global Infectious Diseases.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alex Bateman

Received on October 15, 2007; revised on December 13, 2007; accepted on January 3, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 EXPERIMENTS AND RESULTS
 3 DISCUSSION
 4 METHODS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Abril JF, Guig R. gff2ps: visualizing genomic annotations. Bioinformatics (2000) 16:743–744.[Abstract/Free Full Text]

    Allen JE, et al. Computational gene prediction using multiple sources of gene evidence. Genome Res (2004) 14.

    Allen JE, et al. JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics (2005) 21:3596–3603.[Abstract/Free Full Text]

    Allen JE, et al. JIGSAW, GeneZilla and GlimmerHMM: puzzling out the feature of human genes in the ENCODE regions. Genome Biol (2006) 7(Suppl_1):S9.[CrossRef][Medline]

    Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol (1990) 215:403–10.[CrossRef][Web of Science][Medline]

    Arumugam M, et al. Pairagon+NSCAN_EST: a model-based gene annotation pipeline. Genome Biol (2006) 7(Suppl 1):S5.[CrossRef][Medline]

    Bernal A, et al. Global discriminative learning for higher-accuracy computational gene prediction. PLoS Computation Biol (2007) 3:e54.[CrossRef]

    Brejova B, et al. ExonHunter: a comprehensive approach to gene finding. Bioinformatics (2005) 21(Suppl 1):i57–i65.[Abstract]

    Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol (1997) 268:78–94.[CrossRef][Web of Science][Medline]

    Carter D, Durbin R. Vertebrate gene finding from multiple-species alignments using a two-level strategy. Genome Biol (2006) 7(Suppl 1):S6.[CrossRef][Medline]

    Cawley SE. Phat: a gene finding program for Plasmodium falciparum. Mol. Biochem. Parasitol (2001) 118:167–174.[CrossRef][Web of Science][Medline]

    Chatterji S, Pachter L. Large multiple organism gene finding by collapsed Gibbs sampling. J. Comput. Biol (2005) 99:33–54.

    Coghlan A, Durbin R. Genomix: a method for combining gene-finders predictions, which uses evolutionary conservation of sequence and intron-exon structure. Bioinformatics (2007) 23.

    Curwen V, et al. The Ensembl automatic gene annotation system. Genome Res (2004) 14:942–950.[Abstract/Free Full Text]

    Dempster AP, et al. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Series B (Methodological) (1997) 39:1–38.

    Djebali S, et al. Exogean: a framework for annotating protein-coding genes in eukaryotic genomic DNA. Genome Biol (2006) 7(Suppl 1):S7.[CrossRef][Medline]

    Durbin R, et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. (1998) Cambridge University Press.

    Elsik CG, et al. Creating a honey bee consensus gene set. Genome Biol (2007) 8:R13.[CrossRef][Medline]

    ENCODE project consortium. The ENCODE (ENCyclopedia Of DNA Elements) project. Science (2004) 306:636–40.[Abstract/Free Full Text]

    Flicek P, Brent MR. Using several pair-wise informant sequences for de novo prediction of alternatively spliced transcripts. Genome Biol (2006) 7(Suppl 1):S8.[CrossRef][Medline]

    Flicek P, et al. Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. Genome Res (2001) 13:46–54.[CrossRef]

    Guigo R, Reese MR. EGASP: collaboration through competition to find human genes. Nat. Methods (2005) 2:575–577.[CrossRef][Web of Science][Medline]

    Guigo R, et al. EGASP: The human ENCODE genome annotation assessment project. Genome Biol (2006) 7(Suppl 1):S2.[CrossRef][Medline]

    Haas BJ, et al. Full-length messenger RNA sequences greatly improve genome annotation. Genome Biol (2002) 3.

    Howe KL, et al. GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Res (2002) 12:1418–1427.[Abstract/Free Full Text]

    Huang X, et al. A tool for analyzing and annotating genomic sequences. Genomics (1997) 46:37–45.[CrossRef][Web of Science][Medline]

    Jordan MI. Learning in Graphical Models. (1999) Cambridge, MA: The MIT Press.

    Keibler E, Brent MR. Eval: a software package for analysis of genome annotations. BMC Bioinformatics (2003) 4:50.[CrossRef][Medline]

    Korf I, et al. Integrating genomic homology into gene structuure prediction. Bioinformatics (2001) 17(Suppl 1):S140–148.[Abstract]

    Lukashin A, Borodovsky M. GeneMark.hmm: new solutions for gene finding. Nucl. Acids Res (1998) 26:1107–1115.[Abstract/Free Full Text]

    Majoros WH, et al. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics (2004) 20:2878.[Abstract/Free Full Text]

    Mendis K, et al. The neglected burden of Plasmodium vivax malaria. Am. J. Tropical. Med. Hygiene (2001) 64. S.

    Murakami K, Takagi T. Gene recognition by combination of several gene-finding programs. Bioinformatics (1998) 14:665–675.[Abstract/Free Full Text]

    Murphy K. Dynamic Bayesian Networks: representation, inference and learning. (2002) PhD Thesis, UC Berkeley.

    Murphy KP, et al. Loopy belief propagation for approximate inference: an empirical study. (1999) Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. 467–475.

    Parra G, et al. GeneID in Drosophila. Genome Res (2000) 10:511–515.[Abstract/Free Full Text]

    Pavlovic V, et al. A Bayesian framework for combining gene predictions. Bioinformatics (2002) 18:19–27.[Abstract/Free Full Text]

    Pertea M, et al. GeneSplicer: a new computational method for splice site prediction. Nucl. Acids Res (2001) 29:1185–1190.[Abstract/Free Full Text]

    Pertea M, Salzberg SL. Computational gene finding in plants. Plant Mol. Biol (2002) 48:39.[CrossRef][Web of Science][Medline]

    Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. (1989) 77. Proceedings of the IEEE. 257. 2013286.

    Rogic S, et al. Improving gene recognition accuracy by combining predictions from two gene-finding programs. Bioinformatics (2002) 18:1034–1045.[Abstract/Free Full Text]

    Schiex T, et al. Eug'ne, an eukaryotic gene finder that combines several type of evidence. Comput. Biol (2001) 118–133.

    Schwartz R, Chow Y. The n-best algorithm: an efficient and exact procedure for finding the n most likely sentence hypotheses. (1990) Proceedings of International Conference on Acoustics, Speech and Signal Processing. 81–84.

    Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics (2003) 19(Suppl 2):II215–II225.[Medline]

    Stanke M, et al. AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome. Genome Biol (2006) 7(Suppl 1):S11.[CrossRef][Medline]

    Solovyev V, et al. Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol (2006) 7(Suppl 1):S10.[CrossRef][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/5/597    most recent
btn004v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Liu, Q.
Right arrow Articles by Pereira, F. C. N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Liu, Q.
Right arrow Articles by Pereira, F. C. N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?