Skip Navigation


Bioinformatics Advance Access originally published online on August 2, 2005
Bioinformatics 2005 21(18):3596-3603; doi:10.1093/bioinformatics/bti609
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
21/18/3596    most recent
bti609v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (19)
Google Scholar
Right arrow Articles by Allen, J. E.
Right arrow Articles by Salzberg, S. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Allen, J. E.
Right arrow Articles by Salzberg, S. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions{at}oupjournals.org

JIGSAW: integration of multiple sources of evidence for gene prediction

Jonathan E. Allen 1,3,* and Steven L. Salzberg 1,2

1Center for Bioinformatics and Computational Biology, University of Maryland Institute for Advanced Computer Studies, University of Maryland College Park, MD 20742, USA
2Department of Computer Science, University of Maryland Institute for Advanced Computer Studies, University of Maryland College Park, MD 20742, USA
3Department of Computer Science, Johns Hopkins University 3400 N. Charles Street, Baltimore, MD 21218, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 

Motivation: Computational gene finding systems play an important role in finding new human genes, although no systems are yet accurate enough to predict all or even most protein-coding regions perfectly. Ab initio programs can be augmented by evidence such as expression data or protein sequence homology, which improves their performance. The amount of such evidence continues to grow, but computational methods continue to have difficulty predicting genes when the evidence is conflicting or incomplete. Genome annotation pipelines collect a variety of types of evidence about gene structure and synthesize the results, which can then be refined further through manual, expert curation of gene models.

Results: JIGSAW is a new gene finding system designed to automate the process of predicting gene structure from multiple sources of evidence, with results that often match the performance of human curators. JIGSAW computes the relative weight of different lines of evidence using statistics generated from a training set, and then combines the evidence using dynamic programming. Our results show that JIGSAW's performance is superior to ab initio gene finding methods and to other pipelines such as Ensembl. Even without evidence from alignment to known genes, JIGSAW can substantially improve gene prediction accuracy as compared with existing methods.

Availability: JIGSAW is available as an open source software package at http://cbcb.umd.edu/software/jigsaw

Contact: jeallen{at}umiacs.umd.edu


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 
Determining the true set of human genes has turned out to be a much harder problem than many expected; the exact number of genes has gradually decreased since the publication of the draft human genome in 2001 (International Human Genome Sequencing Consortium, 2001; Venter et al., 2001), but it has not stabilized. The protein-coding regions of many human genes are generally agreed upon, but even for these, the precise gene structure, comprising the boundaries of all the exons and of the coding sequence, remains less than certain. Evidential support for existing genes varies widely, from tentatively defined to experimentally confirmed. For some rarely expressed genes, the evidence is limited to a small number of expressed sequence tags (ESTs) and to the overlapping but inconsistent predictions of multiple gene finders. For some loci, evidence of expression is absent but evidence from protein sequence alignments to other species strongly suggests the presence of a gene. Many human genes have been carefully confirmed through full-length cDNA sequencing, the remapping of those cDNAs to the original chromosomes is considered the ‘gold standard’ for defining the true exon–intron structure. The numerous genes for which full-length cDNA sequences have not been generated pose an ongoing challenge in our efforts to produce complete and accurate predictions of all genes.

To illustrate this challenge, we consider an example from human chromosome 20, shown (Fig. 1) in a display from the UCSC genome browser, which provides an interface to a collection of programs that have been run on each human chromosome. Figure 1 shows gene predictions from several gene finding programs, plus alignments of both protein and cDNA data from Swiss-Prot (Bairoch et al., 2005), UniGene (Wheeler et al., 2003) and the TIGR Gene Index (Lee et al., 2005). Despite strong evidence for a gene in Figure 1, the various programs disagree on the precise gene structure. It is possible that zero, one, or multiple different predictions are correct. Currently it is left to the user to decide what evidence to use for gene structure prediction.



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 1 Gene structure evidence from the UCSC human genome annotation database (chromosome 20). Each row shows evidence generated from adistinct source.

 
One track shown in Figure 1 is the Ensembl prediction (fourth row from the top), generated by the Ensembl group's automated method for integrating various forms of evidence. Ensembl, one of the leading genome annotation systems, applies a collection of rules to decide when to use the output from different prediction programs, depending on the type of evidence available for a particular gene (Curwen et al., 2004). The rules attempt to filter out unreliable data, leaving only high quality alignments for use in prediction. A strength of this approach is that strict criteria are established to identify high quality alignments, which is particularly useful when complete cDNA sequence can be mapped to the originating chromosome. Applying criteria that are too strict, however, will miss genes that are not supported by expression data (for example), even if those genes are supported by other forms of evidence. One of the goals of JIGSAW is to capture more of these genes through an automated, statistically principled method for weighing the different evidence sources.

JIGSAW uses a manually curated gene set, along with all of the alignments and predictions associated with that set, to collect statistics on the accuracy of the gene prediction evidence. The goal is to provide consistent, reproducible predictions based on sound statistical principles, even in cases of conflicting evidence about gene structure. The program is a successor to our earlier Combiner system (Allen et al., 2004). JIGSAW and Combiner have been used by annotators as the basis for gene calls in several recently sequenced organisms, including Oryza sativa (rice) (Buell et al., in press) and Cryptococcus neoformans (Loftus et al., 2005). This paper describes several new algorithmic developments, explains the relationship between the JIGSAW algorithm and generalized hidden Markov models (GHMMs), and evaluates JIGSAW's prediction accuracy on the human genome. Our experiments show that JIGSAW's accuracy on both exon prediction and whole-gene prediction is superior to other methods.


    2 SYSTEMS AND METHODS
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 
Computational gene finding programs are designed to model different aspects of protein coding genes, often using different statistical models for different parts of a transcript. For example, 3-periodic inhomogeneous interpolated Markov models (IMMs) have proven successful at modeling protein coding intervals (Salzberg et al., 1999), and decision trees (Burge and Karlin, 1997) have been used to capture dependencies between non-adjacent nucleotides near splice junctions. JIGSAW is able to take advantage of diverse models and evidence types and to combine them into frame-consistent gene predictions using dynamic programming. Here we define our dynamic programming algorithm for computing an optimally scoring parse of a genome sequence, where the parse directly corresponds to a prediction of exon–intronstructure.

We represent the components of a gene with a collection of gene structure labels Y = {Initial, Internal 1, Internal 2, Internal 3, Intron 1, Intron 2, Intron 3, Terminal, Single, Intergenic}, with each element y Y matching an exon, intron or intergenic region. The three distinct labels for introns and for internal exons (Internal 1, Internal 2 and Internal 3) are required to track the phase in cases where an intron interrupts a codon. Four signal types are allowed: start codons, stop codons, and splice junctions (acceptor and donor sites), which denote the beginning and ending (5' and 3' ends) of introns. Internal exons begin one base downstream of acceptor sites and end one base upstream of donor sites. By default, the acceptor site is an AG dinucleotide, and the donor site is either a GT or GC dinucleotide. As an option, the user can replace the default set of consensus splice sites with a custom dinucleotide set. To specify which DNA strand each gene occurs on, separate labels are defined for each strand (e.g. ‘Initial Plus Strand’, ‘Initial Minus Strand’ and ‘Internal 1 Plus Strand’) with the Intergenic label applied to both strands. Distinct genes are not permitted to overlap even if they occur on opposite strands. Without loss of generality we define the problem for predicting genes on one strand, which can be extended to both strands using the expanded set of labels. As implemented, JIGSAW predicts genes on both strands simultaneously.

Let S be a genome sequence and S[i,j] be the subsequence from position i to j (inclusive). A parse of genome sequence S, t = (t0,t1,...,tn) consists of partitions ti = (bi,ei,yi) with subsequence S[bi,ei] assigned label yi and t spanning the entire length of S. The parse covers each nucleotide in the sequence so that bi+1 = ei + 1. For example, the parse for a 1000-base genome sequence containing a single-exon beginning at position 120 and ending at position 730 would be t = [(0, 119, Intergenic), (120, 730, Single),(731, 999, Intergenic)]. The gene prediction problem is to find a parse t to maximize the joint probability of t and S: maxtP(t,S). The problem is made tractable by imposing a Markov assumption, so that label yi in partition ti depends only on the previous label yi–1 leading to .

Figure 2 shows a generalized hidden Markov model (GHMM) to parse genomic sequence using the Markov assumption, where states in the GHMM correspond to labels. A GHMM is defined by a set of states Q with states q and q', and an initial state probability P(q); a set of transitions from q' to q with probability P(q|q'); a set of probabilities P(S[i,j]|q) representing the probability of generating subsequence S[i,j] in state q and a set of probabilities Pq(l) for the likelihood of generating a sequence of length l in state q. Using the GHMM, the joint probability for parse t and sequence S is



View larger version (29K):
[in this window]
[in a new window]
 
Fig. 2 Generalized hidden Markov model for predicting gene structure in genomic sequence. States represent Initial, Internal, Terminal and Single exons, respectively, plus Intron and Intergenic sequence.

 
The JIGSAW dynamic programming algorithm finds the most probable parse for S of length l using an lx|Q| matrix D. Moving from left to right in the sequence, the highest scoring series of states ending in position j with state q assigned to the subsequence from i to j is stored in

where D(j,q) is initialized to P(S[0,j]|qP(qPq(j+1). The most probable parse spanning the sequence S is then found by retracing the sequence of states ending in maxqD(l–1,q). Stop codons are not allowed to span two adjacent exons within the same gene. When leaving an intron state (using the dynamic programming algorithm), the upstream and downstream exons are checked for the possible occurrence of a stop codon spanning the two adjacent exons. If a stop codon is found, the parse must end in the current intron state.

2.1 Representing gene structure evidence
Gene prediction using a set of evidence sources introduces an additional input parameter E, defined to be the gene structure evidence mapping to S. Figure 3 shows an example representation of annotation data for four sources of evidence: two gene prediction programs, GP1 and GP2 (GP2 reports a confidence score of 0.65), and two alignments to expression data with 86% and 95% identity, respectively. JIGSAW allows each evidence source to predict up to six gene features:

  • start codon (sta)
  • stop codon (stp)
  • intron (inr)
  • protein coding nucleotide (cod)
  • donor (don)
  • acceptor (acc)
The function

returns a set Bk of six feature vectors, one for each feature type, for each position k occurring in S. One example for each feature type is shown in Figure 3. For example, the start codon feature vector for position k0—the evidence that a protein starts in that position—is (1,0.65,0,0) because GP1 and GP2 both predict the beginning of a protein here, but the sequence alignment evidence predicts a gene to start downstream at position k1. In general, feature vector reflects what program x predicts with confidence on nucleotide S[k,k], and gi,j = gi,j(E,S) = (Bi,...,Bj) are the sets of feature vectors from position i to j. In Figure 3 the feature vector set for k0 is

The vector set Bk0 asserts that GP1 and GP2 predict a start codon at k0, which implies a coding interval, and no other predictions overlap k0. JIGSAW accepts any raw exon prediction score from an evidence source; however, the score should represent the program's confidence in the accuracy of its prediction. For example, the percent identity value of a transcript aligned to the target genomic sequence can be used as the confidence score, with the presumption that similar transcripts are more reliable predictors of genes than dissimilar transcripts. In cases where an evidence source does not report a confidence value (such as GP1 in Fig. 3), the entry in the feature vector is 1 or 0 to indicate the presence or absence of a prediction, respectively.



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 3 Representation of four sources of gene structure evidence mapping to genome sequence S. Two gene prediction programs (GP1 and GP2), a cDNA alignment with 86% identity to S and an EST alignment with 95% identity to S. Examples of the six features, start (sta), stop (stp), coding (cod), intron (inr), donor (don) and acceptor (acc) encoded in feature vectors are shown. The predicted exon boundaries are k0,...,k6.

 
The input sequence S determines the type of prediction. For example, the cDNA alignment in Figure 3 is presumed to predict a donor site at k3 since the alignment stops at this position and begins again at position k4. However, the function fk(S,E) checks the sequence to ensure that a consensus splice site occurs at k3, and k4. Programs are assumed to predict a feature only when the feature is consistent with the sequence. The size of each feature vector - m0,...,m5 can differ, so that the set of programs used to predict each type, sta, stp, inr, cod, don and acc, are assumed to be independent. Therefore, programs designed to predict only one part of the gene can be used in addition to sources that predict complete genes. Evidence for introns typically comes indirectly from gene finders and sequence alignment data. For example, in Figure 3, the evidence for an intron from k3 + 1 to k4 – 1 is implied by three of the four evidence sources, since each source predicts flanking exons.

In Figure 3, positions from k4 + 1 to k5 – 1 return the same set of feature vector values:

In cases like this, where Bk = Bk–1, JIGSAW compresses the two sets in one. To capture important neighboring features each Bk includes Bk–1,Bk and Bk+1 (when k = 0, Bk–1 is defined to be a 0 valued vector).

To find the parse with the highest-score, maxt P(t,S,E), JIGSAW uses the dynamic programming matrix where D(j,q) is initialized to

and

Assuming independence, the probability of generating the feature vectors from i to j in a given state q is , but modeling Bk poses a problem because the distribution of percent identity values from the sequence alignments cannot easily be combined with the confidence values from the gene prediction programs. Moreover, even if we assume that each of the prediction programs only generates discrete values, the number of parameters grows exponentially with respect to the number of programs in the annotation database.

2.2 Predictions conditioned on input evidence
To improve flexibility in the type and amount of evidence used, an independent conditional probability is defined for each of the six gene features, type = {sta,stp,inr,cod,don,acc}. The independence assumption is justified by the fact that the collection of programs used to predict each gene feature type is assumed to be independent. In practice this is not true, since many programs are used to predict all six features (and the input sequence is the same); however, assuming independence reduces a large number of the statistical parameters. An estimate is made for the probability of a gene feature of type occurring at position k given the feature vector for type at position k: P(typek|vtype).

The function h(q,k,type,S[i,j],Bk)=

checks to see if the gene feature, type, should occur at position k when predicting the state q to align to subsequence S[i,j]. As an example, when state q is Initial, the first nucleotide (at position i) should correspond to the beginning of a start codon and to the first coding region for a protein. In this case, h(q,i,sta,S[i,j],Bi) = P(stai|vsta), h(q,i,cod,S[i,j],Bi) = P(codi|vcod) and h(q,i,type,S[i,j],Bi) = 1 – P(typei|vtype) for the four remaining feature types (stp, inr, don and acc). The probability that JIGSAW tries to compute for P(q|gi,j,S[i,j]) when q is Initial is the probability of a start codon at position i given the evidence for a start codon, times the probability of a coding interval from i to j given the evidence, times the probability of a donor site at j + 1 given the evidence, times the probability that no conflicting features occur.

In general, the probability of state q aligning to subsequence S[i,j] given all the feature vectors gi,j between i and j is the product of probabilities for each set of feature vectors Bk: , where type enumerates over all six gene features. The model makes the simplifying assumption that each feature vector set, Bk, is independent. Note that the Intergenic state is not explicitly modeled. Intergenic sequence is defined to be the absence of evidence predicting gene features in the sequence:

A gene finder like JIGSAW that uses other gene finders as input should build on the success of existing gene finders rather than duplicating their function. Therefore, we construct a probabilistic model to compute the probability of a parse conditioned on the input evidence. [Related work in natural language processing (Sarawagi and Cohen, 2004) has demonstrated useful applications of conditional probabilities in graphical models.] The most probable parse t given S, maxtP(t|S,E) is used to make predictions. The inference algorithm for finding maxt P(t|S,E) uses the dynamic programming matrix

where D(j,q) is initialized to P(q|g0,j,S[0,j])· P(q).

For each scored parse, the six feature types contribute to each independent probability, in some cases predicting support for the parse and in other cases predicting support against the parse. Since each possible parse is scored using the same fixed number of independent random feature type events, the length of an exon, intron or intergenic sequence interval is dependent solely on the evaluation from the feature type models and the state transition probabilities.

2.3 Parameter estimation
Feature models are estimated for P(sta|vsta), P(stp|vstp), P(inr|vinr), P(cod|vcod), P(don|vdon) and P(acc|vacc) through enumeration over labeled sequences in a training set. Each feature model, sta, stp, inr, cod, don and acc, is trained independently. As an example, in Figure 3, the index into fk0 for start codons is vsta = (1,0.65,0,0). The training procedure counts the number of times the evidence (1,0.65,0,0) occurs in the training data and the percentage of cases where (1,0.65,0,0) correctly predicts the start codon location. The operating assumption is the more evidence predicting a start codon with high confidence the greater chance that a start codon actually occurs. Thus, the training set should show that when all sources of evidence predict a start codon with high confidence [e.g. vsta=(1,1,1,1)] the probability of a start codon is much higher than when no evidence predicts a start codon [e.g. vsta=(0,0,0,0)].

Simple counting methods do not accurately estimate actual probability values because the sample space is theoretically infinite (in practice it is finite but extremely large). For example, (1,0.66,0,0) may not occur in the training set, while (1,0.65,0,0) happens to occur 50 times showing 80% accuracy, resulting in P(sta|(1,0.65,0,0)) = 0.8 and leaving P(sta|(1,0.66,0,0)) undefined. To avoid the problem of a large sample space, the observed feature vectors are first divided into two groups, accurate and inaccurate. For each feature vector vtype, c(vtype) is the percentage of cases in which vtype is observed to correctly predict type. vtype is assigned to the group accurate if c(vtype) >0.5 and inaccurate otherwise.

A decision tree is induced to partition the feature vector space into subregions, distinguishing feature vectors in the accurate set from feature vectors in the inaccurate set. For the human genome data, JIGSAW uses the OC1 (Murthy et al., 1994) decision tree system to create these trees. Each element of the feature vector is tested for ‘yes’ or ‘no’ questions to separate the accurate set from the inaccurate set. From the example in Figure 3, a trivial one-node decision tree is ‘Is GP2's confidence value >0.5?’, which partitions the data into two disjoint sets. These sets can be partitioned further by building a larger tree, but OC1 implements procedures to avoid partitioning the data too finely. It does this by withholding 10% of the data to determine when to stop building the tree. As the tree is built, its classification accuracy is tested on the hold-out set, and tree-building terminates when the classification accuracy drops below a threshold.

The decision tree captures in an automated way a set of rules that is similar to those used in a rule based system such as Ensembl, where percent identity cutoffs are used to make a prediction. For example, a simple prediction rule might be that a gene is valid if predicted by one gene finder and by alignment to a non-human RefSeq protein with 98% nucleotide identity. Figure 4 shows example protein coding (cod) feature vectors for non-human RefSeq alignments and non-human cDNA alignments, where a gene finder made an overlapping prediction. Each point in Figure 4 shows the percent identity of the respective alignments. An example is marked ‘+’ if it is in the accurate class and ‘x’ if it is in the inaccurate class. The decision tree creates three partitions, which determine the cutoff values. Examples in the training set show that a threshold of 95% for either source of alignments is an accurate predictor of a protein coding region. When both alignments are <95%, the accuracy is questionable.



View larger version (18K):
[in this window]
[in a new window]
 
Fig. 4 The plot on the left side of the figure shows the accuracy of predictions based on alignments to non-human sequences that overlap a gene finder's predictions. Each point is a pair of alignments observed in training and their percent identity to the genomic sequence. ‘+’ points are labeled ‘{accurate}’ and ‘x’ points are labeled ‘{inaccurate}.’ The two lines correspond to the non-leaf nodes in the decision tree shown on the right side of the figure.

 
We can still make use of the examples below the 95% percent identity cutoff using the average probability from the individual examples. For feature vector vtype and decision tree ßtype, V = ßtype(vtype) is the list of the feature vectors from training, grouped into a local region. In Figure 4, evaluating the evidence vector (0.58,0.52) returns the set of examples in V3. The probability estimate is the average accuracy of the individual examples in V3. In general, the probability of type given feature vector vtype is P(type|vtype) = {sum}vVc(v)/|V|, where |V| is the size of V.

In the earliest Combiner implementations (Allen et al., 2004), each evidence source was assigned an independent weight, which was shown to be effective in some cases. However, assuming independence precludes the inclusion of potentially useful sources of evidence, such as predictions from a single gene finder using two different parameter settings. The JIGSAW training procedure addresses the problem of interdependence by evaluating the accuracy of each observed evidence combination. If observed correlated sources produce similar predictions the decision tree induction algorithm can ignore a redundant source. When differences are observed, the decision tree can treat the case where correlated evidence sources overlap differently from the case where the evidence sources do not overlap.


    3 RESULTS AND DISCUSSION
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 
We used two test sets to evaluate the accuracy of JIGSAW on human gene prediction. For the first, we selected 1563 genes at random (uniformly distributed among the 24 chromosomes) from a set of 17 477 non-redundant RefSeq genes (Pruitt et al., 2005). We eliminated genes that are known to exhibit alternative splicing, although of course many alternative splice forms are still unknown and therefore some of the 1563 genes may still have multiple splice variants. We estimated accuracy using 3-fold cross validation, training on 2/3 of the data and testing on 1/3, and averaging results for the three experiments. For the second test, we used annotations from the Havana group (Ashurst et al., 2005) in the 44 ENCODE regions (The ENCODE Project Consortium, 2004), which span ~1% of the human genome. The 44 ENCODE regions do not overlap with the 1563 genes from our first set and do include known alternatively spliced genes.

JIGSAW predictions were based on the sequence using the default consensus splice sites (GT/GC and AG) and a collection of evidence from an annotation database. Annotation data was downloaded from the UCSC genome annotation database (http://hgdownload.cse.ucsc.edu/goldenPath/hg17/database) using NCBI Build 35; examples of this evidence are shown in Figure 1.

The main evidence types are as follows:

For each of the 1563 test genes JIGSAW was run on an interval that included 50 000 bases of upstream and downstream sequence. If test genes occurred in overlapping regions, the regions were merged to form a longer contiguous sequence. At a minimum, therefore, each input sequence contains over 100 000 bases. Three prediction criteria were measured: accuracy on genes, on exons and on nucleotides. A gene-level prediction was counted correct only if the entire exon structure matches the test gene, from start codon to stop codon, including all intron boundaries. Non-coding exons were not included in these tests. For an exon prediction to be counted correct both the 5' and 3' boundaries must match the test exon exactly. A predicted protein coding nucleotide was counted correct when it matched a protein coding nucleotide in a test gene. Sensitivity is defined here as the percentage of true genes (exons) that were correctly predicted, and specificity is the percentage of the predicted genes (exons) that are correct.

On average, Ensembl predicts 1.2 isoforms per gene locus, and the alignments matching the Swiss-Prot/TrEmbl proteins generate an average of 1.7 distinct isoforms per test gene. In order not to penalize any of the programs for predicting correct alternatively spliced genes, if any one of the predicted isoforms matched the ‘true’ gene, the prediction was counted as correct. However, multiple identical predictions at the exon and nucleotide level were not double-counted when computing specificity.

Table 1 shows results for the 1563 test set using different combinations of evidence, along with the two single most accurate evidence sources, Ensembl and the cDNA alignments matching Swiss-Prot/TrEMBL proteins. One objective of this study was to evaluate how closely JIGSAW predictions matched the human-curated set. When JIGSAW has access to the same information as a human curator, the goal should be to report the most accurate gene predictions possible, subject to the constraints of the available evidence. Wherever JIGSAW matches the RefSeq gene, it would appear that the program is matching the ability of human curators. The first row in Table 1, JIGSAW—All, shows JIGSAW output using all available evidence, including the gene finders, cross-species conservation and expression evidence, including known proteins plus Ensembl. Interestingly, while this version yields nearly the best performance, the second row, JIGSAW, shows the results from excluding evidence from the explicit non-human gene expression sources. Excluding the explicit non-human expression evidence results in slightly more genes and exons being missed completely, but the remaining prediction accuracy measures increase.


View this table:
[in this window]
[in a new window]
 
Table 1 Prediction performance on 1563 test genes for JIGSAW and for the most accurate of the evidence sources

 
In theory, when JIGSAW uses both Ensembl predictions and the cDNA alignments from Swiss-Prot/TrEMBL as input, it should either match or improve upon the accuracy of the those lines of evidence. The analysis is complicated by the fact that both Ensembl and the cDNA alignments predict multiple isoforms, while JIGSAW predicts (in its current implementation) only a single isoform. This shows up in the number of whole genes predicted correctly (Gene Sensitivity in Table 1), the one accuracy measure where JIGSAW does not have the best results. Predicting multiple isoforms enables Ensembl to predict 62% of the genes exactly correct, and the aligned cDNA correctly capture 65% of the genes, compared with JIGSAW's 59%. For 50% of JIGSAW predictions not matching a RefSeq gene, Ensembl or a cDNA alignment match the test gene while predicting additional isoforms. Since JIGSAW assembles a single isoform, while looking at all of the possible local gene features, the algorithm can merge the predicted isoforms into a single gene. Despite this potential limitation, JIGSAW is able to correctly detect as many or more exons completely correct as the underlying sources while picking up 5% more of the true protein coding nucleotides and completely missing less genes and exons. Moreover, the combination of specificity and sensitivity in JIGSAW is high. With 90% of protein coding nucleotides correctly detected, 66% of JIGSAW's gene predictions exactly match a RefSeq gene, showing a strong balance between detecting genes and making accurate predictions.

Row three in Table 1 (JIGSAW—No Ensembl) shows JIGSAW performance when the Ensembl predictions were excluded from its input. While JIGSAW benefits from Ensembl input, JIGSAW is able to make predictions without Ensembl and achieves comparable performance. The overall number of correctly detected genes drops slightly, but for all other measures JIGSAW's accuracy is superior.

It is important that computational gene finders be able to identify genes even when evidence from known curated human proteins is not available. Table 1 shows results for JIGSAW when we excluded the Swiss-Prot/TrEMBL protein evidence from its input. The first two rows in Table 1 show JIGSAW's results when the curated proteins were not used as evidence, but instead we used expression evidence. Using only human expression data appears to improve performance slightly in most categories, but at the cost of missing 1% more of the genes completely and 1% more of the protein coding nucleotides. Without benefit of the cDNA alignments matching the curated proteins as input, overall performance drops; however, JIGSAW still correctly detects 88–89% of the protein coding nucleotides and 41–42% of the test genes. The third row in Table 2 shows JIGSAW's performance when we excluded all expression data, using only the gene finders and sequence conservation as evidence. While performance drops still further, the results show that JIGSAW still does better than ab initio gene finders.


View this table:
[in this window]
[in a new window]
 
Table 2 Prediction performance on 1563 test genes, excluding the use of curated human proteins

 
To further measure specificity in the alternative splice site prediction programs, we looked at the ENCODE regions and the Havana manual annotations for genes with start and stop codons, no in-frame stop codons and consensus splice sites. The set includes 195 loci, with 41% of the loci producing multiple transcripts, for a total of 330 isoforms. Each transcript was compared against the different programs' predictions so that all isoforms of each gene were checked. If an exon or nucleotide was correctly identified by any one of a program's predicted isoforms, we counted it as a correct prediction, but only once. (Incorrect predictions are likewise counted only once.) Results are shown in Table 3. As expected, the alternative isoform predictors (Ensembl and the cDNA alignments derived from Swiss-Prot/TrEMBL proteins) are able to capture a higher percentage of isoforms than JIGSAW. However, both programs predict many more isoforms, and thus have substantially lower specificity than JIGSAW. At the exon level, (Exon Sensitivity and Specificity in Table 3) JIGSAW performs slightly better in both sensitivity and specificity than Ensembl and the Swiss-Prot/TrEMBL cDNA based alignments, suggesting that many of the isoforms involve small changes in the exon structure. JIGSAW's balance between sensitivity and specificity remains strong, with 92% of the protein coding nucleotides correctly detected and 72% of the gene predictions exactly matching a test gene.


View this table:
[in this window]
[in a new window]
 
Table 3 Prediction performance on ENCODE regions

 
The ENCODE Gene Prediction Workshop (EGASP, 2005) http://genome.imim.es/gencode/workshop2005.html held in April 2005 was a meeting whose primary purpose was to assess the accuracy of computational human gene predictions in the ENCODE regions. JIGSAW predictions were submitted for comparison with dozens of other methods. Results from the workshop support our findings that JIGSAW is able to create accurate computational predictions of human genes, and in most cases outperform both ab initio methods and other ‘combination’ methods that use homology in various forms. JIGSAW's gene predictions for the ENCODE regions are freely available for download through the UCSC genome browser (http://genome.ucsc.edu/encode).


    4 CONCLUSION
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 
Cataloging the complement of human proteins remains an important yet elusive scientific milestone. Progress continues as the evidence available to support predictions increases. Computational methods need to be able to integrate this new data quickly and effectively. JIGSAW demonstrates the benefit of combining a statistical approach to evidence evaluation with a GHMM based algorithm in order to integrate multiple, often disagreeing, sources of evidence. JIGSAW is flexible with respect to types of input, requiring only that each evidence source produce a list of coordinates on a genome sequence. The program smoothly incorporates DNA and protein sequence alignment scores as well as the confidence values produced by some gene finders. As gene structure evidence continues to grow and improve, furthermore, JIGSAW's accuracy should improve as well.

Finally, JIGSAW's gene finding accuracy benefits from having free access to the rich annotation sources inside genome databases. Public access to genome sequences and annotation not only benefits biologists working on this data, but also speeds development of better computational methods.


    Acknowledgments
 
The authors thank William H. Majoros and Mihaela Pertea for extraction of RefSeq genes and for numerous useful comments and suggestions. This work was supported in part by the National Institutes of Health under grants R01-LM06845 and R01-LM007938. Funding to pay the Open Access publication charges for this article was provided by grant R01-LM06845 to SLS from the National Institutes of Health.

Conflict of Interest: none declared.

Received on June 20, 2005; revised on July 28, 2005; accepted on July 29, 2005

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSION
 REFERENCES
 

    Allen, J.E., Pertea, M., Salzberg, S.L. (2004) Computational gene prediction using multiple sources of evidence. Genome Research, 14, .

    Ashurst, J.L., et al. (2005) The Vertebrate genome annotation ({V}ega) database. Nucleic Acids Res., 33, 459–465.

    Bairoch, A., et al. (2005) The universal protein resource ({U}ni{P}rot). Nucleic Acids Res., 33, 154–159[CrossRef].

    Buell, C.R., et al. (2005) Sequence, annotation, and analysis of synteny between rice chromosome 3 and diverged grass speices. Genome Res., in press.

    Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268, 78–84[CrossRef][ISI][Medline].

    Curwen, V., et al. (2004) The Ensembl automatic gene annotation system. Genome Res., 14, 942–950[Abstract/Free Full Text].

    EGASP. (2005) Gene prediction workshop. http://genome.imim.es/gencode/workshop2005.html.

    Flicek, P., et al. (2003) Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. Genome Res., 13, 46–54[Abstract/Free Full Text].

    Guigo, R. (1998) Assembling genes from predicted exons in linear time with dynamic programming. J. Comput. Biol., 5, 681–702[ISI][Medline].

    International Human Genome Sequencing Consortium. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921[CrossRef][Medline].

    Kent, W.J. (2002) BLAT—the BLAST-like alignment tool. Genome Res., 12, 656–664[Abstract/Free Full Text].

    Lee, Y., et al. (2005) The TIGR gene indices: clustering and assembling EST and known genes and integration with eukaryotic genomes. Nucleic Acids Res., 33, 71–74.

    Loftus, B.J., et al. (2005) The genome of the basidiomycetous yeast and human pathogen Cryptococcus neoformans. Science, 307, 1321–1324[Abstract/Free Full Text].

    Majoros, W.H., et al. (2004) Tigr{S}can and Glimmer{HMM}: two open source ab initio eukaryotic gene-finders. Bioinformatics, 20, 2878–2879[Abstract/Free Full Text].

    Murthy, S.K., et al. (1994) A system for induction of oblique decision trees. J. Artif. Intell. Res., 2, 1–32.

    Parra, G., et al. (2003) Comparative gene prediction in human and mouse. Genome Res., 13, 108–117[Abstract/Free Full Text].

    Pruitt, K.D., et al. (2005) NCBI Reference Sequence ({R}ef{S}eq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res., 1, 501–504.

    Salzberg, S.L., et al. (1999) Interpolated markov models for eukaryotic gene finding. Genomics, 59, 24–31[CrossRef][ISI][Medline].

    Sarawagi, S. and Cohen, W.W. (2004) Semi-markov conditional random fields for information extraction. Proceedings of the Advances in Neural Information Processing Systems, 17 (NIPS 2004)Vancourer, BC, Canada.

    Siepel, A. and Haussler, D. (2003) Combining phylogenetic and hidden markov models in biosequence analysis. Proceedings of the 7th Annual International Conference on Computational Molecular Biology (RECOMB 2003)Berlin, Germany , pp. 277–286.

    The ENCODE Project Consortium. (2004) The ENCODE (ENCyclopedia of DNA elements) project. Science, 306, 636–640[Abstract/Free Full Text].

    Venter, J.C., et al. (2001) The sequence of the human genome. Science, 291, 1304–1351[Abstract/Free Full Text].

    Wheeler, D.L., et al. (2003) Database resources of the national center for biotechnology. Nucleic Acids Res., 31, 28–33[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
Q. Liu, A. J. Mackey, D. S. Roos, and F. C. N. Pereira
Evigan: a hidden variable model for integrating gene evidence for eukaryotic gene prediction
Bioinformatics, March 1, 2008; 24(5): 597 - 605.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
Genome Information Integration Project And H-Invit
The H-Invitational Database (H-InvDB), a comprehensive annotation resource for human genes and transcripts
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D793 - D799.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
C. M. Bergman and H. Quesneville
Discovering and detecting transposable elements in genome sequences
Brief Bioinform, November 1, 2007; 8(6): 382 - 392.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
D. DeCaprio, J. P. Vinson, M. D. Pearson, P. Montgomery, M. Doherty, and J. E. Galagan
Conrad: Gene prediction using conditional random fields
Genome Res., September 1, 2007; 17(9): 1389 - 1398.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. Moretti, F. Armougom, I. M. Wallace, D. G. Higgins, C. V. Jongeneel, and C. Notredame
The M-Coffee web server: a meta-method for computing multiple sequence alignments by combining alternative alignment methods
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W645 - W648.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
H. Xia, J. Bi, and Y. Li
Identification of alternative 5'/3' splice sites based on the mechanism of splice site competition
Nucleic Acids Res., December 4, 2006; 34(21): 6305 - 6313.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
N. Pierstorff, C. M. Bergman, and T. Wiehe
Identifying cis-regulatory modules by combining comparative and compositional analysis of DNA
Bioinformatics, December 1, 2006; 22(23): 2858 - 2864.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. J. Hsieh, C. Y. Lin, N. H. Liu, W. Y. Chow, and C. Y. Tang
GeneAlign: a coding exon prediction tool based on phylogenetical comparisons.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W280 - W284.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
I. M. Wallace, O. O'Sullivan, D. G. Higgins, and C. Notredame
M-Coffee: combining multiple sequence alignment methods with T-Coffee
Nucleic Acids Res., March 23, 2006; 34(6): 1692 - 1699.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
21/18/3596    most recent
bti609v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (19)
Google Scholar
Right arrow Articles by Allen, J. E.
Right arrow Articles by Salzberg, S. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Allen, J. E.
Right arrow Articles by Salzberg, S. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?