Bioinformatics Advance Access originally published online on August 2, 2005
Bioinformatics 2005 21(18):3596-3603; doi:10.1093/bioinformatics/bti609
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
JIGSAW: integration of multiple sources of evidence for gene prediction
1Center for Bioinformatics and Computational Biology, University of Maryland Institute for Advanced Computer Studies, University of Maryland College Park, MD 20742, USA
2Department of Computer Science, University of Maryland Institute for Advanced Computer Studies, University of Maryland College Park, MD 20742, USA
3Department of Computer Science, Johns Hopkins University 3400 N. Charles Street, Baltimore, MD 21218, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Computational gene finding systems play an important role in finding new human genes, although no systems are yet accurate enough to predict all or even most protein-coding regions perfectly. Ab initio programs can be augmented by evidence such as expression data or protein sequence homology, which improves their performance. The amount of such evidence continues to grow, but computational methods continue to have difficulty predicting genes when the evidence is conflicting or incomplete. Genome annotation pipelines collect a variety of types of evidence about gene structure and synthesize the results, which can then be refined further through manual, expert curation of gene models.
Results: JIGSAW is a new gene finding system designed to automate the process of predicting gene structure from multiple sources of evidence, with results that often match the performance of human curators. JIGSAW computes the relative weight of different lines of evidence using statistics generated from a training set, and then combines the evidence using dynamic programming. Our results show that JIGSAW's performance is superior to ab initio gene finding methods and to other pipelines such as Ensembl. Even without evidence from alignment to known genes, JIGSAW can substantially improve gene prediction accuracy as compared with existing methods.
Availability: JIGSAW is available as an open source software package at http://cbcb.umd.edu/software/jigsaw
Contact: jeallen{at}umiacs.umd.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Determining the true set of human genes has turned out to be a much harder problem than many expected; the exact number of genes has gradually decreased since the publication of the draft human genome in 2001 (International Human Genome Sequencing Consortium, 2001; Venter et al., 2001), but it has not stabilized. The protein-coding regions of many human genes are generally agreed upon, but even for these, the precise gene structure, comprising the boundaries of all the exons and of the coding sequence, remains less than certain. Evidential support for existing genes varies widely, from tentatively defined to experimentally confirmed. For some rarely expressed genes, the evidence is limited to a small number of expressed sequence tags (ESTs) and to the overlapping but inconsistent predictions of multiple gene finders. For some loci, evidence of expression is absent but evidence from protein sequence alignments to other species strongly suggests the presence of a gene. Many human genes have been carefully confirmed through full-length cDNA sequencing, the remapping of those cDNAs to the original chromosomes is considered the gold standard for defining the true exonintron structure. The numerous genes for which full-length cDNA sequences have not been generated pose an ongoing challenge in our efforts to produce complete and accurate predictions of all genes.
To illustrate this challenge, we consider an example from human chromosome 20, shown (Fig. 1) in a display from the UCSC genome browser, which provides an interface to a collection of programs that have been run on each human chromosome. Figure 1 shows gene predictions from several gene finding programs, plus alignments of both protein and cDNA data from Swiss-Prot (Bairoch et al., 2005), UniGene (Wheeler et al., 2003) and the TIGR Gene Index (Lee et al., 2005). Despite strong evidence for a gene in Figure 1, the various programs disagree on the precise gene structure. It is possible that zero, one, or multiple different predictions are correct. Currently it is left to the user to decide what evidence to use for gene structure prediction.
|
One track shown in Figure 1 is the Ensembl prediction (fourth row from the top), generated by the Ensembl group's automated method for integrating various forms of evidence. Ensembl, one of the leading genome annotation systems, applies a collection of rules to decide when to use the output from different prediction programs, depending on the type of evidence available for a particular gene (Curwen et al., 2004). The rules attempt to filter out unreliable data, leaving only high quality alignments for use in prediction. A strength of this approach is that strict criteria are established to identify high quality alignments, which is particularly useful when complete cDNA sequence can be mapped to the originating chromosome. Applying criteria that are too strict, however, will miss genes that are not supported by expression data (for example), even if those genes are supported by other forms of evidence. One of the goals of JIGSAW is to capture more of these genes through an automated, statistically principled method for weighing the different evidence sources.
JIGSAW uses a manually curated gene set, along with all of the alignments and predictions associated with that set, to collect statistics on the accuracy of the gene prediction evidence. The goal is to provide consistent, reproducible predictions based on sound statistical principles, even in cases of conflicting evidence about gene structure. The program is a successor to our earlier Combiner system (Allen et al., 2004). JIGSAW and Combiner have been used by annotators as the basis for gene calls in several recently sequenced organisms, including Oryza sativa (rice) (Buell et al., in press) and Cryptococcus neoformans (Loftus et al., 2005). This paper describes several new algorithmic developments, explains the relationship between the JIGSAW algorithm and generalized hidden Markov models (GHMMs), and evaluates JIGSAW's prediction accuracy on the human genome. Our experiments show that JIGSAW's accuracy on both exon prediction and whole-gene prediction is superior to other methods.
| 2 SYSTEMS AND METHODS |
|---|
|
|
|---|
Computational gene finding programs are designed to model different aspects of protein coding genes, often using different statistical models for different parts of a transcript. For example, 3-periodic inhomogeneous interpolated Markov models (IMMs) have proven successful at modeling protein coding intervals (Salzberg et al., 1999), and decision trees (Burge and Karlin, 1997) have been used to capture dependencies between non-adjacent nucleotides near splice junctions. JIGSAW is able to take advantage of diverse models and evidence types and to combine them into frame-consistent gene predictions using dynamic programming. Here we define our dynamic programming algorithm for computing an optimally scoring parse of a genome sequence, where the parse directly corresponds to a prediction of exonintronstructure.
We represent the components of a gene with a collection of gene structure labels Y = {Initial, Internal 1, Internal 2, Internal 3, Intron 1, Intron 2, Intron 3, Terminal, Single, Intergenic}, with each element y
Y matching an exon, intron or intergenic region. The three distinct labels for introns and for internal exons (Internal 1, Internal 2 and Internal 3) are required to track the phase in cases where an intron interrupts a codon. Four signal types are allowed: start codons, stop codons, and splice junctions (acceptor and donor sites), which denote the beginning and ending (5' and 3' ends) of introns. Internal exons begin one base downstream of acceptor sites and end one base upstream of donor sites. By default, the acceptor site is an AG dinucleotide, and the donor site is either a GT or GC dinucleotide. As an option, the user can replace the default set of consensus splice sites with a custom dinucleotide set. To specify which DNA strand each gene occurs on, separate labels are defined for each strand (e.g. Initial Plus Strand, Initial Minus Strand and Internal 1 Plus Strand) with the Intergenic label applied to both strands. Distinct genes are not permitted to overlap even if they occur on opposite strands. Without loss of generality we define the problem for predicting genes on one strand, which can be extended to both strands using the expanded set of labels. As implemented, JIGSAW predicts genes on both strands simultaneously.
Let S be a genome sequence and S[i,j] be the subsequence from position i to j (inclusive). A parse of genome sequence S, t = (t0,t1,...,tn) consists of partitions ti = (bi,ei,yi) with subsequence S[bi,ei] assigned label yi and t spanning the entire length of S. The parse covers each nucleotide in the sequence so that bi+1 = ei + 1. For example, the parse for a 1000-base genome sequence containing a single-exon beginning at position 120 and ending at position 730 would be t = [(0, 119, Intergenic), (120, 730, Single),(731, 999, Intergenic)]. The gene prediction problem is to find a parse t to maximize the joint probability of t and S: maxtP(t,S). The problem is made tractable by imposing a Markov assumption, so that label yi in partition ti depends only on the previous label yi1 leading to
.
Figure 2 shows a generalized hidden Markov model (GHMM) to parse genomic sequence using the Markov assumption, where states in the GHMM correspond to labels. A GHMM is defined by a set of states Q with states q and q', and an initial state probability P(q); a set of transitions from q' to q with probability P(q|q'); a set of probabilities P(S[i,j]|q) representing the probability of generating subsequence S[i,j] in state q and a set of probabilities Pq(l) for the likelihood of generating a sequence of length l in state q. Using the GHMM, the joint probability for parse t and sequence S is
![]() |
|
The JIGSAW dynamic programming algorithm finds the most probable parse for S of length l using an lx|Q| matrix D. Moving from left to right in the sequence, the highest scoring series of states ending in position j with state q assigned to the subsequence from i to j is stored in
![]() |
2.1 Representing gene structure evidence
Gene prediction using a set of evidence sources introduces an additional input parameter E, defined to be the gene structure evidence mapping to S. Figure 3 shows an example representation of annotation data for four sources of evidence: two gene prediction programs, GP1 and GP2 (GP2 reports a confidence score of 0.65), and two alignments to expression data with 86% and 95% identity, respectively. JIGSAW allows each evidence source to predict up to six gene features:
- start codon (sta)
- stop codon (stp)
- intron (inr)
- protein coding nucleotide (cod)
- donor (don)
- acceptor (acc)
![]() |
reflects what program x predicts with confidence
on nucleotide S[k,k], and gi,j = gi,j(E,S) = (Bi,...,Bj) are the sets of feature vectors from position i to j. In Figure 3 the feature vector set for k0 is
![]() |
|
The input sequence S determines the type of prediction. For example, the cDNA alignment in Figure 3 is presumed to predict a donor site at k3 since the alignment stops at this position and begins again at position k4. However, the function fk(S,E) checks the sequence to ensure that a consensus splice site occurs at k3, and k4. Programs are assumed to predict a feature only when the feature is consistent with the sequence. The size of each feature vector - m0,...,m5 can differ, so that the set of programs used to predict each type, sta, stp, inr, cod, don and acc, are assumed to be independent. Therefore, programs designed to predict only one part of the gene can be used in addition to sources that predict complete genes. Evidence for introns typically comes indirectly from gene finders and sequence alignment data. For example, in Figure 3, the evidence for an intron from k3 + 1 to k4 1 is implied by three of the four evidence sources, since each source predicts flanking exons.
In Figure 3, positions from k4 + 1 to k5 1 return the same set of feature vector values:
![]() |
To find the parse with the highest-score, maxt P(t,S,E), JIGSAW uses the dynamic programming matrix where D(j,q) is initialized to
![]() |
![]() |
, but modeling Bk poses a problem because the distribution of percent identity values from the sequence alignments cannot easily be combined with the confidence values from the gene prediction programs. Moreover, even if we assume that each of the prediction programs only generates discrete values, the number of parameters grows exponentially with respect to the number of programs in the annotation database.
2.2 Predictions conditioned on input evidence
To improve flexibility in the type and amount of evidence used, an independent conditional probability is defined for each of the six gene features, type = {sta,stp,inr,cod,don,acc}. The independence assumption is justified by the fact that the collection of programs used to predict each gene feature type is assumed to be independent. In practice this is not true, since many programs are used to predict all six features (and the input sequence is the same); however, assuming independence reduces a large number of the statistical parameters. An estimate is made for the probability of a gene feature of type occurring at position k given the feature vector for type at position k: P(typek|vtype).
The function h(q,k,type,S[i,j],Bk)=
![]() |
In general, the probability of state q aligning to subsequence S[i,j] given all the feature vectors gi,j between i and j is the product of probabilities for each set of feature vectors Bk:
, where type enumerates over all six gene features. The model makes the simplifying assumption that each feature vector set, Bk, is independent. Note that the Intergenic state is not explicitly modeled. Intergenic sequence is defined to be the absence of evidence predicting gene features in the sequence:
![]() |
A gene finder like JIGSAW that uses other gene finders as input should build on the success of existing gene finders rather than duplicating their function. Therefore, we construct a probabilistic model to compute the probability of a parse conditioned on the input evidence. [Related work in natural language processing (Sarawagi and Cohen, 2004) has demonstrated useful applications of conditional probabilities in graphical models.] The most probable parse t given S, maxtP(t|S,E) is used to make predictions. The inference algorithm for finding maxt P(t|S,E) uses the dynamic programming matrix
![]() |
For each scored parse, the six feature types contribute to each independent probability, in some cases predicting support for the parse and in other cases predicting support against the parse. Since each possible parse is scored using the same fixed number of independent random feature type events, the length of an exon, intron or intergenic sequence interval is dependent solely on the evaluation from the feature type models and the state transition probabilities.
2.3 Parameter estimation
Feature models are estimated for P(sta|vsta), P(stp|vstp), P(inr|vinr), P(cod|vcod), P(don|vdon) and P(acc|vacc) through enumeration over labeled sequences in a training set. Each feature model, sta, stp, inr, cod, don and acc, is trained independently. As an example, in Figure 3, the index into fk0 for start codons is vsta = (1,0.65,0,0). The training procedure counts the number of times the evidence (1,0.65,0,0) occurs in the training data and the percentage of cases where (1,0.65,0,0) correctly predicts the start codon location. The operating assumption is the more evidence predicting a start codon with high confidence the greater chance that a start codon actually occurs. Thus, the training set should show that when all sources of evidence predict a start codon with high confidence [e.g. vsta=(1,1,1,1)] the probability of a start codon is much higher than when no evidence predicts a start codon [e.g. vsta=(0,0,0,0)].
Simple counting methods do not accurately estimate actual probability values because the sample space is theoretically infinite (in practice it is finite but extremely large). For example, (1,0.66,0,0) may not occur in the training set, while (1,0.65,0,0) happens to occur 50 times showing 80% accuracy, resulting in P(sta|(1,0.65,0,0)) = 0.8 and leaving P(sta|(1,0.66,0,0)) undefined. To avoid the problem of a large sample space, the observed feature vectors are first divided into two groups, accurate and inaccurate. For each feature vector vtype, c(vtype) is the percentage of cases in which vtype is observed to correctly predict type. vtype is assigned to the group accurate if c(vtype) >0.5 and inaccurate otherwise.
A decision tree is induced to partition the feature vector space into subregions, distinguishing feature vectors in the accurate set from feature vectors in the inaccurate set. For the human genome data, JIGSAW uses the OC1 (Murthy et al., 1994) decision tree system to create these trees. Each element of the feature vector is tested for yes or no questions to separate the accurate set from the inaccurate set. From the example in Figure 3, a trivial one-node decision tree is Is GP2's confidence value >0.5?, which partitions the data into two disjoint sets. These sets can be partitioned further by building a larger tree, but OC1 implements procedures to avoid partitioning the data too finely. It does this by withholding 10% of the data to determine when to stop building the tree. As the tree is built, its classification accuracy is tested on the hold-out set, and tree-building terminates when the classification accuracy drops below a threshold.
The decision tree captures in an automated way a set of rules that is similar to those used in a rule based system such as Ensembl, where percent identity cutoffs are used to make a prediction. For example, a simple prediction rule might be that a gene is valid if predicted by one gene finder and by alignment to a non-human RefSeq protein with 98% nucleotide identity. Figure 4 shows example protein coding (cod) feature vectors for non-human RefSeq alignments and non-human cDNA alignments, where a gene finder made an overlapping prediction. Each point in Figure 4 shows the percent identity of the respective alignments. An example is marked + if it is in the accurate class and x if it is in the inaccurate class. The decision tree creates three partitions, which determine the cutoff values. Examples in the training set show that a threshold of 95% for either source of alignments is an accurate predictor of a protein coding region. When both alignments are <95%, the accuracy is questionable.
|
We can still make use of the examples below the 95% percent identity cutoff using the average probability from the individual examples. For feature vector vtype and decision tree ßtype, V = ßtype(vtype) is the list of the feature vectors from training, grouped into a local region. In Figure 4, evaluating the evidence vector (0.58,0.52) returns the set of examples in V3. The probability estimate is the average accuracy of the individual examples in V3. In general, the probability of type given feature vector vtype is P(type|vtype) =
v
Vc(v)/|V|, where |V| is the size of V. In the earliest Combiner implementations (Allen et al., 2004), each evidence source was assigned an independent weight, which was shown to be effective in some cases. However, assuming independence precludes the inclusion of potentially useful sources of evidence, such as predictions from a single gene finder using two different parameter settings. The JIGSAW training procedure addresses the problem of interdependence by evaluating the accuracy of each observed evidence combination. If observed correlated sources produce similar predictions the decision tree induction algorithm can ignore a redundant source. When differences are observed, the decision tree can treat the case where correlated evidence sources overlap differently from the case where the evidence sources do not overlap.
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
We used two test sets to evaluate the accuracy of JIGSAW on human gene prediction. For the first, we selected 1563 genes at random (uniformly distributed among the 24 chromosomes) from a set of 17 477 non-redundant RefSeq genes (Pruitt et al., 2005). We eliminated genes that are known to exhibit alternative splicing, although of course many alternative splice forms are still unknown and therefore some of the 1563 genes may still have multiple splice variants. We estimated accuracy using 3-fold cross validation, training on 2/3 of the data and testing on 1/3, and averaging results for the three experiments. For the second test, we used annotations from the Havana group (Ashurst et al., 2005) in the 44 ENCODE regions (The ENCODE Project Consortium, 2004), which span
1% of the human genome. The 44 ENCODE regions do not overlap with the 1563 genes from our first set and do include known alternatively spliced genes. JIGSAW predictions were based on the sequence using the default consensus splice sites (GT/GC and AG) and a collection of evidence from an annotation database. Annotation data was downloaded from the UCSC genome annotation database (http://hgdownload.cse.ucsc.edu/goldenPath/hg17/database) using NCBI Build 35; examples of this evidence are shown in Figure 1.
The main evidence types are as follows:
- cDNA from human genes
- UniGene transcripts (Wheeler et al., 2003)
- GenBank cDNAs matching Swiss-Prot and TrEMBL proteins (Bairoch et al., 2005) aligned using BLAT (Kent, 2002) to the genome with at least 98% identity
- cDNA sequences from non-human species
- RefSeq genes from non-human species
- the TIGR Gene Index (Lee et al., 2005) (includes both assembled human and related non-human ESTs)
- ab initio gene finders: Genscan (Burge and Karlin, 1997), Geneid (Guigo, 1998), GeneZilla and GlimmerHMM (Majoros et al., 2004)
- Alignment-based gene finders: Twinscan (Flicek et al., 2003) and SGP (Parra et al., 2003) (uses mouse-human sequence conservation)
- Predicted conserved elements from phylogenetic analysis (Siepel and Haussler, 2003)
On average, Ensembl predicts 1.2 isoforms per gene locus, and the alignments matching the Swiss-Prot/TrEmbl proteins generate an average of 1.7 distinct isoforms per test gene. In order not to penalize any of the programs for predicting correct alternatively spliced genes, if any one of the predicted isoforms matched the true gene, the prediction was counted as correct. However, multiple identical predictions at the exon and nucleotide level were not double-counted when computing specificity.
Table 1 shows results for the 1563 test set using different combinations of evidence, along with the two single most accurate evidence sources, Ensembl and the cDNA alignments matching Swiss-Prot/TrEMBL proteins. One objective of this study was to evaluate how closely JIGSAW predictions matched the human-curated set. When JIGSAW has access to the same information as a human curator, the goal should be to report the most accurate gene predictions possible, subject to the constraints of the available evidence. Wherever JIGSAW matches the RefSeq gene, it would appear that the program is matching the ability of human curators. The first row in Table 1, JIGSAWAll, shows JIGSAW output using all available evidence, including the gene finders, cross-species conservation and expression evidence, including known proteins plus Ensembl. Interestingly, while this version yields nearly the best performance, the second row, JIGSAW, shows the results from excluding evidence from the explicit non-human gene expression sources. Excluding the explicit non-human expression evidence results in slightly more genes and exons being missed completely, but the remaining prediction accuracy measures increase.
|
In theory, when JIGSAW uses both Ensembl predictions and the cDNA alignments from Swiss-Prot/TrEMBL as input, it should either match or improve upon the accuracy of the those lines of evidence. The analysis is complicated by the fact that both Ensembl and the cDNA alignments predict multiple isoforms, while JIGSAW predicts (in its current implementation) only a single isoform. This shows up in the number of whole genes predicted correctly (Gene Sensitivity in Table 1), the one accuracy measure where JIGSAW does not have the best results. Predicting multiple isoforms enables Ensembl to predict 62% of the genes exactly correct, and the aligned cDNA correctly capture 65% of the genes, compared with JIGSAW's 59%. For 50% of JIGSAW predictions not matching a RefSeq gene, Ensembl or a cDNA alignment match the test gene while predicting additional isoforms. Since JIGSAW assembles a single isoform, while looking at all of the possible local gene features, the algorithm can merge the predicted isoforms into a single gene. Despite this potential limitation, JIGSAW is able to correctly detect as many or more exons completely correct as the underlying sources while picking up 5% more of the true protein coding nucleotides and completely missing less genes and exons. Moreover, the combination of specificity and sensitivity in JIGSAW is high. With 90% of protein coding nucleotides correctly detected, 66% of JIGSAW's gene predictions exactly match a RefSeq gene, showing a strong balance between detecting genes and making accurate predictions.
Row three in Table 1 (JIGSAWNo Ensembl) shows JIGSAW performance when the Ensembl predictions were excluded from its input. While JIGSAW benefits from Ensembl input, JIGSAW is able to make predictions without Ensembl and achieves comparable performance. The overall number of correctly detected genes drops slightly, but for all other measures JIGSAW's accuracy is superior.
It is important that computational gene finders be able to identify genes even when evidence from known curated human proteins is not available. Table 1 shows results for JIGSAW when we excluded the Swiss-Prot/TrEMBL protein evidence from its input. The first two rows in Table 1 show JIGSAW's results when the curated proteins were not used as evidence, but instead we used expression evidence. Using only human expression data appears to improve performance slightly in most categories, but at the cost of missing 1% more of the genes completely and 1% more of the protein coding nucleotides. Without benefit of the cDNA alignments matching the curated proteins as input, overall performance drops; however, JIGSAW still correctly detects 8889% of the protein coding nucleotides and 4142% of the test genes. The third row in Table 2 shows JIGSAW's performance when we excluded all expression data, using only the gene finders and sequence conservation as evidence. While performance drops still further, the results show that JIGSAW still does better than ab initio gene finders.
|
To further measure specificity in the alternative splice site prediction programs, we looked at the ENCODE regions and the Havana manual annotations for genes with start and stop codons, no in-frame stop codons and consensus splice sites. The set includes 195 loci, with 41% of the loci producing multiple transcripts, for a total of 330 isoforms. Each transcript was compared against the different programs' predictions so that all isoforms of each gene were checked. If an exon or nucleotide was correctly identified by any one of a program's predicted isoforms, we counted it as a correct prediction, but only once. (Incorrect predictions are likewise counted only once.) Results are shown in Table 3. As expected, the alternative isoform predictors (Ensembl and the cDNA alignments derived from Swiss-Prot/TrEMBL proteins) are able to capture a higher percentage of isoforms than JIGSAW. However, both programs predict many more isoforms, and thus have substantially lower specificity than JIGSAW. At the exon level, (Exon Sensitivity and Specificity in Table 3) JIGSAW performs slightly better in both sensitivity and specificity than Ensembl and the Swiss-Prot/TrEMBL cDNA based alignments, suggesting that many of the isoforms involve small changes in the exon structure. JIGSAW's balance between sensitivity and specificity remains strong, with 92% of the protein coding nucleotides correctly detected and 72% of the gene predictions exactly matching a test gene.
|
The ENCODE Gene Prediction Workshop (EGASP, 2005) http://genome.imim.es/gencode/workshop2005.html held in April 2005 was a meeting whose primary purpose was to assess the accuracy of computational human gene predictions in the ENCODE regions. JIGSAW predictions were submitted for comparison with dozens of other methods. Results from the workshop support our findings that JIGSAW is able to create accurate computational predictions of human genes, and in most cases outperform both ab initio methods and other combination methods that use homology in various forms. JIGSAW's gene predictions for the ENCODE regions are freely available for download through the UCSC genome browser (http://genome.ucsc.edu/encode).
| 4 CONCLUSION |
|---|
|
|
|---|
Cataloging the complement of human proteins remains an important yet elusive scientific milestone. Progress continues as the evidence available to support predictions increases. Computational methods need to be able to integrate this new data quickly and effectively. JIGSAW demonstrates the benefit of combining a statistical approach to evidence evaluation with a GHMM based algorithm in order to integrate multiple, often disagreeing, sources of evidence. JIGSAW is flexible with respect to types of input, requiring only that each evidence source produce a list of coordinates on a genome sequence. The program smoothly incorporates DNA and protein sequence alignment scores as well as the confidence values produced by some gene finders. As gene structure evidence continues to grow and improve, furthermore, JIGSAW's accuracy should improve as well.
Finally, JIGSAW's gene finding accuracy benefits from having free access to the rich annotation sources inside genome databases. Public access to genome sequences and annotation not only benefits biologists working on this data, but also speeds development of better computational methods.
| Acknowledgments |
|---|
The authors thank William H. Majoros and Mihaela Pertea for extraction of RefSeq genes and for numerous useful comments and suggestions. This work was supported in part by the National Institutes of Health under grants R01-LM06845 and R01-LM007938. Funding to pay the Open Access publication charges for this article was provided by grant R01-LM06845 to SLS from the National Institutes of Health.
Conflict of Interest: none declared.
Received on June 20, 2005; revised on July 28, 2005; accepted on July 29, 2005
| REFERENCES |
|---|
|
|
|---|
Allen, J.E., Pertea, M., Salzberg, S.L. (2004) Computational gene prediction using multiple sources of evidence. Genome Research, 14, .
Ashurst, J.L., et al. (2005) The Vertebrate genome annotation ({V}ega) database. Nucleic Acids Res., 33, 459465.
Bairoch, A., et al. (2005) The universal protein resource ({U}ni{P}rot). Nucleic Acids Res., 33, 154159[CrossRef].
Buell, C.R., et al. (2005) Sequence, annotation, and analysis of synteny between rice chromosome 3 and diverged grass speices. Genome Res., in press.
Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268, 7884[CrossRef][ISI][Medline].
Curwen, V., et al. (2004) The Ensembl automatic gene annotation system. Genome Res., 14, 942950
EGASP. (2005) Gene prediction workshop. http://genome.imim.es/gencode/workshop2005.html.
Flicek, P., et al. (2003) Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. Genome Res., 13, 4654
Guigo, R. (1998) Assembling genes from predicted exons in linear time with dynamic programming. J. Comput. Biol., 5, 681702[ISI][Medline].
International Human Genome Sequencing Consortium. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860921[CrossRef][Medline].
Kent, W.J. (2002) BLATthe BLAST-like alignment tool. Genome Res., 12, 656664
Lee, Y., et al. (2005) The TIGR gene indices: clustering and assembling EST and known genes and integration with eukaryotic genomes. Nucleic Acids Res., 33, 7174.
Loftus, B.J., et al. (2005) The genome of the basidiomycetous yeast and human pathogen Cryptococcus neoformans. Science, 307, 13211324
Majoros, W.H., et al. (2004) Tigr{S}can and Glimmer{HMM}: two open source ab initio eukaryotic gene-finders. Bioinformatics, 20, 28782879
Murthy, S.K., et al. (1994) A system for induction of oblique decision trees. J. Artif. Intell. Res., 2, 132.
Parra, G., et al. (2003) Comparative gene prediction in human and mouse. Genome Res., 13, 108117
Pruitt, K.D., et al. (2005) NCBI Reference Sequence ({R}ef{S}eq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res., 1, 501504.
Salzberg, S.L., et al. (1999) Interpolated markov models for eukaryotic gene finding. Genomics, 59, 2431[CrossRef][ISI][Medline].
Sarawagi, S. and Cohen, W.W. (2004) Semi-markov conditional random fields for information extraction. Proceedings of the Advances in Neural Information Processing Systems, 17 (NIPS 2004)Vancourer, BC, Canada.
Siepel, A. and Haussler, D. (2003) Combining phylogenetic and hidden markov models in biosequence analysis. Proceedings of the 7th Annual International Conference on Computational Molecular Biology (RECOMB 2003)Berlin, Germany , pp. 277286.
The ENCODE Project Consortium. (2004) The ENCODE (ENCyclopedia of DNA elements) project. Science, 306, 636640
Venter, J.C., et al. (2001) The sequence of the human genome. Science, 291, 13041351
Wheeler, D.L., et al. (2003) Database resources of the national center for biotechnology. Nucleic Acids Res., 31, 2833
This article has been cited by other articles:
![]() |
Q. Liu, A. J. Mackey, D. S. Roos, and F. C. N. Pereira Evigan: a hidden variable model for integrating gene evidence for eukaryotic gene prediction Bioinformatics, March 1, 2008; 24(5): 597 - 605. [Abstract] [Full Text] [PDF] |
||||
![]() |
Genome Information Integration Project And H-Invit The H-Invitational Database (H-InvDB), a comprehensive annotation resource for human genes and transcripts Nucleic Acids Res., January 11, 2008; 36(suppl_1): D793 - D799. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. M. Bergman and H. Quesneville Discovering and detecting transposable elements in genome sequences Brief Bioinform, November 1, 2007; 8(6): 382 - 392. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. DeCaprio, J. P. Vinson, M. D. Pearson, P. Montgomery, M. Doherty, and J. E. Galagan Conrad: Gene prediction using conditional random fields Genome Res., September 1, 2007; 17(9): 1389 - 1398. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Moretti, F. Armougom, I. M. Wallace, D. G. Higgins, C. V. Jongeneel, and C. Notredame The M-Coffee web server: a meta-method for computing multiple sequence alignments by combining alternative alignment methods Nucleic Acids Res., July 13, 2007; 35(suppl_2): W645 - W648. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Xia, J. Bi, and Y. Li Identification of alternative 5'/3' splice sites based on the mechanism of splice site competition Nucleic Acids Res., December 4, 2006; 34(21): 6305 - 6313. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Pierstorff, C. M. Bergman, and T. Wiehe Identifying cis-regulatory modules by combining comparative and compositional analysis of DNA Bioinformatics, December 1, 2006; 22(23): 2858 - 2864. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. J. Hsieh, C. Y. Lin, N. H. Liu, W. Y. Chow, and C. Y. Tang GeneAlign: a coding exon prediction tool based on phylogenetical comparisons. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W280 - W284. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. M. Wallace, O. O'Sullivan, D. G. Higgins, and C. Notredame M-Coffee: combining multiple sequence alignment methods with T-Coffee Nucleic Acids Res., March 23, 2006; 34(6): 1692 - 1699. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

















