Bioinformatics Advance Access originally published online on January 20, 2005
Bioinformatics 2005 21(9):1776-1781; doi:10.1093/bioinformatics/bti283
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Gene finding for the helical cytokines
1Department of Computing, City University London, United Kingdom
2ZymoGenetics Inc. Seattle, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Gene finding remains an open problem well after the sequencing of the human genome. The low gene sensitivity of current methods is a problem for divergent protein families, because fairly accurate exon assemblies are required before sensitive fold recognition algorithms can be applied. This paper presents a new genomic threading algorithm which integrates the gene finding and fold recognition steps into a single process. The method is applicable to evolutionarily divergent protein families that have retained some trace of their common ancestry, number and phase of introns, sizes of exons and placement of structural elements on specific exons. Such conserved structural signals may be visible despite dramatic evolution of protein sequence.
Results: The method is evaluated on the family of helical cytokines by cross-validation sensitivity analysis. The method has also been applied to all intergenic regions of the human genome, and an expression and cloning approach has been coupled with the predictions of the method. Two genes discovered by this method are discussed.
Supplementary information: All data used and the results obtained in the cross-validation analysis are available at http://www.soi.city.ac.uk/~conklin/papers/GT/
Contact: conklin{at}city.ac.uk
| 1 INTRODUCTION |
|---|
|
|
|---|
Accurate gene finding remains an open problem well after the sequencing of the human genome. Though the low missed gene rate of ab initio gene finding methods makes them valuable for identifying the loci of unknown genes, their low gene sensitivity (in validation studies, the fraction of genes with all exons found and correctly spliced) can result in predictions of limited utility. For the goal of predicting the structure and function of a predicted gene, low gene sensitivity is tolerable for conserved protein families, because even a single partially predicted exon may have homology readily detectable at a high level of statistical significance. For divergent protein families, however, it is essential to have a gene prediction of roughly the correct size and exon composition before applying sensitive homology detection or fold recognition methods. Though limited substitutions of the true exons by the wrong exons of approximately the correct size may be tolerated, most insertions of wrong exons and deletion of true exons will make the fold of the protein unrecognizable.
More specifically, Guigó et al. (2000) have estimated the missed gene rate of Genscan (Burge and Karlin, 1997), consistently shown to be one of the best ab initio gene finding methods, at only 3%, a result that concurs with the estimation of 6% on the annotation of human chromosome 22 (Dunham et al., 1999). The results of this paper show that Genscan has an impressive 0% missed gene rate on the helical cytokines. Regarding the gene sensitivity of Genscan, the results are much less encouraging, with Korf et al. (2001) estimating only 16%, and Dunham et al. (1999) estimating 20%. The results of this paper estimate the gene sensitivity of Genscan at 24% on the multiexon helical cytokine genes. These low gene sensitivity figures work against the success of any approach which decouples the gene finding and fold recognition steps for divergent protein families.
The helical cytokines are a divergent protein family, many members having no identifiable intrafamily sequence similarity. Evidence of evolutionary relationships for divergent members of this family arises from their conserved protein structure, similarities in intron phases (Bazan, 1990; Betts et al., 2001), and broadly similar receptor families (Voßhenrich and Di Santo, 2002; Boulay et al., 2003). The high divergence of the helical cytokines limits the applicability of similarity-based gene finding methods, such as GenomeScan (Yeh et al., 2001), which use homologous protein sequences to guide the gene finding process.
Several helical cytokines are produced or are in clinical trials as biopharmaceuticals for the treatment of human disease, and there is great interest in recognizing new family members in the human genome. This paper presents a specialized gene finding method to identify new helical cytokines directly in the raw genomic sequence.
| 2 METHODS |
|---|
|
|
|---|
The objective of this research is the development of a new gene finding method with high sensitivity for revealing potential helical cytokines in the human genome. This section outlines the method in detail and an overview of the main steps of the method can be found in Table 1.
|
2.1 The helical cytokines
A preprotein sequence is a translated mRNA sequence, sometimes containing a signal peptide and without any post-translational processing. The helical cytokine preprotein sequences can be classified into four groups (Conklin, 2004): group 4, type I membrane proteins; group 3, having a long C-terminal extension after the last helix; group 2, having no signal peptide and finally group 1 (Fig. 1). Most of the known helical cytokines fall into group 1 (Conklin, 2004).
|
2.2 Fold recognition of the helical cytokines
To recognize group 1 helical cytokines, a specialized fold recognition method has been developed (Conklin, 2004). This method aligns seven profiles, including four core helix profiles and three non-structural profiles (Fig. 1), with permitted spacing between them, to a target sequence. Alignments are scored by the product of P-values of component profile alignments. The core profiles are created by a learning algorithm applied to a training set of human helical cytokines. The method can be viewed as a threading (Madej et al., 1995) method, but without pairwise interaction terms between residues in contact. The method achieves a high accuracy for the recognition of divergent helical cytokines. For the present paper, the threshold score for helical cytokine recognition is the lowest score observed during cross validation, i.e. the score at which the model has 100% cross-validated sensitivity at identifying known helical cytokines.
2.3 Gene structures of the helical cytokines
To accurately map the gene structures of the helical cytokines, each protein sequence was used as a query to find its longest cDNA sequence in Genbank. Each cDNA sequence was then used to search the human genome sequence for appropriate genomic clones. The coding-frame translated cDNA sequence (including the translation of its 5' UTR) was used to identify the accurate coding exon splicing using the GeneWise program (Birney et al., 2004). By including the translated 5' UTR in this process, the approach addresses the instability of GeneWise on very short coding segments in initial exons, quite common in the helical cytokines.
The mapping from profile to exon number was identified by applying the helical cytokine fold recognition method to each protein sequence. In most cases the profile placed by the fold recognition method were found to be completely and obviously placed within exon boundaries. However, there were a few cases where the helix placement predicted by the fold recognition method appeared to span an intron junction. In these cases, the helix was mapped to the exon containing the majority of the predicted helix.
Table 2 presents a compilation of the gene structure classes for the multiexon human helical cytokines. The genes are placed into the same class if their phase pattern and their profile to exon mapping are identical. Only those classes containing at least one group 1 cytokine are presented. The mapping of the seven core profiles (Fig. 1) to coding exon number is presented, using the following encoding: M, initiating Methionine; S, secretory signal sequence; A, B, C, D, four structural helices and *, stop codon. The mapping of the stop codon therefore indicates the exact number of coding exons in the gene.
|
Aside from the five genes with unique phase patterns and/or profile to exon mappings, the helical cytokines fall into six major gene structure classes, with the IL3 and the CSF3 classes (both with 5 exons, though with different phase patterns) being the largest. All genes except FLT3LG have the helix D and stop codon on the last coding exon. The BSF3 and the CSF3 classes, and all of the 6 and 7 exon classes have only the initiating Methionine on their first coding exon. The intron sizes in the multiexon helical cytokines vary from as small as 75 bp (in IL12A) to 58 kb (in IL7) (intron size data for individual genes not shown).
The ability of the gapped BLAST method (Altschul et al., 1997) to recognize helical cytokines is also illustrated in Table 2. Each training sequence was used as a gapped BLASTP query against the helical cytokine protein set itself with the parameters Z = 250e6 (to simulate a large database size) and E = 10 (a permissive E-value threshold). The queries that do not contain another helical cytokine within their BLAST output are noted in bold in Table 2. It is apparent that many cytokines have no significant sequence similarity to other family members. Interestingly, many of these divergent genes fall into the CSF2 gene structure class, containing several genes on the 5q31 cytokine cluster.
2.4 Genomic threading method
The fold recognition method for helical cytokine preprotein sequences (Section 2.2) has been adapted to find genes directly within genomic sequence. This adaptation is done in two ways. First, the method is allowed to score partial alignments to the first k profiles, for 1
k
7. For each k, the lowest encountered cross-validated score to the first k profiles is used as a cutoff below which a partial sequence is predicted to be a prefix of a full helical cytokine sequence. The cutoff score to the first k profiles is
min(k) where min(k) is the minimum cross-validated score to profile k. Second, to enforce profile to exon mappings within a gene structure class, the method can now constrain the placement of profiles to lie in specified regions of a target sequence.
The genomic threading (GT) algorithm finds exon splices that are compatible with a known helical cytokine gene structure (having conserved intron phases and similar exon sizes), such that the translation of the spliced exons is predicted to be a potential helical cytokine. The key to performing this efficiently is to use threshold scores for each partial threading as described above, enforcing the profile to exon mapping observed for a gene structure class, and recasting the problem so that the dynamic programming principle can be applied. The remainder of this section describes the GT algorithm in more detail.
Although the GT method could be applied separately for each individual helical cytokine gene structure, a generalized gene model is created for a gene structure class. For each exon in a gene structure class, the observed minimum and maximum exon sizes are expanded to more permissive ranges, and the exon phases are converted to exon frames and remainders.
Given a gene model with n exons, a partial solution of k
n exons is a sequence of exons e1, ..., ek, ordered by increasing the donor site position and on the same strand of DNA, that satisfies all of the following conditions:
- profiles map to the exons specified by the gene model,
- all k partial threading scores exceed the cutoff score specified by the fold recognition method,
- the partial GT is reading frame consistent (no stop codons are created at exon junctions),
- the first exon e1 is an Initial exon,
- the frame and remainder of each exon is as specified by the gene model,
- the size of each exon is within the gene model expanded size ranges and
- the intron length is less than a maximum allowed intron size. A permissive 60 kb is the default value.
For a gene model, there may be many solutions ending in a particular exon, each a possible helical cytokine by meeting the conditions above, including the partial fold recognition cutoff scores. For reasons of efficiency, it is desirable to compute and report only one of these solutions. Preferred solutions will have exons sizes closer to the expected size provided by the gene model, and higher-scoring exon splice signals, as outlined below.
The score of a partial solution e1, ..., ek is defined to be the sum of two independent log likelihood ratios, modeling exon size and exon splice signals:
![]() | (1) |
The score s(e, k) of the highest-scoring partial solution with k exons, ending in a particular exon e, can be derived from Equation (1), leading to the recurrence relation
![]() | (2) |
The recurrence relation of Equation (2) is solved by a branch and bound tree search with redundant path pruning (Winston, 1992). The algorithm maintains a queue of partial solutions found so far along with their scores. A partial solution is taken from the front of the queue and extended with new compatible exons to form new partial solutions. These are compared with existing partial solutions starting from the back of the queue. If a partial solution ending in the same exon is encountered, the new partial solution replaces the existing partial solution if the new score is higher. The comparison of the new partial solution can also terminate when either all existing solutions of length k have been seen (a partial solution with k 1 exons is encountered during the queue scan) or the start position of its last exon is greater than that of the partial solution in the queue. The queue is therefore always in a sorted state; primarily by increasing the number of exons in partial solutions, and secondarily by increasing the start position of their last exon. Therefore the addition of new partial solutions to the queue can be done very efficiently.
The GT method can find multiple genes in a sequence. Some of these may be suboptimal, one or more overlapping exons with other solutions. To remove this redundancy, all the genes found are clustered by the overlap of the corresponding exons, and only the highest-scoring gene in a cluster is reported. To apply GT to the whole human genome, gene models for all 11 gene structure classes of Table 2 are applied separately to the database. This may lead to some redundancy in the results, in that two or more gene models might find overlapping variants of the same predicted gene.
All predictions within a genome are BLASTed against the Refseq database to filter out known proteins (Refseq NP entries). Mining of the remaining results requires intensive study because many predictions have highly significant yet uninformative BLAST hits to other Genscan predictions, incompletely specified patent sequences, pseudogenes, and translated partial cDNA sequences.
2.5 Exon identification
To identify and score candidate exons in the genomic sequences, the Geneid (v1.1a) software (Guigó, 1998) is used. On the human helical cytokines, Geneid was found to have an excellent sensitivity on exon prediction, missing only one exon of one gene (IL23) though with a low overall specificity (data not shown). The log likelihood exon scores reported by Geneid were converted from base 2 to base e and form the f(·) term of Equation (2).
Both the specificity and the efficiency of the GT method are reduced with increasing numbers of false positive exons predicted. Therefore, exon score thresholds for Geneid were calibrated to achieve just 100% exon sensitivity (aside from the one difficult exon of IL23) on all exons in known human helical cytokines (for internal exons, a Geneid minimum exon score of 4.24, and all exons at most 600 bp in length).
2.6 Investigative cloning
An investigative cloning approach has been developed to test and clone the predictions of the GT method. Oligonucleotide primers for PCR screening of cDNA cloning sources were designed in two separate exons of each threading prediction to avoid generating false positives from the presence of contaminating genomic DNA in the cDNA samples. When possible, primers were designed to predicted exons and/or helices of higher confidence. To evaluate the primer sensitivity and specificity, a synthetic DNA fragment was created with primer binding sites and spiked into genomic DNA at the level expected for a single copy gene. PCR on a dilution series of this spiked genomic control was then performed. If the primers did not produce a specific product on at least 100 pg of this control DNA, they were not used. In the case of primers that had <1.5 kb of genomic DNA between their binding sites, it was not necessary to create a synthetic control fragment.
For each candidate, a standard set of cDNA panels was screened, consisting of cDNA made from 111 cell lines and 313 tissues representing the most diverse sample set possible. Each PCR reaction was scored for potential positives and those products were subcloned for sequence verification. For those threading candidates producing positives in the cDNA screening experiment, standard 5' and 3' RACE techniques were performed to elucidate the full-length transcripts.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Evaluation by cross validation
The GT method has been evaluated by a leave-one-out cross-validation study using all 38 group 1 helical cytokines of Table 2. The validation set for these 38 genes had a mean length of 53 kb per genomic sequence. A total of 70K exons were predicted (including both strands) in the validation set. For every gene in the validation set, the helical cytokine fold recognition model described in Section 2.2 was retrained without this gene in the training set. This produces an ablated model and new partial threading cutoff scores. Furthermore, gene model exon size parameters for the relevant gene model are recreated without this gene.
The standard gene finding algorithm evaluation measures of nucleotide, exon and gene sensitivity (Guigó et al., 2000) do not accurately convey the ability of a method to splice together exons into a gene of approximately the correct size and recognizable by a fold recognition method. Therefore, a new measure called threading sensitivity has been developed to evaluate the performance of gene finding methods on the helical cytokines. A helical cytokine gene is said to be recognized if a predicted gene is predicted to be a helical cytokine, and overlaps with one or more true exons. The threading sensitivity of a method is the fraction of genes that can be recognized by the method.
Table 3 shows the results of the cross-validation study, for the group 1 multiexon helical cytokines. For the GT method, all 11 gene models are applied to each validation sequence, and the results are concatenated. For comparison purposes, three other gene finding methods were also run on the same dataset: Fgenes v1.6 (Solovyev et al., 1994), Genscan v1.0 (Burge and Karlin, 1997) and Geneid v1.1 (Guigó, 1998), all with default parameter settings. To simulate a realistic scenario, all the gene finding methods (including GT) were directed to report genes on both the forward and reverse complement strands of the validation sequences, even though all genes are oriented on the forward strand.
|
The results indicate that GT performs well in terms of gene sensitivity, accurately predicting 12 genes, Genscan and Fgenes with 9 correct genes. In terms of the ability to recognize the helical cytokine fold in predicted genes (threading sensitivity), GT outperforms other gene finding methods tested (34 of 38 recognized). This result is expected, since GT has integrated the gene finding and fold recognition steps particularly for this gene family. However, the result is not at the expense of total genes found: GT predicts a lower number of genes in the validation set than other gene finding methods. For the divergent subset of 17 helical cytokines, GT also outperforms other methods (14 of 17 recognized), indicating the power of integrating fold recognition with gene finding.
The four genes that are missed by GT have arisen from three unrelated phenomena. The IL22 gene is missed because the predicted helices mapped to this protein did not exactly conform to the IL3 gene structure class (Table 2, exon mapping pattern 1113355), though IL22 was placed in that class due to the sequence homology with other members. Its threading score with profile to exon mapping constraints enforced is below the cross-validated threshold. The IL12A and FLT3LG genes are missed because they are singleton gene structure classes and are not a solution for any other gene model. Though the other singletons (IL9, CSHL1, IL15) are found by the method, FLT3LG is substantially different from other gene structure classes in that it has only the stop codon profile on its last exon, and IL12A in that it is the only helical cytokine with 7 exons. Finally, IL23 is not found because it has an unusual donor site for exon 1; this exon is not found by Geneid, and all other exon/gene finding methods tested have difficulty with this exon (data not shown).
The cross-validation study provided some data on the complexity of the GT method. Time complexity of the branch and bound tree search algorithm can be defined in terms of the total number of exons considered by the algorithm for queue extension (Section 2.4), as a multiple of the raw number of exons in the sequence. Over all the 11 gene models and 38 validation sequences (both strands), the mean number of exons considered for queue extension within one strand of a sequence was only six times the total number of identified exons in a strand. Over all the tests, the queue grew to at most 200 partial solutions, indicating that the memory requirements of the branch and bound search are modest.
3.2 Results on the human genome
The Ensembl (v22.34d) database (www.ensembl.org), using the Ensembl Perl API, is used to extract slices of the human genome between known genes. This extraction procedure is intended to deflect the method away from predicting genes that overlap with or span known genes, and therefore to direct the method to areas of the genome with potentially novel or incompletely specified genes. All the intergenic regions are masked for repetitive DNA sequence. The derived intergenic database contained approximately 18 000 intergenic regions, with mean length 113 kb. All the exons for these sequences (both strands) are computed (Section 2.5) and stored in a database comprising about 30M exons.
An initial application of the GT algorithm to the exon database yields
1300 predicted genes in the human genome. A sample of ten high-scoring GT candidates was chosen for detailed study in our cloning approach (Table 4). The following presents in more detail two genes, called G30 and G32, discovered by this method.
|
The genes G30 and G32 are both 3-exon predictions fitting the LIF gene model (Table 2: phase/mapping pattern 020/1223333). Primers were designed for each and these produced PCR products that were subject to further RACE extensions. The cloned gene for G30 had a slightly longer N-terminus than the original prediction (Fig. 2), but still fit a structural prediction for a helical cytokine and matched the CNTF gene model (Table 2, phase/mapping pattern 00/1112222). In addition, at least one splice variant was indicated at this locus. For G32, the cell line assay showed that G32 mRNA was present in many of the cell lines on the panel (Fig. 3), and that G32 was not a rare transcript, at least by this assay. Two variants of G32 were cloned, one without a putative signal sequence but otherwise containing a possible four core helices, and another a long transcript which no longer satisfied the helical cytokine fold recognition model.
|
|
For all candidates in the expressed set, neither the Genscan predictions nor the GT predictions were entirely correct. Both G30 and G32 have been cloned to full-length transcripts containing reading frames that contain the necessary helical cytokine core helices; this was not obvious from the overlapping Genscan predictions at these loci. However, because neither of these two transcripts revealed a putative signal peptide, and because both had splice variants not conforming to the helical cytokine structure, it is doubtful that they represent new helical cytokines.
The genes in our initial study were found to cluster into two categories: one set of five candidates that had expression of mRNA related to the initial prediction and the other set which had no apparent expression of transcript (Table 4). All the candidates in the expressed set had a putative mouse ortholog, defined as the existence of a significant HSP using the GT prediction as a TBLASTN query of the mouse genome. Partially overlapping Genscan predictions existed for all five of the expressed set, and for two of the non-expressed set. Based on the sample of candidates studied above, for further practical expression and cloning work it was subsequently decided to a retain only those predictions having some overlap with a Genscan predicted gene and with a putative mouse ortholog. This reduces the set of predictions to a more manageable number (a total of 230 total genes predicted), on which cloning experiments are ongoing.
| 4 DISCUSSION |
|---|
|
|
|---|
This paper has developed and evaluated a new method for finding new helical cytokine genes in the human genome. One important result of this study was the compilation and presentation of gene structures for human helical cytokines. A cross-validation analysis of the GT method showed that it achieves high sensitivity with a manageable number of false positives. The investigative cloning approach developed has shown its potential with several novel genes predicted by the method.
The GT method uses conserved gene structure information and conserved placements of structural profiles to exons, with the goal of achieving high sensitivity on identifying new members of divergent gene families. The basic idea of exploiting conserved gene structure information has been explored previously by Brown et al. (1995), who use exon structure as an additional signal during sequence comparison of proteins with known genomic structure. The work presented here extends the use of gene structure conservation to the problem of gene finding of novel genes in genomic sequence.
One important caveat of the expression and cloning strategy is that one must choose only two exons of the prediction to be used for primer design. If the primers are designed to a predicted exon that is missing from the true transcript, then a false negative result will result. However, assaying for individual exons in a prediction is complicated by the extreme difficulty of generating reagents that are completely free of contaminating genomic DNA. The approach used instead was to design inter-exon primers to more confident exons, using data generated from the GT prediction, any overlapping Genscan predictions, and placement of structural core profiles. It is recognized, however, that false negative results may arise in the non-expressed set because of primer placement. Also, assays for genes that may be real, but are strictly spatially and temporally regulated, will generate false negatives if those tissues and cells are not in the sample set used for screening.
The objective of this research is the development of a new gene finding method with high sensitivity for revealing potential helical cytokines in the human genome. Though cross-validation analysis indicates that this objective has been achieved, there are several ideas that can be explored to increase potentially the specificity of the approach without affecting its sensitivity. For example, one idea is to integrate comparative genomics directly into the computation of the exon database, by the masking of human genomic regions that cannot be aligned to a comparative genome. Another idea is to augment gene models by nucleotide models; e.g. models for transcription start sites or for 3'-UTR patterns found in many helical cytokine genes.
The GT method is expected to have a high sensitivity for identification of new helical cytokines in the human genome, subject to the important qualifications that they are group 1 multiexon helical cytokine genes, and that their gene structure is isomorphic or nearly so to that of a known helical cytokine. The method could be extended to other protein families with features similar to the helical cytokines: families with a few highly populated gene structure classes and with conserved structural core profile to exon mappings.
| Acknowledgments |
|---|
We wish to thank Lennie Chen, Teresa Gilbert, and Kimberly Shoemaker for their excellent technical contributions to this work, the Nucleic Acids Technology group at ZymoGenetics for exceptional sequencing and oligonucleotide synthesis support; Brian Fox and Scott Presnell for their dedicated support of bioinformatics tools and sequence databases at ZymoGenetics; David Adler for assistance with the preparation of Figure 2 and Patrick O'Hara.
Received on September 20, 2004; revised on January 7, 2005; accepted on January 18, 2005
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 33893402
Bazan, J.F. (1990) Structural design and molecular evolution of a cytokine receptor superfamily. Proc. Natl Acad. Sci. USA, 87, 69346938
Betts, M.J., Guigo, R., Agarwal, P., Russell, R.B. (2001) Exon structure conservation despite low sequence similarity: a relic of dramatic events in evolution? EMBO J, 20, 53545360[CrossRef][Web of Science][Medline].
Birney, E., Clamp, M., Durbin, R. (2004) GeneWise and Genomewise. Genome Res., 14, 988995
Boulay, J.-L., O'Shea, J.J., Paul, W.E. (2003) Molecular phylogeny within type I cytokines and their cognate receptors. Immunity, 19, 159163[CrossRef][Web of Science][Medline].
Brown, N.P., Whittaker, A.J., Newell, W.R., Rawlings, C.J., Beck, S. (1995) Identification and analysis of multigene families by comparison of exon fingerprints. J. Mol. Biol., 249, 342359[CrossRef][Web of Science][Medline].
Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268, 7894[CrossRef][Web of Science][Medline].
Conklin, D. (2004) Recognition of the helical cytokine fold. J. Comput. Biol., 11, 11891200[CrossRef][Web of Science][Medline].
Dunham, I., Shimizu, N., Roe, B.A., Chissoe, S., Hunt, A.R., Collins, J.E., Bruskiewich, R., Beare, D.M., Clamp, M., Smink, L.J., et al. (1999) The DNA sequence of human chromosome 22. Nature, 402, 489495[CrossRef][Medline].
Feng, Y., Klein, B.K., McWherter, C.A. (1996) Three-dimensional solution structure and backbone dynamics of a variant of human interleukin-3. J. Mol. Biol., 259, 524541[CrossRef][Web of Science][Medline].
Guigó, R. (1998) Assembling genes from predicted exons in linear time with dynamic programming. J. Comput. Biol., 5, 681702[Web of Science][Medline].
Guigó, R., Agarwal, P., Abril, J., Burset, M., Fickett, J. (2000) An assessment of gene prediction accuracy in large DNA sequences. Genome Res., 10, 16311642
Korf, I., Flicek, P., Duan, D., Brent, M. (2001) Integrating genomic homology into gene structure prediction. Bioinformatics, 17, S140S148[Abstract].
Madej, T., Boguski, M.S., Bryant, S.H. (1995) Threading analysis suggests that the obese gene product may be a helical cytokine. FEBS Lett., 373, 1318[CrossRef][Web of Science][Medline].
Solovyev, V., Salamov, A., Lawrence, C. (1994) Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res., 25, 51565163.
Voßhenrich, C. and Di Santo, J. (2002) Interleukin signalling. Curr. Biol., 12, R760R763[CrossRef][Web of Science][Medline].
Winston, P.H. Artificial Intelligence, (1992) 3rd edn , Reading, MA AddisonWesley.
Yeh, R., Lim, L., Burge, C. (2001) Computational inference of homologous gene structures in the human genome. Genome Res., 11, 803816
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




