Bioinformatics Advance Access originally published online on May 27, 2005
Bioinformatics 2005 21(16):3347-3351; doi:10.1093/bioinformatics/bti521
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Quantifying optimal accuracy of local primary sequence bioinformatics methods
Department of Physics, Williams College Williamstown, MA 01267, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Traditional bioinformatics methods scan primary sequences for local patterns. It is important to assess how accurate local primary sequence methods can be.
Results: We study the problem of donor pre-mRNA splice site recognition, where the sequence overlaps between real and decoy datasets can be quantified, exposing the intrinsic limitations of the performance of local primary sequence methods. We assess the accuracy of primary sequence methods generally by studying how they scale with dataset size and demonstrate that our new primary sequence ranking methods have superior performance.
Availability: Our primary sequence ranking analysis tools are available at http://rna.williams.edu/
Contact: aalberts{at}williams.edu, +1-413-597-3520
Supplementary information: Supplementary data for this paper are available on Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
The chief contributions of bioinformaticists are the powerful tools which allow related genomic sequences to be identified. When trained on related sequences, even as few as one, simple bioinformatics methods have been remarkably successful at matching patterns to identify related sequences in other contexts or organisms. Techniques ranging from local sequence alignment to simple pattern-based models now permit researchers to readily identify consensus sequences, homologies and evolutionary histories.
One of the Grand Challenge bioinformatics problems is accurate gene finding (Collins et al., 2003). The expressed coding sequences (exons) in higher eukaryotes are interrupted by non-coding intervening sequences (introns). Within the precursor mRNA transcript, exons (E) and introns (I) alternate; the splicing process removes the introns, leaving the exons to form mature mRNA:
![]() |
Primary sequences are scanned for local patterns, generally not taking into consideration non-local contacts or folding. One straightforward measure for scoring an RNA string's similarity to a set of aligned sequences is the Weight Matrix Model, or WMM (Staden, 1984). The WMM extracts single nucleotide probabilities pi(xi) for each position i from a training set, with xi
{A,C,G,U}. Then, to score a biological sequence X = x1x2x3
xL, the probabilities of each position are multiplied,
![]() | (1) |
Many other local primary sequence models (Zhang and Marr, 1993; Burge and Karlin, 1997; Yeo and Burge, 2003; Garland and Aalberts, 2004) have been employed to find splice sites. The more sophisticated models not only account for interdependencies between positions, but also require more training data to achieve the greater reliability. All local primary sequence methods misdiagnose splice sites; in fact, we shall demonstrate that perfect accuracy can never be achieved.
An alternative approach to biostatistical problems arises when considering the analogy between biological sequences and English language words. A WMM-type approach to identifying language words uses the probabilities of individual letters [p(e) > p(t) >
> p(z)]. Correlations (e.g. qu or th) extracted from a relatively small list of words can also be included (Zhang and Marr, 1993; Bussemaker et al., 2000). These are reasonably good predictors of such probabilities and correlations in other English words. If one can refer to an actual dictionary (Pachter et al., 1999; Laub and Smith, 1998), other questions can be asked since each dictionary entry is a genuine word with 100% certainty. Our method benefits from this complementary approach.
In this paper, (1) we demonstrate why there are intrinsic limitations to every local primary sequence method, (2) we quantify an upper bound on the maximum accuracy of donor splice site identification and describe new methods with superior performance and (3) we compare the effectiveness of many models as a function of theamount of training data.
| 2 DONOR SPLICE SITE DATASET |
|---|
|
|
|---|
Statistical methods are subject to the limitations of their datasets. Let us begin by examining the characteristics of a large donor splice site database compiled by Yeo and Burge (2003) available at http://genes.mit.edu/burgelab/maxent/ssdata/. A nine-base sequence (the final three nucleotides of the exon, the consensus GU and four more bases of the intron) appears in each case. The spliceosome's U1 snRNA binds to approximately this much of the pre-mRNA (Hodas and Aalberts, 2004).
The YeoBurge (YB) training dataset for the donor splice site contains 8415 real annotated human splice sites (with no alternative splicing). There are also 179 438 decoy splice sites, generated by taking the other RNA strings of the desired length which meet the consensus requirements for splicing (the first two letters of the intron are GU) but which are not annotated as splice sites. Note that there are 47 = 16 384 possible nine-letter sequences (with GU consensus).
The extent of overlap in the real and decoy datasets is perhaps our most startling observation. Of 8415 real sequences, 1191 are unique; surprisingly, only 168 of the 8415 are not also in the decoy dataset. Fully 98% of the real sequences are also decoy sequences!
This stunning overlap makes it abundantly clear that local primary sequence alone is insufficient to separate reals from decoys, illuminating the need to consider other factors. Accepting a real also admits those decoys with the same sequence. Thus, at least for the problem of donor splice site recognition, the accuracy of every local primary sequence method is necessarily imperfect.
Nevertheless, it is still of fundamental interest to assess the relative merits of different primary sequence methods. In the following section, we will propose a new approach which complements existing methods and permits the calculation of an upper bound on the optimal local primary sequence performance.
| 3 PRIMARY SEQUENCE RANKING |
|---|
|
|
|---|
The goal of statistical methods is to maximize the number of true positives while minimizing the number of false positives. The most direct way to achieve this goal in this case is simply to rank order the sequences X according to the probability X is + = true,
![]() | (2) |
Lower values of P(+|X) have a greater number of decoys per real. As is done in implementing Equation (1), a threshold probability PC is chosen and test sequences with P(+|X)
PC are accepted as true and the others are predicted to be false. The accuracy of the PSR method is assessed using receiver operating characteristic (ROC) analysis (Egan, 1975). The true positive rate = (true positives)/(all reals) and the false positive rate = (false positives)/(all decoys) are plotted by varying the cutoff PC. A greater true positive rate at a given false positive rate indicates a more accurate method.
The upper bound for accuracy, the one imposed by the overlap of real and decoy datasets, can be attained using the PSR approach using the training data both to train and to test the model. We notate such auto-validation, i.e. models trained and tested with the same data, using a subscript 0. As we see in Figure 1, the PSR0 outperforms all other auto-validated models. Auto-validation is useful for demonstrating that reals and decoys cannot be perfectly separated with any local primary sequence method, but it does not reveal how the method will perform on unknown data.
|
To evaluate actual performance, we must cross-validate using fresh test data, distinct from the training data. For this purpose, we also employ the YB test dataset of 4208 real and 89 717 decoy sites. In Figure 1, the performance of the PSR on fresh data is compared with competing statistical models, such as WMM [Equation (1)] and maximum entropy modeling (MEM), a recent model that includes all dinucleotide dependencies (Yeo and Burge, 2003). It is evident that the PSR is undertrained, not reaching high true positive rates because additional unique real sequences appear in the test dataset (Section 4.2).
Our PSR method uses the full nine-base primary sequence as its basic unit. This is in contrast with WMM-type methods which bundle together single nucleotide probabilities. The PSR score depends only on the sequence in question, and one simply looks up the score for each X. This procedure is like using a dictionary to determine if a word is a part of a language, so it is very easy to implement. However, we know there may be other real words which do not appear in abridged dictionaries.
A sequence cannot always be definitively labeled by the real and decoy dictionaries since it may be in both. Furthermore, even though the datasets are becoming large, the PSR dictionary will remain abridged until all primary sequences are known (Section 4.2). To improve accuracy, we now extend our ranking method to accommodate the possibility of incomplete information.
Consider test sequences Y and Z which are not members of the real training dataset [i.e. R(Y) = R(Z) = 0]. The sequences near Y appear often in the real training set, but no reals are to be found near Z. Common sense tells us that Y is more likely real than Z. We improve the ranking by devising a smoothing technique which involves nearby sequences. Nearness measures have been developed previously to predict splice sites (Ting et al., 2000); we instead use nearness information to smooth our splicing dataset.
We smooth the YB data by adding pseudocounts from sequences one or two substitution mutations away. For example, the single mutation dataset,
![]() | (3) |
nearest neighbors X'. Recall that in the YB donor dataset, sequences vary at seven positions. The factor w' determines the relative weight of the neighbors and is varied to optimize the results (the scaling of w' and w'' with training dataset size is presented in the online Supplementary data). R'(X) and D'(X) are computed for all 47 sequences, allowing us to calculate P'(+|X) for our PSR' method. Further dataset expansion is done by considering
double mutations X'' and constructing
![]() | (4) |
|
Figure 1 demonstrates that performance improves after smoothing the datasets. The PSR' model is our top performer. All PSR methods can be accessed at our http://rna.williams.edu/ website.
| 4 SCALING |
|---|
|
|
|---|
4.1 Model performance
The quantity of training data is important to consider when choosing a model. In this section, we compare the performance characteristics of different models as a function of the training dataset size. The 2D nature of an ROC curve complicates the analysis; however, in Figure 1, we see that with transformed axes (Egan, 1975) the full ROC curves are straight, so their information can be captured by a 1D measure. The optimal binary Pearson correlation coefficient (Matthews, 1975; Burset and Guigó, 1996) between prediction and reality is the 1D measure we use:
![]() | (5) |
Figure 2 shows how the models perform with varying training dataset size. In this figure, C is plotted as a function of the number of reals N = #R for several different models; our number of training decoys is (179 438/8415) N to preserve the ratio of reals to decoys. For cross-validation, the test dataset size is fixed at 4208 reals and 89 717 decoys, randomly selected, for all N.
|
For small datasets, the uncorrelated WMM is best, but its performance also saturates earliest. The PSR models outperform the WMM and other statistical models (Yeo and Burge, 2003), but require more data to do so. For small N, using nearby sequences to enhance the dataset is an effective strategy; more the smoothing, the better. Figure 2 shows that the 2-edit PSR'' model is the top performer when 100
N
3000. With 103
N, the lack of specificity of double mutations is detrimental. The 1-edit PSR' prevails with a higher C for 3000
N. To estimate the upper bound of local primary sequence accuracy we also show how the auto-validated method (i.e. training and testing with the same dataset) scales with dataset size in Figure 2. The auto-validated PSR0 is the best at separating a few reals from decoys, but as the datasets grow, the number of overlaps increases. The gap between the performance of local primary sequence models and their auto-validated upper bounds narrows as N increases. Note that our cross-validated PSR models exceed the auto-validated MEM0 and WMM0. This saturation makes it clear that other non-local information is the only possible source for major improvements in accuracy.
4.2 Unique sequences
In Figure 3, we plot the number of unique real sequences #u R as a function of the total number of real sequences N = #R in the dataset. This is a monotonic function, rising quickly at first when there are fewer unique sequences to duplicate. Interestingly, it is clearly far from saturation, with a positive slope even at a total of 12 623 real sequences (the pooled YB train and test sets).
|
This means that local sequence methods will need to accommodate the possibility of new patterns. Our smoothing methods are designed to achieve this goal.
One can model the distribution of real donor splice sites with a simple thermodynamic theory. Assume that the sequences are Boltzmann distributed around the U1 consensus match. Any mutation will cost some free energy g, so after n mutations we have a free energy
![]() | (6) |
![]() | (7) |
![]() | (8) |
![]() | (9) |
To estimate the number of unique sequences, we will use the Poisson distribution. The average number of instances for a given sequence is pn N, so the Poisson probability of that sequence appearing is
![]() | (10) |
![]() | (11) |
| 5 CONCLUSION |
|---|
|
|
|---|
Bioinformatics datasets possess only local primary sequence information for practical reasons, so it is important to consider the limitations of local primary sequence methodologies. Our scaling study reveals the upper limit of local primary sequence models (extrapolating to N
yields C = 0.676 for locating donor splice sites), and our smoothed PSR' model was able to achieve the best performance of any method (C = 0.668), far surpassing traditional methods (see Table 2). The PSR methods make full use of the available data and are extremely easy to implement.
|
Other smoothing strategies may also be employed. In nature, mutations favor transitions (G
A or C
T) over transversions (purine
pyrimidine); this half-prime PSR gives C = 0.664, intermediate between PSR and PSR' (see Table 2). The smoothing of Equation (3) can also be used in conjunction with WMM to make a WMM' with C = 0.606, a small improvement in performance. We emphasize that the performance of our PSR' method is now very close to the asymptotic upper bound for donor splice sites. Further improvements in accuracy will need to come, not from minor adjustments to the details of the methods but from going beyond the local primary sequence paradigm. The YB dataset for acceptor splice sites at the 3' end of the intron contains 23 characters and consequently has a huge space of 423 = 7 x 1013 sequences, much greater than the training 8465 reals or 180 957 decoys. PSR-type methods only become viable when the sequence space is well covered with training data. The acceptor splice dataset does not meet this requirement, so WMM-type methods are more appropriate.
Local primary sequence data are insufficient to determine if a given location will be a splice site. The substantial overlap of real and decoy donor sets proves this. There are even overlaps between real and decoy acceptor sites, in spite of its giant sequence space. Additional information away from the exon/intron junction is needed to put a definitive label on a site. This is consistent with the approaches used to construct gene finding algorithms, which, in addition to the local information at the splice site, also rely on other signals such as codon biases or intron length (Burge and Karlin, 1997). We must look next to non-local factors, such as preexisting RNA secondary structure (Patterson et al., 2002), the effect of splicing enhancers or silencers (Blencowe, 2000; Fairbrother et al., 2002; Del Gatto-Konczak et al., 1999) and proximity of donor, acceptor and branch point signals within the intron (Lim and Burge, 2001).
| Acknowledgments |
|---|
The authors thank Gene Yeo and Chris Burge for providing their datasets. Jeff Garland first noticed the dataset overlaps and provided useful discussion. Andrea Danyluk, Dick DeVeaux, Steve Sousa and Mehdi Yahyanejad also contributed to our thinking. DPA was supported by National Institutes of Health grant GM068485.
Conflict of Interest: none declared.
Received on October 27, 2004; revised on May 25, 2005; accepted on May 25, 2005
| REFERENCES |
|---|
|
|
|---|
Blencowe, B.J. (2000) Exonic splicing enhancers: mechanism of action, diversity and role in human genetic diseases. Trends Biochem. Sci., 25, 106110[CrossRef][Web of Science][Medline].
Burge, C.B., et al. (1999) Splicing precursors to mRNAs by the spliceosomes. In Gesteland, R.F., Cech, T.R., Atkins, J.F. (Eds.). The RNA World, , Cold Spring Harbor, NY Cold Spring Harbor Lab. Press, pp. 525560.
Burge, C.B. (1998) Modeling dependencies in pre-mRNA splicing signals. In Salzberg, S.L., Searls, D.B., Kasif, S. (Eds.). Computational Methods in Molecular Biology, , Amsterdam Elsevier, pp. 129164.
Burge, C.B. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268, 7894[CrossRef][Web of Science][Medline].
Burset, M. and Guigó, R. (1996) Evaluation of gene structure prediction programs. Genomics, 34, 353367[CrossRef][Web of Science][Medline].
Bussemaker, H.J., et al. (2000) Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. Proc. Natl Acad. Sci. USA, 97, 1009610100
Collins, F.S., et al. (2003) A vision for the future of genomics research. Nature, 422, 835847[CrossRef][Medline].
Del Gatto-Konczak, F., et al. (1999) hnRNP A1 recruited to an exon in vivo can function as an exon splicing silencer. Mol. Cell Biol., 19, 251260
Egan, J.P. Signal Detection Theory and ROC Analysis, (1975) , NY Academic.
Fairbrother, W.G., et al. (2002) Predictive identification of exonic splicing enhancers in human genes. Science, 297, 10071013
Garland, J.A. and Aalberts, D.P. (2004) Thermodynamic modeling of donor splice site recognition in pre-mRNA. Phys. Rev., E., 69, 041903[CrossRef].
Hodas, N.O. and Aalberts, D.P. (2004) Efficient computation of optimal oligo-RNA binding. Nucleic Acids Res., 32, 66366642
Laub, M.T. and Smith, D.W. (1998) Finding intron/exon splice junctions usingINFO, INterruption Finder and Organizer. J. Comp. Biol., 5, 307321.
Lim, L.P. and Burge, C.B. (2001) A computational analysis of sequence features involved in recognition of short introns. Proc. Natl Acad. Sci. USA, 98, 1119311198
Mathews, D.H., et al. (1999) Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol., 288, 911940[CrossRef][Web of Science][Medline].
Matthews, B.W. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta, 405, 442451[Medline].
Pachter, L., et al. (1999) A dictionary-based approach for gene annotation. J. Comp.Biol., 6, 419430.
Patterson, D.J., et al. (2002) Pre-mRNA secondary structure prediction aids splice site prediction. In Altman, R.B., Dunker, A.K., Hunter, L., Lauderdale, K., Klein, T.E. (Eds.). Pacific Symposium on Biocomputing, , Singapore World Scientific Vol. 7, , pp. 223234.
Staden, R. (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res., 12, 505519
Ting, K.M., Zheng, Z., Webb, G. (2000) Learning lazy rules to improve the performance of classifiers. Proceedings of the Nineteenth SGES International Conference on Knowledge Based Systems and Applied Artificial Intelligence , London Springer, pp. 122131.
Xia, T., et al. (1998) Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with WatsonCrick base pairs. Biochemistry, 37, 1471914735[CrossRef][Medline].
Yeo, G. and Burge, C.B. (2003) Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. In Miller, W., Vingron, M., Istrail, S., Pevzner, P., Waterman, M. (Eds.). Proceedings of the 7th Annual International Conference on Computational Molecular Biology (RECOMB'03), , New York ACM Press, pp. 322331.
Zhang, M.Q. and Marr, T.G. (1993) A weight array method for splicing signal analysis. Comput. Appl. Biosci., 9, 499509
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






, R', R'', D, D' and D'' with w' = w'' = 1







