Bioinformatics Advance Access originally published online on August 31, 2006
Bioinformatics 2006 22(21):2590-2596; doi:10.1093/bioinformatics/btl441
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PSoL: a positive sample only learning algorithm for finding non-coding RNA genes
1 Physical Biosciences Division, Lawrence Berkeley National Laboratory Berkeley, CA 94720, USA
2 Computational Research Division, Lawrence Berkeley National Laboratory Berkeley, CA 94720, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Small non-coding RNA (ncRNA) genes play important regulatory roles in a variety of cellular processes. However, detection of ncRNA genes is a great challenge to both experimental and computational approaches. In this study, we describe a new approach called positive sample only learning (PSoL) to predict ncRNA genes in the Escherichia coli genome. Although PSoL is a machine learning method for classification, it requires no negative training data, which, in general, is hard to define properly and affects the performance of machine learning dramatically. In addition, using the support vector machine (SVM) as the core learning algorithm, PSoL can integrate many different kinds of information to improve the accuracy of prediction. Besides the application of PSoL for predicting ncRNAs, PSoL is applicable to many other bioinformatics problems as well.
Results: The PSoL method is assessed by 5-fold cross-validation experiments which show that PSoL can achieve about 80% accuracy in recovery of known ncRNAs. We compared PSoL predictions with five previously published results. The PSoL method has the highest percentage of predictions overlapping with those from other methods.
Contact: srholbrook{at}lbl.gov
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
RNA molecules are endowed with extraordinary capacities owing to their intrinsic conformational versatility and catalytic abilities. However, their potentials have mostly remained hidden from attention until recently through the discoveries of non-coding RNA (ncRNA) genes. In bacteria, ncRNAs have been found to be involved in the control of transcription (Wassarman and Storz, 2000), RNA processing (Wassarman et al., 1999), RNA stability (Masse and Gottesman, 2002), mRNA translation (Altuvia and Wagner, 2000) and even protein degradation (Gillet and Felden, 2001) and translocation (Keenan et al., 2001). Therefore, ncRNAs play important roles in a variety of cellular processes and correspondingly, efforts to identify the whole set of ncRNAs and then to elucidate their functions are becoming more and more prominent.
However, it is a big challenge to identify the whole set of ncRNA genes in a genome. Most ncRNAs are small and non-susceptible to frame-shift and nonsense mutations, which makes them very difficult to detect using routine biochemical and genetic methods (Hershberg et al., 2003). In addition, ncRNAs have varied stability and are expressed under a variety of environmental and physiological conditions. Therefore, methods such as whole genome microarrays (Tjaden et al., 2002) and the whole genome cloning method (Vogel et al., 2003) are unlikely to fully characterize all ncRNA genes in a genome. The development of computational methods for efficiently finding ncRNA genes in genomic sequences has proven difficult. Unlike protein genes, ncRNA genes lack clear endpoints, vary in size and have few common statistical features. This poses a great challenge to computational approaches. Despite the difficulties, great efforts have been devoted to predict ncRNA genes by exploring different aspects of properties about known ncRNA genes. Evolutionary conservation of secondary structures provides compelling evidence for biologically relevant RNA function; thus comparative genomics approaches are particularly attractive for ncRNA gene prediction. In a study by Rivas et al. (2001), pair stochastic context free grammars were exploited to modeling patterns of co-variation in sequence alignment from related genomes. The program RNAz developed by Washietl et al. (2005) basically combines structural conservation and thermodynamic stability of RNA secondary structures in multiple sequence alignments to predict functional RNA structures including ncRNA. Functional sites (i.e. promoter and terminator) are required in ncRNA gene expression. Just as one can reach the melon by following the vine, it is possible to use the predicted signals to approach the boundaries of ncRNA genes. Chen et al. (2002) pinpointed ncRNA genes with genomic positions of promoters and terminators, which were predicted based on profile-based methods. The nucleotide composition of known ncRNA genes has been tested to search for discriminative variables between primary sequences of ncRNA genes and intergenic regions in bacterial genome sequences. However, no particular measure stands out to be very discriminative. The combination of some measures such as k-mer (i.e. the usage k nt words) usage might provide a certain level of predictive capability. In addition, different measures often examine different aspects of an actual gene, all of which may complement each other. Therefore, combining different predictive features is highly likely to yield a more accurate prediction. The integrated strategy was initially used to identify ncRNA genes in E.coli by Carter et al. (2001). Selected discriminative base composition measures and calculated minimum free energies of folding (MFE) were used to train a neural network to distinguish ncRNA from other intergenic sequences. However, <10% of all predictions are shared among different methods above (Hershberg et al., 2003), suggesting that some computational ncRNA gene-finding methods are not highly successful.
We approach the problem of computational prediction of ncRNA genes using a single-class discriminative machine-learning algorithm. Machine-learning involves training a prediction algorithm with knowledge derived from already available data and applying this knowledge to prediction. For this ncRNA prediction problem, we try to train a support vector machine (SVM) algorithm (Vapnik, 1995) to distinguish ncRNA genes from intergenic sequences based on statistical differences between biologically relevant, computable representations of these sequences. In general, an SVM is used as a discriminative method to learn a decision boundary from a set of existing examples that can generalize to unseen examples. The performance of an SVM highly depends on the training dataset which should consist of examples from all classes to be learned, and have as few misclassifications as possible. However, in many computational biology problems, there are only a limited number of positive (desired) training examples available and the negative examples are difficult to define appropriately.
To overcome the lack of appropriate negative training samples, we developed a new approach called the positive sample only learning (PSoL) algorithm. The PSoL algorithm defines the first set of negative examples by maximizing both the distances between negative sample points to the known positive sample points and the distances among negative samples points simultaneously, and then refines the negative data iteratively using an SVM algorithm based on current positive and negative samples until no additional negative samples can be found according to pre-determined rules. By doing so, a decision boundary is updated iteratively from far away to nearby positive samples, achieving high specificity. In this manuscript we detail the algorithm and apply this approach to the prediction of ncRNA genes in E.coli.
| 2 MATERIALS AND METHODS |
|---|
|
|
|---|
2.1 Transformation of biological sequence to feature vectors
The M52 version of the E.coli K-12 genome sequence (Blattner et al., 1997) was used to compile a database of ncRNA and non-annotated sequences. The well-characterized ncRNA sequences of E.coli were collected based on a literature search (3 rRNA, 20 tRNA and 69 known ncRNA genes, see Supplementary Table 1). The sequences of these RNA molecules served as positive examples from which we derived parameters for machine learning. The non-coding, or intergenic sequences were obtained by removing all protein and known functional RNA coding regions from the genome along with a buffer of 50 nucleotides on both the 5' and 3' sides so as to remove possible promoter, terminator and other untranslated control elements. Sequences in both strands were removed when there was a protein or RNA coding region on either strand. Each RNA or non-annotated intergenic sequence was then divided into sequence windows of 80 nt with a 40 nt overlap between windows (i.e. each window slides 40 nt along the sequence). Any window of <40 nucleotides was excluded from the study. A total of 5909 windows from each strand (11818 total) were partitioned from the non-coding sequences, while 321 unique RNA sequence windows were generated from the known RNA sequences (after removing redundant RNAs).
|
Each window was transformed into a feature vector consisting of sequence statistics, the MFE and similarity measurement between related genomes. Sequence statistics were the counts of individual nucleotide (A, C, G, T), dimer (AA, AC
TT) and trimer (AAA, AAC
TTT) in each window. The conservation of the sequence of a window was simply represented by the highest bits score with WU-BLAST (W = 4) between a sequence and the genomic sequence of a reference species. The three reference species are Salmonella typhimurium LT2 (access number NC_003197
[GenBank]
) Salmonella typhi CT18 (access number NC_003198
[GenBank]
) and S.typhi Ty2 (access number NC_004631
[GenBank]
). The MFE for each window was calculated using the program RNAfold (Washietl et al., 2005) with default parameters. All values were then normalized by dividing by the size of window.
2.2 Feature selection
A total of 88 possible features were generated from the feature extraction method described above. In general, too many features often degrade the performance of the discriminant method by overfitting the training data. Therefore, we picked a small number of features and discarded the rest. The most common feature selection involves computing the t-statistic test (for two-class problems) or F-statistic (for multi-class problems) on the class-conditional distributions. Then the features were ranked according to their scores. Those most highly ranked features were then selected.
Both t-statistics and F-statistics assume that for each class, the data follow a normal distribution. In reality, this assumption is not always correct. For this reason, we used a L1 distance metric between two distributions p, q:
![]() |
Since log(x) is a very slow changing function, we ignore it. The L1 distance has an intuitive interpretation. If we plot the probability density distribution curves for two different classes, the L1 distance is the total area sum of the difference between the two curves (Figs 13). The most discriminant features should have the largest differences on these class-conditional distributions.
|
|
|
The L1 ranking does not require the underlying data to follow a particular distribution. When the class-conditional distributions are Gaussian, the ranked orders based on t-statistics and on L1 distance are very similar.
2.3 Learning from partially labeled data
Discriminative machine learning algorithms require labeled data during the training phase. The windows derived from previously identified ncRNA genes were labeled as the positive (+) data. We were trying to distinguish putative ncRNA genes from intergenic sequences. Intergenic sequences contain positive examples (putative ncRNAs) as well as negative examples (sequences that do not encode putative ncRNAs). Therefore, we considered the intergenic sequences to be unlabeled data. Thus our problem became learning from a positively labeled-only dataset.
2.4 Positive sample only learning
In this problem, we have two types of data: (1) positive data samples and (2) the unlabeled data set, which contains both positives and negatives, and generally much more data than the positive data samples. The goal of PSoL is to predict the positives in the unlabeled data.
PSoL is a challenging problem because there are no negative data. The usual discriminative methods, which require both positive and negative samples for training, cannot be applied to this problem directly. In our earlier approach (Carter et al., 2001), we first took random samples from the unlabeled data and assumed they were negative data. This negative dataset plus the true positive dataset were used for training the discriminant decision function between the positive and the negative data. This approach is reasonably effective for RNA gene prediction (Carter et al., 2001) since there are many more negatives than positives in the unlabeled sequences.
However, some of the negative samples in training the decision function could in fact be positives embedded in the unlabeled data. These wrongly assumed negative samples could tilt the decision boundary in an unpredictable way and thus affect the decision boundary significantly.
The key to the success of PSoL is to generate a negative training set without contamination from those positives embedded in the unlabeled data. In this paper, we describe a more sophisticated method to determine the negative training set. The basic spirit of this method has appeared previously (Yu et al., 2002; Yu, 2003; Li and Liu, 2003; Liu et al., 2002).
The method first identifies a small number of data points in the unlabeled dataset that are very far away from the positive training dataset. In this way, we minimized the possibility of those picked data points to be positive. In addition, we minimized the redundancy in those picked data points by maximizing their mutual distances to achieve a better representativeness for negative data.
Given the small initial negative set, we expanded them in multiple steps, each time we picked more data from the currently unlabeled set, using a criteria that they are far-away from the positive training set and close to the current negative set. (The decision function of an SVM gives a convenient measure for the distances to the positives and to the negatives). The negative training set built up in this way will be less contaminated by the positives embedded in the unlabeled dataset.
Once this negative training set is built, we have N: current negative dataset, U: remaining unlabeled dataset and P: positive dataset. The process of predicting positives from the remaining unlabeled dataset is the same as in the two-class prediction.
2.5 Initial negative set selection
2.5.1 Maximum distance minimum redundancy negative set.
For the initial negative set, we selected from the unlabeled set m data points that are (1) most dissimilar from the positive set P and (2) least redundant among themselves. We call this maximum distanceminimum redundancy (MDMR) set (Ding and Peng, 2005).
We first defined the distance between a single data point and the positive set, d(xi, P), as the minimum Euclidean distance between xi and P:
![]() | (1) |
![]() | (2) |
![]() | (3) |
![]() | (4) |
2.5.2 Forward incremental selection algorithm
The algorithm first selects a point according to Equation (2). The rest of N is chosen incrementally. Suppose we already have several points in the current negative set N; the new point xi is selected based on maximum dissimilarity to the positive set:
![]() | (5) |
![]() | (6) |
![]() | (7) |
2.6 Negative set expansion
Given an initial negative set, the PSoL method gradually expands the negative set by classifying more and more unlabeled data points as negative. This is done iteratively using a two-class SVM. At each iteration, an SVM is trained; the decision function values for all remaining unlabeled points are computed, and some of them are classified as negative. Thus |N| is increased and |U| is decreased at each step.
At the stop point, N contains the negative training set and U contains the remaining unlabeled dataset. A final SVM is trained. Based on this, a portion of those in U N are classified as positive; the remaining ones in U N are classified as undecided.
2.6.1 Controlled stepwise expansion
Given the current negative training set Ntrain and the current unlabeled set U, we perform negative set expansion. We begin by training an SVM on the data P + Ntrain to obtain a large margin decision boundary. The support vectors in Ntrain for this SVM are denoted Nsv. All objects in the currently unlabeled set U are tested against the SVM.
We classify unlabeled data points as negative in a conservative and controlled fashion. At an iteration, once the SVM is trained, each unlabeled point will have an decision value f(xi). Normally, a point xi is classified as negative if f(xi) < 0. To insure the quality of the negative set, we build a safety margin h > 0 by requiring
![]() | (8) |
Besides the safety margin, we also control the size of the newly predicted negative samples Npred at each step by setting
![]() | (9) |
Once Npred are selected, they are added to the current negative set: N
N + Npred and they are subtracted from the current unlabeled set: U
U Npred.
2.6.2 SVM training
In SVM training, it is well-known that if the sizes of the classes differ substantively, say 1:5, SVM training typically converges to a solution where all data points in the smaller class are classified as belonging to the larger class.
To overcome this problem, we maintain a current negative training set Ntrain whose size is comparable with that of |P|. At the first iteration, Ntrain = N. Later on, after each SVM, the support vectors on the negative side Nsv are used to represent the existing negative set N. This is combined with the newly predicted negatives to give the negative training dataset for next round of SVM training: Ntrain = Npred + Nsv. Since |Npred|
r |P|, the size of Ntrain is controllable and is maintained in the range where the SVM can be successfully trained with high accuracy.
2.6.3 Stopping criteria of negative expansion
Negative set expansion is repeated until the size of the remaining unlabeled set goes below a predefined number, typically about three times the number of expected positives in the unlabeled set. At this last step, the unlabeled data points with the largest positive decision function values are declared as the positives.
2.7 SVM parameter selection
We used the libsvm (Fan et al., 2005) to perform SVM training and predicting. A radial basis function (RBF) kernel was used. There are two parameters for the RBF kernel:
, which determines the effective range of distances between points, and C, which determines the trade-off between margin maximization and training error minimization. The parameter search is carried out with cross validation. We used a grid-search approach to search for a pair of C and
with the best performance in cross validation.
It should be emphasized that we fixed parameters for the entire PSoL, i.e. parameters for training the SVM were fixed for each iterative step in the negative set expansion. If we let parameters change during the negative expansion, the data would be overfit and poorer performance in cross validation would result.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Feature selection
For each feature, distributions for positive and unlabeled classes were computed, from which t-score and L1 score were derived. Detailed distributions for three features are shown in Figures 13. The figures show that distributions follow the normal distributions by varying degrees and the validity of t-score becomes questionable. Since the L1 measure does not require the underlying data to follow a particular distribution, the L1 measure can capture the difference. We decided to use 30 features (A, C, G, T, AA, AT, CC, CG, GG, GT, TA, TT, AAA, AAT, ATA, ATT, CCG, CGG, GCC, GGC, GGG, GGT, TAT, TGG, TTA, TTT, MFE, Typhi_CT18, Typhi_Ty2 and Typhi_LT2) with highest L1 scores.
3.2 The 5-fold cross validation
In order to calibrate the performance of PSoL on the ncRNA data, we carried out a 5-fold cross validation. Briefly, the positive data were randomly divided into five subsets of approximately equal sizes. We ran the validation process five times; each time, we merged four subsets into positive training data and merged the remaining subset into unlabeled data. We ran the PSoL procedure described above and counted the number of positive samples embedded in the unlabeled data which remain to be unlabeled. Figure 4 shows the results for five independent 5-fold cross validation experiments. From those curves, it is apparent that the embedded 64 (321/5) known positives are mostly present in the remaining unlabeled samples as negative expansion proceeds, suggesting that the negative set is not contaminated by the positives. This validates our design of the negative set expansion. When the negative expansion stops at |U| = 1000 (1000 samples predicted to be positive),
80% recovery rate is achieved (Fig. 4). The optimal parameters are C = 1000 and
= 0.04.
|
ROC curve analysis was carried out to further assess the performance of PSoL. A total of 321 negative control samples were generated by shuffling each positive sample window once using the program SHUFFLE in Sean Eddy's Squid toolbox (http://hmmer.wustl.edu/) to randomize the sequence while perserving mono- and di-nucleotide composition. The negative samples were marked and put into an unlabeled dataset to do a 5-fold cross validation experiement as described above. The true positive rate and false positive rate were then calculated based on those known positive windows and those negative windows generated by shuffling. The ROC curve of this analysis is shown in Figure 5. When the negative expansion stops at |U| = 1000 (1000 samples predicted to be positive), the false positive rate is 6 ± 1 %. Using true positives and true negatives only (igorning the unlabeled category), the average Q
(average of the percentage of correctly predicted positive windows and the percentage of correctly predicted negative windows) (Baldi et al., 2000) of the 5-fold cross validation experiment is 87.3%. We note that this type of estimation of false positive rate can be automatically computed in PSoL and used to help judge when to terminate the process.
|
3.3 Prediction
Using the best parameters C and
from the 5-fold cross-validation experiment, we ran PSoL with all positive data and predicted 1000 windows. This choice is based on the observation that when the number of remaining unlabeled windows is close to 1000, the curves in Figure 4 show a sharp downturn. Since many of these predicted windows were consecutive, we then merged windows that overlapped each other into one. The predicted 1000 windows were assembled into 420 independent RNA sequence segments (details listed in Supplementary Table 2). One of the difficulties in computational prediction of ncRNA genes is the lack of benchmark data to validate the method. Experimental approaches are expensive and time-consuming, therefore only a limited number of predictions were subject to verification using Northern blots or RTPCR. There are also additional drawbacks to such approaches. Since ncRNA expression may vary according to environmental and physiological conditions, some authentic ncRNA genes might not be detected under experimental conditions. In this study, we compared our predictions to results from previous work. We argued that if our results have more agreement with other studies, that would be a validation of our method. The data used in our comparison were predicted by methods which are listed below.
| |||||||||||||||||||||
The results of pairwise comparison are listed in Table 1. Note that a predicted ncRNA gene from one method could overlap with two predicted ncRNA genes from another method (see Supplementary figures). In general, PSoL has the largest overlap with other methods. This can be seen clearly from row or column sums. The greater overlap of PSoL with Affy is significant since Affy is the only experimentally based method of which the results are more reliable.
SVM function values can be used to measure the confidence of prediction. In Figure 6 we show the percentage agreement. It is clear, as the decision values increases, the percentage agreement increases. This suggests that predictions with higher decision values are more likely to be true positives. The rank of each prediction based on its decision value is provided in Supplementary Table 2. The secondary structures and their genome schema for the top five predictions are also provided in Supplementary Data.
|
3.4 Comparison of the statistics of predictions from different methods
Currently, there is no consensus as to the characteristics of an ncRNA gene. In this study, we examined the distribution of length, GC-content, and MFE (minimum free energy) for predictions from different methods mentioned above, as shown in Figures 79. Methods based sequence statistics such as NNs, bGP and PSoL predict more short ncRNA genes. There is less bias in length of prediction from Affy, QRNA and IBIS when compared to the distribution of length of known ncRNA genes. The overall GC content is 50.8 and 40.3% in the E.coli K12 genome and in the intergenic regions, respectively. It appears that NNs, bGP and PSoL pick up prediction with slightly higher GC content than the other three methods.
|
|
|
Currently MFE is commonly used as the major predictor for ncRNA genes. Three out of six methods (PSoL, NNs and bGP) utilize MFE as a prediction parameter. However, as shown in Figure 9, only a small fraction of both known ncRNA genes and predictions from all methods has a very low normalized MFE, suggesting that MFE cannot be used as the only predictor of an ncRNA gene.
| 4 SUMMARY |
|---|
|
|
|---|
In summary, the PSoL algorithm addresses two significant concerns in machine learning for biological systems: (1) the uncertainty of the negatives or the lack of negatives and (2) the overwhelming majority of unlabeled data relative to known positives. This situation is quite common in many bioinformatics problems. We believe our method could provide an effective prediction tool in these difficult situations.
We tested this technique on the prediction of ncRNA genes in the E.coli genome sequence solely based on known functional RNA molecules. The 5-fold cross-validation experiments show that PSoL has a recovery rate of 80%. When we compare our predictions with results from previous studies, we find that our prediction has the most overlap with other results, especially with the experimental microarray data, Affy, suggesting the success of this technique.
| Acknowledgments |
|---|
The authors gratefully acknowledge Nancy Schimmelman and Libby Holbrook for editing this manuscript. This work is supported by the National Human Genome Research Institute of the National Institute of Health grant 5RO1H6002665-02. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the US Department of Energy under Contract No. DE-AC03-76SF00098.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on May 30, 2006; revised on August 2, 2006; accepted on August 13, 2006
| REFERENCES |
|---|
|
|
|---|
Altuvia, S. and Wagner, E.G. (2000) Switching on and off with RNA. Proc. Natl Acad. Sci. USA, 97, 98249826
Baldi, P., et al. (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16, 412424
Blattner, F.R., et al. (1997) The complete genome sequence of Escherichia coli K-12. Science, 277, 14531474
Carter, R.J., et al. (2001) A computational approach to identify genes for functional RNAs in genomic sequences. Nucleic Acids Res, . 29, 39283938
Chen, S., et al. (2002) A bioinformatics based approach to discover small RNA genes in the Escherichia coli genome. Biosystems, 65, 157177[CrossRef][ISI][Medline].
Ding, C. and Peng, H. (2005) Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol, . 3, 185205[CrossRef][Medline].
Fan, R.E. (2005) Working set selection using second order information for training support vector machines. J. Mach. Learn. Res, . 6, 18891918.
Gillet, R. and Felden, B. (2001) Emerging views on tmRNA-mediated protein tagging and ribosome rescue. Mol. Microbiol, . 42, 879885[CrossRef][ISI][Medline].
Gottesman, S. (2005) Micros for microbes: non-coding regulatory RNAs in bacteria. Trends Genet, . 21, 399404[CrossRef][ISI][Medline].
Hershberg, R., et al. (2003) A survey of small RNA-encoding genes in Escherichia coli. Nucleic Acids Res, . 31, 18131820
Hildebrandt, M. and Nellen, W. (1992) Differential antisense transcription from the Dictyostelium EB4 gene locus: implications on antisense-mediated regulation of mRNA stability. Cell, 69, 197204[CrossRef][ISI][Medline].
Keenan, R.J., et al. (2001) The signal recognition particle. Annu. Rev. Biochem, . 70, 755775[CrossRef][ISI][Medline].
Kullback, S. and Leibler, R. A. (1951) On information and sufficiency. Ann. Math. Stat, . 22, 7986.
Lankenau, S., et al. (1994) The Drosophila micropia retrotransposon encodes a testis-specific antisense RNA complementary to reverse transcriptase. Mol. Cell. Biol, . 14, 17641775
Li, X. and Liu, B. (2003) Learning to classify text using positive and unlabeled data. Proceedings of Eighteenth International Joint Conference on Artificial Intelligence, , Mexico Acapulco, pp. 587594.
Liu, B., et al. (2002) Partially supervised classification of text documents. Proceedings of the 19th Int Conf. on Machine LearningSydney, Australia, pp. , pp. 387394.
Masse, E. and Gottesman, S. (2002) A small RNA regulates the expression of genes involved in iron metabolism in Escherichia coli. Proc. Natl Acad. Sci. USA, 99, 46204625
Morfeldt, E., et al. (1995) Activation of alphatoxin translation in Staphylococcus aureus by the trans-encoded antisense RNA, RNAIII. Embo J, 14, 45694577[ISI][Medline].
Pfeffer, S., et al. (2005) Identification of microRNAs of the herpesvirus family. Nat. Methods, 2, 269276[CrossRef][ISI][Medline].
Rivas, E., et al. (2001) Computational identification of noncoding RNAs in E.coli by comparative genomics. Curr. Biol, . 11, 13691373[CrossRef][ISI][Medline].
Saetrom, P., et al. (2005) Predicting non-coding RNA genes in Escherichia coli with boosted genetic programming. Nucleic Acids Res, . 33, 32633670
Sharp, T.V., et al. (1993) Comparative analysis of the regulation of the interferoninducible protein kinase PKR by Epstein-Barr virus RNAs EBER-1 and EBER-2 and adenovirus VAI RNA. Nucleic Acids Res, . 21, 44834490
Tjaden, B., et al. (2002) Transcriptome analysis of Escherichia coli using high-density oligonucleotide probe arrays. Nucleic Acids Res, . 30, 37323738
Vapnik, V.N. The Nature of Statistical Learning Theory, . (1995) , SpringerVerlag, New York.
Vogel, J., et al. (2003) RNomics in Escherichia coli detects new sRNA species and indicates parallel transcriptional output in bacteria. Nucleic Acids Res, . 31, 64356443
Wagner, E.G. and Simons, R.W. (1994) Antisense RNA control in bacteria, phages, and plasmids. Annu. Rev. Microbiol, . 48, 713742[ISI][Medline].
Washietl, S., et al. (2005) Fast and reliable prediction of noncoding RNAs. Proc. Natl Acad. Sci. USA, 102, 24542459
Wassarman, K.M. and Storz, G. (2000) 6S RNA regulates E.coli RNA polymerase activity. Cell, 101, 613623[CrossRef][ISI][Medline].
Wassarman, K.M., et al. (1999) Small RNAs in Escherichia coli. Trends Microbiol, . 7, 3745[CrossRef][ISI][Medline].
Wightman, B., et al. (1993) Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans. Cell, 75, 855862[CrossRef][ISI][Medline].
Yu, H. (2003) SVMC: single-class classification with support vector machines. Proceedings of International Joint Conference on Artificial Intelligence,, , Acapulco Maxico.
Yu, H., et al. (2002) PEBL: positive example-based learning for web page classification using SVM. Proceedings of the ACM SIGKDD International Conference Knowledge Discovery in Databases (KDD02), ACM press, pp. 239248.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


















