Skip Navigation


Bioinformatics Advance Access originally published online on August 31, 2006
Bioinformatics 2006 22(21):2590-2596; doi:10.1093/bioinformatics/btl441
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary data
Right arrow All Versions of this Article:
22/21/2590    most recent
btl441v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wang, C.
Right arrow Articles by Holbrook, S. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wang, C.
Right arrow Articles by Holbrook, S. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

PSoL: a positive sample only learning algorithm for finding non-coding RNA genes

Chunlin Wang 1, Chris Ding 2, Richard F. Meraz 1 and Stephen R. Holbrook 1,*

1 Physical Biosciences Division, Lawrence Berkeley National Laboratory Berkeley, CA 94720, USA
2 Computational Research Division, Lawrence Berkeley National Laboratory Berkeley, CA 94720, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 SUMMARY
 REFERENCES
 

Motivation: Small non-coding RNA (ncRNA) genes play important regulatory roles in a variety of cellular processes. However, detection of ncRNA genes is a great challenge to both experimental and computational approaches. In this study, we describe a new approach called positive sample only learning (PSoL) to predict ncRNA genes in the Escherichia coli genome. Although PSoL is a machine learning method for classification, it requires no negative training data, which, in general, is hard to define properly and affects the performance of machine learning dramatically. In addition, using the support vector machine (SVM) as the core learning algorithm, PSoL can integrate many different kinds of information to improve the accuracy of prediction. Besides the application of PSoL for predicting ncRNAs, PSoL is applicable to many other bioinformatics problems as well.

Results: The PSoL method is assessed by 5-fold cross-validation experiments which show that PSoL can achieve about 80% accuracy in recovery of known ncRNAs. We compared PSoL predictions with five previously published results. The PSoL method has the highest percentage of predictions overlapping with those from other methods.

Contact: srholbrook{at}lbl.gov

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 SUMMARY
 REFERENCES
 
RNA molecules are endowed with extraordinary capacities owing to their intrinsic conformational versatility and catalytic abilities. However, their potentials have mostly remained hidden from attention until recently through the discoveries of non-coding RNA (ncRNA) genes. In bacteria, ncRNAs have been found to be involved in the control of transcription (Wassarman and Storz, 2000), RNA processing (Wassarman et al., 1999), RNA stability (Masse and Gottesman, 2002), mRNA translation (Altuvia and Wagner, 2000) and even protein degradation (Gillet and Felden, 2001) and translocation (Keenan et al., 2001). Therefore, ncRNAs play important roles in a variety of cellular processes and correspondingly, efforts to identify the whole set of ncRNAs and then to elucidate their functions are becoming more and more prominent.

However, it is a big challenge to identify the whole set of ncRNA genes in a genome. Most ncRNAs are small and non-susceptible to frame-shift and nonsense mutations, which makes them very difficult to detect using routine biochemical and genetic methods (Hershberg et al., 2003). In addition, ncRNAs have varied stability and are expressed under a variety of environmental and physiological conditions. Therefore, methods such as whole genome microarrays (Tjaden et al., 2002) and the whole genome cloning method (Vogel et al., 2003) are unlikely to fully characterize all ncRNA genes in a genome. The development of computational methods for efficiently finding ncRNA genes in genomic sequences has proven difficult. Unlike protein genes, ncRNA genes lack clear endpoints, vary in size and have few common statistical features. This poses a great challenge to computational approaches. Despite the difficulties, great efforts have been devoted to predict ncRNA genes by exploring different aspects of properties about known ncRNA genes. Evolutionary conservation of secondary structures provides compelling evidence for biologically relevant RNA function; thus comparative genomics approaches are particularly attractive for ncRNA gene prediction. In a study by Rivas et al. (2001), pair stochastic context free grammars were exploited to modeling patterns of co-variation in sequence alignment from related genomes. The program RNAz developed by Washietl et al. (2005) basically combines structural conservation and thermodynamic stability of RNA secondary structures in multiple sequence alignments to predict functional RNA structures including ncRNA. Functional sites (i.e. promoter and terminator) are required in ncRNA gene expression. Just as one can reach the melon by following the vine, it is possible to use the predicted signals to approach the boundaries of ncRNA genes. Chen et al. (2002) pinpointed ncRNA genes with genomic positions of promoters and terminators, which were predicted based on profile-based methods. The nucleotide composition of known ncRNA genes has been tested to search for discriminative variables between primary sequences of ncRNA genes and intergenic regions in bacterial genome sequences. However, no particular measure stands out to be very discriminative. The combination of some measures such as k-mer (i.e. the usage k nt words) usage might provide a certain level of predictive capability. In addition, different measures often examine different aspects of an actual gene, all of which may complement each other. Therefore, combining different predictive features is highly likely to yield a more accurate prediction. The integrated strategy was initially used to identify ncRNA genes in E.coli by Carter et al. (2001). Selected discriminative base composition measures and calculated minimum free energies of folding (MFE) were used to train a neural network to distinguish ncRNA from other intergenic sequences. However, <10% of all predictions are shared among different methods above (Hershberg et al., 2003), suggesting that some computational ncRNA gene-finding methods are not highly successful.

We approach the problem of computational prediction of ncRNA genes using a single-class discriminative machine-learning algorithm. Machine-learning involves training a prediction algorithm with knowledge derived from already available data and applying this knowledge to prediction. For this ncRNA prediction problem, we try to train a support vector machine (SVM) algorithm (Vapnik, 1995) to distinguish ncRNA genes from intergenic sequences based on statistical differences between biologically relevant, computable representations of these sequences. In general, an SVM is used as a discriminative method to learn a decision boundary from a set of existing examples that can generalize to unseen examples. The performance of an SVM highly depends on the training dataset which should consist of examples from all classes to be learned, and have as few misclassifications as possible. However, in many computational biology problems, there are only a limited number of positive (desired) training examples available and the negative examples are difficult to define appropriately.

To overcome the lack of appropriate negative training samples, we developed a new approach called the positive sample only learning (PSoL) algorithm. The PSoL algorithm defines the first set of negative examples by maximizing both the distances between negative sample points to the known positive sample points and the distances among negative samples points simultaneously, and then refines the negative data iteratively using an SVM algorithm based on current positive and negative samples until no additional negative samples can be found according to pre-determined rules. By doing so, a decision boundary is updated iteratively from far away to nearby positive samples, achieving high specificity. In this manuscript we detail the algorithm and apply this approach to the prediction of ncRNA genes in E.coli.


    2 MATERIALS AND METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 SUMMARY
 REFERENCES
 
2.1 Transformation of biological sequence to feature vectors
The M52 version of the E.coli K-12 genome sequence (Blattner et al., 1997) was used to compile a database of ncRNA and non-annotated sequences. The well-characterized ncRNA sequences of E.coli were collected based on a literature search (3 rRNA, 20 tRNA and 69 known ncRNA genes, see Supplementary Table 1). The sequences of these RNA molecules served as positive examples from which we derived parameters for machine learning. The ‘non-coding’, or intergenic sequences were obtained by removing all protein and known functional RNA coding regions from the genome along with a buffer of 50 nucleotides on both the 5' and 3' sides so as to remove possible promoter, terminator and other untranslated control elements. Sequences in both strands were removed when there was a protein or RNA coding region on either strand. Each RNA or non-annotated intergenic sequence was then divided into sequence windows of 80 nt with a 40 nt overlap between windows (i.e. each window slides 40 nt along the sequence). Any window of <40 nucleotides was excluded from the study. A total of 5909 windows from each strand (11818 total) were partitioned from the non-coding sequences, while 321 unique RNA sequence windows were generated from the known RNA sequences (after removing redundant RNAs).


View this table:
[in this window]
[in a new window]

 
Table 1 Pairwise overlap between ncRNA prediction methods

 
Each window was transformed into a feature vector consisting of sequence statistics, the MFE and similarity measurement between related genomes. Sequence statistics were the counts of individual nucleotide (A, C, G, T), dimer (AA, AC ··· TT) and trimer (AAA, AAC ··· TTT) in each window. The conservation of the sequence of a window was simply represented by the highest bits score with WU-BLAST (W = 4) between a sequence and the genomic sequence of a reference species. The three reference species are Salmonella typhimurium LT2 (access number NC_003197 [GenBank] ) Salmonella typhi CT18 (access number NC_003198 [GenBank] ) and S.typhi Ty2 (access number NC_004631 [GenBank] ). The MFE for each window was calculated using the program RNAfold (Washietl et al., 2005) with default parameters. All values were then normalized by dividing by the size of window.

2.2 Feature selection
A total of 88 possible features were generated from the feature extraction method described above. In general, too many features often degrade the performance of the discriminant method by overfitting the training data. Therefore, we picked a small number of features and discarded the rest. The most common feature selection involves computing the t-statistic test (for two-class problems) or F-statistic (for multi-class problems) on the class-conditional distributions. Then the features were ranked according to their scores. Those most highly ranked features were then selected.

Both t-statistics and F-statistics assume that for each class, the data follow a normal distribution. In reality, this assumption is not always correct. For this reason, we used a L1 distance metric between two distributions p, q:

Formula
where s is summed over different states. This metric can be viewed as a simplified version of the symmetrized Kullback–Leibler divergence (Kullback and Leibler, 1951): Formula Since log(x) is a very slow changing function, we ignore it. The L1 distance has an intuitive interpretation. If we plot the probability density distribution curves for two different classes, the L1 distance is the total area sum of the difference between the two curves (Figs 13). The most discriminant features should have the largest differences on these class-conditional distributions.


Figure 1
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Distributions (histogram) of normalized bits score for both positive and unlabeled classes on S.typhi Ty2 genome sequence (extracted from the best HSP in WU-BLAST search). Both distributions deviate substantially from normal distribution; the sample means shift away from the peak regions. Thus the use of t-score is questionable. L1 score can capture the difference in distributions.

 


Figure 2
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Distributions of normalized G content for both positive and unlabeled classes. Both t-score and L1 score can capture the difference in distributions.

 


Figure 3
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Distributions of normalized GGT content for both positive and unlabeled classes. t-score can not capture the difference in distributions while L1 score can.

 
The L1 ranking does not require the underlying data to follow a particular distribution. When the class-conditional distributions are Gaussian, the ranked orders based on t-statistics and on L1 distance are very similar.

2.3 Learning from partially labeled data
Discriminative machine learning algorithms require labeled data during the training phase. The windows derived from previously identified ncRNA genes were labeled as the positive (+) data. We were trying to distinguish putative ncRNA genes from intergenic sequences. Intergenic sequences contain positive examples (putative ncRNAs) as well as negative examples (sequences that do not encode putative ncRNAs). Therefore, we considered the intergenic sequences to be unlabeled data. Thus our problem became learning from a positively labeled-only dataset.

2.4 Positive sample only learning
In this problem, we have two types of data: (1) positive data samples and (2) the unlabeled data set, which contains both positives and negatives, and generally much more data than the positive data samples. The goal of PSoL is to predict the positives in the unlabeled data.

PSoL is a challenging problem because there are no negative data. The usual discriminative methods, which require both positive and negative samples for training, cannot be applied to this problem directly. In our earlier approach (Carter et al., 2001), we first took random samples from the unlabeled data and assumed they were negative data. This negative dataset plus the true positive dataset were used for training the discriminant decision function between the positive and the negative data. This approach is reasonably effective for RNA gene prediction (Carter et al., 2001) since there are many more negatives than positives in the unlabeled sequences.

However, some of the ‘negative samples’ in training the decision function could in fact be positives embedded in the unlabeled data. These wrongly assumed ‘negative samples’ could tilt the decision boundary in an unpredictable way and thus affect the decision boundary significantly.

The key to the success of PSoL is to generate a negative training set without contamination from those ‘positives’ embedded in the unlabeled data. In this paper, we describe a more sophisticated method to determine the negative training set. The basic spirit of this method has appeared previously (Yu et al., 2002; Yu, 2003; Li and Liu, 2003; Liu et al., 2002).

The method first identifies a small number of data points in the unlabeled dataset that are very far away from the positive training dataset. In this way, we minimized the possibility of those picked data points to be positive. In addition, we minimized the redundancy in those picked data points by maximizing their mutual distances to achieve a better representativeness for negative data.

Given the small initial negative set, we expanded them in multiple steps, each time we picked more data from the currently unlabeled set, using a criteria that they are far-away from the positive training set and close to the current negative set. (The decision function of an SVM gives a convenient measure for the distances to the positives and to the negatives). The negative training set built up in this way will be less contaminated by the positives embedded in the unlabeled dataset.

Once this negative training set is built, we have N: current negative dataset, U: remaining unlabeled dataset and P: positive dataset. The process of predicting positives from the remaining unlabeled dataset is the same as in the two-class prediction.

2.5 Initial negative set selection
2.5.1 Maximum distance minimum redundancy negative set.
For the initial negative set, we selected from the unlabeled set m data points that are (1) most dissimilar from the positive set P and (2) least redundant among themselves. We call this maximum distance—minimum redundancy (MDMR) set (Ding and Peng, 2005).

We first defined the distance between a single data point and the positive set, d(xi, P), as the minimum Euclidean distance between xi and P:

Formula 1(1)
The maximum distance negative set was constructed by selecting the initial negative set N from the unlabeled set U such that the distance between N and P was maximized:

Formula 2(2)
This optimization is trivially solved by picking the N points with largest distance d(xi, P). However, often the chosen set has many members close to each other and the space represented by N is narrow. From the viewpoint of learning, we may say that there is a certain redundancy in N. To reduce the redundancy, we added a second requirement that maximizes the distance among data points in N:

Formula 3(3)
To satisfy these two criteria simultaneously, we maximize:

Formula 4(4)
The exact solution of Equation (4), however, is NP hard. We propose the following simple approximate algorithm that is efficient and gives good results in practice.

2.5.2 Forward incremental selection algorithm
The algorithm first selects a point according to Equation (2). The rest of N is chosen incrementally. Suppose we already have several points in the current negative set N; the new point xi is selected based on maximum dissimilarity to the positive set:

Formula 5(5)
And the maximum distance to the current set.

Formula 6(6)
Now Equation (5) is an exact solution to Equation (2) and Equation (6) in an approximate solution to Equation (3). As in Equation (4) these two criteria are combined into one:

Formula 7(7)
This can be solved by a simple linear search. Once the specified size of N is reached, the algorithm is terminated and we set the initial negative training set Ntrain = N.

2.6 Negative set expansion
Given an initial negative set, the PSoL method gradually expands the negative set by classifying more and more unlabeled data points as negative. This is done iteratively using a two-class SVM. At each iteration, an SVM is trained; the decision function values for all remaining unlabeled points are computed, and some of them are classified as negative. Thus |N| is increased and |U| is decreased at each step.

At the stop point, N contains the negative training set and U contains the remaining unlabeled dataset. A final SVM is trained. Based on this, a portion of those in U – N are classified as positive; the remaining ones in U – N are classified as ‘undecided’.

2.6.1 Controlled stepwise expansion
Given the current negative training set Ntrain and the current unlabeled set U, we perform negative set expansion. We begin by training an SVM on the data P + Ntrain to obtain a large margin decision boundary. The support vectors in Ntrain for this SVM are denoted Nsv. All objects in the currently unlabeled set U are tested against the SVM.

We classify unlabeled data points as negative in a conservative and controlled fashion. At an iteration, once the SVM is trained, each unlabeled point will have an decision value f(xi). Normally, a point xi is classified as negative if f(xi) < 0. To insure the quality of the negative set, we build a safety margin h > 0 by requiring

Formula 8(8)
for xi to belong to the negative set. We typically set h = 0.2.

Besides the safety margin, we also control the size of the newly predicted negative samples Npred at each step by setting

Formula 9(9)
where r is set to be 3 in most of our experiments. This size control is necessary because the size of unlabeled data samples can be huge compared with that of the positive samples. Therefore the number of newly predicted negative samples is possibly very large in each expansion.

Once Npred are selected, they are added to the current negative set: N <- N + Npred and they are subtracted from the current unlabeled set: U <- UNpred.

2.6.2 SVM training
In SVM training, it is well-known that if the sizes of the classes differ substantively, say 1:5, SVM training typically converges to a solution where all data points in the smaller class are classified as belonging to the larger class.

To overcome this problem, we maintain a current negative training set Ntrain whose size is comparable with that of |P|. At the first iteration, Ntrain = N. Later on, after each SVM, the support vectors on the negative side Nsv are used to represent the existing negative set N. This is combined with the newly predicted negatives to give the negative training dataset for next round of SVM training: Ntrain = Npred + Nsv. Since |Npred| ≤ r |P|, the size of Ntrain is controllable and is maintained in the range where the SVM can be successfully trained with high accuracy.

2.6.3 Stopping criteria of negative expansion
Negative set expansion is repeated until the size of the remaining unlabeled set goes below a predefined number, typically about three times the number of expected positives in the unlabeled set. At this last step, the unlabeled data points with the largest positive decision function values are declared as the positives.

2.7 SVM parameter selection
We used the libsvm (Fan et al., 2005) to perform SVM training and predicting. A radial basis function (RBF) kernel was used. There are two parameters for the RBF kernel: {gamma}, which determines the effective range of distances between points, and C, which determines the trade-off between margin maximization and training error minimization. The parameter search is carried out with cross validation. We used a grid-search approach to search for a pair of C and {gamma} with the best performance in cross validation.

It should be emphasized that we fixed parameters for the entire PSoL, i.e. parameters for training the SVM were fixed for each iterative step in the negative set expansion. If we let parameters change during the negative expansion, the data would be overfit and poorer performance in cross validation would result.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 SUMMARY
 REFERENCES
 
3.1 Feature selection
For each feature, distributions for positive and unlabeled classes were computed, from which t-score and L1 score were derived. Detailed distributions for three features are shown in Figures 13. The figures show that distributions follow the normal distributions by varying degrees and the validity of t-score becomes questionable. Since the L1 measure does not require the underlying data to follow a particular distribution, the L1 measure can capture the difference. We decided to use 30 features (A, C, G, T, AA, AT, CC, CG, GG, GT, TA, TT, AAA, AAT, ATA, ATT, CCG, CGG, GCC, GGC, GGG, GGT, TAT, TGG, TTA, TTT, MFE, Typhi_CT18, Typhi_Ty2 and Typhi_LT2) with highest L1 scores.

3.2 The 5-fold cross validation
In order to calibrate the performance of PSoL on the ncRNA data, we carried out a 5-fold cross validation. Briefly, the positive data were randomly divided into five subsets of approximately equal sizes. We ran the validation process five times; each time, we merged four subsets into positive training data and merged the remaining subset into unlabeled data. We ran the PSoL procedure described above and counted the number of positive samples embedded in the unlabeled data which remain to be ‘unlabeled’. Figure 4 shows the results for five independent 5-fold cross validation experiments. From those curves, it is apparent that the embedded 64 (321/5) known positives are mostly present in the remaining unlabeled samples as negative expansion proceeds, suggesting that the negative set is not contaminated by the positives. This validates our design of the negative set expansion. When the negative expansion stops at |U| = 1000 (1000 samples predicted to be positive), ~80% recovery rate is achieved (Fig. 4). The optimal parameters are C = 1000 and {gamma} = 0.04.


Figure 4
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 5-fold cross validation results. Each curve presents the percentage of correctly recovered positives. We run five 5-fold CV experiments with different random partitions of the positive data. The horizontal coordinate denotes the number of unlabeled samples left after each negative expansion. A colour version of this figure appears in the Supplementary data.

 
ROC curve analysis was carried out to further assess the performance of PSoL. A total of 321 negative control samples were generated by shuffling each positive sample window once using the program SHUFFLE in Sean Eddy's Squid toolbox (http://hmmer.wustl.edu/) to randomize the sequence while perserving mono- and di-nucleotide composition. The negative samples were marked and put into an unlabeled dataset to do a 5-fold cross validation experiement as described above. The true positive rate and false positive rate were then calculated based on those known positive windows and those negative windows generated by shuffling. The ROC curve of this analysis is shown in Figure 5. When the negative expansion stops at |U| = 1000 (1000 samples predicted to be positive), the false positive rate is 6 ± 1 %. Using true positives and true negatives only (igorning the unlabeled category), the average Q{alpha} (average of the percentage of correctly predicted positive windows and the percentage of correctly predicted negative windows) (Baldi et al., 2000) of the 5-fold cross validation experiment is 87.3%. We note that this type of estimation of false positive rate can be automatically computed in PSoL and used to help judge when to terminate the process.


Figure 5
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5 ROC curves of 5-fold cross validation experiment. Each curve represents the result of one single run of the experiment. A colour version of this figure appears in the Supplementary data.

 
3.3 Prediction
Using the best parameters C and {gamma} from the 5-fold cross-validation experiment, we ran PSoL with all positive data and predicted 1000 windows. This choice is based on the observation that when the number of remaining unlabeled windows is close to 1000, the curves in Figure 4 show a sharp downturn. Since many of these predicted windows were consecutive, we then merged windows that overlapped each other into one. The predicted 1000 windows were assembled into 420 independent RNA sequence segments (details listed in Supplementary Table 2).

One of the difficulties in computational prediction of ncRNA genes is the lack of benchmark data to validate the method. Experimental approaches are expensive and time-consuming, therefore only a limited number of predictions were subject to verification using Northern blots or RT–PCR. There are also additional drawbacks to such approaches. Since ncRNA expression may vary according to environmental and physiological conditions, some authentic ncRNA genes might not be detected under experimental conditions. In this study, we compared our predictions to results from previous work. We argued that if our results have more agreement with other studies, that would be a validation of our method. The data used in our comparison were predicted by methods which are listed below.


Abbreviation Methods Reference

Affy microarray experiments (Tjaden et al., 2002)
QRNA stochastic context free grammars (Rivas et al., 2001)
IBIS promoter and terminator prediction (Chen et al., 2002)
bGP boosted genetic programming (Saetrom et al., 2005)
NNs neural networks (Carter et al., 2001)

The results of pairwise comparison are listed in Table 1. Note that a predicted ncRNA gene from one method could overlap with two predicted ncRNA genes from another method (see Supplementary figures). In general, PSoL has the largest overlap with other methods. This can be seen clearly from row or column sums. The greater overlap of PSoL with Affy is significant since Affy is the only experimentally based method of which the results are more reliable.

SVM function values can be used to measure the confidence of prediction. In Figure 6 we show the percentage agreement. It is clear, as the decision values increases, the percentage agreement increases. This suggests that predictions with higher decision values are more likely to be true positives. The rank of each prediction based on its decision value is provided in Supplementary Table 2. The secondary structures and their genome schema for the top five predictions are also provided in Supplementary Data.


Figure 6
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6 Percentage of PSoL predictions overlapping with other methods versus decision values. The solid line shows the sum of percentage overlapping with all other five methods. The dash line shows the percentage overlapping with Affy result.

 
3.4 Comparison of the statistics of predictions from different methods
Currently, there is no consensus as to the characteristics of an ncRNA gene. In this study, we examined the distribution of length, GC-content, and MFE (minimum free energy) for predictions from different methods mentioned above, as shown in Figures 79. Methods based sequence statistics such as NNs, bGP and PSoL predict more short ncRNA genes. There is less bias in length of prediction from Affy, QRNA and IBIS when compared to the distribution of length of known ncRNA genes. The overall GC content is 50.8 and 40.3% in the E.coli K12 genome and in the intergenic regions, respectively. It appears that NNs, bGP and PSoL pick up prediction with slightly higher GC content than the other three methods.


Figure 7
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7 The length distribution of predicted ncRNA genes and known ncRNA genes (tRNA and rRNA genes are not included). A colour version of this figure appears in the Supplementary data.

 


Figure 8
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 8 The GC content distribution of predicted ncRNA genes and known ncRNA genes (tRNA and rRNA genes are not included). A colour version of this figure appears in the Supplementary data.

 


Figure 9
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 9 The normalized (by length) MFE distribution of predicted ncRNA genes and known ncRNA genes (tRNA and rRNA genes are not included). A colour version of this figure appears in the Supplementary data.

 
Currently MFE is commonly used as the major predictor for ncRNA genes. Three out of six methods (PSoL, NNs and bGP) utilize MFE as a prediction parameter. However, as shown in Figure 9, only a small fraction of both known ncRNA genes and predictions from all methods has a very low normalized MFE, suggesting that MFE cannot be used as the only predictor of an ncRNA gene.


    4 SUMMARY
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 SUMMARY
 REFERENCES
 
In summary, the PSoL algorithm addresses two significant concerns in machine learning for biological systems: (1) the uncertainty of the negatives or the lack of negatives and (2) the overwhelming majority of unlabeled data relative to known positives. This situation is quite common in many bioinformatics problems. We believe our method could provide an effective prediction tool in these difficult situations.

We tested this technique on the prediction of ncRNA genes in the E.coli genome sequence solely based on known functional RNA molecules. The 5-fold cross-validation experiments show that PSoL has a recovery rate of 80%. When we compare our predictions with results from previous studies, we find that our prediction has the most overlap with other results, especially with the experimental microarray data, Affy, suggesting the success of this technique.


    Acknowledgments
 
The authors gratefully acknowledge Nancy Schimmelman and Libby Holbrook for editing this manuscript. This work is supported by the National Human Genome Research Institute of the National Institute of Health grant 5RO1H6002665-02. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the US Department of Energy under Contract No. DE-AC03-76SF00098.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on May 30, 2006; revised on August 2, 2006; accepted on August 13, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 SUMMARY
 REFERENCES
 

    Altuvia, S. and Wagner, E.G. (2000) Switching on and off with RNA. Proc. Natl Acad. Sci. USA, 97, 9824–9826[Free Full Text].

    Baldi, P., et al. (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16, 412–424[Abstract/Free Full Text].

    Blattner, F.R., et al. (1997) The complete genome sequence of Escherichia coli K-12. Science, 277, 1453–1474[Abstract/Free Full Text].

    Carter, R.J., et al. (2001) A computational approach to identify genes for functional RNAs in genomic sequences. Nucleic Acids Res, . 29, 3928–3938[Abstract/Free Full Text].

    Chen, S., et al. (2002) A bioinformatics based approach to discover small RNA genes in the Escherichia coli genome. Biosystems, 65, 157–177[CrossRef][ISI][Medline].

    Ding, C. and Peng, H. (2005) Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol, . 3, 185–205[CrossRef][Medline].

    Fan, R.E. (2005) Working set selection using second order information for training support vector machines. J. Mach. Learn. Res, . 6, 1889–1918.

    Gillet, R. and Felden, B. (2001) Emerging views on tmRNA-mediated protein tagging and ribosome rescue. Mol. Microbiol, . 42, 879–885[CrossRef][ISI][Medline].

    Gottesman, S. (2005) Micros for microbes: non-coding regulatory RNAs in bacteria. Trends Genet, . 21, 399–404[CrossRef][ISI][Medline].

    Hershberg, R., et al. (2003) A survey of small RNA-encoding genes in Escherichia coli. Nucleic Acids Res, . 31, 1813–1820[Abstract/Free Full Text].

    Hildebrandt, M. and Nellen, W. (1992) Differential antisense transcription from the Dictyostelium EB4 gene locus: implications on antisense-mediated regulation of mRNA stability. Cell, 69, 197–204[CrossRef][ISI][Medline].

    Keenan, R.J., et al. (2001) The signal recognition particle. Annu. Rev. Biochem, . 70, 755–775[CrossRef][ISI][Medline].

    Kullback, S. and Leibler, R. A. (1951) On information and sufficiency. Ann. Math. Stat, . 22, 79–86.

    Lankenau, S., et al. (1994) The Drosophila micropia retrotransposon encodes a testis-specific antisense RNA complementary to reverse transcriptase. Mol. Cell. Biol, . 14, 1764–1775[Abstract/Free Full Text].

    Li, X. and Liu, B. (2003) Learning to classify text using positive and unlabeled data. Proceedings of Eighteenth International Joint Conference on Artificial Intelligence, , Mexico Acapulco, pp. 587–594.

    Liu, B., et al. (2002) Partially supervised classification of text documents. Proceedings of the 19th Int Conf. on Machine LearningSydney, Australia, pp. , pp. 387–394.

    Masse, E. and Gottesman, S. (2002) A small RNA regulates the expression of genes involved in iron metabolism in Escherichia coli. Proc. Natl Acad. Sci. USA, 99, 4620–4625[Abstract/Free Full Text].

    Morfeldt, E., et al. (1995) Activation of alphatoxin translation in Staphylococcus aureus by the trans-encoded antisense RNA, RNAIII. Embo J, 14, 4569–4577[ISI][Medline].

    Pfeffer, S., et al. (2005) Identification of microRNAs of the herpesvirus family. Nat. Methods, 2, 269–276[CrossRef][ISI][Medline].

    Rivas, E., et al. (2001) Computational identification of noncoding RNAs in E.coli by comparative genomics. Curr. Biol, . 11, 1369–1373[CrossRef][ISI][Medline].

    Saetrom, P., et al. (2005) Predicting non-coding RNA genes in Escherichia coli with boosted genetic programming. Nucleic Acids Res, . 33, 3263–3670[Abstract/Free Full Text].

    Sharp, T.V., et al. (1993) Comparative analysis of the regulation of the interferoninducible protein kinase PKR by Epstein-Barr virus RNAs EBER-1 and EBER-2 and adenovirus VAI RNA. Nucleic Acids Res, . 21, 4483–4490[Abstract/Free Full Text].

    Tjaden, B., et al. (2002) Transcriptome analysis of Escherichia coli using high-density oligonucleotide probe arrays. Nucleic Acids Res, . 30, 3732–3738[Abstract/Free Full Text].

    Vapnik, V.N. The Nature of Statistical Learning Theory, . (1995) , SpringerVerlag, New York.

    Vogel, J., et al. (2003) RNomics in Escherichia coli detects new sRNA species and indicates parallel transcriptional output in bacteria. Nucleic Acids Res, . 31, 6435–6443[Abstract/Free Full Text].

    Wagner, E.G. and Simons, R.W. (1994) Antisense RNA control in bacteria, phages, and plasmids. Annu. Rev. Microbiol, . 48, 713–742[ISI][Medline].

    Washietl, S., et al. (2005) Fast and reliable prediction of noncoding RNAs. Proc. Natl Acad. Sci. USA, 102, 2454–2459[Abstract/Free Full Text].

    Wassarman, K.M. and Storz, G. (2000) 6S RNA regulates E.coli RNA polymerase activity. Cell, 101, 613–623[CrossRef][ISI][Medline].

    Wassarman, K.M., et al. (1999) Small RNAs in Escherichia coli. Trends Microbiol, . 7, 37–45[CrossRef][ISI][Medline].

    Wightman, B., et al. (1993) Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans. Cell, 75, 855–862[CrossRef][ISI][Medline].

    Yu, H. (2003) SVMC: single-class classification with support vector machines. Proceedings of International Joint Conference on Artificial Intelligence,, , Acapulco Maxico.

    Yu, H., et al. (2002) PEBL: positive example-based learning for web page classification using SVM. Proceedings of the ACM SIGKDD International Conference Knowledge Discovery in Databases (KDD02), ACM press, pp. 239–248.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary data
Right arrow All Versions of this Article:
22/21/2590    most recent
btl441v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wang, C.
Right arrow Articles by Holbrook, S. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wang, C.
Right arrow Articles by Holbrook, S. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?