Bioinformatics Advance Access originally published online on January 31, 2007
Bioinformatics 2007 23(11):1321-1330; doi:10.1093/bioinformatics/btm026
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures
1Bioinformatics Institute, Singapore 138671 and 2NUS Graduate School for Integrative Sciences & Engineering, Centre for Life Sciences, Singapore 117456
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: MicroRNAs (miRNAs) are small ncRNAs participating in diverse cellular and physiological processes through the post-transcriptional gene regulatory pathway. Critically associated with the miRNAs biogenesis, the hairpin structure is a necessary feature for the computational classification of novel precursor miRNAs (pre-miRs). Though many of the abundant genomic inverted repeats (pseudo hairpins) can be filtered computationally, novel species-specific pre-miRs are likely to remain elusive.
Results: miPred is a de novo Support Vector Machine (SVM) classifier for identifying pre-miRs without relying on phylogenetic conservation. To achieve significantly higher sensitivity and specificity than existing (quasi) de novo predictors, it employs a Gaussian Radial Basis Function kernel (RBF) as a similarity measure for 29 global and intrinsic hairpin folding attributes. They characterize a pre-miR at the dinucleotide sequence, hairpin folding, non-linear statistical thermodynamics and topological levels. Trained on 200 human pre-miRs and 400 pseudo hairpins, miPred achieves 93.50% (5-fold cross-validation accuracy) and 0.9833 (ROC score). Tested on the remaining 123 human pre-miRs and 246 pseudo hairpins, it reports 84.55% (sensitivity), 97.97% (specificity) and 93.50% (accuracy). Validated onto 1918 pre-miRs across 40 non-human species and 3836 pseudo hairpins, it yields 87.65% (92.08%), 97.75% (97.42%) and 94.38% (95.64%) for the mean (overall) sensitivity, specificity and accuracy. Notably, A.mellifera, A.geoffroyi, C.familiaris, E.Barr, H.Simplex virus, H.cytomegalovirus, O.aries, P.patens, R.lymphocryptovirus, Simian virus and Z.mays are unambiguously classified with 100.00% (sensitivity) and >93.75% (specificity).
Availability: Data sets, raw statistical results and source codes are available at http://web.bii.a-star.edu.sg/~stanley/Publications
Contact: stanley{at}bii.a-star.edu.sg; santosh{at}bii.a-star.edu.sg
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
MicroRNAs (miRNAs) constitute an abundant class of small (
21–23 nts), endogenous, and evolutionarily conserved ncRNA molecules that mediate post-transcriptionally the production of intra-cellular proteins in most eukaryotes via sequence-specific target mechanisms (Bartel, 2004). The founding members lin-4 and let-7 miRNAs discovered respectively in 1993 and 2000, are key heterochronic regulators directing temporal aspects of development timing in early larval C.elegans (Lee et al., 1993; Reinhart et al., 2000). Subsequently, thousands of novel miRNA genes have been unraveled across plants, worms, flies, vertebrates and even viruses; >4000 miRNAs spanning 45 species are listed in miRBase 8.2 (Griffiths-Jones et al., 2006). Biologically pivotal and more prevalent genomically than presumed, miRNAs perform key regulatory roles in diverse cellular and physiological events such as apoptosis, proliferation, and fat metabolism in the D.melanogaster (Brennecke et al., 2003; Xu et al., 2003); patterning and developmental specification in plants (Chen, 2004; Palatnik et al., 2003); genetic diseases including oncogenesis (Calin and Croce, 2006; Lu et al., 2005).
Previously, novel miRNA genes were identified almost exclusively by directional cloning of endogenous small RNAs and high-throughput sequencing of large numbers of cDNA clones (Lagos-Quintana et al., 2001; Lau et al., 2001; Lee and Ambros, 2001). Conventional forward genetic screening is highly biased toward abundantly and/or ubiquitously expressed miRNAs that usually dominate the cloned products (Lagos-Quintana et al., 2003). Evidently, miRNAs expressed constitutively at low levels or in highly constrained tissue- and time-specific patterns are intricate to detect experimentally. Computational prediction techniques have been employed extensively to overcome this technical hurdle (Berezikov et al., 2006). The underlying principle revolves around two tenets. First, precursor miRNAs (pre-miiRs) should possess statistically significant and evolutionarily conserved (a)symmetric RNA hairpin, a structural prerequisite functionally critical for the early stages of the mature miRNA biogenesis (Bartel, 2004; Kim, 2005); details at supplemental Biogenesis of Mature MicroRNAs. Second, the hairpin feature of pre-miiRs should be distinct from those of random inverted repeats (termed as pseudo hairpins) that can potentially fold into dysfunctional candidate hairpins e.g. 1.1E7 in human (Bentwich et al., 2005) and 4.4E4 in C.elegans (Pervouchine et al., 2003). Removing these overwhelming and irrelevant genomic pool of false-positives without sacrificing excessively putative pre-miiRs is most technically challenging, as they are relatively short in length (
60–80 nts in animal and
100–400 nts in plants) and have highly diverse base compositions (Zhang et al., 2006b). Unlike protein-coding genes, they frequently exhibit seemingly weak or lack detectable statistically significant primary-sequence signals such as the open reading frames (ORFs), promoter motifs, and codon signatures (Berezikov et al., 2006).
Earlier attempts in circumventing these difficulties relied on identifying close homologs of published pre-miRs e.g. let-7 (Pasquinelli et al., 2000). This can be as straightforward as aligning sequences through NCBI BlastN (McGinnis and Madden, 2004) while allowing several mismatches and gaps depending on their inter-phylogenetic distance. False-positives not residing in the orthologous locations are deemed not conserved phylogentically among closely related species, and are consequently masked (Floyd and Bowman, 2004; Pasquinelli et al., 2000). The candidate orthologues of evolutionary conserved miRNAs genes are then assessed for their capability to potentially fold into hairpin structures with the lowest Minimum Free Energy of folding (MFE), have
16-bps involving the first 22-nts of the mature miRNA embedded within one arm of the fold-back precursor, and in the absence of large internal loops or bulges especially large asymmetric ones (Ambros et al., 2003). Apparently, mere application of simple alignment queries and positive-selection rules is likely to overlook novel families lacking clear homologues to published mature miRNAs.
Advanced comparative approaches like MiRscan (Lim et al., 2003a,b), MIRcheck (Jones-Rhoades and Bartel, 2004), miRFinder (Bonnet et al., 2004b), miRseeker (Lai et al., 2003), findMiRNA (Adai et al., 2005), PalGrade (Bentwich et al., 2005) and MiRAlign (Wang et al., 2005) have systematically exploit the greater availability of sequenced genomes for eliminating the over-represented false-positives. Cross-species sequence conservation based on computationally intensive multiple genome alignments is a powerful approach for genome-wide screening of phylogentically well conserved pre-miiRs between closely related species. However, it suffers lower sensitivity in divergent evolutionary distance (Berezikov et al., 2005; Boffelli et al., 2003). Identifying pre-miRs that differ significantly or evolve rapidly at the sequence level while retaining their characteristic evolutionary conserved hairpin structures may also be an issue. Another significant drawback is that non-conserved pre-miRs with genus-specific patterns are likely to evade detection. Pathogenic viral-encoded pre-miiRs have been uncovered in E.Barr virus, K.sarcoma-associated herpesvirus, M.
-herpesvirus 68, H.cytomegalovirus and Simian virus 40 that neither share significant sequence similarities with known host pre-miRs nor among themselves (Cullen, 2006; Sarnow et al., 2006).
(Table 1) To surmount the technical shortfalls of comparative works for distinguishing species-specific and non-conserved pre-miRs, several state-of-the-art de novo (or ab initio) predictive approaches have been extensively developed. The inaugural and definitive work by Sewer et al. (2005) compiled 40 distinctive sequence and structural markers from the hairpins that obviates the use of comparative genomics information. The SVM classifier model trained with the experimental domain knowledge and binary-labeled feature vectors, recovered 71% of the positive pre-miRs with a remarkably low false-positive rate of
3%. It also predicted
50 to 100 novel pre-miRs for several species;
30% of these were previously experimentally validated. The validation rate among the predicted cases that were conserved in
1 other species was higher at
60%; many had not been detected by comparative genomics approaches. The 3SVM (Xue et al., 2005) improved the performances to
90.00% for human and up to 90.00% in other species. Albeit its methodological simplicity, promising performances and independence of comparative genomics information, 3SVM was strictly limited to classifying RNA sequences that fold into secondary structures without multiple loops. RNAmicro (Hertel and Stadler, 2006) incorporating sequence and structural information as part of its feature vector, reported an incredibly promising efficiency of 91.16% (sensitivity) and 99.47% (specificity). Besides requiring computationally expensive multiple sequence alignments for inputs, another major drawback of its classification pipeline was it excluded assessment of alignment windows whose consensus structure contained a stem with <10-bps or
2 hairpins with
5-bps each, and classified them instantly as non-pre-miRs.
|
ProMiR (Nam et al., 2005) exploited a probabilistic co-learning model technique HMM to discriminate miRNA genes according to their pairwise aligned sequences. It achieved a promisingly low false-positive rate of 4.00%, but compromised for a less performing sensitivity of 73.00%. A relatively recent work BayesMIRfinder (Yousef et al., 2006) adopted an alternative discriminative machine learning algorithm NBI as its underlying classifier model. Notwithstanding its technical novelty, BayesMIRfinder relied on the comparative analysis of conserved genomics regions for post-processing to yield a considerably higher sensitivity of 97.00% and comparable specificity of 91.00% in mouse to existing algorithms.
| 2 MATERIALS AND METHODS |
|---|
|
|
|---|
2.1 Biologically relevant datasets
Training, testing and independent data sets. They were pooled separately from four independent sources; details at supplemental Materials and Methods. To improve the quality of this comprehensive collection, sequences with non-ACG[TU] nucleotides were filtered and no sequence was reused. Entire set of 2241 pre-miRs was obtained from miRBase 8.2 (Griffiths-Jones et al., 2006); 8494 pseudo hairpins from human RefSeq genes (Pruitt and Maglott, 2001) without undergoing any known experimentally validated alternative splicing (AS) events. For hyperparameter estimation and training the decision function of miPred, binary-class labeled samples consisting of 200 human pre-miRs (positives) and 400 pseudo hairpins (negatives) were randomly selected without replacement to avoid the classifier being skewed toward specifically screened training samples. The remaining 123 human pre-miRs (positives) and 246 randomly selected pseudo hairpins (negatives) were used for testing. They are denoted as TR-H and TE-H. The comparable ratio of 1 : 2 ensures that the selected negatives contribute more significantly to the specificity of a classifier than positives, while avoiding the problem of overtraining. Typically, the decision function of SVM converges to a solution where all samples belonging to the smaller class are classified as that of the larger class if the class sizes differ significantly. The performance of miPred was evaluated against three datasets IE-NH, IE-NC and IE-M. They represent the remaining 1918 pre-miRs spanning 40 non-human species (positives) and 3836 randomly selected pseudo hairpins (negatives); 12 387 functional ncRNAs (negatives) from Rfam 7.0 (Griffiths-Jones et al., 2005); and 31 mRNAs (negatives) from GenBank DNA database (Benson et al., 2005), respectively.
Four complete viral genomes. They were downloaded from GenBank DNA database (Benson et al., 2005), namely E.Barr virus (EBV; 171,823-bps; DNA circular; AJ507799
[GenBank]
.2), K.sarcoma-associated herpesvirus (KSHV; 137,508-bps; DNA linear; U75698.1), M.
-herpesvirus 68 strain WUMS (MGHV68; 119,451-bps; DNA linear; U97553.2) and H.cytomegalovirus strain AD169 (HCMV; 229,354-bps; DNA linear; X17403.1).
2.2 Computational pipeline of miPred
Background of SVM. Integral to miPred is the Support Vector Machine (SVM), a supervised classification technique derived from the statistical learning theory of structural risk minimization principle (Burges, 1998). Given its simplicity to deal easily with multi-dimensional data sets that can be noisy or redundant (non-informative or highly correlated), SVM has been adopted extensively as an invaluable discriminative machine learning tool to address diverse bioinformatics problems (Dror et al., 2005; Han et al., 2004; Liu et al., 2006).
Briefly, the primary objective of SVM is to explicitly construct a multi-dimensional hyperplane separating a set of complex feature vectors xi into binary labeled classes yi
(1 or –1) with the distance between the hyperplane and the closest support vectors (the margin) maximized. In non-linear separable cases, the maximum-margin hyperplane is obtained after transforming uniquely the input variables into a high-dimensional feature space via the Gaussian Radial Basis Function kernel (RBF) K(x, xi) in Equation (1). Typically, SVM is conducted using three straightforward steps: feature extraction, training the decision function on a set of selected binary-labeled training vectors, and classifying a given test sample xi into either positive or negative classes (Burges, 1998).
|
| (1) |
Extraction of miPred's features. Considering that a single criterion to filter pseudo hairpins has not yet been identified, miPred undertakes a novel approach that posits the entire hairpin-shaped structure of each pre-miR can be characterized solely into a feature vector xi containing 29 RNA global and intrinsic folding attributes, without relying on phylogenetic conservation information (Ng and Mishra, 2007); details at supplemental Materials and Methods. Seventeen base composition variables: 16 dinucleotide frequencies %XY such that X, Y
= [A, C, G, U], and 1 aggregate dinucleotide frequency %G + C ratio. Dinucleotide is the preferred predicting descriptor to mononucleotide or higher-order frequencies, as it strikes a compromise between the resolution and computation tractability. Six folding measures: adjusted base pairing propensity dP (Schultes et al., 1999), adjusted Minimum Free Energy of folding (MFE) denoted as dG (Freyhult et al., 2005; Seffens and Digby, 1999), MFE index 1 MFEI1 (Zhang et al., 2006a), adjusted base pair distance dD (Freyhult et al., 2005; Moulton et al., 2000), adjusted shannon entropy dQ (Freyhult et al., 2005), and MFE index 2 MFEI2. One topological descriptor: degree of compactness dF (Fera et al., 2004; Gan et al., 2004). Five normalized variants of dP, dG, dQ, dD and dF i.e. zP, zG, zQ, zD and zF derived from dinucleotide shuffling. We computed the 17 sequence composition variables as well as the non-linear statistical thermodynamics measures dQ and dD by a custom Perl program interfaced to the module RNAlib of Vienna RNA Package 1.4 (Hofacker, 2003); dG by RNAfold program (Hofacker, 2003) that predicts the most favorable RNA secondary structure and the corresponding MFE; the topological descriptors S and dF by a custom program RNAspectral. After synthesizing the set of random RNA sequences, the normalized variants zP, zG, zQ, zD, and zF were computed.
Parameter estimation, training, and evaluation of miPred. The libSVM version 2.82 (http://www.csie.ntu.edu.tw/~cjlin/libsvm), a free SVM implementation was used for training and testing miPred's binary classification. Samples were randomly selected without replacement via a custom python script. Foremost, the 29 attributes of miPred were rescaled linearly by the svm-scale program to the interval [–1.0, 1.0] to guard against asymptomatic biasness in the numeric ranges for all the data sets; larger variance may dominate the classification e.g. [6.0, 50.0] versus [–0.5, –0.2]. All miPred classifier models were generated with svm-train -b 1 -c 2C -g
; default RBF kernel; -b 1 option computes the SVM probability estimates (P-values) for classification thresholding. As both the penalty parameter C (determines the trade-off between training error minimization and margin maximization) and the RBF kernel parameter
(defines the nonlinear mapping from input space to some high-dimensional feature space) are critical for the performance of SVM (Duan et al., 2003), they were optimally calibrated by an exhaustive grid-search strategy. Briefly, at each hyperparameter pair (C,
) selected from the search space log2C
[–10, –9, ... , 15] and log2
[–15, –14, ... , 10], we performed a 5-fold cross validation. The training data set was randomly partitioned into approximately five distinct equal-sized subsets. Repeating the validation process five times for each subset i.e. retaining a set for testing and the remaining four sets for training, the average accuracy of the five models gave the 5-fold leave-one-out cross-validation (LOOCV) accuracy rate (Duan et al., 2003). To avoid over-fitting the generalization, the best combination of hyperparameters (C,
) maximizing the 5-fold LOOCV accuracy rate served as the default setting for training miPred. Finally, the classification was conducted on the testing and independent evaluation data sets with svm-predict -b 1. See supplemental Materials and Methods for details on statistical tests and performance evaluation metrics.
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
3.1 Training and classifying human pre-miRs
We calibrate miPred using TR-H, the optimal hyperparameter pair (C,
) is (16.0, 0.03125) that maximizes the 5-fold cross-validation accuracy rate of 93.50%. A classification score ranging [0.0, 1.0] is assigned by miPred to each hairpin, which is designated as a candidate pre-miR if its score is beyond a specified threshold. Across the entire spectrum of thresholds, a trade-off generally exists between specificity (greater value at higher threshold) and sensitivity (value increases at lower threshold) (Dror et al., 2005; Liu et al., 2006). The ROC analysis of miPred's classification model (figure not shown) reported that the ROC score is approximately unity i.e. 0.9833. (Fig. 1a and Table S1) With the default miPred's threshold predefined at 0.5, the SE (Sensitivity), SP (Specificity) and ACC (Accuracy) reported for TR-H are 88.00%, 97.50% and 94.33%, respectively. Here, SP > SE is more desirable in screening for novel pre-miRs from the entire genomic sequences or cloned small RNAs as abundant dysfunctional hairpins are encoded in the human (Bentwich et al., 2005) and C.elegans (Pervouchine et al., 2003) genomes. An implication of a slightly lower SP than SE will reduce the signal (genuine pre-miRs) to background (pseudo hairpins) ratio, inflating significantly the effort and resources demanded in experimental validation of the putative precursor transcripts as biologically functional pre-miRs.
|
(Fig. 1b and Table S1) Next, conducting miPred onto TE-H obtains comparable performances of 84.55% (SE), 97.97% (SP) and 93.50% (ACC). In all, miPred can classify correctly 86.69% (280/323) human pre-miRs as positives and 97.68% (631/646) pseudo hairpins as negatives. Three of the human pre-miRs designated as negatives receive very low classification scores from miPred: hsa-mir-565 (0.454), hsa-mir-566 (0.012) and hsa-mir-594 (0.187). Coincidently, they have been suspected to be falsely annotated as precursor transcripts encoding mature miRNAs on two grounds (Berezikov et al., 2006). First, both hsa-mir-565 and hsa-mir-594 overlap with tRNA annotations; hsa-mir-566 overlaps with Alu repeats. Second, none was represented by >1 clone or differentially expressed in a Dicer-deficient cell-line (Cummins et al., 2006). Nevertheless, we believe that neither criterion is sufficient to eliminate a candidate as repeat- (Smalheiser and Torvik, 2005) and pseudogene-derived miRNAs (Devor, 2006) have been discovered, and miRNAs expressed at low levels may be elusive to detection in a Dicer-disrupted mutant (Berezikov et al., 2006).
(Table S1) In contrast, 3SVM based on triplet-encoding scheme (Xue et al., 2005) yields slightly poorer results: 86.00% (SE), 97.00% (SP) and 93.33% (ACC) for TR-H; 73.15%, 95.37% and 87.96% for TE-H; or overall 81.49% (251/308) of human pre-miRs as positives and 96.43% (594/616) of pseudo hairpins as negatives. The evaluation demonstrates the outstanding and consistent classification performance of miPred in partitioning specifically human pre-miRs from pseudo hairpins. The improved distinct separation by miPred is likely due to its excellent capability in recognizing the specific intrinsic and global features of human pre-miRs against those of pseudo hairpins.
3.2 Improved classification of non-human pre-miRs
(Fig. 1c and Table S1) We next extend the validation of miPred to IE-NH and quantify its mean (overall) SE, SP, and ACC. Here, mean denotes the average performance for all species within IE-NH; overall performance is derived from the entire IE-NH independent of species. In this setting, miPred yields excellent and comparable classification performances to those of TR-H and TE-H, with respective SE, SP and ACC of 87.65% (92.08%; 1766/1918 non-human pre-miRs as positives), 97.75% (97.42%; 3737/3836 pseudo hairpins as negatives) and 94.38% (95.64%). (Table S1) In contrast, 3SVM reports 80.10% (86.15%; 1443/1675 non-human pre-miRs as positives), 96.81% (96.27%; 3225/3350 pseudo hairpins as negatives) and 91.24% (92.90%). Apparently, these results point to miPred as a more credible and consistent classifier for distinguishing reliably specie-specific and evolutionary well-conserved pre-miRs across plants, worms, flies, vertebrates and viruses (Griffiths-Jones et al., 2006).
Notably, those pre-miRs present in the genomes of A.mellifera, A.Geoffroyi, C.familiaris, E.Barr, H.Simplex virus, H.cytomegalovirus, O.aries, P.patens, R.lymphocryptovirus, Simian virus and Z.mays are unambiguously identified by miPred with 100.00% (SE) and >93.75% (SP). Moreover, pre-miRs encoded in C.briggsae and C.elegans are excellently classified with SE of 94.74% and 84.96%, as well as SP of 99.34% and 96.90%; the remaining two pathogenic viruses M.
-herpesvirus and K.sarcoma-associated herpesvirus have SE of 88.89% and 91.67%, as well as SP of 94.44% and 100.00%. Since miPred was not trained initially on any species-specific pre-miRs and especially viral-encoded ones, this supporting evidence reinforces the premise that its selected descriptors have successfully captured the intrinsic and global properties characterizing the biologically functional pre-miRs spanning across different species including viruses.
(Table S2) An obvious question is how viral-encoded pre-miRs can be distinguished by miPred so outstandingly, especially when they are known to lack homologs in other viruses or in the host (Cullen, 2006; Sarnow et al., 2006). As there are few experimental studies elucidating their biological activities and biogenesis (Sullivan et al., 2005), we speculate pathogenic viruses do not possess homologous genes that can express functionally similar host miRNA processing proteins e.g. Drosha, Dicer and RISC. After infecting the human immune cells, they hijack these critical host proteins to regulate viral and host gene expression (Cullen, 2006; Sarnow et al., 2006). This will facilitate their viral replication and pathogenesis by blocking the innate or adaptive host immune responses or by interfering with the appropriate regulation of apoptosis, cell growth or DNA replication. Consequently, viral-encoded pre-miRs are likely to be recognized and processed identically to the host (i.e. human) pre-miRs that miPred was trained on.
3.3 Performance comparison with existing predictors
(Fig. 1d) By evaluating the published results of existing (quasi) de novo classifiers (Table 1), both RNAmicro (Hertel and Stadler, 2006) and miPred are the highest-scoring predictors in identifying putative pre-miRs from a genomic pool of candidate hairpins. RNAmicro displays comparable F-measure and Matthew's Correlation Coefficient of 98.90% and 92.97% (pre-miRs from various animals) versus miPred of 95.29% and 85.47% (human pre-miRs) or 95.34% and 90.14% (non-human pre-miRs). In contrast, 3SVM (Xue et al., 2005) is the worst performer among the remaining classifiers that report 20.85–91.87% and 30.80–79.51%, respectively.
Notably, miPred benefits two key areas of technical advancements. First, its 29 features are extracted from a single RNA sequence for classifying novel pre-miRs against pseudo hairpins in an unequivocal de novo manner. This is the primary advantage that miPred has over RNAmicro by avoiding costly and occasionally unreliable multiple sequences alignments due to large phylogenetic distant or rapidly evolving pre-miRs. RNAmicro relies on computationally expensive comparative genomic alignments for predicting the consensus secondary structures and computing its feature vector (Hertel and Stadler, 2006). Moreover, ProMiR (Nam et al., 2005) and BayesMIRfinder (Yousef et al., 2006) depend on similar phylogenetic/conservation information for not incurring any significant loss of performances. Due to the sequence homologous nature of the genomics datasets being generated, their predictive accuracy may suffer when the cross-species evolutionary distance (e.g. verterbrates versus nematode/urochordate) is too exceptionally diverged in rendering reliable multi-genomes alignment technically difficult or impossible. Second, distinct from classifiers by Sewer et al. (2005) and 3SVM (Xue et al., 2005), the 29 attributes from miPred represent the global and intrinsic properties of any RNA structure, and not specific regions of it. Besides avoiding the pars pro toto fallacy in mistaking part for the entire, miPred can handle both hairpin structures as well as RNA sequences that fold with multiple loops.
3.4 Classification of functional ncRNAs and mRNAs
The original intent of miPred is to distinguish pre-miRs spanning diverse species from genomic pseudo hairpins, according to the classifier model trained solely on human data sets. Since ncRNAs and mRNAs were not included in the initial training, it will be very instructive to assess how well miPred can discriminate them as non pre-miRs without relying on their specific dinucleotide sequence, structural and topological characteristics. Moreover, such assessment was lacking or not available from existing (quasi) de novo predictors (Table 1). (Fig. 1e and Table S3) Evaluating miPred and 3SVM (Xue et al., 2005) onto IE-NC and IE-M, the former reports mean (overall) SP of 76.15% (68.68%; 8507/12387 ncRNAs) and 87.10% (27/31 mRNAs). Here, mean or average SP is computed from all ncRNA types within IE-NC; overall SP corresponds to the entire IE-NC independent of ncRNA types. In contrast, 3SVM yields 90.30% (78.37%; 1884/2404 ncRNAs across 155 types) and 0.00% (0/31 mRNAs) for SP (figure not shown). Upon scrutiny, its better performances are attained at the expense of excluding 9983 ncRNAs spanning 302 types (IE-NC) and 31 mRNAs (IE-M) that fold into complex structures containing multiple loops. This structural exclusion is a major limitation experienced commonly by most of the existing (quasi) de novo classifiers (Table 1) that extract modularized features from predefined RNA sub-structures. The comparison with 3SVM clearly demonstrates that miPred trained solely on human pre-miRs and pseudo hairpins, can provide reasonable generalization in identifying unambiguously at least two-thirds of all the samples in IE-NC and IE-M as bona fide negatives.
Among the ncRNA samples in IE-NC, tRNAs (Sprinzl and Vassilenko, 2005) and snoRNAs (Weinstein and Steitz, 1999) are two of the largest classes of small ncRNAs present in the eukaryotic genomes. They are frequently misclassified as pre-miRs in most experimental settings, due to the absence of statistical signatures like codon structure and open reading frame (ORF) encoded by protein-coding genes (Sprinzl and Vassilenko, 2005; Weinstein and Steitz, 1999). The snoRNAs can be divided into C/D snoRNAs or H/ACA snoRNAs acting as guides for site-specific 2'-O-ribose methylation or for pseudouridylation in the post-transcriptional processing of rRNAs (Weinstein and Steitz, 1999). (Fig. 1e and Table S4) 94.61% C/D snoRNAs 60.97% H/ACA snoRNAs, and 85.55% tRNAs are identified by miPred as genuine non pre-miRs. To enhance the quality of miPred's identification, specialized algorithmic tools like snoseeker (Yang et al., 2006) and tRNAscan-SE (Lowe and Eddy, 1997) can serve as rapid and pre-processing filters in excluding these abundant ncRNAs, except C/D snoRNAs. They have reported SE of 90.00%, 75.00% and 99.5% for detecting C/D snoRNAs, H/ACA snoRNAs and tRNAs, respectively.
(Fig. 1e and Table S4) miPred is capable of discriminating correctly 75.75% frameshift, 85.47% IRES, 75.00% thermoregulator, 70.66% rRNA and 85.71% snRNA as authentic non-pre-miRs. Interestingly, a novel and abundant class of ncRNAs known as riboswitches (Winkler and Breaker, 2003) are correctly classified by miPred as non pre-miRs with comparable SP of 82.28%. These riboswitches found only in prokaryotes to date, can cis-modulate their expressions upon binding to metabolite (e.g. guanine and thiamine pyrophosphate) without involving accessory protein cofactors. Our SVM classifier miPred will likely to become an invaluable pre-experimental predictor in the event eukaryotic riboswitches(-like) molecules are identified.
(Fig. 1e and Table S4) Several classes of ncRNA are poorly classified by miPred as potential pre-miRs with SP
60.00%: Antisense, Ribozymes, Spliceosomes like U1–2 and U4–6 and Group I/II intron RNAs. Careful inspection into their sequence, structural, and topological properties reveals no general noticeable trends to explain the evasive detection. This finding prompts us to speculate that the feature vector used by miPred may lack specific discriminative components against these elusive classes of functional ncRNAs, or in part that they may possibly be exceedingly mobile or rapidly evolving. To identify and eliminate such ncRNAs will definitely require specialized tools built on the domain knowledge of their characteristic properties.
3.5 Contribution of individual features
We next investigate the essential attributes of miPred that contribute substantially to the class distinctions between pre-miRs and pseudo hairpins, or whether exclusion of selected feature(s) can further enhance/degrade miPred's performances. Elucidating the contributory quality of individual attribute within a feature vector reaps the potential benefits of enhancing the predictive performance and computational tractability of the classifier, and gaining deeper insights into the domain problem (Isabelle and Andre, 2003). Despite the importance, only 3SVM (Xue et al., 2005) among the existing (quasi) de novo classifiers (Table 1) has conducted an analysis (less detailed than ours) on its feature selection.
(Fig. 1f and Table S5) We evaluate the F-scores F1 and F2 (definitions at supplemental Materials and Methods) on the class-conditional distributions, which measure the discriminative power of the miPred's 29 attributes. They are strongly and positively correlated, reporting Pearson correlation coefficient r = 0.977 and p = 1.272E–19. As expected, structural features possess the strongest discriminative importance/powers by dominating the 12 highest scoring attributes (ranked according to descending F1 scores): MFEI1, zG, dP, zP, zQ, dG, dQ, zD, dD, MFEI2, %AU, and %G + C. They overlap to some degree with RNAmicro's features (Hertel and Stadler 2006) i.e. %G + C, MFEI1, dG (RNAmicro uses mean MFE of the aligned sequences and MFE of the consensus structure), and zG (RNAmicro computes via a regression model). Since the majority of the pre-miRs are well-defined and thermodynamically stable stem-loop structures critical for the biogenesis of mature miRNAs (Bonnet et al., 2004b), these common features and miPred's top-ranking ones are most probable to be conserved across all species from human to viruses. We believe they are likely to be indispensable for rendering more robustness to the multi-feature capability of miPred against erroneous classifications of novel pre-miRs.
Generally, the efficiency and reliability of classifiers depend on the size and selection of both the relevant data samples and specific attributes (Isabelle and Andre, 2003). We next repeat previous experiments using 10 variants of miPred i.e. they have a smaller collection of features and are trained in the exact manner as miPred with identical samples in TR-H, and their performances are assessed against the remaining data sets (TE-H, IE-NH, IE-NC, and IE-M). miPred3 contains a subset of 26 features from miPred that excludes dQ, dD and zD. When evaluated statistically onto the 2241 non-redundant pre-miRs, three pairs of attributes are strongly and positively correlated (Ng and Mishra, 2007) with r ranging 0.9221–0.9846 and P < 0.001: dQ versus dD, dQ versus zQ and zQ versus zD. zQ is selected due to its higher discriminative power (as indicated by both its F1 and F2 scores) than dQ, dD and zD (Fig. 1f). Derived from miPred3, the remaining nine variants miPred3/5, miPred3/10, ... , miPred3/24, and miPred3/25 include only the top ranking 21, 16, 11, 6, 5, 4, 3, 2, and 1 feature(s), respectively.
(Fig. 1g and Table S6) As expected, miPred and miPred3 demonstrate consistent and comparable classification accuracies spanning the five datasets. The former containing near perfect correlated features dQ, dD, and zD as part of its larger feature vector is highly resilient to redundancy, since it also relies on SVM. SVM incorporates regularization techniques and is based on the theory of risk minimization, which can provide robust generalization control in accommodating redundant (i.e. strongly correlated) variables (Burges, 1998). Removing 5–15 low scoring features, miPred3/5 – miPred3/15 yield negligible performance differences compared to miPred3 when applied to pre-miR datasets; better improvements reported by miPred3/5 for ncRNAs and mRNAs datasets. This result suggests that the removed features are likely to contribute in a smaller degree to miPred as non-informative attributes and they generally do not degrade the performance of the discriminant method by overfitting the training data. With fewer than seven top-ranking features contained in miPred3/20 – miPred3/25, their overall classification accuracies degrade slightly for pre-miR datasets; generally have better performances for ncRNAs and mRNAs datasets. Both findings indicate that these six highest-scoring attributes MFEI1, zG, dP, zP, zQ, and dG are likely to be predominantly functioning, so as to contribute significantly to the prediction accuracies of miPred.
(Fig. 1g and Table S6) Features with weak discriminative power (like those sequence attributes in miPred possessing low F-scores) are viewed largely as redundant (i.e. non-informative), as no additional performance is gained by including them (Isabelle and Andre, 2003). To affirm this premise, we evaluate another three variants of miPred: miPredI (17 features: 16 dinucleotides frequencies and %G + C), miPredII (12 features; MFEI1, MFEI2, dP, dG, dQ, dD, dF, zP, zG, zQ, zD and zF), and miPredIII (9 features; a subset of miPredII that excludes dQ, dD and zD). Apparently, miPredI performs the worst when identifying pre-miRs and degrades moderately for IE-NC, but reports better than expected classification when applying to IE-M. In contrast, the absence of sequence information (i.e. 16 dinucleotide frequencies and %G + C) shows no noticeable effect on the performances of miPredII and miPredIII for human pre-miRs in comparison to miPred and miPred3; both classifiers fare slightly inferior to miPredI for IE-NH and much worst for IE-NC and IE-M. As indicated by both findings, the sequence information does not contribute (significantly or at all) toward discriminating pre-miRs from pseudo hairpins. Nevertheless, they are probable to perform a critical or compensatory role in the classification of ncRNAs and mRNAs as non pre-miRs.
3.6 Screening viral-encoded miRNA genes
A recent rna22-based census suggested that the previous numbers for pre-miRs present in several species were gross underestimation, and are likely to range in the tens of thousands (Miranda et al., 2006): C.elegans (359), D.melanogaster (654), M.musculus (>25,000) and H.sapiens (>25,000). As an illustrative application of miPred, we randomly select four complete viral genomes for screening novel pre-miRs via a similar methodology (Miranda et al., 2006): E.Barr virus (EBV), K.sarcoma-associated herpesvirus (KSHV), M.
-herpesvirus 68 strain WUMS (MGHV68), and H.cytomegalovirus strain AD169 (HCMV). To date, miRBase 8.2 (Griffiths-Jones et al., 2006) have annotated 23 (EBV; 23+ strands), 13 (KSHV; 12– and 1 unknown strand), 9 (MGHV68; 9+ strands), and 11 (HCMV; 6+, 4–, and 1 unknown strands) viral-encoded pre-miRs. The four viral genomic sequences are oriented to the corresponding +/– strands along which the published pre-miRs are located, and then scanned with a predefined sliding window (size of 95-nts in 1-nt steps) for potential viral-encoded hairpins. Those genomic regions satisfying the maximum length (
95-nts), minimum size of terminal loop (
3-nts), and MFEs (
–25 kcal/mol) are reserved for classification via miPred. The three thresholds were empirically determined from available genuine pre-miRs encoded in the four pathogenic viruses. The computational approach was described previously by Grad et al. (2003) with differences in the parameter settings as mentioned earlier. Briefly, it uses a BLAST-like algorithm to search for short complementary words (stem-shaped structure) within a specified distance and dynamic programming to determine the complete alignment. MFEs are predicted by the RNAfold program (Hofacker, 2003) with default parameters.
(Fig. 2a) Roughly, 30.15% (EBV; 60/199), 16.51% (KSHV; 36/218), 10.87% (MGHV68; 20/184), and 27.71% (HCMV; 133/480) of the hairpins are classified as putative pre-miRs (positives) at the default miPred score cut-off
0.5; remaining ones are regarded as negatives. (Table S7) The viral-encoded hairpins are manually mapped to the published pre-miRs, 25 true-positives (and 1 false-negative) match 25 published viral-encoded pre-miRs (red region) and their mature miRNAs (underlined region): 12 (1) EBV, 6 (0) KSHV, 3 (0) MGHV68 and 4 (0) HCMV. Except
kshv-mir-K12-9 and
kshv-mir-K12-9, the remaining true-positive predictions have one or two mature miRNAs embedded exclusively in either arms of their (a)symmetric stem.
kshv-mir-K12-9 is subsequently eliminated as it is a duplicate copy containing the exact sequence of kshv-mir-K12-9, and the encoded mature miRNAs overlap the most with its predicted 4-nts (uaua) terminal loop. Together, we can identify 44.64% (25/56) of the known pre-miRs for the four viruses as hairpins, and recover 96.00% from these hairpins (24/25) as true-positives.
|
The 25 identified positives report high miPred scores
0.815 except for two
ebv-mir-BHRF1-1 (0.437 miPred score) and
mghv-mir-M1-8 (0.658), indicative of the default cut-off at 0.5 was unlikely to be stringent. (Table S7) With the new cut-off set at 0.815, only 92.00% (EBV; 23/35), 60.00% (KSHV; 9/15), 75.00% (MGHV68; 6/8) and 92.73% (HCMV; 51/55) of the previous positives (excluding published pre-miRs) survive as novel putatives. Majority has not yet been discovered (more will arise due to innate evolutionary mutations), suggesting previous estimates of viral-encoded pre-miRs and miRNAs especially in EBV and HCMV may be grossly understated. (Fig. 2b) By mapping carefully the six newly found MGHV68-encoded pre-miRs to the entire MGHV68 viral genome, the closest relative to human EBV and KSHV (Pfeffer et al., 2005), we observe that p1 overlaps exactly with but is shorter than m6 by 3-nts (UUU) at the 3' termini (see inset for RNA structure). Since the mature miRNA (red region) encoded in m6 was experimentally cloned (Pfeffer et al., 2005), p1 is reassigned as a false-positive. p2 resides immediate downstream of m3 and within a known miRNA cluster
1.5 kb consisting of m1–7 that are transcribed by RNA Polymerase III (Pol-III) (Pfeffer et al., 2005), which indicates p2 is likely to be regulated by similar Pol-III promoter. Known host miRNA transcripts are synthesized from intergenic or intronic regions of annotated transcription units (Rodriguez et al., 2004) by Pol-II with the hallmarks of 5' m7G cap structures and 3' poly(A) tails (Cai et al., 2004; Lee et al., 2004), however, there are emerging evidence of them being transcribed from the exons of protein-coding genes like in O.sativa (Sunkar et al., 2005). Thus, p3, p4, and p5–6 located in the exons of three proteins may also undergo distinct processing and nuclear export mechanism from the host cell's miRNA maturation machinery. | 4 CONCLUSION |
|---|
|
|
|---|
In this work, we have proposed a de novo SVM classifier model miPred to address specifically the challenges in improving the classification accuracy of existing (quasi) de novo approaches. Our comprehensive analysis reported that it yielded comparable or significantly better predictive performances (in terms of sensitivity and specificity) than existing classifiers for distinguishing non-conserved functional pre-miRs (spanning diverse species) from genomic pseudo hairpins and non pre-miRs (most classes of ncRNAs and mRNAs) with high discriminative accuracy.
Deployment of miPred will likely to translate into considerable saving on precious and scarce experimental resources devoted to validating significantly fewer false-positives, since we are highly assured that those precursor transcripts predicted would be experimentally confirmed as functional pre-miRs. Recognizing these benefits that underscore miPred as a potential and invaluable pre-experimental screening tool, we are currently revamping our research prototype into a user-friendly online predictor. As part of our ongoing research, novel and clustered pre-miRs in human, mouse, and viruses are being actively identified. With the availability of a comprehensive repository of combined pre-miRs and mature miRNAs, will then computational mRNA target identification and comprehensive genome annotation be greatly advanced. An expanded repertoire of miRNA genes will definitely signify both a huge opportunity and technical challenge, as we delve into the functional roles of miRNAs interplay with other genetic regulatory networks, biological pathways, and signaling cascades.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The authors are deeply indebted to the anonymous reviewers (current and ISMB2006) for their generous feedback and constructive suggestions, which has greatly inspired the quality and technical ideas developed in this article. Sincere appreciation to BII's Clustering Group for their best effort in ensuring the three clusters run smoothly. This work was supported by Bioinformatics Institute. SNKL received PhD scholarship funds from Agency for Science, Technology and Research (A*STAR), Singapore.
Authors' contributions: SNKL and SKM conceived the initial ideas. SNKL designed and performed the experiments. SNKL and SKM wrote this manuscript.
Conflict of Interest. none declared.
| FOOTNOTES |
|---|
Associate Editor: Charlie Hodgman
Received on July 1, 2006; revised on December 13, 2006; accepted on January 23, 2007
| REFERENCES |
|---|
|
|
|---|
Adai A, et al. Computational prediction of miRNAs in Arabidopsis thaliana. Genome. Res. (2005) 15:78–91.
Ambros V, et al. A uniform system for microRNA annotation. RNA (2003) 9:277–279.
Bartel DP. MicroRNAs: genomics, biogenesis, mechanism, and function. Cell (2004) 116:281–297.[CrossRef][Web of Science][Medline]
Benson DA, et al. GenBank. Nucleic Acids Res. (2005) 33:D34–D38.
Bentwich I, et al. Identification of hundreds of conserved and nonconserved human microRNAs. Nat. Genet. (2005) 37:766–770.[CrossRef][Web of Science][Medline]
Berezikov E, et al. Approaches to microRNA discovery. Nat. Genet. (2006) 38(Suppl):S2–S7.[CrossRef][Medline]
Berezikov E, et al. Phylogenetic shadowing and computational identification of human microRNA genes. Cell (2005) 120:21–24.[CrossRef][Web of Science][Medline]
Boffelli D, et al. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science (2003) 299:1391–1394.
Bonnet E, et al. Detection of 91 potential conserved plant microRNAs in Arabidopsis thaliana and Oryza sativa identifies important target genes. Proc. Natl Acad. Sci. USA (2004a) 101:11511–11516.
Bonnet E, et al. Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free energies than random sequences. Bioinformatics (2004b) 20:2911–2917.
Brennecke J, et al. Bantam encodes a developmentally regulated microRNA that controls cell proliferation and regulates the proapoptotic gene hid in Drosophila. Cell (2003) 113:25–36.[CrossRef][Web of Science][Medline]
Burges C.JC. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery (1998) 2:121–167.[CrossRef][Web of Science]
Cai X, et al. Human microRNAs are processed from capped, polyadenylated transcripts that can also function as mRNAs. RNA (2004) 10:1957–1966.
Calin GA, Croce CM. MicroRNA-Cancer Connection: the Beginning of a New Tale. Cancer Res. (2006) 66:7390–7394.
Chen X. A MicroRNA as a translational repressor of APETALA2 in arabidopsis flower development. Science (2004) 303:2022–2025.
Cullen BR. Viruses and microRNAs. Nat. Genet. (2006) 38(Suppl):S25–S30.[CrossRef][Medline]
Cummins JM, et al. The colorectal microRNAome. Proc. Natl Acad. Sci. USA (2006) 103:3687–3692.
Devor EJ. Primate MicroRNAs miR-220 and miR-492 Lie within processed pseudogenes. J. Hered. (2006) 97:186–190.
Dror G, et al. Accurate identification of alternatively spliced exons using support vector machine. Bioinformatics (2005) 21:897–901.
Duan K, et al. Evaluation of simple performance measures for tuning SVM hyperparameters. Neurocomputing (2003) 51:41–59.[CrossRef][Web of Science]
Fera D, et al. RAG: RNA-As-Graphs web resource. BMC Bioinformatics (2004) 5:88.[CrossRef][Medline]
Floyd SK, Bowman JL. Gene regulation ancient microRNA target sequences in plants. Nature (2004) 428:485–486.[CrossRef][Medline]
Freyhult E, et al. A comparison of RNA folding measures. BMC Bioinformatics (2005) 6:241.[CrossRef][Medline]
Gan HH, et al. RAG: RNA-As-Graphs database—concepts, analysis, and features. Bioinformatics (2004) 20:1285–1291.
Griffiths-Jones S, et al. miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res. (2006) 34:D140–D144.
Griffiths-Jones S, et al. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. (2005) 33:D121–D124.
Han LY, et al. Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. RNA (2004) 10:355–368.
Hertel J, Stadler PF. Hairpins in a Haystack: recognizing microRNA precursors in comparative genomics data. Bioinformatics (2006) 22:e197–e202.
Hofacker IL. Vienna RNA secondary structure server. Nucleic Acids Res. (2003) 31:3429–3431.
Isabelle G, Andre E. An introduction to variable and feature selection. J. Mach. Learn. Res. (2003) 3:1157–1182.[CrossRef]
Jones-Rhoades MW, Bartel DP. Computational identification of plant microRNAs and their targets, including a stress-induced miRNA. Mol. Cell (2004) 14:787–799.[CrossRef][Web of Science][Medline]
Kim VN. MicroRNA biogenesis: coordinated cropping and dicing. Nat. Rev. Mol. Cell Biol. (2005) 6:376–385.[CrossRef][Web of Science][Medline]
Lagos-Quintana M, et al. New microRNAs from mouse and human. RNA (2003) 9:175–179.
Lagos-Quintana M, et al. Identification of Novel Genes Coding for Small Expressed RNAs. Science (2001) 294:853–858.
Lai E, et al. Computational identification of Drosophila microRNA genes. Genome Biol. (2003) 4:R42.[CrossRef][Medline]
Lau NC, et al. An abundant class of tiny RNAs with probable regulatory roles in caenorhabditis elegans. Science (2001) 294:858–862.
Lee RC, Ambros V. An extensive class of small RNAs in caenorhabditis elegans. Science (2001) 294:862–864.
Lee RC, et al. The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell (1993) 75:843–854.[CrossRef][Web of Science][Medline]
Lee Y, et al. MicroRNA genes are transcribed by RNA polymerase II. EMBO J. (2004) 23:4051–4060.[CrossRef][Web of Science][Medline]
Lim LP, et al. Vertebrate MicroRNA genes. Science (2003a) 299:1540.
Lim LP, et al. The microRNAs of Caenorhabditis elegans. Genes Dev. (2003b) 17:991–1008.
Liu J, et al. Distinguishing Protein-Coding from Non-Coding RNAs through support vector machines. PLoS Genet. (2006) 2:e29.[CrossRef][Medline]
Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. (1997) 25:955–964.
Lu J, et al. MicroRNA expression profiles classify human cancers. Nature (2005) 435:834–838.[CrossRef][Medline]
McGinnis S, Madden TL. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. (2004) 32:W20–W25.
Miranda KC, et al. A pattern-based method for the identification of MicroRNA binding sites and their corresponding heteroduplexes. Cell (2006) 126:1203–1217.[CrossRef][Web of Science][Medline]
Moulton V, et al. Metrics on RNA secondary structures. J. Comp. Biol. (2000) 7:277–292.[CrossRef]
Nam JW, et al. Human microRNA prediction through a probabilistic co-learning model of sequence and structure. Nucleic Acids Res. (2005) 33:3570–3581.
Ng KLS, Mishra SK. Unique folding of precursor microRNAs: quantitative evidence and implications for de novo identification. RNA (2007) 13:170–187.
Palatnik JF, et al. Control of leaf morphogenesis by microRNAs. Nature (2003) 425:257–263.[CrossRef][Medline]
Pasquinelli AE, et al. Conservation of the sequence and temporal expression of let-7 heterochronic regulatory RNA. Nature (2000) 408:86–89.[CrossRef][Medline]
Pervouchine DD, et al. On the normalization of RNA equilibrium free energy to the length of the sequence. Nucleic Acids Res. (2003) 31:e49.
Pfeffer S, et al. Identification of microRNAs of the herpesvirus family. Nat. Method (2005) 2:269–276.[CrossRef]
Pruitt KD, Maglott DR. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. (2001) 29:137–140.
Rebeiz M, Posakony JW. GenePalette: a universal software tool for genome sequence visualization and analysis. Dev. Biol. (2004) 271:431–438.[CrossRef][Web of Science][Medline]
Reinhart BJ, et al. The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans. Nature (2000) 403:901–906.[CrossRef][Medline]
Rodriguez A, et al. Identification of mammalian microRNA host genes and transcription units. Genome Res. (2004) 14:1902–1910.
Sarnow P, et al. MicroRNAs: expression, avoidance and subversion by vertebrate viruses. Nat. Rev. Microbiol. (2006) 4:651–659.[CrossRef][Web of Science][Medline]
Schultes EA, et al. Estimating the contributions of selection and self-organization in RNA secondary structure. J. Mol. Evol. (1999) 49:76–83.[CrossRef][Web of Science][Medline]
Seffens W, Digby D. mRNAs have greater negative folding free energies than shuffled or codon choice randomized sequences. Nucleic Acids Res. (1999) 27:1578–1584.
Smalheiser NR, Torvik VI. Mammalian microRNAs derived from genomic repeats. Trends Genet. (2005) 21:322–326.[CrossRef][Web of Science][Medline]
Sprinzl M, Vassilenko KS. Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res. (2005) 33:D139–D140.
Sullivan CS, et al. SV40-encoded microRNAs regulate viral gene expression and reduce susceptibility to cytotoxic T cells. Nature (2005) 435:682–686.[CrossRef][Medline]
Sunkar R, et al. Cloning and characterization of MicroRNAs from rice. Plant Cell (2005) 17:1397–1411.
Wang X, et al. MicroRNA identification based on sequence and structure alignment. Bioinformatics (2005) 21:3610–3614.
Weinstein LB, Steitz JA. Guided tours: from precursor snoRNA to functional snoRNP. Curr. Opin. Cell Biol. (1999) 11:378–384.[CrossRef][Web of Science][Medline]
Winkler WC, Breaker RR. Genetic control by metabolite-binding riboswitches. Chembiochem. (2003) 4:1024–1032.[CrossRef][Web of Science][Medline]
Xu P, et al. The drosophila MicroRNA Mir-14 suppresses cell death and is required for normal fat metabolism. Curr. Biol. (2003) 13:790–795.[CrossRef][Web of Science][Medline]
Xue C, et al. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics (2005) 6:310.[CrossRef][Medline]
Yang JH, et al. Snoseeker: an advanced computational package for screening of guide and orphan snoRNA genes in the human genome. Nucleic Acids Res (2006) gkl672.
Yousef M, et al. Combining multi-species genomic data for microRNA identification using a naive bayes classifier. Bioinformatics (2006) 22:1325–1334.
Zhang B, et al. Evidence that miRNAs are different from other RNAs. Cell. Mol. Life Sci. (2006a) 63:246–254.[CrossRef][Web of Science][Medline]
Zhang B, et al. Plant microRNA: A small regulatory molecule with big impact. Dev. Biol. (2006b) 289:3–16.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
J. Waldispuhl, S. Devadas, B. Berger, and P. Clote RNAmutants: a web server to explore the mutational landscape of RNA secondary structures Nucleic Acids Res., July 1, 2009; 37(suppl_2): W281 - W286. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. D. Mendes, A. T. Freitas, and M.-F. Sagot Current tools for the identification of miRNA genes and their targets Nucleic Acids Res., May 1, 2009; 37(8): 2419 - 2433. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Batuwita and V. Palade microPred: effective classification of pre-miRNAs for human miRNA gene prediction Bioinformatics, April 15, 2009; 25(8): 989 - 995. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Morita, Y. Saito, K. Sato, K. Oka, K. Hotta, and Y. Sakakibara Genome-wide searching with base-pairing kernel functions for noncoding RNAs: computational and expression analysis of snoRNA families in Caenorhabditis elegans Nucleic Acids Res., February 1, 2009; 37(3): 999 - 1009. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Gerlach, E. V. Kriventseva, N. Rahman, C. E. Vejnar, and E. M. Zdobnov miROrtho: computational survey of microRNA genes Nucleic Acids Res., January 1, 2009; 37(suppl_1): D111 - D117. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Xu, X. Zhou, and W. Zhang MicroRNA prediction with a novel ranking algorithm based on random walks Bioinformatics, July 1, 2008; 24(13): i50 - i58. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



