Bioinformatics Advance Access originally published online on March 22, 2005
Bioinformatics 2005 21(11):2611-2617; doi:10.1093/bioinformatics/bti385
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures
Keio University, Department of Biosciences and Informatics 3-14-1 Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Since the whole genome sequences of many species have been determined, computational prediction of RNA secondary structures and computational identification of those non-coding RNA regions by comparative genomics become important. Therefore, more advanced alignment methods are required. Recently, an approach of structural alignment for RNA sequences has been introduced to solve these problems. Pair hidden Markov models on tree structures (PHMMTSs) proposed by Sakakibara are efficient automata-theoretic models for structural alignment of RNA secondary structures, although PHMMTSs are incapable of handling pseudoknots. On the other hand, tree adjoining grammars (TAGs), a subclass of context-sensitive grammars, are suitable for modeling pseudoknots. Our goal is to extend PHMMTSs by incorporating TAGs to be able to handle pseudoknots.
Results: We propose pair stochastic TAGs (PSTAGs) for aligning and predicting RNA secondary structures including a simple type of pseudoknot which can represent most known pseudoknot structures. First, we extend PHMMTSs defined on alignment of trees to PSTAGs defined on alignment of TAG trees which represent derivation processes of TAGs and are functionally equivalent to derived trees of TAGs. Then, we develop an efficient dynamic programming algorithm of PSTAGs for obtaining an optimal structural alignment including pseudoknots. We implement the PSTAG algorithm and demonstrate the properties of the algorithm by using it to align and predict several small pseudoknot structures. We believe that our implemented program based on PSTAGs is the first grammar-based and practically executable software for comparative analyses of RNA pseudoknot structures, and, further, non-coding RNAs.
Availability: The source code of PSTAG and its web application are available at http://phmmts.dna.bio.keio.ac.jp/pstag/
Contact: yasu{at}bio.keio.ac.jp
| 1 INTRODUCTION |
|---|
|
|
|---|
Secondary structures including pseudoknots of non-coding RNA molecules play important roles for their own functions such as catalytic functions (Dam et al., 1992). Thus, the computational prediction of pseudoknot RNA structures from primary RNA sequences has become an active research area in bioinformatics, and further there are several theoretical or heuristic works to predict pseudoknot RNA structures such as by maximizing stacking base pairs or free energy minimizations (Abrahams et al., 1990; Cary and Stormo, 1995; Gultyaev et al., 1995; van Batenburg et al., 1995; Rivas and Eddy, 1999; Lyngsø and Pedersen, 2000; Ieong et al., 2003; Ruan et al., 2004). On the other hand, since the whole genome sequences of many species have been determined, computational identification of non-coding RNA regions by comparative genomics become important. Therefore, more advanced methods such as precise algorithms for database search are required for detecting non-coding RNA regions.
Recently, Sakakibara (2003) proposed pair hidden Markov models on tree structures (PHMMTSs) which is an extension of pair HMMs. The PHMMTSs are defined on alignment of trees based on stochastic context-free grammars (SCFGs), and applied to the problem of structural alignment of RNA secondary structures. The approach of structural alignment is to calculate a pairwise alignment to align an unfolded RNA sequence into a folded RNA sequence of known secondary structure (as illustrated in Fig. 1). Thus, an unfolded RNA sequence will be folded by the structural alignment into a single folded RNA sequence, and hence the structural alignment is clearly different from the usual pairwise alignment only based on sequence homology. Two important features of structural alignment are (1) to predict secondary structures for primary RNA sequences and (2) to detect non-coding RNA regions with more sensitivity than sequence homology. The second feature is obviously an advantage compared with conventional methods which can only predict RNA secondary structures.
|
However, PHMMTSs are incapable of handling pseudoknots because modeling pseudoknot RNA structures is beyond the generative power of context-free grammars, thus inevitably involves in the hard complexity of context sensitivity.
In this paper, we propose a novel method for structural alignment to align and predict RNA secondary structures including pseudoknots. For modeling pseudoknot RNA structures, we first employ special subclasses of tree adjoining grammars (TAGs), which are more generative than context-free grammars but less than context-sensitive grammars. Second, we extend PHMMTSs defined on the alignment of trees to pair stochastic TAGs (PSTAGs) defined on the alignment of TAG trees which can represent the derivation process of TAGs for pseudoknot RNA structures. Thus, by combining it with a parsing algorithm for TAGs, we can solve the alignment problem of TAG trees by an efficient dynamic programming algorithm of PSTAGs for obtaining an optimal structural alignment including pseudoknots.
| 2 TAGs FOR PSEUDOKNOT STRUCTURES |
|---|
|
|
|---|
For modeling pseudoknot RNA structures, we employ special subclasses of TAGs, which were introduced to study pseudoknot structures by Uemura et al. (1999). In this section, we briefly describe TAGs and the two subclasses, simple linear TAGs (SL-TAGs) and extended simple linear TAGs (ESL-TAGs). For more details, refer to Uemura et al. (1999).
Let t be a tree which is a rooted indirected acyclic graph, each node of which is labeled with X
VN
VT
{
}, where VN is a finite set of non-terminal symbols, VT is a finite set of terminal symbols and
is the empty sequence. Nodes in a tree t of the size m are numbered from 1 to m according to the preorder where the root node of t is numbered 1. By t(p) = A, we denote that the node p of t is labeled with A
VN
VT. A yield of a tree t, which is denoted as
, is defined as a concatenating sequence of labels at leaf nodes of t traced from left to right.
A TAG, introduced by Joshi et al. (1975) is specified by a 5-tuple
, where S
VN is an initial symbol,
is a finite set of initial trees and
is a finite set of adjunct trees.
and
must satisfy the following conditions:
- if
, then
(1) = S and
;
- if
, then ß(1) = X
VN and
,
denotes a set of all finite sequences over VT. A foot node of an adjunct tree
is the node which has a label of X
VN in
. Each path of an adjunct tree from the root node to a foot node is called a backbone. All of the initial trees and adjunct trees are referred to as elementary trees. Then, adjoining operations over trees in TAGs are defined as follows. Let
be a tree such that
(p) = X
VN and ß be an adjunct tree such that ß(1) = X, and one of the foot nodes is also labeled X. An adjoining operation is to derive
' from
and ß such that ß is adjoining
at the node p as shown in Figure 2. We call
' a derived tree from
. A node p of
with a label X
VN is active if and only if there exists
which can adjoin
at p such as the node indicated by * in Figure 2.
|
Uemura et al. (1999) introduced two subclasses of TAGs, SL-TAGs and ESL-TAGs, and parsing algorithms of them which run in time O(n4) and O(n5), respectively, for an input sequence of the length n.
An initial tree tinitial is simple linear if tinitial has one active node exactly. Similarly, an adjunct tree tadjunct is simple linear if tadjunct has one active node on its backbone exactly. Then, a TAG G is a simple linear TAG if all of the elementary trees in G are simple linear. An adjunct tree tadjunct is semi-simple linear if tadjunct has two active nodes exactly, where one is on its backbone and the other is elsewhere. Then, a TAG G is an extended simple linear TAG if initial trees in G are simple linear and all of the adjunct trees in G are either simple linear or semi-simple linear.
Let ß be a simple linear adjunct tree such that ß(1)=X,
and q is the active node of ß labeled with Y*, and the yield of a subtree of ß rooted at the node q be ai'
aiX ai+1
aj' for i', j' (1
i'
i, i + 1
j'
j). We decompose
into four subsequences as LU(ß) = a1
ai'1, LD(ß) = ai'
ai, RD(ß) = ai+1
aj' and RU(ß) = aj'+1
aj, as shown in Figure 3. For any sequence
, we denote the length of w as |w|. Note that for empty sequence
, |
| = 0.
|
For representing RNA secondary structures including pseudoknots, we define a special case of ESL-TAGs, denoted by
), in which VT = {A,C, G, U} for representing four kinds of nucleotides and VN = {S}, that is, S is the only non-terminal symbol. GRNA uses the following forms of elementary trees: a form for initial trees of TYPE-1 and forms for adjunct trees of TYPE-2, 3, 4 and 5 as shown in Figure 4. The forms of TYPE-2 and TYPE-3 are used for generating base pairs, the form of TYPE-4 is used for generating unpaired bases and the form of TYPE-5 is used for representing branching structures. For instance, Figure 5. illustrates a derivation process to produce a pseudoknot RNA secondary structure for (A(G[AC)U)U] by GRNA, where two kinds of parentheses, and [ ], indicate base pairs.
|
|
Let us consider a secondary structure T of an RNA sequence
, which is a set of base pairs (ai, aj) such that 1
i < j
n. Then, we say that (ai, aj) and (ak, al) in T is crossing if and only if either i < k < j < l or k < i < l < j. A secondary structure T has m-crossing property if and only if there exists a subset T' of T with |T' = m
2 such that any pair of (ai, aj) and (ak, al) in T' is crossing, where |T| is the number of base pairs included in T. ESL-TAGs have the ability to generate a simple type of pseudoknot RNA structures with exactly 2-crossing property, which can represent most known pseudoknot structures but cannot represent all of them. | 3 PSTAGs |
|---|
|
|
|---|
In this section, we propose PSTAGs for aligning and predicting RNA structures including pseudoknots.
3.1 TAG trees
We introduce a TAG tree which represents the derivation process of TAGs, that is, what order of adjoining trees could be adjoining to induce a derived tree whose yield is an input sequence. Each node of TAG trees is labeled with an adjunct tree, and each edge means an adjoining operation on an active node at its parent. The root node of a TAG tree is labeled with an adjunct tree which adjoins an initial tree. Figure 6(a) and (b) illustrate some examples of TAG trees for parsing two structured sequences, (a(bB)A)(d[eD)E] and (bB) d (fp).
|
TAG trees have two important properties: (1) every TAG tree has a one-to-one correspondence to the derived tree, and (2) the set of TAG trees of an ESL-TAG can be recognized by a tree automaton, and therefore, we can extend tree automata to TAG tree automata defined on TAG trees.
An alignment for a pair of trees is obtained by inserting some null nodes labeled with
into each other such that two resulting trees have the same topology. Since each node of TAG trees represents an adjunct tree, an alignment of two TAG trees requires matches between adjunct trees. In the case of TYPE-4 and TYPE-5, two adjunct trees of exactly the same form can be matched. On the other hand, adjunct trees of TYPE-2 and TYPE-3 are allowed to be matched as shown in Table 1.
|
For instance, Figure 6 illustrates that two TAG trees (a) and (b) are aligned into an alignment of TAG trees (c) which corresponds to a derived tree (d). First, since both nodes p1 and q1 in the TAG tree (a) and (b) are of the same form
, they are matched into the node (p1, q1) in the aligned TAG tree (c). Similarly, either node p2 or node p3 will be matched with q2, and a null node is inserted to make both TAG trees of the same topology. The nodes p4 of the form
and q3 of the form
are aligned into the node (p4, q3) of the form
. Then, the nodes p5 of the form
and q4 of the form
cannot be matched, and thus null nodes are inserted. Consequently, the resulting structural alignment is obtained as below:
![]() |
3.2 Algorithm of PSTAGs
Given an RNA sequence with annotations of secondary structures including pseudoknots, where the annotations are usually given by parentheses, a TAG tree is obtained by parsing the annotated RNA sequence with the ESL-TAG GRNA. We call it a skeletal tree.
Let
be an unfolded RNA sequence of the length n, T be a skeletal tree of the size m representing a folded RNA sequence of known pseudoknot structure, T[q] (1
q
m) be the subtree of T rooted at the node q and w[i,j,k,l] (0
i
j
k
l
n) be two subsequences ai+1
aj and ak+1
al of w. We denote the children of q as q1 and q2.
Then, in order to calculate an optimal structural alignment between an unfolded RNA sequence w and a skeletal tree T for a folded RNA sequence, we present recurrence equations based on the affine gap model with three states: match states (M), insertion states (I) and deletion states (D).
![]() |
is a set of simple-linear adjunct trees of TYPE-2, TYPE-3 and TYPE-4 defined in Figure 4,
![]() |
is a set of simple-linear adjunct trees of TYPE-4 defined in Figure 4,
![]() |
XY for X, Y
{M, I, D} denotes the probability of state transition from the state X to the state Y,
denotes the probability of emission for a pair of adjunct trees
and ß at the state X, vq denotes an adjunct tree labeled at a node q in the tree T, and
denotes the empty tree. A state transition diagram among three states M, I and D for affine gap alignment is given in Figure 7. An optimal structural alignment between a sequence w and a skeletal tree T is obtained by calculating
![]() |
M,
I,
D.
|
Our implementation of PSTAG employs a non-stochastic score matrix proposed by Gorodkin et al. (1997) instead of the full stochastic model described above, because each score in Gorodkin's matrix is essentially identical to the probabilistic log-odds score approximated by their round number.
An efficient algorithm for calculating the above recurrence equations can be implemented by using dynamic programming techniques. The computational complexity of executing PSTAGs for structural alignment is the same order as that of parsing an input sequence with ESL-TAGs theoretically. More precisely, time complexity to run a PSTAG for an input pair of an unfolded sequence of the length N and a skeletal tree of the size M with m branch nodes and n other nodes (M = m + n) is O(KnN4 + KmN5), where K is the number of states in the PSTAG (K = 3 for the affine gap model), and space complexity is O(KMN4).
| 4 EXPERIMENTAL RESULTS |
|---|
|
|
|---|
To confirm our method, we performed some experiments using a certain RNA family in the database. We first randomly chose an RNA sequence annotated with a known pseudoknot structure and parsed it into a skeletal tree, then aligned all the other unfolded RNA sequences in the family into the selected folded skeletal tree without using annotations of them. We evaluated the results of our experiments by specificity and sensitivity, that is, the rate of correctly predicted base pairs by the method to all predicted base pairs, and the rate of correctly predicted base pairs to all of the trusted base pairs in the database, respectively. Further, in order to remove the dependency of the prediction results on the selected folded RNA sequence, we performed cross-validation and calculated the average for all cases.
The datasets used in our experiments were taken from RNA families database Rfam at Sanger Institute (Griffiths-Jones et al., 2003; http://www.sanger.ac.uk/Software/Rfam/) and a collection of RNA pseudoknots PseudoBase at Leiden University (van Batenburg et al., 2000; http://wwwbio.leidenuniv.nl/~Batenburg/PKB.html). RNA sequences in Rfam are aligned and annotated with secondary structures by using the covariance model (CM) method (Eddy and Durbin, 1994). Among 176 RNA families in Rfam (version 5.0), 7 RNA families have pseudoknot annotations which are unreliable because CM is based on profile SCFGs for modeling RNA sequences which cannot deal with pseudoknots. On the other hand, the annotations of pseudoknot RNA structures in PseudoBase are biologically reliable.
First, we evaluated the accuracy of predicting base pairs by the PSTAG algorithm for three RNA families,
,
and
, which have pseudoknot annotations in Rfam.
and
constitute simple pseudoknot structures which can be analyzed by an SL-TAG, whereas
has one branching secondary structure involving a pseudoknot which requires an ESL-TAG. The results in Table 2 show that PSTAG can predict accurate structural alignments for all three RNA families.
|
Second, we compared the prediction accuracy of the PSTAG algorithm with that of PHMMTS and of the standard alignment software Clustal-W (Thompson et al., 1994; http://www.ebi.ac.uk/clustalw/) by an RNA of
in PseudoBase with reliable annotations about pseudoknot structures. In this experiment, PHMMTS ignores annotations of some stacked base pairs with crossing dependency due to the lack of generative power for pseudoknots. Similarly, Clustal-W ignores any structural annotations due to lack of generative power for secondary structures. Each row of Table 3 shows the accuracy of predicting base pairs by PSTAG, PHMMTS and Clustal-W, respectively. Figure 8 shows the detailed comparison among the three methods, in which correctly predicted structures by each method are indicated by the mark _. Obviously, PSTAG succeeded in predicting both ( ) and [ ] base pairs, PHMMTS can predict only "( )" base pairs and Clustal-W can predict a few structural annotations. These results indicate that more grammatically powerful the method used, the more accurate the predictions obtained. However, a more grammatically powerful method would consume lager CPU time and memory space generally.
|
|
In the third experiment, we structurally re-aligned all the RNA sequences of the
family in Rfam, which are unreliable regarding pseudoknots, into reliable pseudoknot structures of
in PseudoBase by PSTAG. As a result, PSTAG significantly improved about 25% base pairs in Rfam for
, which are undesirable in comparison to PseudoBase. For example, there are some significant differences between the annotation in Rfam and prediction by PSTAG as shown in Figure 9, where some undesirable base pairs, indicated with the mark ^, are annotated in Rfam. Therefore, PSTAG can predict more stable secondary structures on these undesirable base pairs than the annotations in Rfam.
|
In addition, the predictions by PSTAG have some suggestion of a new structure, which constitutes an additional internal loop in the 3'-end of
, as indicated with the mark _ in Figure 9 and also indicated with the arrow
in Figure 10.
|
| 5 RELATED WORK |
|---|
|
|
|---|
There does not exist any other structural alignment approach to align and predict pseudoknot RNA structures.
In non-comparative approaches, there are several theoretical or heuristic works to predict pseudoknot RNA structures for a single RNA sequence by maximizing stacking base pairs or free energy minimizations (Abrahams et al., 1990; Cary and Stormo, 1995; Gultyaev et al., 1995; van Batenburg et al., 1995; Rivas and Eddy, 1999; Lyngsø and Pedersen, 2000; Ieong et al., 2003; Ruan et al., 2004). Ruan et al. (2004) recently proposed a simple but effective heuristic method, called iterated loop matching (ILM), for predicting pseudoknot structures, and showed high performance results compared with other existing methods. Although their approach is completely different from ours, we compared PSTAG with ILM to confirm the effectiveness of our approach. Table 4 shows comparisons between ILM and PSTAG in predicting pseudoknot structures for
and a tobamovirus
. In this experiment, PSTAG aligned unfolded RNA sequences of
in Rfam into a folded RNA sequence of
with structural annotations in PseudoBase, and similarly, aligned unfolded sequences of
in Rfam into a folded RNA sequence of sunn-hemp mosaic virus
whose structure has been determined by van Belkum et al. (1985). Note that sequence homology between the sequences of
in Rfam and the selected sequence of
in PseudoBase is 65.1% on average and that between
in Rfam and
is only 26.0%. This result exhibits comparable performances with ILM for prediction accuracy of pseudoknot structures, and further suggests that structural alignment by PSTAG does not require so much sequence homology between an unfolded sequence and a folded sequence.
|
Another important feature of our approach is searching and detecting non-coding RNA regions on genome. Klein and Eddy (2003) have shown an interesting direction to search non-coding RNA regions in a structural alignment approach based on SCFGs.They have developed a local alignment program, called RSEARCH, to search a database for finding structurally homologous RNA sequences, and compared performances with a well-known BLAST program. Our approach using PSTAG enables us to develop a more accurate database search method which takes a type of pseudoknot RNA structures into account.
Received on December 7, 2004; revised on March 7, 2005; accepted on March 8, 2005
| REFERENCES |
|---|
|
|
|---|
Abrahams, J.P., et al. (1990) Prediction of RNA secondary structure, including pseudoknotting, by computer simulation. Nucleic Acids Res., 18, 30353044
Cary, R.B. and Stormo, G.D. (1995) Graph-theoretic approach to RNA modeling using comparative data. Proceedings of the 3rd International Conference on Intelligent Systems for Molecular Biology , Robinson College, Cambridge American Association for Artificial Intelligence Press, pp. 7580.
Dam, E., et al. (1992) Structural and functional aspects of RNA pseudoknots. Biochemistry, 31, 1166511676[CrossRef][Medline].
Eddy, S.R. and Durbin, R. (1994) RNA sequence analysis using covariance models. Nucleic Acids Res., 22, 20792088
Gorodkin, J., et al. (1997) Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Res., 25, 37243732
Griffiths-Jones, S., et al. (2003) Rfam: an RNA family database. Nucleic Acid Res., 31, 439441
Gultyaev, A.P., et al. (1995) The computer simulation of RNA folding pathways using a genetic algorithm. J. Mol. Biol., 250, 3751[CrossRef][ISI][Medline].
Ieong, S., et al. (2003) Predicting RNA secondary structures with arbitrary pseudoknots by maximizing the number of stacking pairs. J. Computat. Biol., 10, 981995.
Joshi, A.K., et al. (1975) Tree adjunct grammars. J. Comput. Syst. Sci., 10, 136163.
Klein, R.J. and Eddy, S.R. (2003) RSEARCH: finding homologs of single structured RNA sequences. BMC Bioinform., 4, 44[CrossRef][Medline].
Lyngsø, R.B. and Pedersen, C.N.S. (2000) RNA pseudoknot prediction in energy-based models. J. Computat. Biol., 7, 409427.
Rivas, E. and Eddy, S.R. (1999) A dynamic programming algorithm for RNA structure prediction including pseudoknots. J. Mol. Biol., 285, 20532068[CrossRef][ISI][Medline].
Ruan, J., et al. (2004) An iterated loop matching approach to the prediction of RNA secondary structures with pseudoknots. Bioinformatics, 20, 5866
Sakakibara, Y. (2003) Pair hidden Markov models on tree structures. Bioinformatics, 19, i232i240[Abstract].
Thompson, J.D., et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 46734680
Uemura, Y., et al. (1999) Tree adjoining grammars for RNA structure prediction. Theoret. Comput. Sci., 210, 277303[CrossRef].
van Batenburg, F.H.D., et al. (1995) An APL-programmed genetic algorithm for the prediction of RNA secondary structure. J. Theoret. Biol., 174, 269280[CrossRef][ISI][Medline].
van Batenburg, F.H.D., et al. (2000) PseudoBase: a database with RNA pseudoknots. Nucleic Acids Res., 28, 201204
van Belkum, A., et al. (1985) Five pseudoknots are present at the 204 nucleotides long 3' noncoding region of tobacco mosaic virus RNA. Nucleic Acids Res., 13, 76737686
This article has been cited by other articles:
![]() |
K. Asai, H. Kiryu, M. Hamada, Y. Tabei, K. Sato, H. Matsui, Y. Sakakibara, G. Terai, and T. Mituyama Software.ncrna.org: web servers for analyses of RNA sequences Nucleic Acids Res., July 1, 2008; 36(suppl_2): W75 - W78. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Cai, B. Hartnett, C. Gustafsson, and J. Peccoud A syntactic model to design and verify synthetic genetic constructs derived from standard biological parts Bioinformatics, October 15, 2007; 23(20): 2760 - 2767. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||














