Bioinformatics Advance Access originally published online on March 9, 2006
Bioinformatics 2006 22(11):1302-1307; doi:10.1093/bioinformatics/btl088
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A generalization of the PST algorithm: modeling the sparse nature of protein sequences
Instituto de Matemática e Estatística, Universidade de São Paulo Rua do Matão 1010 CEP 05508-090, São Paulo, Brazil
| ABSTRACT |
|---|
|
|
|---|
Motivation: A central problem in genomics is to determine the function of a protein using the information contained in its amino acid sequence. Variable length Markov chains (VLMC) are a promising class of models that can effectively classify proteins into families and they can be estimated in linear time and space.
Results: We introduce a new algorithm, called Sparse Probabilistic Suffix Trees (SPST), that identifies equivalences between the contexts of a VLMC. We show that, in many cases, the identification of these equivalences can improve the classification rate of the classical Probabilistic Suffix Trees (PST) algorithm. We also show that better classification can be achieved by identifying representative fingerprints in the amino acid chains, and this variation in the SPST algorithm is called F-SPST.
Availability: The SPST algorithm can be freely downloaded from the site http://www.ime.usp.br/~leonardi/spst/
Contact: leonardi{at}ime.usp.br
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Efficient family classification of newly discovered protein sequences is a central problem in bioinformatics (Rust et al., 2002). Nowadays, the most popular methods for generating a hypothesis about the function of a protein are BLAST and hidden Markov models (HMM). These methods, although very useful, are time-consuming considering the actual size of the databases. Recently, Bejerano and Yona (2001) applied VLMC in protein classification. They showed that variable length Markov chains (VLMC) detects much more related sequences than Gapped-BLAST and it is almost as sensitive as HMM. A major advantage of VLMC is that it does not depend on multiple alignments of the input sequences and it can be estimated in linear time and space (Apostolico and Bejerano, 2000).
In the present work we affront the problem of protein family classification creating a model for each protein family
and testing if each query sequence belongs to
or not. The model is a variation of VLMC, called sparse Markov chains (SMC). These two models are a class of stochastic processes that relies on the estimation of conditional probabilities of symbols in a sequence given the preceding symbols. The remarkable property of VLMC is that the number of symbols in the conditional sequence is variable, and depends on the symbols involved. This property attempts to capture variable dependencies in different configurations of amino acids in a protein family. A VLMC model can be represented by a tree in which each branch corresponds to a conditional sequence of symbols. For example, a VLMC tree for a set of sequences in the binary alphabet {0, 1} is illustrated in Figure 1. The conditional probabilities are associated to each branch, indicating the probability of the next symbol in a query sequence given that the preceding symbols match the sequence in the branch. The property of variable length in the conditional sequences is maintained in SMC. The difference between SMC and VLMC is that in SMC the conditional sequences are given by sequences of subsets of amino acids. This property attempts to identify which positions are more relevant than others in the configurations of amino acids that define the conditional sequences. The sequences composed by subsets of amino acids are called sparse.
|
VLMC was first introduced in Rissanen (1983) as an universal model for data compression. Rissanen also introduced an estimation algorithm for this class of models called Context Algorithm (CA). It was proven in Bülhmann and Wyner (1999) that this algorithm is consistent. This means that if we suppose that the data were generated with a VLMC, with a sufficiently large sample the CA will find the real model. PST is another algorithm to estimate a VLMC (Ron et al., 1996), and it has the advantage of being computationally more economical than the CA (Apostolico and Bejerano, 2000). The PST algorithm was successfully used in protein classification, even though its performance decreases with less conserved families (Bejerano and Yona, 2001). For that reason some attempts were made to use VLMC related models for sparse sequences (Eskin et al., 2000; Bourguignon and Nicolas, 2004). Although very attractive, these two methods have the major disadvantage of having computationally very expensive algorithms.
In this work we introduce a variation in the PST algorithm to estimate an SMC for sparse sequences, called SPST. Another variation is introduced to improve the classification of PST and SPST, and it is called F-SPST. The paper is organized as follows. In Section 2 the formal definitions of VLMC and SMC and a simple example are given. In Section 3 we introduce the two variations in the PST algorithm and we explain how to classify protein sequences using these algorithms. In Section 4 we present the experimental results and in Section 5 we discuss these results and some ideas for future work.
| 2 THEORY |
|---|
|
|
|---|
Let
be a finite alphabet (e.g. the alphabet of twenty amino acids for protein sequences, or four nucleotides for DNA sequences). Let (Xn)n
be a stationary stochastic process with values Xn
. We say that the process (Xn) is a variable length Markov chain (VLMC) if
![]() |
is a function of the sequence x0, ... , xn1; i.e.
. This function is called length function and the finite sequences (xn
, ... , xn1) are called contexts. As we are assuming a time homogeneous process, the length function does not depend on n, so we simply denote the contexts by (x
, ... , x1).
In a VLMC the set of contexts has the suffix property: this means that no context is a suffix of another context. This makes it possible to define without ambiguity the probability distribution of the next symbol. The suffix property also permits to represent the set of contexts as a tree. In this tree, each context (x
, ... , x1) is represented by a complete branch, in which the first node on top is x1 and so on until the last element x
which is represented by the terminal node of the branch. For example, in Figure 1 the set of contexts is {00, 10, 1}. A complete description of VLMC can be found in Bülhmann and Wyner (1999).
We introduce here the definition of an SMC. An SMC is a VLMC in which some contexts can be grouped together into an equivalence class. In a SMC the probability transitions are given by
![]() |
for all i = n
, ... , n1. This relation induces a partition of the set of contexts of the corresponding VLMC. Then, the contexts in an SMC are given by the equivalence classes denoted by the sequences of subsets (A
, ... , A1). A tree representation for this type of contexts can be seen in Figure 2. To each node of the tree is associated a conditional probability distribution over the next symbol given the context represented by the node. For example, if we want to compute the probability of the sequence 01001 in the model given by Figure 2 we have P(01001) = P(0) x P(1|0) x P(0|01) x P(0|10) x P(1|00) = 9/16 x 7/16 x 3/10 x 9/10 x 7/10.
|
In this work we propose a variation in the PST algorithm to identify the equivalence classes given by the sparse contexts (A
, ... , A1). This algorithm is presented in the next section. | 3 ALGORITHM |
|---|
|
|
|---|
Let p1, p2, ... , pm be the sample sequences belonging to a protein family
. For each i = 1, ... , m we denote
, where ni is the length of sequence pi.
Given a sparse context w = (A
, ... , A1) and a symbol a
, we denote by N(w, a) the number of occurrences of context w followed by symbol a in the sample set. That is,
![]() |
, or 0 otherwise. We also define
![]() |
Given a sparse context w = (A
, ... , A1) and symbols a1, a2
we denote by a1w and a2w the sparse contexts ({a1}, A
, ... , A1) and ({a2}, A
, ... , A1), respectively. We also denote by [a1, a2]w the sparse context ({a1, a2}, A
, ... , A1). Using this notation we define the operator
w (a1, a2) as the logarithm of the ratio between the estimated probability of the sequences in the model that has the contexts a1w and a2w as equivalent and the model that distinguishes the two contexts as different. That is,
![]() |
Using the preceding definitions we can specify how the SPST algorithm works. The free parameters that must be specified by the user are the maximum depth of the tree, Lmax, the minimum number of times that a sparse context has to be seen in the sample to be considered, Nmin, and a cutoff parameter that establishes the equivalence between two contexts, rmax. The SPST algorithm works as follows. It starts with a tree T consisting of a single root node. At each step, for every terminal node in T labeled by a sparse sequence w with depth less than L and for every symbol a
, the child a is added to node w if N(aw,·)
Nmin. Then, for every pair of children of w, a1 and a2, we test the equivalence of the contexts a1w and a2w using the
operator. That is, we compute
w(a1, a2) for every pair of symbols (a1, a2)
added to node w, and choose the minimum between all the pairs. If this minimum is smaller than rmax, the corresponding nodes are merged together into a single node. This procedure is iterated with the new set of children of w until no more nodes can be merged. Taking the minimum of
w(a1, a2) between all the possible pairs (a1, a2) ensures the independence of the order in which the tests are performed. To conclude the construction of the tree we assign to each node a transition probability distribution. This distribution gives the probability of a symbol in
given the sparse context between the node and the root of the tree. The transition probabilities are estimated using the maximum likelihood estimates. That is, given a sparse context w = (A
, ... , A1), the estimated probability of a symbol a
given the context w is given by
![]() |
Given a sparse context w = (A
, ... , A1) we denote by w' the sparse context (A
+1, ... , A1) and by |w| the length of context w. Summarizing the steps of the SPST algorithm, we have the following:
SPST-Algorithm (Nmin, rmax, Lmax)
- Initialization: let T be a tree consisting of a single root node, and let

- Iteration: while
do- Remove u of
and add u to T. Then remove all sparse contexts
such that w' = u' and add them to T. Note that the context w is of the form aiu' for some symbol ai
. Let C denote the set of contexts added to T in this step.
- Compute
and

- If r < rmax merge
and
in a single node. Replace the contexts
and
in C by the context
.
- Repeat steps (b) and (c) until no more changes can be made in C.
- For each sparse context w
C, if |w| < Lmax then add the set {aw : a
and N (aw, · )
Nmin} (if any) to
.
- Remove u of
- Estimation of the transition probabilities: assign to each node in T, associated with a sparse context w, the transition probability distribution over
given by 
3.1 Application of the SPST algorithm to classify protein sequences
Given a protein family
and a new sequence of amino acids p = p1, ... , pn, we want to know if p belongs to
or not. To answer this question we first construct a model for the family
using the sequences already classified into the family. Then we compute a score, and depending on this value we classify the sequence p as belonging to the family
or not. The model constructed for the family
is an SMC model obtained by estimating the sparse contexts and the transition probabilities given by the SPST algorithm, as explained above. There is no restriction about the minimum number of sequences needed to estimate an SMC, because SPST will only estimate the next symbol probability conditioned on sequences that appear a minimum number of times in the sample. But it is also true that with a very small family the estimation of the probability distributions could be very poor. For that reason we only test the SPST algorithm with families containing more than 10 sequences, as explained in the next section.
Given a query sequence p, its score in the estimated SMC model for
is given by
![]() |
is the smoothed probability distribution derived from
. That is
![]() |
min is a smoothing parameter to avoid zero probabilities, and therefore, a
score.
Sometimes the region of high similarity between the sequences in a protein family is considerably smaller than the length of the sequences. This is because a protein sequence can be composed by several domains, performing different functions in the cell. Then, computing the score S over the entire sequence p may not be appropriate. For this reason we propose a change in the computation of the score S and called it S'. In this case we fix an integer M, and for sequences with length n > M we compute the score S'(p) by
![]() |
M, the score is computed using S as before. The algorithm that implements the score S' is called F-SPST. | 4 IMPLEMENTATION |
|---|
|
|
|---|
In order to test our algorithm and to compare it with PST published results (Bejerano and Yona, 2001) we use protein families of the Pfam database (Bateman et al., 2004) release 1.0. This database contains 175 families derived from the Swiss-Prot 33 database (Boeckmann et al., 2004). We selected the first 50 families (in alphabetical order) that contain more than 10 sequences. For each family in this set, we trained an SMC model with the SPST algorithm. We used four-fifths of the sequences in each family for training, and then we applied the resulting model to classify all the sequences in the Swiss-Prot 33 database. To establish the family membership threshold we used the equivalence number criterion (Pearson, 1995). This method sets the threshold at the point where the number of false positives (the number of non-member proteins with score above the threshold) equals the number of false negatives (the number of member proteins with score below the threshold), i.e. it is the point of balance between selectivity and sensitivity. A member protein that scores above the threshold (true positive) is considered successfully detected. The quality of the model is measured by the percentage of true positives detected with respect to the total number of proteins in the family.
Table 1 shows the classification rates obtained with the SPST and F-SPST algorithms, together with the published results obtained with the PST algorithm. We also show the number of false positives for each algorithm. Because of the way of establishing the family membership threshold, the percentage of false positives is equal to 100% minus the percentage of true positives (with respect to the total number of sequences in the family). For example, in the case of family 7tm_1, the percentage of true positives detected by the F-SPST algorithm is 97.7%, so the percentage of false positives is 2.3%. This gives 12 sequences erroneously classified as members of the 7tm_1 family. Figure 3 summarizes the classification rates of Table 1 in two scatter-plots.
|
|
It was shown in Bejerano and Yona (2001) that the performance of the PST algorithm can be improved increasing the number of nodes in the tree. In the case of SPST and F-SPST, this fact depends on the values of the parameters L and Nmin. In this work we study the performance of the F-SPST algorithm as a function of the other user-specific parameters rmax,
min and M. Five families of Table 1 were randomly chosen and the SMC models were estimated varying the value of the corresponding parameter. The performance of the F-SPST algorithm was not significantly affected in the case of
min and M. The reader can see these results in the Supplementary Material for this paper. In Figure 4 we show the results for rmax.
|
The implementation of the SPST algorithm described in this paper was coded in ANSI C and compiled using gcc. The run time spent training a model on a AMD (Athlon) 851 Mhz PC, for a protein family in the Pfam 1.0 database, varies between 2 s and 49 min. The computation of the scores for all sequences in the Swiss-Prot 33 database takes on average 1 min 40 s for the SPST algorithm and 1 min 5 s for the F-SPST algorithm.
| 5 DISCUSSION |
|---|
|
|
|---|
The results presented in this paper strongly suggest that the SPST and F-SPST algorithms can improve the classification rates obtained with the PST algorithm. This probably owes to the fact that the sparse model mimics well the sparse nature of relevant domains in the amino acid chains. Another very interesting feature of SPST appears when comparing nodes in the estimated trees with the classes obtained by grouping the amino acids according to their physical and chemical properties. For instance, the estimated tree for the ATPase family associated with various cellular activities (AAA) family has as a sparse node the set of amino acids {I, V, L}. This set corresponds exactly with the group of aliphatic amino acids (Fig. 5).
|
In Figure 4 we see that the performance of F-SPST as a function of the parameter rmax decreases when the value of rmax increases. This is due to the fact that when rmax increases, the number of nodes in the tree decreases (more nodes are merged), and an under-fitted model is obtained. But it is important to note that for all values of rmax inferior to 50, the performance of F-SPST for the five families is maintained over 90%.
Actually we do not know which conditions must be satisfied by the stochastic process in order for the SPST algorithm to be consistent. We know that in the simple tree configuration of Figure 2 the SPST algorithm is consistent, but in the general case it is not. For space limitations we can not present in an appropriate way these facts, but we will discuss the important issue of consistency of the SPST algorithm in a forthcoming paper.
| Acknowledgments |
|---|
The author thanks her advisors, A. Galves and H. Armelin, for illuminating discussions during the preparation of this paper, P.-Y. Bourguignon for making his manuscript available for her and C. Coletti and R. Vêncio for helpful remarks. She also thanks the reviewers for their useful comments. This work is supported by CAPES and is part of PRONEX/FAPESP's project Stochastic behavior, critical phenomena and rhythmic pattern identification in natural languages (grant number 03/09930-9) and CNPq's project Stochastic modeling of speech (grant number 475177/2004-5).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Thomas Lengauer
Received on October 26, 2005; revised on February 19, 2006; accepted on March 6, 2006
| REFERENCES |
|---|
|
|
|---|
Apostolico, A. and Bejerano, G. (2000) Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. J. Comput Biol, . 7, 381393[Medline].
Bateman, A., et al. (2004) The Pfam protein families database. Nucleic Acids Res, . 32, 138141.
Bejerano, G. (2004) Algorithms for variable length Markov chain modeling. Bioinformatics, 20, 788789
Bejerano, G. and Yona, G. (2001) Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics, 17, 2343
Boeckmann, B., et al. (2004) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res, . 31, 365370.
Bourguignon, P.-Y. and Nicolas, P. (2004) Modèles de Markov parcimonieux: sélection de modèle et estimation. proceedings of JOBIM 2004Montreal.
Bülhmann, P. and Wyner, A.J. (1999) Variable length Markov chains. Annal. Stat, . 27, 480513[CrossRef].
Eskin, E., et al. (2000) Protein family classification using sparse Markov transducers. Proc. Intl Conf. Intell. Syst. Mol. Biol, . 8, 134145.
Pearson, W.R. (1995) Comparison of methods for searching protein sequence databases. Protein Sci, 4, 11451160[Web of Science][Medline].
Rissanen, J. (1983) A universal data compression system. IEEE Trans. Inform. Theory, 29, 656664[CrossRef].
Ron, D., et al. (1996) The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length. Mach. Learn, . 25, 117149.
Rust, A.G., et al. (2002) Genome annotation techniques: new approaches and challenges. Drug Discov. Today, 7, 7076.
This article has been cited by other articles:
![]() |
H. Li, X. Dai, and X. Zhao A nearest neighbor approach for automated transporter prediction and categorization from protein sequences Bioinformatics, May 1, 2008; 24(9): 1129 - 1136. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||














