Skip Navigation


Bioinformatics Advance Access originally published online on March 9, 2006
Bioinformatics 2006 22(11):1302-1307; doi:10.1093/bioinformatics/btl088
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/11/1302    most recent
btl088v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Leonardi, F. G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Leonardi, F. G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

A generalization of the PST algorithm: modeling the sparse nature of protein sequences

Florencia G. Leonardi

Instituto de Matemática e Estatística, Universidade de São Paulo Rua do Matão 1010 CEP 05508-090, São Paulo, Brazil


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THEORY
 3 ALGORITHM
 4 IMPLEMENTATION
 5 DISCUSSION
 REFERENCES
 

Motivation: A central problem in genomics is to determine the function of a protein using the information contained in its amino acid sequence. Variable length Markov chains (VLMC) are a promising class of models that can effectively classify proteins into families and they can be estimated in linear time and space.

Results: We introduce a new algorithm, called Sparse Probabilistic Suffix Trees (SPST), that identifies equivalences between the contexts of a VLMC. We show that, in many cases, the identification of these equivalences can improve the classification rate of the classical Probabilistic Suffix Trees (PST) algorithm. We also show that better classification can be achieved by identifying representative fingerprints in the amino acid chains, and this variation in the SPST algorithm is called F-SPST.

Availability: The SPST algorithm can be freely downloaded from the site http://www.ime.usp.br/~leonardi/spst/

Contact: leonardi{at}ime.usp.br

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THEORY
 3 ALGORITHM
 4 IMPLEMENTATION
 5 DISCUSSION
 REFERENCES
 
Efficient family classification of newly discovered protein sequences is a central problem in bioinformatics (Rust et al., 2002). Nowadays, the most popular methods for generating a hypothesis about the function of a protein are BLAST and hidden Markov models (HMM). These methods, although very useful, are time-consuming considering the actual size of the databases. Recently, Bejerano and Yona (2001) applied VLMC in protein classification. They showed that variable length Markov chains (VLMC) detects much more related sequences than Gapped-BLAST and it is almost as sensitive as HMM. A major advantage of VLMC is that it does not depend on multiple alignments of the input sequences and it can be estimated in linear time and space (Apostolico and Bejerano, 2000).

In the present work we affront the problem of protein family classification creating a model for each protein family F and testing if each query sequence belongs to F or not. The model is a variation of VLMC, called sparse Markov chains (SMC). These two models are a class of stochastic processes that relies on the estimation of conditional probabilities of symbols in a sequence given the preceding symbols. The remarkable property of VLMC is that the number of symbols in the conditional sequence is variable, and depends on the symbols involved. This property attempts to capture variable dependencies in different configurations of amino acids in a protein family. A VLMC model can be represented by a tree in which each branch corresponds to a conditional sequence of symbols. For example, a VLMC tree for a set of sequences in the binary alphabet {0, 1} is illustrated in Figure 1. The conditional probabilities are associated to each branch, indicating the probability of the next symbol in a query sequence given that the preceding symbols match the sequence in the branch. The property of variable length in the conditional sequences is maintained in SMC. The difference between SMC and VLMC is that in SMC the conditional sequences are given by sequences of subsets of amino acids. This property attempts to identify which positions are more relevant than others in the configurations of amino acids that define the conditional sequences. The sequences composed by subsets of amino acids are called sparse.


Figure 1
View larger version (11K):
[in this window]
[in a new window]
 
Fig. 1 A tree representation of a VLMC over the alphabet Formula = {0, 1}. The branches represent the sequences that are relevant to predict the next symbol in a sequence, and the probability distribution associated to each branch gives the probability of having 0 or 1 given that the preceding symbols match the sequence in the branch. For example, the probability of having symbol 0 in any position of a query sequence given that the preceding symbols are 0 and 1 respectively is P(0|10) = 0.9.

 
VLMC was first introduced in Rissanen (1983) as an universal model for data compression. Rissanen also introduced an estimation algorithm for this class of models called Context Algorithm (CA). It was proven in Bülhmann and Wyner (1999) that this algorithm is consistent. This means that if we suppose that the data were generated with a VLMC, with a sufficiently large sample the CA will find the real model. PST is another algorithm to estimate a VLMC (Ron et al., 1996), and it has the advantage of being computationally more economical than the CA (Apostolico and Bejerano, 2000). The PST algorithm was successfully used in protein classification, even though its performance decreases with less conserved families (Bejerano and Yona, 2001). For that reason some attempts were made to use VLMC related models for sparse sequences (Eskin et al., 2000; Bourguignon and Nicolas, 2004). Although very attractive, these two methods have the major disadvantage of having computationally very expensive algorithms.

In this work we introduce a variation in the PST algorithm to estimate an SMC for sparse sequences, called SPST. Another variation is introduced to improve the classification of PST and SPST, and it is called F-SPST. The paper is organized as follows. In Section 2 the formal definitions of VLMC and SMC and a simple example are given. In Section 3 we introduce the two variations in the PST algorithm and we explain how to classify protein sequences using these algorithms. In Section 4 we present the experimental results and in Section 5 we discuss these results and some ideas for future work.


    2 THEORY
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THEORY
 3 ALGORITHM
 4 IMPLEMENTATION
 5 DISCUSSION
 REFERENCES
 
Let Formula be a finite alphabet (e.g. the alphabet of twenty amino acids for protein sequences, or four nucleotides for DNA sequences). Let (Xn)nisinN be a stationary stochastic process with values Xn isin Formula. We say that the process (Xn) is a variable length Markov chain (VLMC) if

Formula
where {ell} is a function of the sequence x0, ... , xn–1; i.e. Formula. This function is called length function and the finite sequences (xn{ell}, ... , xn–1) are called contexts. As we are assuming a time homogeneous process, the length function does not depend on n, so we simply denote the contexts by (x{ell}, ... , x–1).

In a VLMC the set of contexts has the suffix property: this means that no context is a suffix of another context. This makes it possible to define without ambiguity the probability distribution of the next symbol. The suffix property also permits to represent the set of contexts as a tree. In this tree, each context (x{ell}, ... , x–1) is represented by a complete branch, in which the first node on top is x–1 and so on until the last element x{ell} which is represented by the terminal node of the branch. For example, in Figure 1 the set of contexts is {00, 10, 1}. A complete description of VLMC can be found in Bülhmann and Wyner (1999).

We introduce here the definition of an SMC. An SMC is a VLMC in which some contexts can be grouped together into an equivalence class. In a SMC the probability transitions are given by

Formula
where Ai sub Formula for all i = n{ell}, ... , n–1. This relation induces a partition of the set of contexts of the corresponding VLMC. Then, the contexts in an SMC are given by the equivalence classes denoted by the sequences of subsets (A{ell}, ... , A–1). A tree representation for this type of contexts can be seen in Figure 2. To each node of the tree is associated a conditional probability distribution over the next symbol given the context represented by the node. For example, if we want to compute the probability of the sequence 01001 in the model given by Figure 2 we have P(01001) = P(0) x P(1|0) x P(0|01) x P(0|10) x P(1|00) = 9/16 x 7/16 x 3/10 x 9/10 x 7/10.


Figure 2
View larger version (11K):
[in this window]
[in a new window]
 
Fig. 2 A sparse tree over the alphabet Formula = {0, 1}. The set of contexts of the associated VLMC is {00, 01, 10, 11}, and the equivalence classes are {00, 01} and {10, 11}. This means that P(Xn = xn|Xn–2 = 0, Xn–1 = 0) = P(Xn = xn|Xn–2 = 0, Xn–1 = 1) and P(Xn = xn|Xn–2 = 1, Xn–1 = 0) = P(Xn = xn|Xn–2 = 1, Xn–1 = 1), for all xn isin Formula.

 
In this work we propose a variation in the PST algorithm to identify the equivalence classes given by the sparse contexts (A{ell}, ... , A–1). This algorithm is presented in the next section.


    3 ALGORITHM
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THEORY
 3 ALGORITHM
 4 IMPLEMENTATION
 5 DISCUSSION
 REFERENCES
 
Let p1, p2, ... , pm be the sample sequences belonging to a protein family F. For each i = 1, ... , m we denote Formula, where ni is the length of sequence pi.

Given a sparse context w = (A{ell}, ... , A–1) and a symbol a isin Formula, we denote by N(w, a) the number of occurrences of context w followed by symbol a in the sample set. That is,

Formula
where 1 is the indicator function that takes value 1 if Formula, or 0 otherwise. We also define

Formula

Given a sparse context w = (A{ell}, ... , A–1) and symbols a1, a2 isin Formula we denote by a1w and a2w the sparse contexts ({a1}, A{ell}, ... , A–1) and ({a2}, A{ell}, ... , A–1), respectively. We also denote by [a1, a2]w the sparse context ({a1, a2}, A{ell}, ... , A–1). Using this notation we define the operator {Delta}w (a1, a2) as the logarithm of the ratio between the estimated probability of the sequences in the model that has the contexts a1w and a2w as equivalent and the model that distinguishes the two contexts as different. That is,

Formula
Note that N([a1, a2]w, a) = N(a1w, a) + N(a2w, a) and N([a1, a2] w, ·) = N(a1w, ·) + N(a2w, ·).

Using the preceding definitions we can specify how the SPST algorithm works. The free parameters that must be specified by the user are the maximum depth of the tree, Lmax, the minimum number of times that a sparse context has to be seen in the sample to be considered, Nmin, and a cutoff parameter that establishes the equivalence between two contexts, rmax. The SPST algorithm works as follows. It starts with a tree T consisting of a single root node. At each step, for every terminal node in T labeled by a sparse sequence w with depth less than L and for every symbol a isin Formula, the child a is added to node w if N(aw,·) ≥ Nmin. Then, for every pair of children of w, a1 and a2, we test the equivalence of the contexts a1w and a2w using the {Delta} operator. That is, we compute {Delta}w(a1, a2) for every pair of symbols (a1, a2) isin Formula added to node w, and choose the minimum between all the pairs. If this minimum is smaller than rmax, the corresponding nodes are merged together into a single node. This procedure is iterated with the new set of children of w until no more nodes can be merged. Taking the minimum of {Delta}w(a1, a2) between all the possible pairs (a1, a2) ensures the independence of the order in which the tests are performed. To conclude the construction of the tree we assign to each node a transition probability distribution. This distribution gives the probability of a symbol in Formula given the sparse context between the node and the root of the tree. The transition probabilities are estimated using the maximum likelihood estimates. That is, given a sparse context w = (A{ell}, ... , A–1), the estimated probability of a symbol a isin Formula given the context w is given by

Formula

Given a sparse context w = (A{ell}, ... , A–1) we denote by w' the sparse context (A{ell}+1, ... , A–1) and by |w| the length of context w. Summarizing the steps of the SPST algorithm, we have the following:

SPST-Algorithm (Nmin, rmax, Lmax)

  1. Initialization: let T be a tree consisting of a single root node, and let

    Formula

  2. Iteration: while Formula do
    1. Remove u of Formula and add u to T. Then remove all sparse contexts Formula such that w' = u' and add them to T. Note that the context w is of the form aiu' for some symbol ai isin Formula. Let C denote the set of contexts added to T in this step.
    2. Compute

      Formula
      and

      Formula

    3. If r < rmax merge Formula and Formula in a single node. Replace the contexts Formula and Formula in C by the context Formula.
    4. Repeat steps (b) and (c) until no more changes can be made in C.
    5. For each sparse context w isin C, if |w| < Lmax then add the set {aw : a isin Formula and N (aw, · ) ≥ Nmin} (if any) to Formula.

  3. Estimation of the transition probabilities: assign to each node in T, associated with a sparse context w, the transition probability distribution over Formula given by

    Formula

3.1 Application of the SPST algorithm to classify protein sequences
Given a protein family F and a new sequence of amino acids p = p1, ... , pn, we want to know if p belongs to F or not. To answer this question we first construct a model for the family F using the sequences already classified into the family. Then we compute a score, and depending on this value we classify the sequence p as belonging to the family F or not. The model constructed for the family F is an SMC model obtained by estimating the sparse contexts and the transition probabilities given by the SPST algorithm, as explained above. There is no restriction about the minimum number of sequences needed to estimate an SMC, because SPST will only estimate the next symbol probability conditioned on sequences that appear a minimum number of times in the sample. But it is also true that with a very small family the estimation of the probability distributions could be very poor. For that reason we only test the SPST algorithm with families containing more than 10 sequences, as explained in the next section.

Given a query sequence p, its score in the estimated SMC model for F is given by

Formula
where Formula is the smoothed probability distribution derived from Formula. That is

Formula
where w(p1, ... , pi–1) is the sparse context corresponding to the sequence p1, ... , pi–1. The parameter {gamma}min is a smoothing parameter to avoid zero probabilities, and therefore, a –{infty} score.

Sometimes the region of high similarity between the sequences in a protein family is considerably smaller than the length of the sequences. This is because a protein sequence can be composed by several domains, performing different functions in the cell. Then, computing the score S over the entire sequence p may not be appropriate. For this reason we propose a change in the computation of the score S and called it S'. In this case we fix an integer M, and for sequences with length n > M we compute the score S'(p) by

Formula
In the case n ≤ M, the score is computed using S as before. The algorithm that implements the score S' is called F-SPST.


    4 IMPLEMENTATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THEORY
 3 ALGORITHM
 4 IMPLEMENTATION
 5 DISCUSSION
 REFERENCES
 
In order to test our algorithm and to compare it with PST published results (Bejerano and Yona, 2001) we use protein families of the Pfam database (Bateman et al., 2004) release 1.0. This database contains 175 families derived from the Swiss-Prot 33 database (Boeckmann et al., 2004). We selected the first 50 families (in alphabetical order) that contain more than 10 sequences. For each family in this set, we trained an SMC model with the SPST algorithm. We used four-fifths of the sequences in each family for training, and then we applied the resulting model to classify all the sequences in the Swiss-Prot 33 database. To establish the family membership threshold we used the equivalence number criterion (Pearson, 1995). This method sets the threshold at the point where the number of false positives (the number of non-member proteins with score above the threshold) equals the number of false negatives (the number of member proteins with score below the threshold), i.e. it is the point of balance between selectivity and sensitivity. A member protein that scores above the threshold (true positive) is considered successfully detected. The quality of the model is measured by the percentage of true positives detected with respect to the total number of proteins in the family.

Table 1 shows the classification rates obtained with the SPST and F-SPST algorithms, together with the published results obtained with the PST algorithm. We also show the number of false positives for each algorithm. Because of the way of establishing the family membership threshold, the percentage of false positives is equal to 100% minus the percentage of true positives (with respect to the total number of sequences in the family). For example, in the case of family 7tm_1, the percentage of true positives detected by the F-SPST algorithm is 97.7%, so the percentage of false positives is 2.3%. This gives 12 sequences erroneously classified as members of the 7tm_1 family. Figure 3 summarizes the classification rates of Table 1 in two scatter-plots.


View this table:
[in this window]
[in a new window]
 
Table 1 Performance comparison between PST, SPST and F-SPST.

 

Figure 3
View larger version (9K):
[in this window]
[in a new window]
 
Fig. 3 Scatter-plots of performances from PST, SPST and F-SPST protein classification methods. Above: SPST versus PST. Below: F-SPST versus PST.

 
It was shown in Bejerano and Yona (2001) that the performance of the PST algorithm can be improved increasing the number of nodes in the tree. In the case of SPST and F-SPST, this fact depends on the values of the parameters L and Nmin. In this work we study the performance of the F-SPST algorithm as a function of the other user-specific parameters rmax, {gamma}min and M. Five families of Table 1 were randomly chosen and the SMC models were estimated varying the value of the corresponding parameter. The performance of the F-SPST algorithm was not significantly affected in the case of {gamma}min and M. The reader can see these results in the Supplementary Material for this paper. In Figure 4 we show the results for rmax.


Figure 4
View larger version (12K):
[in this window]
[in a new window]
 
Fig. 4 Performance evaluation of the F-SPST algorithm as a function of the user-specific parameter rmax. The x-axis shows the values of this parameter used to estimate the model. The evaluation was made for five randomly chosen families.

 
The implementation of the SPST algorithm described in this paper was coded in ANSI C and compiled using gcc. The run time spent training a model on a AMD (Athlon) 851 Mhz PC, for a protein family in the Pfam 1.0 database, varies between 2 s and 49 min. The computation of the scores for all sequences in the Swiss-Prot 33 database takes on average 1 min 40 s for the SPST algorithm and 1 min 5 s for the F-SPST algorithm.


    5 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THEORY
 3 ALGORITHM
 4 IMPLEMENTATION
 5 DISCUSSION
 REFERENCES
 
The results presented in this paper strongly suggest that the SPST and F-SPST algorithms can improve the classification rates obtained with the PST algorithm. This probably owes to the fact that the sparse model mimics well the sparse nature of relevant domains in the amino acid chains. Another very interesting feature of SPST appears when comparing nodes in the estimated trees with the classes obtained by grouping the amino acids according to their physical and chemical properties. For instance, the estimated tree for the ATPase family associated with various cellular activities (AAA) family has as a sparse node the set of amino acids {I, V, L}. This set corresponds exactly with the group of aliphatic amino acids (Fig. 5).


Figure 5
View larger version (20K):
[in this window]
[in a new window]
 
Fig. 5 An SMC tree estimated with the SPST algorithm. This tree corresponds to sequences in the AAA family (ATPase family associated with various cellular activities). In each node we can see the subsets of amino acids corresponding to different positions of the sparse contexts (the curly brackets of the subsets were dropped). Some nodes of the tree are in correspondence with the physico-chemical groups of the amino acids (shown in the square), as for example the set {I, L, V} that corresponds to the set of aliphatic amino acids.

 
In Figure 4 we see that the performance of F-SPST as a function of the parameter rmax decreases when the value of rmax increases. This is due to the fact that when rmax increases, the number of nodes in the tree decreases (more nodes are merged), and an under-fitted model is obtained. But it is important to note that for all values of rmax inferior to 50, the performance of F-SPST for the five families is maintained over 90%.

Actually we do not know which conditions must be satisfied by the stochastic process in order for the SPST algorithm to be consistent. We know that in the simple tree configuration of Figure 2 the SPST algorithm is consistent, but in the general case it is not. For space limitations we can not present in an appropriate way these facts, but we will discuss the important issue of consistency of the SPST algorithm in a forthcoming paper.


    Acknowledgments
 
The author thanks her advisors, A. Galves and H. Armelin, for illuminating discussions during the preparation of this paper, P.-Y. Bourguignon for making his manuscript available for her and C. Coletti and R. Vêncio for helpful remarks. She also thanks the reviewers for their useful comments. This work is supported by CAPES and is part of PRONEX/FAPESP's project Stochastic behavior, critical phenomena and rhythmic pattern identification in natural languages (grant number 03/09930-9) and CNPq's project Stochastic modeling of speech (grant number 475177/2004-5).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Thomas Lengauer

Received on October 26, 2005; revised on February 19, 2006; accepted on March 6, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THEORY
 3 ALGORITHM
 4 IMPLEMENTATION
 5 DISCUSSION
 REFERENCES
 

    Apostolico, A. and Bejerano, G. (2000) Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. J. Comput Biol, . 7, 381–393[Medline].

    Bateman, A., et al. (2004) The Pfam protein families database. Nucleic Acids Res, . 32, 138–141.

    Bejerano, G. (2004) Algorithms for variable length Markov chain modeling. Bioinformatics, 20, 788–789[Abstract/Free Full Text].

    Bejerano, G. and Yona, G. (2001) Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics, 17, 23–43[Abstract/Free Full Text].

    Boeckmann, B., et al. (2004) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res, . 31, 365–370.

    Bourguignon, P.-Y. and Nicolas, P. (2004) Modèles de Markov parcimonieux: sélection de modèle et estimation. proceedings of JOBIM 2004Montreal.

    Bülhmann, P. and Wyner, A.J. (1999) Variable length Markov chains. Annal. Stat, . 27, 480–513[CrossRef].

    Eskin, E., et al. (2000) Protein family classification using sparse Markov transducers. Proc. Intl Conf. Intell. Syst. Mol. Biol, . 8, 134–145.

    Pearson, W.R. (1995) Comparison of methods for searching protein sequence databases. Protein Sci, 4, 1145–1160[Web of Science][Medline].

    Rissanen, J. (1983) A universal data compression system. IEEE Trans. Inform. Theory, 29, 656–664[CrossRef].

    Ron, D., et al. (1996) The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length. Mach. Learn, . 25, 117–149.

    Rust, A.G., et al. (2002) Genome annotation techniques: new approaches and challenges. Drug Discov. Today, 7, 70–76.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
H. Li, X. Dai, and X. Zhao
A nearest neighbor approach for automated transporter prediction and categorization from protein sequences
Bioinformatics, May 1, 2008; 24(9): 1129 - 1136.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/11/1302    most recent
btl088v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Leonardi, F. G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Leonardi, F. G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?