Bioinformatics Advance Access originally published online on March 29, 2005
Bioinformatics 2005 21(10):2287-2293; doi:10.1093/bioinformatics/bti374
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Quasi-consensus-based comparison of profile hidden Markov models for protein sequences
1Delaware Biotechnology Institute Newark, DE 19715, USA
2Fox Chase Cancer Center Philadelphia, PA 19111, USA
3Department of Computer and Information Sciences, University of Delaware Newark, DE 19716, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
A simple approach for the sensitive detection of distant relationships among protein families and for sequencestructure alignment via comparison of hidden Markov models based on their quasi-consensus sequences is presented. Using a previously published benchmark dataset, the approach is demonstrated to give better homology detection and yield alignments with improved accuracy in comparison to an existing state-of-the-art dynamic programming profileprofile comparison method. This method also runs significantly faster and is therefore suitable for a server covering the rapidly increasing structure database. A server based on this method is available at http://liao.cis.udel.edu/website/servers/modmod
Contact: roland.dunbrack{at}fccc.edu; lliao{at}mail.eecis.udel.edu
| INTRODUCTION |
|---|
|
|
|---|
As the number of available protein sequences increases at a much faster pace than their structure and functions can be experimentally determined, the development of more sensitive and accurate sequence analysis algorithms remains important. Evolutionary relationships implied by sequence similarity remain a major basis for computationally predicting structure and function. Despite the tremendous number of methods devised, the task of detecting distant homologies, those that share sequence similarity lower than 30%, and predicting accurate sequencestructure alignments is still quite a challenge.
Detection ability and sequencealignment accuracy can be improved by using multiple sequence information in the form of a position-specific scoring matrix or profile (Gribskov et al., 1987; Altschul et al., 1997). In contrast to methods that try to establish homology based on similarity between a pair of protein sequences, sequenceprofile methods compare an individual protein sequence to one or more profiles. In a similar vein, profileprofile methods (Pietrokovski, 1996) compare two profiles, attempting to capture remote homologies that may be otherwise missed.
Widely accepted sequenceprofile comparison methods for homology detection can be put into two major categories depending on how the profiles are constructed and represented. The first category consists of methods that represent protein families as position-specific scoring matrices, obtained by transforming a multiple sequence alignment into log-odds or frequency matrices of the 20 amino acids at each position in the alignment (Gribskov et al., 1987). The second category consists of hidden Markov models (HMMs) (Krogh et al., 1994; Eddy, 1998), which have a more complex architecture designed to model the observed multiple sequence alignment, including insertiondeletion regions, probabilistically. In both categories, a sequenceprofile comparison method can work in both directions: a profile query against a sequence database or a sequence query against a profile database.
To exploit the information contained in profiles, profileprofile comparison methods have been developed that compare a query profile to a database of profiles for enhanced detection sensitivity (Pietrokovski, 1996; Rychlewski et al., 2000; Yona and Levitt, 2002; Sadreyev and Grishin, 2003; Wang and Dunbrack Jr, 2004). Most of these profileprofile methods start with defining a scoring scheme between columns of the two profiles. The similarity between two profiles is then measured as the total score when aligning columns in one profile to columns in the other profile with an appropriate gap-penalty scheme, and such an alignment can be found by using the same dynamic programming (DP) technique as used in SmithWaterman local alignment for a pair of sequences (Smith and Waterman, 1981). Several recent papers have compared many of the scoring functions previously proposed (Mittelman et al., 2003; Edger and Sjolander, 2004; Wang and Dunbrack Jr, 2004).
Because of the favorable properties of HMMs in comparison with matrix profiles (Durbin et al., 1998), it is desirable to develop methods for comparing HMMs with one another. A DP scheme to align states of two HMMs for protein families was described several years ago by Shi and States (http://www.ornl.gov/sci/techresources/Human_Genome/publicat/99santa/108.html) and applied to cluster the models in the PFAM database (Bateman et al., 2004). Another approach for comparing two HMMs based on their co-emission probability has also been presented and applied to HMMs for signal peptides (Lyngso et al., 1999). Recently, Madera and Chothia have made the PRC program available on the SUPERFAMILY site (http://supfam.org) that compares HMMs with each other, although the details of this method are not currently published.
In this paper, we propose a simple, alternative approach for comparing two profile HMMs using their quasi-consensus sequences. These sequences are those derived by taking the most probable sequence from the highest-probability path through the HMM. This method provides rapid remote homology detection and alignment prediction.
| METHODS |
|---|
|
|
|---|
Consensus-based comparison of profile HMMs
A profile HMM for a protein family can be considered as a probability distribution over an infinite space of all possible amino acid strings that assigns higher probability for the members of the family. Therefore, comparing two profile HMMs amounts to comparing two distributions of probability. A conceptual approach to comparing two HMMs Mi and Mj is to consider them as (row) vectors in the infinite space of all sequences and define a metric that measures the similarity between the vectors. If we choose the components of these infinite dimension vectors to be the probabilities or scores assigned to all sequences by the models, the models can be then represented as
![]() |
} is an infinite set of all sequences in an arbitrarily preset order. Although represented as vectors, the similarity measure between these vectors is not computable in practice due to the vectors' infinite size. To circumvent this problem, we consider a plausible heuristic in which we focus on comparing the two vectors by their most significant components, where the corresponding sequence receives a maximal score. Let us assume that Mi peaks at si(Ci) and Mj peaks at sj(Cj), with Ci and Cj defined as the consensus sequences for Mi and Mj respectively. We consider two models to be highly similar if their respective consensus sequences are very probable in both models. Specifically, we compare profile HMMs by defining a similarity metric given as
![]() | (1) |
One way to address the issue of score significance is to transform raw scores into z-scores, which measure how far the raw scores deviate from the distribution mean. Specifically, for a raw score si (x), its z-score zi (x) is defined as
![]() | (2) |
i is the standard deviation of the distribution {si(x1), si(x2), ..., si (x
)} induced by model Mi. In practice, the mean and standard deviation of this distribution over an infinite set is estimated from a finite sampling, which in most cases is naturally obtained when searching a database of profile HMMs. In other words, in our case of searching a database of profile HMMs, the infinite dimensional vector representation of a model is reduced to a vector of a finite dimension {si(C1), si (C2), ..., si (CN)} where N is equal to the database size and the Ck are the consensus sequences of the models in the database. Thus, si(Ck) is the kth component of Mi, with Ck being the consensus sequence of the kth model in the database, and can be transformed into a z-score zi(Ck) as defined above. Our z-score-based similarity measure can then be taken as
![]() | (3) |
One problem with using consensus sequences to compare profile HMMs is that the task of finding the exact consensus sequence for a given HMM is an NP-hard problem (Lyngso et al., 1999). As a substitute for the exact consensus sequence, in this work we use a quasi-consensus sequence computed by a heuristic approach using the hmmemit program from the HMMER package (Eddy, 1998) with the -c option. This is found by traversing the nodes in the profile HMM and calculating the state occupancy of the match and insert states in each node. If the match state in a node is used with probability
0.5, then this match state is consensus and the maximally likely residue in the state is chosen. On the other hand, if the state occupancy says that the insert state is used with probability
0.5, this insert state is consensus and is used to generate the degenerate symbol (X) as many times as the expectation value of the insertion length.
Sequence alignments
To search a database of HMMs associated with known structures, we first construct an HMM for the query sequence from a multiple sequence alignment of similar proteins, for instance those generated by a PSI-BLAST search. Our goal is to find an alignment between the query sequence (as seed sequence to its own HMM) and the seed sequences for HMMs in the database. Alignments between the seed sequences of two profile HMMs being compared can be found by aligning quasi-consensus sequences to the HMMs and using simple transitivity. To illustrate this, let us assume that there are two HMMs, Mi and Mj, generated using seed sequences Si and Sj, and let Ci and Cj be quasi-consensus sequences for Mi and Mj, respectively. Our goal is to find an alignment between the seed sequences Si and Sj. To get such an alignment, we can use either of the quasi-consensus sequences as an intermediate sequence between Si and Sj. That is, we can align Ci to models Mi and Mj using a standard sequencemodel alignment procedure, such as the posterior decoding or Viterbi algorithm (Durbin et al., 1998). We use the align2model program in the SAM package for this purpose (Krogh et al., 1994). The resulting SiCi and CiSj alignments can then be used to determine the SiSj alignment using transitivity. Similarly, the other consensus sequence Cj can also be used. In a later section on alignment accuracy, we describe how to combine these two alignments to reach a single final alignment between Si and Sj.
Benchmark datasets
In a previous paper on comparison of different profileprofile alignment scoring functions (Wang and Dunbrack Jr, 2004), we built two datasets for benchmarking alignment accuracy and homology detection based on consensus DALI (Holm and Sander, 1995) and CE (Shindyalov and Bourne, 1998) structural alignments of domains defined in SCOP (Murzin et al., 1995). These were defined as a Set A of 3441 pairs of sequences for experiments on sequence alignment accuracy and a Set S of 665 sequences for assessing sequence search sensitivity and specificity. In this work, we reduced Set S to 569 sequences. We used multiple sequence alignments of proteins homologous to those in Sets A and S from our previous work (Wang and Dunbrack Jr, 2004). Profile HMMs were constructed from these multiple sequence alignments using the SAM package (Karplus et al., 1998) with the w0.5 script. Then, for each of the constructed models, quasi-consensus sequences were generated using the hmmemit program from the HMMER package (Eddy, 1998).
For accuracy, following Wang and Dunbrack Jr (2004), about one-third of those sequence pairs in Set A were selected randomly to be used as a training set (1136 pairs), with the rest serving as a testing set (2305 pairs). The testing set was further split into eight groups, each corresponding to a range of sequence percent identity. The distribution of sequence pairs over the eight ranges is shown in Table 1.
|
We compare our method with the COMPASS program (Sadreyev and Grishin, 2003). COMPASS was used as described in its documentation. The same multiple sequence alignments used to construct HMMs were provided to COMPASS for profileprofile alignments. For COMPASS, all versus all comparisons were made by comparing each of the 569 multiple sequence alignments (as queries) to a database of the 569 profiles (created from the 569 multiple sequence alignments using the COMPASS package program mk_compass_db) using the compass_vs_db program.
| RESULTS |
|---|
|
|
|---|
Evaluation of homolog detection ability
We performed all versus all comparison experiments on the reduced Set S using three approaches: profileprofile alignments with COMPASS (Sadreyev and Grishin, 2003), quasi-consensus sequence-based HMMHMM alignment, and seed sequence-based HMMHMM alignment. In the quasi-consensus sequence-based HMMHMM alignment, each model was used to score the quasi-consensus sequences of all other models in the set using the hmmscore program in the SAM package (with -sw 3 option and reverse-sequence null model) and the resulting vector of 568 scores was normalized into a z-score vector. The symmetric form (using both terms) and asymmetric form (using the first term) of Equation (3) were then used as a measure of similarity between a pair of models Mi and Mj. The same procedure was used for the seed sequence-based HMMHMM alignment by scoring seed sequences. For COMPASS, all versus all comparisons were made by comparing each of the 569 multiple sequence alignments (as queries) to a database of the 569 profiles. Each list of 568 querydatabase comparison scores were then transformed into z-scores and the same symmetric and asymmetric similarity measures of Equation (3) were used. We also used the E-values for COMPASS as similarity measures. However, for the quasi-consensus- and seed sequence-based comparison of HMMs, we found that using the E-value from the hmmscore program of the SAM package gave poor performance (data not shown).
The criterion used for deciding whether two models are related or not is based on whether the seeds used to generate these models are from the same SCOP fold or not. As in our earlier work (Wang and Dunbrack Jr, 2004), if two proteins were defined as having a Rossmann fold in SCOP, we considered them as true positives, even if they were in different SCOP fold categories. However, we found that avoiding this assumption on the Rossmann fold affects the results for all cases in this benchmark the same way (data not shown). Detection performance was measured by using the receiver operator characteristic (ROC), which plots the true positive count (Y-axis) as a function of false positive count (X-axis) when scanning through the output sorted by score. Receiver operator characteristic curves offer a combined measure of both sensitivity and specificity (Gribskov and Robinson, 1996). For numeric quantification of these curves, we calculated the ROC100 score, which is the normalized area under each curve up to the first 100 false positives. The ROC curves and ROC100 scores for different approaches are reported in Figure 1 and Table 2 respectively.
|
|
As shown in Table 2 and Figure 1, the results for COMPASS show that using the symmetric Z measure as given in Equation (3) does better than using the E-value measure in terms of search specificity and are comparable in terms of search sensitivity. However, we see that the symmetric Z measure is much better than the asymmetric Z measure. In both the consensus and seed-based comparisons, the symmetric Z measure is much better than the asymmetric one. As shown in the last plot of Figure 1 and Table 2, the consensus-based comparison of HMMs achieves better performance than the other two approachesCOMPASS and seed sequence-based HMM comparison. The symmetric ROC100 score for the quasi-consensus method is 0.915 while COMPASS achieves a value of 0.883. More importantly, the quasi-consensus method is much faster and is shown to produce more accurate alignments, as discussed below.
Evaluation of alignment accuracy
The accuracy of the predicted alignments is quantified by three parameters QModeler, QDeveloper and QCombined as proposed in earlier work (Sauder et al., 2000; Yona and Levitt, 2002). The accuracy parameter from the modeler's point of view, QModeler, is the ratio of the number of correctly aligned pairs of residues to the total number of aligned positions in the alignment being evaluated. The parameter from the developer's point of view, QDeveloper, is the ratio of the number of correctly aligned pairs to the number of aligned positions in the structural alignment. The combined accuracy parameter, QCombined, is the total number of correct matches divided by the total number of positions that are aligned in either the structural alignment or the alignment being evaluated. Aligned pairs that are in both the sequence and the structure alignment are counted only once in the denominator of QCombined.
In this experiment, for every pair in the testing part of Set A (2305 pairs), we created two alignments by using each of the two consensus sequences as a hinge as described earlier. We then combined these two alignments into a final alignment according to the following two schemes. The first scheme (MAX) simply picks one alignment, as a whole, out of the two alignments. The alignment that is picked has the higher score when the quasi-consensus sequence was aligned to the HMM. The second scheme (called AND selection) is to take segments that are common in both alignments. The final alignment thus obtained misses out segments where the two alignments disagree. Three variations to this AND-selection are suggested on which alignment of these uncommon segments shall be picked to patch the final alignment. In the first variation (AND1 selection), for a region where the two alignments disagree, we choose the alignment that contains fewer gap positions. As a refinement of AND1 selection, the second variation (AND2 selection) calculates a BLOSUM score for each of the two alignments at an uncommon segment and picks the alignment that has a larger score. The third variation (AND3 selection) chooses for a non-common segment the alignment that agrees better with the structural alignment for that segment. However, the AND3 selection is not usable when the correct answer, i.e. the structural alignment, is not available. The AND3 was the best alignment one could obtain with a method that was always able to tell which alignment was better.
The quality of the final alignments obtained by using these selection schemes and by using COMPASS was evaluated, and the accuracy parameters were averaged in the eight regions of sequence identity range as listed in Table 1. The results are reported in Figure 2. We observe that regions in the two alignments that agree with one another (AND) are more accurate per residue of the alignment (QModeler) than those that include regions where the alignments disagree (AND1, AND2 and AND3). However, the latter alignments are also considerably longer and do provide some correct information, resulting in higher QDeveloper scores. We see that, with the exception of the first identity range, both the MAX selection and AND2 selection outperform COMPASS in terms of QCombined.
|
We repeated the same experiment by using seed sequences instead of the quasi-consensus sequences to compare the HMMs. In this case, a seed sequence of one model is aligned to the other model, and vice versa. From the two resulting alignments, a final alignment was generated by using the selection schemes discussed above. The accuracy for these seed-based alignments is shown in Figure 3. The results show that the seed-sequence method does not behave well at very low sequence identities, but it does better than COMPASS at higher sequence identities.
|
The results of these experiments as shown in Figures 2 and 3 indicate that it is advisable to use consensus-based alignments in the lower sequence identity regions and seed-based alignments in the higher sequence identity regions. Because of this, we need to decide whether to choose a seed-based or a consensus-based alignment depending on the symmetric z-score between the two HMMs as given in Equation (3). Using the training part of Set A (1136 pairs), we found that a value of 22 resulted in the highest alignment accuracy. Below this cutoff (i.e. more negative scores) we used the seed-based alignment and above this cutoff we used the consensus-based. The results of this mixed scheme on the testing set of Set A (2305) pairs are shown in Figure 4 along with the COMPASS results.
|
Statistical significance
To provide an estimation of the significance of hits produced by the server, we calculated the probability of a false positive at each value of the symmetric z-score [Equation (3)]. This was performed on a set of 7924 profiles in the SUPERFAMILY database (Gough and Chothia, 2002). All unique HMMHMM comparisons for this set were performed using the quasi-consensus method, and the results were sorted by symmetric z-scores. For a particular value of the symmetric z-score, the proportion of false positives with a score equal to or greater than that score was calculated. The results are plotted in Figure 5. In this figure, F(zsym) = log10 [P(zsym)] is plotted as a function of the symmetric z-score. Because it is not possible to estimate the P-value until the first false positive is reached (as zsym decreases from infinity), the values in the figure represent an upper bound on the value of P and a lower bound on F. The points are fitted to a polynomial of degree 8, so that
with a correlation coefficient of 0.98. This function is used only to calculate the statistical significance for any value of zsym. No meaning is attached to the parameter values.
|
Algorithmic complexity
Like most profileprofile methods, COMPASS uses a DP technique to align two files of size n and m columns where a column is represented by a distribution over amino acids. Consider two profiles A = [a1 a2
am] and B = [b1 b2
bn] where ai is the ith column in profile A and bj is the jth column in profile B. To align A and B with affine gap scores, we need to fill three n-by-m DP matrices as given by
![]() |
As described above, generating a quasi-consensus sequence requires just a walk through the chain of states in the model and the time complexity for this step is O(n), where n is the number of match states in the model. Moreover, in the task of searching a database of profile HMMs, quasi-consensus sequences for all the models in the database can be generated and stored. In our experiment running on a system with a Pentium 1.13 GHz CPU and 376 Mb of RAM, it took COMPASS about 38 h to compute the all-versus-all comparison of 569 protein families, but it took our approach only about 1 h to do the same.
| DISCUSSION |
|---|
|
|
|---|
In this work, we propose a method for comparing profile HMMs via quasi-consensus sequences. While finding a true consensus sequence for a given profile HMM is NP-hard, quasi-consensus sequences seem to be a good surrogate to the true consensus sequences. We compare two profile HMMs by using quasi-consensus sequences from one model and score it against the other model and vice versa, and the similarity between the models is measured by a metric defined based on such comparisons. We also proposed aligning seed sequences of these profile HMMs using quasi-consensus sequences as hinges. To test the utility and effectiveness of our method, we used a benchmark dataset from our earlier published work. Our experiments demonstrate that our method outperforms a state-of-the-art profileprofile method, COMPASS, at identifying protein families with similar structures as defined at the SCOP fold level. The alignments generated by our method are more accurate than those generated by COMPASS, as measured against three established benchmark metrics. Moreover, unlike COMPASS, our approach is basically a sequenceprofile comparison and does not require the computation of a substitution matrix and hence runs much faster.
With a slightly better accuracy and a considerably faster speed, our approach should prove to be a good alternative to current profileprofile methods for use in database searches against proteins of known structure. As the size of the Protein Data Bank increases, this advantage is significant.
A server based on this method is available at http://liao.cis.udel.edu/website/servers/modmod
| Acknowledgments |
|---|
This publication was made possible by NIH Grant P20 RR-15588 from the COBRE Program of the National Center for Research Resources (L.L.), DuPont Science & Engineering award (L.L.), NIH Grant R01-HG02302 to R.L.D., and funds from the Pennsylvania Tobacco Settlement (R.L.D. and G.W.).
Received on August 26, 2004; revised on February 14, 2005; accepted on March 2, 2005
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F., et al. (1997) Gapped blast and psi-blast: a new generation of database programs. Nucleic Acids Res., 25, 33893402
Bateman, A., et al. (2004) The pfam protein families database. Nucleic Acids Res., 32, D138D141
Durbin, R., Eddy, S., Krogh, A., Mitchison, G. Biological Sequence Analysis, (1998) , Cambridge, UK Cambridge University Press.
Eddy, S.R. (1998) Profile hidden markov models. Bioinformatics, 14, 755763
Edgar, R.C. and Sjolander, K. (2004) A comparison of scoring functions for protein sequence profile alignment. Bioinformatics, 20, 13011308
Gough, J. and Chothia, C. (2002) Superfamily: Hmms representing all proteins of known structure. Scop sequence searches, alignments and genome assignments. Nucleic Acids Res., 30, 268272
Gribskov, M. and Robinson, N.L. (1996) Use of receiver-operator characteristic (roc) analysis to evaluate sequence matching. Comput. Chem., 20, 2533[CrossRef][Web of Science][Medline].
Gribskov, M., et al. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, 43554358
Holm, L. and Sander, C. (1995) Dali: a network tool for protein structure comparison. Trend. Biochem. Sci., 20, 478480[CrossRef][Web of Science][Medline].
Karplus, K., et al. (1998) Hidden markov models for detecting remote protein homologies. Bioinformatics, 14, 846856
Krogh, A., et al. (1994) Hidden markov models in computational biology. Applications to protein modeling. J. Mol. Biol., 235, 15011531[CrossRef][Web of Science][Medline].
Lyngso, R.B., et al. (1999) Metrics and similarity measures for hidden markov models. Proc. Int. Conf. Intell. Syst. Mol. Biol., 178186.
Mittelman, D., et al. (2003) Probabilistic scoring measures for profileprofile comparison yield more accurate short seed alignments. Bioinformatics, 19, 15311539
Murzin, A.G., et al. (1995) Scop: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536540[CrossRef][Web of Science][Medline].
Pietrokovski, S. (1996) Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res., 24, 38363845
Rychlewski, L., et al. (2000) Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci., 9, 232241[Web of Science][Medline].
Sadreyev, R. and Grishin, N. (2003) Compass: a tool for comparison of multiple protein alignments with assessment of statistical significance. J. Mol. Biol., 326, 317336[CrossRef][Web of Science][Medline].
Sauder, J.M., et al. (2000) Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins, 40, 622[CrossRef][Web of Science][Medline].
Shindyalov, I.N. and Bourne, P.E. (1998) Protein structure alignment by incremental combinatorial extension (ce) of the optimal path. Protein Eng., 11, 739747
Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195197[CrossRef][Web of Science][Medline].
Wang, G. and Dunbrack, R.L., Jr. (2004) Scoring profileprofile alignments. Protein Sci., 13, 16121626[CrossRef][Web of Science][Medline].
Yona, G. and Levitt, M. (2002) Within the twilight zone: a sensitive profileprofile comparison tool based on information theory. J. Mol. Biol., 315, 12571275[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
R. I. Sadreyev, M. Tang, B.-H. Kim, and N. V. Grishin COMPASS server for homology detection: improved statistical accuracy, speed and functionality Nucleic Acids Res., July 1, 2009; 37(suppl_2): W90 - W94. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Przybylski and B. Rost Powerful fusion: PSI-BLAST and consensus sequences Bioinformatics, September 15, 2008; 24(18): 1987 - 1993. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. I. Sadreyev and N. V. Grishin Accurate statistical model of comparison between multiple sequence alignments Nucleic Acids Res., April 1, 2008; 36(7): 2240 - 2248. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Przybylski and B. Rost Consensus sequences improve PSI-BLAST through mimicking profile-profile alignments Nucleic Acids Res., April 1, 2007; 35(7): 2238 - 2246. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











