Skip Navigation


Bioinformatics Advance Access originally published online on March 29, 2005
Bioinformatics 2005 21(10):2287-2293; doi:10.1093/bioinformatics/bti374
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/10/2287    most recent
bti374v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kahsay, R. Y.
Right arrow Articles by Dunbrack, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kahsay, R. Y.
Right arrow Articles by Dunbrack, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Quasi-consensus-based comparison of profile hidden Markov models for protein sequences

Robel Y. Kahsay 1, Guoli Wang 2, Guang Gao 1, Li Liao 1,3,* and Roland Dunbrack 2,*

1Delaware Biotechnology Institute Newark, DE 19715, USA
2Fox Chase Cancer Center Philadelphia, PA 19111, USA
3Department of Computer and Information Sciences, University of Delaware Newark, DE 19716, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 

A simple approach for the sensitive detection of distant relationships among protein families and for sequence–structure alignment via comparison of hidden Markov models based on their quasi-consensus sequences is presented. Using a previously published benchmark dataset, the approach is demonstrated to give better homology detection and yield alignments with improved accuracy in comparison to an existing state-of-the-art dynamic programming profile–profile comparison method. This method also runs significantly faster and is therefore suitable for a server covering the rapidly increasing structure database. A server based on this method is available at http://liao.cis.udel.edu/website/servers/modmod

Contact: roland.dunbrack{at}fccc.edu; lliao{at}mail.eecis.udel.edu


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
As the number of available protein sequences increases at a much faster pace than their structure and functions can be experimentally determined, the development of more sensitive and accurate sequence analysis algorithms remains important. Evolutionary relationships implied by sequence similarity remain a major basis for computationally predicting structure and function. Despite the tremendous number of methods devised, the task of detecting distant homologies, those that share sequence similarity lower than 30%, and predicting accurate sequence–structure alignments is still quite a challenge.

Detection ability and sequence–alignment accuracy can be improved by using multiple sequence information in the form of a position-specific scoring matrix or profile (Gribskov et al., 1987; Altschul et al., 1997). In contrast to methods that try to establish homology based on similarity between a pair of protein sequences, sequence–profile methods compare an individual protein sequence to one or more profiles. In a similar vein, profile–profile methods (Pietrokovski, 1996) compare two profiles, attempting to capture remote homologies that may be otherwise missed.

Widely accepted sequence–profile comparison methods for homology detection can be put into two major categories depending on how the profiles are constructed and represented. The first category consists of methods that represent protein families as position-specific scoring matrices, obtained by transforming a multiple sequence alignment into log-odds or frequency matrices of the 20 amino acids at each position in the alignment (Gribskov et al., 1987). The second category consists of hidden Markov models (HMMs) (Krogh et al., 1994; Eddy, 1998), which have a more complex architecture designed to model the observed multiple sequence alignment, including insertion–deletion regions, probabilistically. In both categories, a sequence–profile comparison method can work in both directions: a profile query against a sequence database or a sequence query against a profile database.

To exploit the information contained in profiles, profile–profile comparison methods have been developed that compare a query profile to a database of profiles for enhanced detection sensitivity (Pietrokovski, 1996; Rychlewski et al., 2000; Yona and Levitt, 2002; Sadreyev and Grishin, 2003; Wang and Dunbrack Jr, 2004). Most of these profile–profile methods start with defining a scoring scheme between columns of the two profiles. The similarity between two profiles is then measured as the total score when aligning columns in one profile to columns in the other profile with an appropriate gap-penalty scheme, and such an alignment can be found by using the same dynamic programming (DP) technique as used in Smith–Waterman local alignment for a pair of sequences (Smith and Waterman, 1981). Several recent papers have compared many of the scoring functions previously proposed (Mittelman et al., 2003; Edger and Sjolander, 2004; Wang and Dunbrack Jr, 2004).

Because of the favorable properties of HMMs in comparison with matrix profiles (Durbin et al., 1998), it is desirable to develop methods for comparing HMMs with one another. A DP scheme to align states of two HMMs for protein families was described several years ago by Shi and States (http://www.ornl.gov/sci/techresources/Human_Genome/publicat/99santa/108.html) and applied to cluster the models in the PFAM database (Bateman et al., 2004). Another approach for comparing two HMMs based on their ‘co-emission’ probability has also been presented and applied to HMMs for signal peptides (Lyngso et al., 1999). Recently, Madera and Chothia have made the PRC program available on the SUPERFAMILY site (http://supfam.org) that compares HMMs with each other, although the details of this method are not currently published.

In this paper, we propose a simple, alternative approach for comparing two profile HMMs using their quasi-consensus sequences. These sequences are those derived by taking the most probable sequence from the highest-probability path through the HMM. This method provides rapid remote homology detection and alignment prediction.


    METHODS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Consensus-based comparison of profile HMMs
A profile HMM for a protein family can be considered as a probability distribution over an infinite space of all possible amino acid strings that assigns higher probability for the members of the family. Therefore, comparing two profile HMMs amounts to comparing two distributions of probability. A conceptual approach to comparing two HMMs Mi and Mj is to consider them as (row) vectors in the infinite space of all sequences and define a metric that measures the similarity between the vectors. If we choose the components of these infinite dimension vectors to be the probabilities or scores assigned to all sequences by the models, the models can be then represented as

where {x1, x2, x3, ..., x{infty}} is an infinite set of all sequences in an arbitrarily preset order. Although represented as vectors, the similarity measure between these vectors is not computable in practice due to the vectors' infinite size. To circumvent this problem, we consider a plausible heuristic in which we focus on comparing the two vectors by their most significant components, where the corresponding sequence receives a maximal score. Let us assume that Mi peaks at si(Ci) and Mj peaks at sj(Cj), with Ci and Cj defined as the consensus sequences for Mi and Mj respectively. We consider two models to be highly similar if their respective consensus sequences are very probable in both models. Specifically, we compare profile HMMs by defining a similarity metric given as

(1)
where nCi and nCj are the unit vectors in the directions of the consensus sequences Ci and Cj of models Mi and Mj respectively. However, this similarity metric is based on raw scores and does not take the significance of the scores si(Cj) and sj(Ci) into consideration.

One way to address the issue of score significance is to transform raw scores into z-scores, which measure how far the raw scores deviate from the distribution mean. Specifically, for a raw score si (x), its z-score zi (x) is defined as

(2)
where µi is the mean and {sigma}i is the standard deviation of the distribution {si(x1), si(x2), ..., si (x{infty})} induced by model Mi. In practice, the mean and standard deviation of this distribution over an infinite set is estimated from a finite sampling, which in most cases is naturally obtained when searching a database of profile HMMs. In other words, in our case of searching a database of profile HMMs, the infinite dimensional vector representation of a model is reduced to a vector of a finite dimension {si(C1), si (C2), ..., si (CN)} where N is equal to the database size and the Ck are the consensus sequences of the models in the database. Thus, si(Ck) is the kth component of Mi, with Ck being the consensus sequence of the kth model in the database, and can be transformed into a z-score zi(Ck) as defined above. Our z-score-based similarity measure can then be taken as

(3)
where zi (Cj) and zj (Ci) are the respective z-scores. In order to avoid the effect of the ‘self component’ si(Ci) on the above metric, Ci is excluded before calculating the mean and standard deviation.

One problem with using consensus sequences to compare profile HMMs is that the task of finding the exact consensus sequence for a given HMM is an NP-hard problem (Lyngso et al., 1999). As a substitute for the exact consensus sequence, in this work we use a quasi-consensus sequence computed by a heuristic approach using the hmmemit program from the HMMER package (Eddy, 1998) with the -c option. This is found by traversing the nodes in the profile HMM and calculating the state occupancy of the match and insert states in each node. If the match state in a node is used with probability ≥0.5, then this match state is ‘consensus’ and the maximally likely residue in the state is chosen. On the other hand, if the state occupancy says that the insert state is used with probability ≥0.5, this insert state is ‘consensus’ and is used to generate the degenerate symbol (‘X’) as many times as the expectation value of the insertion length.

Sequence alignments
To search a database of HMMs associated with known structures, we first construct an HMM for the query sequence from a multiple sequence alignment of similar proteins, for instance those generated by a PSI-BLAST search. Our goal is to find an alignment between the query sequence (as seed sequence to its own HMM) and the seed sequences for HMMs in the database. Alignments between the seed sequences of two profile HMMs being compared can be found by aligning quasi-consensus sequences to the HMMs and using simple transitivity. To illustrate this, let us assume that there are two HMMs, Mi and Mj, generated using seed sequences Si and Sj, and let Ci and Cj be quasi-consensus sequences for Mi and Mj, respectively. Our goal is to find an alignment between the seed sequences Si and Sj. To get such an alignment, we can use either of the quasi-consensus sequences as an intermediate sequence between Si and Sj. That is, we can align Ci to models Mi and Mj using a standard sequence–model alignment procedure, such as the posterior decoding or Viterbi algorithm (Durbin et al., 1998). We use the align2model program in the SAM package for this purpose (Krogh et al., 1994). The resulting SiCi and CiSj alignments can then be used to determine the SiSj alignment using transitivity. Similarly, the other consensus sequence Cj can also be used. In a later section on alignment accuracy, we describe how to combine these two alignments to reach a single final alignment between Si and Sj.

Benchmark datasets
In a previous paper on comparison of different profile–profile alignment scoring functions (Wang and Dunbrack Jr, 2004), we built two datasets for benchmarking alignment accuracy and homology detection based on consensus DALI (Holm and Sander, 1995) and CE (Shindyalov and Bourne, 1998) structural alignments of domains defined in SCOP (Murzin et al., 1995). These were defined as a ‘Set A’ of 3441 pairs of sequences for experiments on sequence alignment accuracy and a ‘Set S’ of 665 sequences for assessing sequence search sensitivity and specificity. In this work, we reduced ‘Set S’ to 569 sequences. We used multiple sequence alignments of proteins homologous to those in Sets A and S from our previous work (Wang and Dunbrack Jr, 2004). Profile HMMs were constructed from these multiple sequence alignments using the SAM package (Karplus et al., 1998) with the w0.5 script. Then, for each of the constructed models, quasi-consensus sequences were generated using the hmmemit program from the HMMER package (Eddy, 1998).

For accuracy, following Wang and Dunbrack Jr (2004), about one-third of those sequence pairs in Set A were selected randomly to be used as a training set (1136 pairs), with the rest serving as a testing set (2305 pairs). The testing set was further split into eight groups, each corresponding to a range of sequence percent identity. The distribution of sequence pairs over the eight ranges is shown in Table 1.


View this table:
[in this window]
[in a new window]
 
Table 1 Number of alignment pairs in each range of sequence identity

 
We compare our method with the COMPASS program (Sadreyev and Grishin, 2003). COMPASS was used as described in its documentation. The same multiple sequence alignments used to construct HMMs were provided to COMPASS for profile–profile alignments. For COMPASS, all versus all comparisons were made by comparing each of the 569 multiple sequence alignments (as queries) to a database of the 569 profiles (created from the 569 multiple sequence alignments using the COMPASS package program mk_compass_db) using the compass_vs_db program.


    RESULTS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Evaluation of homolog detection ability
We performed all versus all comparison experiments on the reduced ‘Set S’ using three approaches: profile–profile alignments with COMPASS (Sadreyev and Grishin, 2003), quasi-consensus sequence-based HMM–HMM alignment, and seed sequence-based HMM–HMM alignment. In the quasi-consensus sequence-based HMM–HMM alignment, each model was used to score the quasi-consensus sequences of all other models in the set using the hmmscore program in the SAM package (with -sw 3 option and reverse-sequence null model) and the resulting vector of 568 scores was normalized into a z-score vector. The symmetric form (using both terms) and asymmetric form (using the first term) of Equation (3) were then used as a measure of similarity between a pair of models Mi and Mj. The same procedure was used for the seed sequence-based HMM–HMM alignment by scoring seed sequences. For COMPASS, all versus all comparisons were made by comparing each of the 569 multiple sequence alignments (as queries) to a database of the 569 profiles. Each list of 568 query–database comparison scores were then transformed into z-scores and the same symmetric and asymmetric similarity measures of Equation (3) were used. We also used the E-values for COMPASS as similarity measures. However, for the quasi-consensus- and seed sequence-based comparison of HMMs, we found that using the E-value from the hmmscore program of the SAM package gave poor performance (data not shown).

The criterion used for deciding whether two models are related or not is based on whether the seeds used to generate these models are from the same SCOP fold or not. As in our earlier work (Wang and Dunbrack Jr, 2004), if two proteins were defined as having a Rossmann fold in SCOP, we considered them as true positives, even if they were in different SCOP ‘fold’ categories. However, we found that avoiding this assumption on the Rossmann fold affects the results for all cases in this benchmark the same way (data not shown). Detection performance was measured by using the receiver operator characteristic (ROC), which plots the true positive count (Y-axis) as a function of false positive count (X-axis) when scanning through the output sorted by score. Receiver operator characteristic curves offer a combined measure of both sensitivity and specificity (Gribskov and Robinson, 1996). For numeric quantification of these curves, we calculated the ROC100 score, which is the normalized area under each curve up to the first 100 false positives. The ROC curves and ROC100 scores for different approaches are reported in Figure 1 and Table 2 respectively.



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 1 ROC curves. The legend sym/asym indicates whether the symmetric (both terms) or asymmetric (only the first term) is used in Equation (3). QUASI is when quasi-consensus sequences are used for HMM–HMM comparison and SEED is when seed sequences are used.

 

View this table:
[in this window]
[in a new window]
 
Table 2 ROC100 (normalized areas under the curves in Figure 1) values for the three approaches: COMPASS, seed sequence-based HMM comparison and quasi-consensus sequence-based HMM comparisona

 
As shown in Table 2 and Figure 1, the results for COMPASS show that using the symmetric Z measure as given in Equation (3) does better than using the E-value measure in terms of search specificity and are comparable in terms of search sensitivity. However, we see that the symmetric Z measure is much better than the asymmetric Z measure. In both the consensus and seed-based comparisons, the symmetric Z measure is much better than the asymmetric one. As shown in the last plot of Figure 1 and Table 2, the consensus-based comparison of HMMs achieves better performance than the other two approaches—COMPASS and seed sequence-based HMM comparison. The symmetric ROC100 score for the quasi-consensus method is 0.915 while COMPASS achieves a value of 0.883. More importantly, the quasi-consensus method is much faster and is shown to produce more accurate alignments, as discussed below.

Evaluation of alignment accuracy
The accuracy of the predicted alignments is quantified by three parameters QModeler, QDeveloper and QCombined as proposed in earlier work (Sauder et al., 2000; Yona and Levitt, 2002). The accuracy parameter from the modeler's point of view, QModeler, is the ratio of the number of correctly aligned pairs of residues to the total number of aligned positions in the alignment being evaluated. The parameter from the developer's point of view, QDeveloper, is the ratio of the number of correctly aligned pairs to the number of aligned positions in the structural alignment. The combined accuracy parameter, QCombined, is the total number of correct matches divided by the total number of positions that are aligned in either the structural alignment or the alignment being evaluated. Aligned pairs that are in both the sequence and the structure alignment are counted only once in the denominator of QCombined.

In this experiment, for every pair in the testing part of Set A (2305 pairs), we created two alignments by using each of the two consensus sequences as a hinge as described earlier. We then combined these two alignments into a final alignment according to the following two schemes. The first scheme (‘MAX’) simply picks one alignment, as a whole, out of the two alignments. The alignment that is picked has the higher score when the quasi-consensus sequence was aligned to the HMM. The second scheme (called ‘AND’ selection) is to take segments that are common in both alignments. The final alignment thus obtained misses out segments where the two alignments disagree. Three variations to this AND-selection are suggested on which alignment of these uncommon segments shall be picked to patch the final alignment. In the first variation (‘AND1’ selection), for a region where the two alignments disagree, we choose the alignment that contains fewer gap positions. As a refinement of AND1 selection, the second variation (‘AND2’ selection) calculates a BLOSUM score for each of the two alignments at an uncommon segment and picks the alignment that has a larger score. The third variation (‘AND3’ selection) chooses for a non-common segment the alignment that agrees better with the structural alignment for that segment. However, the AND3 selection is not usable when the correct ‘answer’, i.e. the structural alignment, is not available. The AND3 was the best alignment one could obtain with a method that was always able to tell which alignment was better.

The quality of the final alignments obtained by using these selection schemes and by using COMPASS was evaluated, and the accuracy parameters were averaged in the eight regions of sequence identity range as listed in Table 1. The results are reported in Figure 2. We observe that regions in the two alignments that agree with one another (AND) are more accurate per residue of the alignment (QModeler) than those that include regions where the alignments disagree (AND1, AND2 and AND3). However, the latter alignments are also considerably longer and do provide some correct information, resulting in higher QDeveloper scores. We see that, with the exception of the first identity range, both the MAX selection and AND2 selection outperform COMPASS in terms of QCombined.



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 2 Values for averaged sequence alignment accuracy parameters QModeler, QDeveloper and QCombined. In each plot, averaged parameters for COMPASS and alignments from five selection schemes for consensus-based HMM–HMM alignments are shown.

 
We repeated the same experiment by using seed sequences instead of the quasi-consensus sequences to compare the HMMs. In this case, a seed sequence of one model is aligned to the other model, and vice versa. From the two resulting alignments, a final alignment was generated by using the selection schemes discussed above. The accuracy for these seed-based alignments is shown in Figure 3. The results show that the seed-sequence method does not behave well at very low sequence identities, but it does better than COMPASS at higher sequence identities.



View larger version (27K):
[in this window]
[in a new window]
 
Fig. 3 Values for averaged sequence alignment accuracy parameters QModeler, QDeveloper and QCombined. In each plot, averaged parameters for COMPASS and alignments from five selection schemes for seed-based HMM–HMM alignments are shown.

 
The results of these experiments as shown in Figures 2 and 3 indicate that it is advisable to use consensus-based alignments in the lower sequence identity regions and seed-based alignments in the higher sequence identity regions. Because of this, we need to decide whether to choose a seed-based or a consensus-based alignment depending on the symmetric z-score between the two HMMs as given in Equation (3). Using the training part of Set A (1136 pairs), we found that a value of –22 resulted in the highest alignment accuracy. Below this cutoff (i.e. more negative scores) we used the seed-based alignment and above this cutoff we used the consensus-based. The results of this mixed scheme on the testing set of Set A (2305) pairs are shown in Figure 4 along with the COMPASS results.



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 4 Values for averaged sequence alignment accuracy parameters QModeler, QDeveloper and QCombined. In each plot, averaged parameters for COMPASS and alignments from five selection schemes for optimal mixture of seed-based and consensus-based HMM–HMM alignments are shown. The mixing approach is discussed in the text.

 
Statistical significance
To provide an estimation of the significance of hits produced by the server, we calculated the probability of a false positive at each value of the symmetric z-score [Equation (3)]. This was performed on a set of 7924 profiles in the SUPERFAMILY database (Gough and Chothia, 2002). All unique HMM–HMM comparisons for this set were performed using the quasi-consensus method, and the results were sorted by symmetric z-scores. For a particular value of the symmetric z-score, the proportion of false positives with a score equal to or greater than that score was calculated. The results are plotted in Figure 5. In this figure, F(zsym) = –log10 [P(zsym)] is plotted as a function of the symmetric z-score. Because it is not possible to estimate the P-value until the first false positive is reached (as zsym decreases from infinity), the values in the figure represent an upper bound on the value of P and a lower bound on F. The points are fitted to a polynomial of degree 8, so that with a correlation coefficient of 0.98. This function is used only to calculate the statistical significance for any value of zsym. No meaning is attached to the parameter values.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 5 Probability of a false positive as a function of the symmetric z-score from an all-against-all comparison of 7924 models from the SUPERFAMILY database. The curve is a polynomial fit to the data and has no explicit meaning otherwise.

 
Algorithmic complexity
Like most profile–profile methods, COMPASS uses a DP technique to align two files of size n and m columns where a column is represented by a distribution over amino acids. Consider two profiles A = [a1 a2 ··· am] and B = [b1 b2 ··· bn] where ai is the ith column in profile A and bj is the jth column in profile B. To align A and B with affine gap scores, we need to fill three n-by-m DP matrices as given by

where s(ai, bj) is the substitution score for columns ai and bj, and d and e are the gap opening and gap extension penalties respectively. This substitution score s(ai, bj) for each pair of columns (ai, bj) needs to be computed on the fly and this is an expensive computation. Similarly, in our consensus-based approach, to align a sequence of length n to a profile HMM with m nodes using the align2model program in the SAM package with the adpstyle 5 option (posterior decoding), we need to fill six n-by-m DP matrices (three forward, three backward) as given in Durbin et al. (1998). However, in this case, there is no need to compute the substitution score between the ith residue in the sequence and the jth match/insert state in the HMM since this is simply given as the emission probability/score of the residue in the state. Thus, even though both profile–profile alignment and sequence–HMM alignment have the same algorithmic complexity of O(nm), the need to compute the substitution matrix makes profile–profile methods computationally expensive.

As described above, generating a quasi-consensus sequence requires just a ‘walk’ through the chain of states in the model and the time complexity for this step is O(n), where n is the number of match states in the model. Moreover, in the task of searching a database of profile HMMs, quasi-consensus sequences for all the models in the database can be generated and stored. In our experiment running on a system with a Pentium 1.13 GHz CPU and 376 Mb of RAM, it took COMPASS about 38 h to compute the all-versus-all comparison of 569 protein families, but it took our approach only about 1 h to do the same.


    DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
In this work, we propose a method for comparing profile HMMs via quasi-consensus sequences. While finding a true consensus sequence for a given profile HMM is NP-hard, quasi-consensus sequences seem to be a good surrogate to the true consensus sequences. We compare two profile HMMs by using quasi-consensus sequences from one model and score it against the other model and vice versa, and the similarity between the models is measured by a metric defined based on such comparisons. We also proposed aligning seed sequences of these profile HMMs using quasi-consensus sequences as hinges. To test the utility and effectiveness of our method, we used a benchmark dataset from our earlier published work. Our experiments demonstrate that our method outperforms a state-of-the-art profile–profile method, COMPASS, at identifying protein families with similar structures as defined at the SCOP fold level. The alignments generated by our method are more accurate than those generated by COMPASS, as measured against three established benchmark metrics. Moreover, unlike COMPASS, our approach is basically a sequence–profile comparison and does not require the computation of a substitution matrix and hence runs much faster.

With a slightly better accuracy and a considerably faster speed, our approach should prove to be a good alternative to current profile–profile methods for use in database searches against proteins of known structure. As the size of the Protein Data Bank increases, this advantage is significant.

A server based on this method is available at http://liao.cis.udel.edu/website/servers/modmod


    Acknowledgments
 
This publication was made possible by NIH Grant P20 RR-15588 from the COBRE Program of the National Center for Research Resources (L.L.), DuPont Science & Engineering award (L.L.), NIH Grant R01-HG02302 to R.L.D., and funds from the Pennsylvania Tobacco Settlement (R.L.D. and G.W.).

Received on August 26, 2004; revised on February 14, 2005; accepted on March 2, 2005

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 

    Altschul, S.F., et al. (1997) Gapped blast and psi-blast: a new generation of database programs. Nucleic Acids Res., 25, 3389–3402[Abstract/Free Full Text].

    Bateman, A., et al. (2004) The pfam protein families database. Nucleic Acids Res., 32, D138–D141[Abstract/Free Full Text].

    Durbin, R., Eddy, S., Krogh, A., Mitchison, G. Biological Sequence Analysis, (1998) , Cambridge, UK Cambridge University Press.

    Eddy, S.R. (1998) Profile hidden markov models. Bioinformatics, 14, 755–763[Abstract/Free Full Text].

    Edgar, R.C. and Sjolander, K. (2004) A comparison of scoring functions for protein sequence profile alignment. Bioinformatics, 20, 1301–1308[Abstract/Free Full Text].

    Gough, J. and Chothia, C. (2002) Superfamily: Hmms representing all proteins of known structure. Scop sequence searches, alignments and genome assignments. Nucleic Acids Res., 30, 268–272[Abstract/Free Full Text].

    Gribskov, M. and Robinson, N.L. (1996) Use of receiver-operator characteristic (roc) analysis to evaluate sequence matching. Comput. Chem., 20, 25–33[CrossRef][ISI][Medline].

    Gribskov, M., et al. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, 4355–4358[Abstract/Free Full Text].

    Holm, L. and Sander, C. (1995) Dali: a network tool for protein structure comparison. Trend. Biochem. Sci., 20, 478–480[CrossRef][ISI][Medline].

    Karplus, K., et al. (1998) Hidden markov models for detecting remote protein homologies. Bioinformatics, 14, 846–856[Abstract/Free Full Text].

    Krogh, A., et al. (1994) Hidden markov models in computational biology. Applications to protein modeling. J. Mol. Biol., 235, 1501–1531[CrossRef][ISI][Medline].

    Lyngso, R.B., et al. (1999) Metrics and similarity measures for hidden markov models. Proc. Int. Conf. Intell. Syst. Mol. Biol., 178–186.

    Mittelman, D., et al. (2003) Probabilistic scoring measures for profile–profile comparison yield more accurate short seed alignments. Bioinformatics, 19, 1531–1539[Abstract/Free Full Text].

    Murzin, A.G., et al. (1995) Scop: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536–540[CrossRef][ISI][Medline].

    Pietrokovski, S. (1996) Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res., 24, 3836–3845[Abstract/Free Full Text].

    Rychlewski, L., et al. (2000) Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci., 9, 232–241[Abstract].

    Sadreyev, R. and Grishin, N. (2003) Compass: a tool for comparison of multiple protein alignments with assessment of statistical significance. J. Mol. Biol., 326, 317–336[CrossRef][ISI][Medline].

    Sauder, J.M., et al. (2000) Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins, 40, 6–22[CrossRef][ISI][Medline].

    Shindyalov, I.N. and Bourne, P.E. (1998) Protein structure alignment by incremental combinatorial extension (ce) of the optimal path. Protein Eng., 11, 739–747[Abstract/Free Full Text].

    Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197[CrossRef][ISI][Medline].

    Wang, G. and Dunbrack, R.L., Jr. (2004) Scoring profile–profile alignments. Protein Sci., 13, 1612–1626[Abstract/Free Full Text].

    Yona, G. and Levitt, M. (2002) Within the twilight zone: a sensitive profile–profile comparison tool based on information theory. J. Mol. Biol., 315, 1257–1275[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
D. Przybylski and B. Rost
Powerful fusion: PSI-BLAST and consensus sequences
Bioinformatics, September 15, 2008; 24(18): 1987 - 1993.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
R. I. Sadreyev and N. V. Grishin
Accurate statistical model of comparison between multiple sequence alignments
Nucleic Acids Res., April 1, 2008; 36(7): 2240 - 2248.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. Przybylski and B. Rost
Consensus sequences improve PSI-BLAST through mimicking profile-profile alignments
Nucleic Acids Res., April 1, 2007; 35(7): 2238 - 2246.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/10/2287    most recent
bti374v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kahsay, R. Y.
Right arrow Articles by Dunbrack, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kahsay, R. Y.
Right arrow Articles by Dunbrack, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?