Biological Sequence Analysis
Incremental window-based protein sequence alignment algorithms
Department of Computer Science & Engineering, University of Minnesota Minneapolis, MN 55455, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Protein sequence alignment plays a critical role in computational biology as it is an integral part in many analysis tasks designed to solve problems in comparative genomics, structure and function prediction, and homology modeling.
Methods: We have developed novel sequence alignment algorithms that compute the alignment between a pair of sequences based on short fixed- or variable-length high-scoring subsequences. Our algorithms build the alignments by repeatedly selecting the highest scoring pairs of subsequences and using them to construct small portions of the final alignment. We utilize PSI-BLAST generated sequence profiles and employ a profile-to-profile scoring scheme derived from PICASSO.
Results: We evaluated the performance of the computed alignments on two recently published benchmark datasets and compared them against the alignments computed by existing state-of-the-art dynamic programming-based profile-to-profile local and global sequence alignment algorithms. Our results show that the new algorithms achieve alignments that are comparable with or better than those achieved by existing algorithms. Moreover, our results also showed that these algorithms can be used to provide better information as to which of the aligned positions are more reliablea critical piece of information for comparative modeling applications.
Contact: rangwala{at}cs.umn.edu
Supplementary information: http://bioinfo.cs.umn.edu/supplements/win-aln/
| 1 INTRODUCTION |
|---|
|
|
|---|
Alignment algorithms serve as the most basic sequence analysis methods in computational biology and have a wide range of applications dealing with sequence database searching, comparative modeling, protein structure and function prediction.
The current state-of-the-art sequence alignment algorithms have a well-defined optimal dynamic programming-based solution, introduced decades ago. These optimal algorithms, SmithWaterman (1981) and NeedlemanWunsch (1970), solve the local and global sequence alignment problems, respectively. Over the years, alignment methods have advanced with several variations of the optimal alignment method, use of gap modeling techniques (Gusfield, 1997), heuristics (Altschul et al., 1990; Pearson and Lipman, 1988), and more recently the use of profile (Gribskov and Robinson, 1996; Edgar and Sjolander, 2004b; Altschul, 1997) and structure information (Jones et al., 1992).
In recent years, there has been a considerable research effort in developing kernel-based methods for building discriminatory models for remote homology detection and fold recognition. This research has led to the development of a number of protein string kernels that determine the similarity between a pair of proteins as a function of the number of sufficiently similar short subsequences that they share. These string kernels have proven to be extremely effective in building very accurate models, and these methods are among the best performing schemes for remote homology prediction and fold recognition (Leslie et al., 2002; Kuang et al., 2004; Rangwala and Karypis, 2005).
Motivated by these developments in string kernels, the work in this paper is designed to address the question as to the extent to which, ideas motivated by these string kernels can be used to build alignments between a pair of sequences. Toward this goal, we developed a set of window-based alignment algorithms that are heuristic in nature. Our methods incrementally constructed the alignment by using the highest scoring pairs of residues between the two sequences at each step. The residue-pair scoring was borrowed from string-kernel theory where to score the residue-pairs in consideration, we examined short subsequences and referred to a wmers centered around each of the two residues. We introduced several heuristics to identify aligned residue-pairs using the wmers coupled with profile information.
We determined the quality of our alignment methods by evaluation on a template-based (Edgar and Sjolander, 2004b; Sauder, 2000) and a model-based dataset (Elofsson, 2002; Cristobal, 2001). Our empirical results on the two datasets showed the competitive performance of our introduced schemes to state-of-the-art methods. We also evaluated our methods by determining the reliability of the aligned positions (Jaroszewski et al., 2002; Cline et al., 2002; Schlosshauer and Ohlsson, 2002; Mevissen and Vingron, 1996; Tress, 2003). The positive results for some of our alignment algorithms on such a reliability metric are very encouraging due to far reaching applications, like comparative modeling.
| 2 METHODS |
|---|
|
|
|---|
2.1 Sequence profiles and profile scoring
The alignment algorithms that we developed take advantage of evolutionary information by utilizing PSI-BLAST (Altschul et al., 1997) generated sequence profiles.
The profile of a sequence X of length m is represented by two m x 20 matrices. The first is its position-specific scoring matrix
that is computed directly by PSI-BLAST using the scheme described in (Altschul et al., 1997). The rows of this matrix correspond to the various positions in X and the columns correspond to the 20 distinct amino acids. The second matrix is its position-specific frequency matrix
that contains the frequencies used by PSI-BLAST to derive
. These frequencies [also referred to as target frequencies (Mittelman et al., 2003)] contain both the sequence-weighted observed frequencies [also referred to as effective frequencies (Mittelman et al., 2003)] and the BLOSUM62 (Henikoff and Henikoff, 1992) derived-pseudocounts (Altschul et al., 1997).
Many different schemes have been developed for determining the similarity between profiles that combine information from the original sequence, position-specific scoring matrix, or position-specific target and/or effective frequencies (Mittelman et al., 2003; Wang and Dunbrack, 2004; Marti-Renom et al., 2004). In our work we use a scheme that is derived from PICASSO (Heger and Holm, 2001; Mittelman et al., 2003) that was recently used in developing effective remote homology detection and fold recognition algorithms (Rangwala and Karypis, 2005). Specifically, the similarity score between the i-th position of protein's X profile and the j-th position of protein's Y profile is given by
![]() | (1) |
and
are the values corresponding to the l-th amino acid at the i-th position of X's position-specific scoring and frequency matrices.
and
are defined in a similar fashion.
2.2 Window-based alignments
The overall methodology of the alignment algorithms developed in this work is to incrementally construct the alignment by using various heuristics to identify the pairs of aligned residues. The key idea shared by these algorithms is that they determine whether or not a pair of residues should be aligned together by examining the (short) subsequences, referred to as wmers, that are centered around each of the two residues.
Given a sequence X of length m and a user-supplied parameter w, the wmer at position i of X (
) is defined to be the
-length subsequence of X centered at position i. That is, the wmer contains xi, the w amino acids before, and the w amino acids after xi. A pair of wmers are compared by computing their ungapped alignment scores. Given two sequences X and Y, the ungapped alignment score, wscore(xi, yj), between a pair of wmers at positions i and j of X and Y, respectively, is given by
![]() | (2) |
is the alignment score between
and
and is computed using Equation (1).
2.2.1 Central alignment scheme
This is the simplest alignment algorithm that we developed and it computes the alignment by progressively aligning the pairs of residues that have the highest positive wscore values subject to the constraint that they do not conflict with the portion of the alignment that has been constructed thus far.
Specifically, given two sequences X and Y of length m and n, respectively, and a value for w, it starts by computing the set
of residue-pairs that are candidates for inclusion in the alignment by considering only the pairs that have positive wscore values. That is,
![]() | (3) |
and
. Then it performs a series of iterations in which it performs the following three steps: first, it extracts from Sw the residue-pair with the highest wscore value
; second, it aligns
against
; third, it removes from Sw all residue-pairs that cannot be part of a valid alignment given that
and
have been aligned with each other. This process terminates when Sw becomes empty. Positions that do not belong to any of the selected residue-pairs are left unaligned (i.e. aligned against spaces).
The residue-pairs that need to be removed are (1) 
, (2) 
, (3)
and (4)
. The first two conditions remove from
all residue-pairs involving
or
, as these positions have now been aligned, whereas the last two conditions remove the residue-pairs that, if aligned, will introduce inversions in the alignment.
2.2.2 Subset alignment scheme
A limitation of the central alignment (CA) scheme is that it may leave a large number of residues unaligned because (1) it only considers the residue-pairs with positive wscores, and (2) it will not align the first and last w positions of the two sequences (Sw contains only pairs involving interior residues).
To address this problem we developed the subset alignment scheme (SA), which can be considered an extension to the CA scheme. Specifically, the SA scheme modifies the second and third steps of the CA algorithm as follows. During the second step, in addition to including the
pair in the alignment, it also includes in the alignment all previously unaligned residue-pairs of the form
for
. That is, it can potentially include all residue-pairs involved in
's wmer. Note that due to the incremental nature of the algorithm, the second step essentially extends the alignment around the
residue-pair until it encounters a residue (from either X or Y) that has already been aligned. We will refer to this as the alignment extension operation. During the third step the SA algorithm removes from Sw all residue-pairs that are now in conflict with all aligned residue-pairs that were selected in second step.
2.2.3 Central and subset alignment scheme
A potential problem with the SA scheme is that it may align a pair of residues
with each other, even when Sw contains residue-pairs with higher wscore values for either or both of the two residues. This happens, because SA's alignment extension operation extends the alignment as soon as it extracts the highest scoring residue-pair from Sw and there may be some higher-scoring wmers for these positions in Sw.
For this reason, we developed a hybrid scheme that combines the CA and SA approaches. Specifically, the new scheme first computes a CA alignment and then augments it by applying the alignment extension approach used by SA to each pair of its aligned residues.
2.2.4 Variable wmer alignment scheme
The alignment schemes, CA, SA and CSA were discussed in the context of a fixed length wmer. The potential drawback of this scheme is that if w is set to a relatively large value, it may fail to identify positive scoring subsequences; whereas if it is set too low, it may fail to reward residue-pairs that have relatively long similar subsequences.
For this reason we extended the algorithms to also operate with variable length wmers. The key difference, from the use of fixed length wmers centered around residue-pairs
and
, is the fact that we define length
in the range of 1 to w, such that
![]() | (4) |
subsequence score as defined in Equation (2).
Our alignment schemes start by computing the set
of residue-pairs that are candidates for inclusion in the alignment by considering only pairs that have positive
values. With this change all steps of our alignment algorithms remain the same. Note that the SA scheme using the variable-length wmers will have its alignment extension operation extended till a maximum length of
.
As a notation reference we denote the variable
alignment algorithms by
,
and
to distinguish them from the fixed wmer alignment algorithms denoted in this study by
,
and
.
| 3 Materials |
|---|
|
|
|---|
3.1 Evaluation methodologies and metrics
We evaluated the performance of the proposed window-based alignment algorithms by considering (1) the quality of the alignment itself and (2) the extent to which the inherent ordering of the aligned pairs of residues can be used to identify portions of the alignment that are more reliable than others. In order to assess alignment quality we used two widely used methodologies, often referred to as template based (Edgar and Sjolander, 2004b) and model based (Elofsson, 2002), whereas the reliability was assessed by following a methodology that was recently proposed in the context of comparative modeling (Tress et al., 2003).
3.1.1 Template-based approach
The first method for evaluating alignment quality compares the differences between the alignment generated to template alignments (Edgar and Sjolander, 2004b; Sauder et al., 2000; Elofsson, 2002). These template alignments are generally derived from various structural alignment programs and are considered to be the gold standard.
We use three quality scores, namely the developer's score (fD) (Sauder et al., 2000), the modeler's score (fM) (Sauder et al., 2000) and the Cline score (CS) (Cline et al., 2002) to compare the template alignments with the generated alignments. The developer's score is the number of correctly aligned residue-pairs in the generated alignment divided by the length of the template alignment. (The length of an alignment is defined as the number of aligned residue-pairs.) The modeler's score computes the ratio of correctly aligned residue-pairs with the length of the generated alignment. The Cline score was developed to address the issues with
and
by penalizing both under-alignment and over-alignment, and also crediting regions in the generated alignment that may be shifted by a few positions relative to the reference alignment (Edgar and Sjolander, 2004b; Cline et al., 2002). The steps for computation of the Cline score can be found in the study (Cline et al., 2002).
Note that the fD and fM scores are equivalent to the more traditional measures of recall and precision (Fawcett, 2004), respectively, that are used extensively to measure prediction performance. In the rest of the discussion we will primarily refer to fD and fM by the more intuitive names of recall and precision, respectively.
3.1.2 Model-based approach
An alternative to using a template-based approach is to build a structural model from the alignment and to evaluate the similarity between the model and the template structure (Elofsson, 2002; Panchenko et al., 2000). Starting from the alignment between a pair of proteins (one protein considered to be the query protein, the second considered to be the target protein whose 3D structure is known), a model protein is created which consists of the carbon alpha,
atoms of the query protein. The atomic coordinates of this model protein are the atomic coordinates of the target protein i.e. for every aligned pair of residues, the query protein has its
atomic coordinates replaced by the corresponding atomic coordinates of the target protein. The similarity between the two structures (the model protein and target protein) after a structural superimposition (Levitt and Gerstein, 1998), is used as an assessment of sequence alignment quality. In our study, we computed this similarity using the LGscore (Cristobal et al., 2001) that takes into account the common segments between the pair of proteins. The LGscore measure was used to evaluate the structures obtained by threading methods (Panchenko et al., 2000) in the CAFASP2 (Fischer et al., 2001) and LiveBench (Bujnicki et al., 2001) experiments as well as a sequence alignment quality measure (Elofsson, 2002).
3.1.3 Reliability of aligned regions
In comparative modeling and several other applications, it is essential not only to align residue-pairs but also to provide some reliability index or confidence measure associated with the aligned residue-pairs. While building protein structure models using comparative modeling strategies it is important to include only those regions where the alignment is considered to be good or reliable (Jaroszewski, 2002; Cline et al., 2002; Schlosshauer and Ohlsson, 2002; Mevissen and Vingron, 1996; Tress et al., 2003).
Using the template-based benchmarks we evaluated the reliability of the aligned residue-pairs by ranking the aligned pairs in the query alignment. We score the aligned positions using fixed length wscores. The reliability measure is computed as the recall at different percent levels of incorrectly aligned residue-pairs (up to 5%). The notion of a hit is defined as having the same aligned residue-pairs in both the query and template alignments. Our approach of using the wscores was similar to a method that used a smoothed profile-derived score for reliability assessment (Tress et al., 2003).
3.2 Datasets
For the template-based assessment scheme we used a dataset created to evaluate the various profile-to-profile scoring functions for protein sequence alignment (Edgar and Sjolander, 2004b). The dataset consists of 588 reference alignment pairs having high structural similarity but low sequence identity (
30%). This dataset was selected to have a high pairwise structural similarity using the consensus of FSSP (Holm and Sander, 1996) and CE (Shindyalov and Bourne, 1998).
For the model-based evaluation scheme, we used a benchmark created from SCOP 1.39 filtered to only contain domains with <50% pairwise sequence identity (Elofsson, 2002). This dataset contains 9983 protein domain pairs, such that 1903 belong to the same families, 3101 share only the same superfamily and 4979 share only the same fold. Due to the non-symmetrical nature of models built from alignments, each pair of sequences were evaluated twiceleading to a benchmark of 19 966 domain pairs.
3.3 Profile generation
The position-specific score and frequency matrices used by the profile-based scoring method of Equation (1) were generated using the latest version of the PSI-BLAST algorithm (available in NCBI's blast release 2.2.10), and were derived from the multiple sequence alignment constructed after five iterations using an e-value of 103. The PSI-BLAST was performed against NCBI's nr database that was downloaded in November of 2004 and contained 2 171 938 sequences.
| 4 RESULTS |
|---|
|
|
|---|
4.1 Assessment of incremental window-based alignments
Table 1 provides an extensive set of results illustrating the performance of the CA, SA and CSA schemes on the template-based dataset for different values of w and for fixed- and variable-length wmers. Note that the column labeled
shows the CS results for the subset of sequence-pairs that have <15% sequence identity (i.e. a subset that is inherently harder to align well).
|
4.1.1 Central versus subset versus combined
The results of Table 1 show that with respect to the CS scores, SA tends to perform better than either CA or CSA, whereas CA performs consistently the worst. The only exception is for variable-length wmers, in which SA's performance is comparable to that of CSA. The relative advantage of SA is more evident if we consider the subset of sequence-pairs with <15% sequence identity, for which its CS scores are consistently higher than those achieved by the other schemes (SA achieves a score of 0.649 whereas CA and CSA achieve scores of 0.614 and 0.628, respectively).
By looking at the performance of the various schemes in terms of recall, we can see that SA's higher CS-based performance is due to the fact that it achieves significantly better recall values than the other schemes. This was to be expected, as it was one of the motivation behind the development of SA. Also, the precision-based results show that CA achieves somewhat better precisions than CSA, whereas SA's precision is comparable with or better than that of the other schemes.
4.1.2 Fixed versus variable-length alignments
Analyzing the performance of alignment methods that use fixed length wmers compared with the methods that use variable-length wmers, we notice that for the CA and CSA schemes, for the same wmer length the recall as well as the precision scores have higher values. Note that the higher recall is expected, because the methods using a variable wmer size window will have a higher flexibility in allowing larger number of wmers (with a positive score) to be picked for the candidate set
.
Another key observation is that
performs better in terms of recall than
. This is because for the same value of w, the
value selected by
may be smaller than w (i.e. the value used by
). As a result,
's alignment extension operations will involve longer windows, which can produce longer alignments than
, and thus higher recall values.
4.1.3 Sensitivity of schemes with respect to varying wmer size
Looking at the performance achieved by the various schemes in Table 1 as w ranges from two to five, we see that in general, SA's and CSA's performance does not significantly change (e.g. CS scores stay within a tight range), whereas
's performance tends to deteriorate with increasing w. This latter behavior is due to the fact that as we increase the
size, fewer wmers will have a positive score and hence will not be included as part of the set
. We see a direct effect of this leading to a decrease in the recall scores. Also increase in the
size does lead to a decrease in precision score as well. This is because for a larger wmer window the positive scoring wmers may not be due to the more central positions. Evidence of this can be seen by comparing the behavior of the
scheme in which both the precision and recall scores stay the same.
Another key observation is that the schemes that utilize variable-length wmers tend to perform better for larger values of w. This is because of the flexibility associated with using a variable-length wmer.
4.2 Comparison with earlier results
4.2.1 Template-based benchmark
Table 2 shows the comparative performance of our window-based schemes against some of the best profile-to-profile scoring techniques studied previously (Edgar and Sjolander, 2004b). In the table we show results for the schemes pdotp, correlp and coach. pdotp uses dot product to compute the similarity between two profiles, correlp computes the Pearson correlation between the profile columns, whereas coach (Edgar and Sjolander, 2004a) uses an asymmetrical complex dot product between the HMM profile and a position frequency matrix.
|
We show results of these schemes as published previously (Edgar and Sjolander, 2004b) using SAM T99 profiles [(The performance of these alignment methods using SAM T99 profiles is 34% better than the PSI-BLAST-based profiles (Edgar and Sjolander, 2004b)]. Our methods show comparable performance to these alignment methods using SAM T99 templates.
We also compare the results of the window-based alignment methods to a local SmithWaterman (1981) alignment algorithm implementation (SW-PSSM) using the same profile-to-profile scoring function as used for the window-based alignments [Equation (1)]. Within this local alignment framework we use an affine gap model along with a zero-shift parameter (Wang and Dunbrack, 2004) to maintain certain necessary requirements of a good optimal alignment. We optimize the gap modeling parameters [gap opening (go), gap extension (ge)] and the zero-shift value (zs) to obtain highly optimal alignments for comparative purposes.
We observe in Table 2 that the incremental window-based alignment schemes perform very competitively when compared with our fully optimized SW-PSSM implementation. Also notice the superiority of our optimized SW-PSSM implementation to the alignment methods using pdotp, correlp and coach as their profileprofile scoring functions. The difference in the SW-PSSM results with the other standard alignment techniques may be due to the use of a more sensitive PICASSO-based profile-to-profile scoring function. Further, these results verify that we are comparing our novel window-based alignment methods to a fully optimized SW-PSSM alignment algorithm.
The performance of the window-based scheme is actually very promising. We select one of the better performing schemes (
) and compare it with the optimized SW-PSSM algorithm using the CS score. Figure 1 shows that the comparative performance of the two methods across the 588 alignment pairs in the dataset.
|
4.2.2 Model-based benchmark
We also performed a comparative study of alignment quality using the average log LGscore based on the model-based benchmark. Our results in Table 3 reiterate the closeness in performance of the incremental window-based alignment method to the highly optimized SW-PSSM alignment algorithm for the family-, superfamily- and fold-level subsets. For this dataset too, we performed a thorough parameter study by varying wmer lengths for our alignment schemes. We observed similar results as seen in Table 1 for the template-based dataset. (Owing to space constraints we do not show the detailed results on the model-based dataset here.)
|
Table 3 also shows results for the optimized local (local sequence alignment using a global scoring matrix), global (global sequence alignment using a global scoring matrix), PSI [3D-PSSM (Kelley et al., 2000) based global sequence alignment against a profile (Gribskov et al., 1987) obtained from PSI-BLAST], SSPSI (Elofsson, 2002) (3D-PSSM-based global sequence alignment against a profile obtained from PSI-BLAST using secondary structure information) and structural (alignment using structural superimposition by LGscore2) alignment methods published previously (Elofsson, 2002). The structural alignment sets up a higher reference quality score for the benchmark. Using sequence alignment techniques we would like to achieve these high levels of accuracy. The results shown in Table 3 for the various previously published schemes, as well as for our methods, are the best achieved after optimization of the various parameters.
We further analyze the data by annotating a model as being correct based on the LGscore value. As done in the study (Elofsson, 2002) we use the less strict LGscore cutoff (103) to define a correct model and a more stringent cutoff (105) to identify models of higher quality. The percentage of models correct based on these cutoffs are shown in Table 4. Both the incremental window-based alignment methods, as well as the SW-PSSM alignment method, are able to pick the correct models with similar degrees of accuracy. Our techniques also seem to identify a higher percentage of correct models when compared with the previously studied schemes, especially PSI and SSPSI, both of which also incorporate some profile information. As seen from Table 4 our methods are able to pick a larger fraction of higher quality models for the family and superfamily levels.
|
4.2.3 Reliability performance
Table 5 shows the reliability performance for the window-based alignment schemes in comparison to the optimized SW-PSSM-based alignment scheme. These results correspond to the average recall scores obtained for all the alignment pairs at different error rates using the procedure described in Section 3.1.3.
|
Though the SW-PSSM algorithm showed slightly better performance in terms of the overall alignment quality (Tables 2 and 3), it is interesting to note the window-based schemes using variable-length wmers showed far better performance at the lower error rates. In particular before seeing any incorrect predictions in the ranked aligned positions, the alignment methods using variable-length wmers have a recall around 0.260 compared with the recall of 0.205 for the SW-PSSM algorithm. Note that the recall performance of the CSA scheme is slightly better than the CA scheme and slightly worse compared with the SA alignment scheme. These results can be explained by the fact that the high-scoring residue-pairs aligned by CA are also aligned by the CSA scheme.
| 5 CONCLUSION |
|---|
|
|
|---|
In this study we developed algorithms that identify the aligned pairs of residues using an incremental approach. These algorithms capture the most similar pairs of subsequences as part of the final alignment. The concepts from string-kernel theory (use of ungapped subsequences, scored using profiles) played an integral role in the design of these alignment algorithms.
Our comprehensive experimental study on the template-based and model-based benchmark datasets showed comparable performance to a fully optimized SmithWaterman profile-based implementation. In terms of the reliability performance of the aligned residue-pairs we notice that the alignment schemes using variable-length wmers had very promising results. Among the window-based schemes we noticed that the subset alignment, SA using both the fixed and variable wmers showed the best performance. The sensitivity analysis done by varying the wmer size showed the SA schemes to have a robust performance.
The simplicity of our methods and competitive alignment quality as well as aligned region reliability will lead to the application of our algorithms in key bioinformatic problems, especially comparative modeling.
| Acknowledgments |
|---|
We would like to express our deepest thanks to Professor Arne Elofsson and Dr Robert C. Edgar for providing us their datasets and codes for the study. This work was supported by NSF EIA-9986042, ACI-0133464, IIS-0431135, NIH RLM008713A, the Army High Performance Computing Research Center contract number DAAD19-01-2-0014 and by the Digital Technology Center at the University of Minnesota.
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F., et al. (1990) Basic local alignment search tool. J. Mol. Biol, . 215, 403410[CrossRef][ISI][Medline].
Altschul, S.F., et al. (1997) Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res, . 25, 33893402
Bujnicki, J., et al. (2001) Livebench: continuous benchmarking of protein structure prediction servers. Protein Sci, . 10, 352361
Cline, M., et al. (2002) Predicting reliable regions in protein sequence alignments. Bioinformatics, 18, 306314
Cristobal, S., et al. (2001) A study of quality measured for protein threading models. BMC Bioinformatics, 2, 5[CrossRef][Medline].
Edgar, R. and Sjolander, K. (2004a) Coach: profile-profile alignment of protein families using hidden Markov models. Bioinformatics, 20, 13091318
Edgar, R. and Sjolander, K. (2004b) A comparison of scoring functions for protein sequence profile alignment. Bioinformatics, 20, 13011308
Elofsson, A. (2002) A study on protein sequence alignment quality. Proteins, 46, 330339[CrossRef][ISI][Medline].
Fawcett, T. (2004) Roc graphs: Notes and practical considerations for researchers. Technical Report HPL-2003-4, HP Labs.
Fischer, D., et al. (2001) Cafasp2: the second critical assessment of fully automated structure prediction methods. Proteins, 45, 171183[CrossRef].
Gribskov, M. and Robinson, N. (1996) Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem, . 20, 2533[CrossRef][ISI][Medline].
Gribskov, M., et al. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci USA, 84, 43554358
Gusfield, D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, (1997) , New York, NY Cambridge University Press.
Heger, A. and Holm, L. (2001) Picasso: generating a covering set of protein family profiles. Bioinformatics, 17, 272279
Henikoff, S. and Henikoff, J.G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA, 89, 1091510919
Holm, L. and Sander, C. (1996) Mapping the protein universe. Science, 273, 595602
Jaroszewski, L., et al. In search for more accurate alignments in the twilight zone. Protein Sci, . 11, 17021713.
Jones, D.T., et al. (1992) A new approach to protein fold recognition. Nature, 358, 8689[CrossRef][Medline].
Kelley, L., et al. (2000) Enhanced genome annotation using structural profiles in the program 3d-PSSM. J. Mol. Biol, . 299, 523544[CrossRef].
Kuang, R., et al. (2004) Profile-based string kernels for remote homology detection and motif extraction. Comput. Syst. Bioinform, ., pp. 152160.
Leslie, C., et al. (2002) The spectrum kernel: a string kernel for SVM protein classification. Pac. Symp. Biocomput, . 564575.
Levitt, M. and Gerstein, M. (1998) A unified statistical framework for sequence comparison and structure comparison. Proc. Natl Acad. Sci. USA, 95, 59135920
Marti-Renom, M., et al. (2004) Alignment of protein sequences by their profiles. Protein Sci, . 13, 10711087
Mevissen, H.T. and Vingron, M. (1996) Quantifying the local reliability of sequence alignment. Protein Eng, . 9, 127132
Mittelman, D., et al. (2003) Probabilistic scoring measures for profile-profile comparison yield more accurate short seed alignments. Bioinformatics, 19, 15311539
Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol, . 48, 443453[CrossRef][ISI][Medline].
Panchenko, A., et al. (2000) Combination of threading potentials and sequence profiles improves fold recognition. J. Mol. Biol, . 296, 13191331[CrossRef][ISI][Medline].
Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 24442448
Rangwala, H. and Karypis, G. (2005) Profile based direct kernels for remote homology detection and fold recognition. Bioinformatics, 21, 42394247
Sauder, J.M., et al. (2000) Large-scale comparison of protein sequence alignments with structural alignments. Proteins, 40, 622[CrossRef][ISI][Medline].
Schlosshauer, M. and Ohlsson, M. (2002) A novel approach to local reliability of sequence alignments. Bioinformatics, 18, 847854
Shindyalov, I. and Bourne, P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng, . 11, 739747
Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol, . 147, 195197[CrossRef][ISI][Medline].
Tress, M.L., et al. (2003) Predicting reliable regions in protein alignments from sequence profiles. J. Mol. Biol, . 330, 705718[CrossRef][ISI][Medline].
Wang, G. and Dunbrack, R.L., Jr. (2004) Scoring profile-to-profile sequence alignments. Protein Sci, . 13, 16121626
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




