Skip Navigation

Bioinformatics 2007 23(2):e17-e23; doi:10.1093/bioinformatics/btl297
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Rangwala, H.
Right arrow Articles by Karypis, G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Rangwala, H.
Right arrow Articles by Karypis, G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org

Biological Sequence Analysis

Incremental window-based protein sequence alignment algorithms

Huzefa Rangwala * and George Karypis

Department of Computer Science & Engineering, University of Minnesota Minneapolis, MN 55455, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Materials
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 

Motivation: Protein sequence alignment plays a critical role in computational biology as it is an integral part in many analysis tasks designed to solve problems in comparative genomics, structure and function prediction, and homology modeling.

Methods: We have developed novel sequence alignment algorithms that compute the alignment between a pair of sequences based on short fixed- or variable-length high-scoring subsequences. Our algorithms build the alignments by repeatedly selecting the highest scoring pairs of subsequences and using them to construct small portions of the final alignment. We utilize PSI-BLAST generated sequence profiles and employ a profile-to-profile scoring scheme derived from PICASSO.

Results: We evaluated the performance of the computed alignments on two recently published benchmark datasets and compared them against the alignments computed by existing state-of-the-art dynamic programming-based profile-to-profile local and global sequence alignment algorithms. Our results show that the new algorithms achieve alignments that are comparable with or better than those achieved by existing algorithms. Moreover, our results also showed that these algorithms can be used to provide better information as to which of the aligned positions are more reliable—a critical piece of information for comparative modeling applications.

Contact: rangwala{at}cs.umn.edu

Supplementary information: http://bioinfo.cs.umn.edu/supplements/win-aln/


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Materials
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
Alignment algorithms serve as the most basic sequence analysis methods in computational biology and have a wide range of applications dealing with sequence database searching, comparative modeling, protein structure and function prediction.

The current state-of-the-art sequence alignment algorithms have a well-defined optimal dynamic programming-based solution, introduced decades ago. These optimal algorithms, Smith–Waterman (1981) and Needleman–Wunsch (1970), solve the local and global sequence alignment problems, respectively. Over the years, alignment methods have advanced with several variations of the optimal alignment method, use of gap modeling techniques (Gusfield, 1997), heuristics (Altschul et al., 1990; Pearson and Lipman, 1988), and more recently the use of profile (Gribskov and Robinson, 1996; Edgar and Sjolander, 2004b; Altschul, 1997) and structure information (Jones et al., 1992).

In recent years, there has been a considerable research effort in developing kernel-based methods for building discriminatory models for remote homology detection and fold recognition. This research has led to the development of a number of protein string kernels that determine the similarity between a pair of proteins as a function of the number of sufficiently similar short subsequences that they share. These string kernels have proven to be extremely effective in building very accurate models, and these methods are among the best performing schemes for remote homology prediction and fold recognition (Leslie et al., 2002; Kuang et al., 2004; Rangwala and Karypis, 2005).

Motivated by these developments in string kernels, the work in this paper is designed to address the question as to the extent to which, ideas motivated by these string kernels can be used to build alignments between a pair of sequences. Toward this goal, we developed a set of window-based alignment algorithms that are heuristic in nature. Our methods incrementally constructed the alignment by using the highest scoring pairs of residues between the two sequences at each step. The residue-pair scoring was borrowed from string-kernel theory where to score the residue-pairs in consideration, we examined short subsequences and referred to a wmers centered around each of the two residues. We introduced several heuristics to identify aligned residue-pairs using the wmers coupled with profile information.

We determined the quality of our alignment methods by evaluation on a template-based (Edgar and Sjolander, 2004b; Sauder, 2000) and a model-based dataset (Elofsson, 2002; Cristobal, 2001). Our empirical results on the two datasets showed the competitive performance of our introduced schemes to state-of-the-art methods. We also evaluated our methods by determining the reliability of the aligned positions (Jaroszewski et al., 2002; Cline et al., 2002; Schlosshauer and Ohlsson, 2002; Mevissen and Vingron, 1996; Tress, 2003). The positive results for some of our alignment algorithms on such a reliability metric are very encouraging due to far reaching applications, like comparative modeling.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Materials
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
2.1 Sequence profiles and profile scoring
The alignment algorithms that we developed take advantage of evolutionary information by utilizing PSI-BLAST (Altschul et al., 1997) generated sequence profiles.

The profile of a sequence X of length m is represented by two m x 20 matrices. The first is its position-specific scoring matrix Formula that is computed directly by PSI-BLAST using the scheme described in (Altschul et al., 1997). The rows of this matrix correspond to the various positions in X and the columns correspond to the 20 distinct amino acids. The second matrix is its position-specific frequency matrix Formula that contains the frequencies used by PSI-BLAST to derive Formula. These frequencies [also referred to as target frequencies (Mittelman et al., 2003)] contain both the sequence-weighted observed frequencies [also referred to as effective frequencies (Mittelman et al., 2003)] and the BLOSUM62 (Henikoff and Henikoff, 1992) derived-pseudocounts (Altschul et al., 1997).

Many different schemes have been developed for determining the similarity between profiles that combine information from the original sequence, position-specific scoring matrix, or position-specific target and/or effective frequencies (Mittelman et al., 2003; Wang and Dunbrack, 2004; Marti-Renom et al., 2004). In our work we use a scheme that is derived from PICASSO (Heger and Holm, 2001; Mittelman et al., 2003) that was recently used in developing effective remote homology detection and fold recognition algorithms (Rangwala and Karypis, 2005). Specifically, the similarity score between the i-th position of protein's X profile and the j-th position of protein's Y profile is given by

Formula 1(1)
where Formula 1 and Formula 1 are the values corresponding to the l-th amino acid at the i-th position of X's position-specific scoring and frequency matrices. Formula 1 and Formula 1 are defined in a similar fashion.

2.2 Window-based alignments
The overall methodology of the alignment algorithms developed in this work is to incrementally construct the alignment by using various heuristics to identify the pairs of aligned residues. The key idea shared by these algorithms is that they determine whether or not a pair of residues should be aligned together by examining the (short) subsequences, referred to as wmers, that are centered around each of the two residues.

Given a sequence X of length m and a user-supplied parameter w, the wmer at position i of X (Formula 1) is defined to be the Formula 1-length subsequence of X centered at position i. That is, the wmer contains xi, the w amino acids before, and the w amino acids after xi. A pair of wmers are compared by computing their ungapped alignment scores. Given two sequences X and Y, the ungapped alignment score, wscore(xi, yj), between a pair of wmers at positions i and j of X and Y, respectively, is given by

Formula 2(2)
where Formula 2 is the alignment score between Formula 2 and Formula 2 and is computed using Equation (1).

2.2.1 Central alignment scheme
This is the simplest alignment algorithm that we developed and it computes the alignment by progressively aligning the pairs of residues that have the highest positive wscore values subject to the constraint that they do not conflict with the portion of the alignment that has been constructed thus far.

Specifically, given two sequences X and Y of length m and n, respectively, and a value for w, it starts by computing the set Formula 2 of residue-pairs that are candidates for inclusion in the alignment by considering only the pairs that have positive wscore values. That is,

Formula 3(3)
where Formula 3 and Formula 3. Then it performs a series of iterations in which it performs the following three steps: first, it extracts from Sw the residue-pair with the highest wscore value Formula 3; second, it aligns Formula 3 against Formula 3; third, it removes from Sw all residue-pairs that cannot be part of a valid alignment given that Formula 3 and Formula 3 have been aligned with each other. This process terminates when Sw becomes empty. Positions that do not belong to any of the selected residue-pairs are left unaligned (i.e. aligned against spaces).

The residue-pairs that need to be removed are (1) Formula 3Formula 3, (2) Formula 3Formula 3, (3) Formula 3 Formula 3 and (4) Formula 3. The first two conditions remove from Formula 3 all residue-pairs involving Formula 3 or Formula 3, as these positions have now been aligned, whereas the last two conditions remove the residue-pairs that, if aligned, will introduce inversions in the alignment.

2.2.2 Subset alignment scheme
A limitation of the central alignment (CA) scheme is that it may leave a large number of residues unaligned because (1) it only considers the residue-pairs with positive wscores, and (2) it will not align the first and last w positions of the two sequences (Sw contains only pairs involving interior residues).

To address this problem we developed the subset alignment scheme (SA), which can be considered an extension to the CA scheme. Specifically, the SA scheme modifies the second and third steps of the CA algorithm as follows. During the second step, in addition to including the Formula 3 pair in the alignment, it also includes in the alignment all previously unaligned residue-pairs of the form Formula 3 for Formula 3. That is, it can potentially include all residue-pairs involved in Formula 3's wmer. Note that due to the incremental nature of the algorithm, the second step essentially extends the alignment around the Formula 3 residue-pair until it encounters a residue (from either X or Y) that has already been aligned. We will refer to this as the alignment extension operation. During the third step the SA algorithm removes from Sw all residue-pairs that are now in conflict with all aligned residue-pairs that were selected in second step.

2.2.3 Central and subset alignment scheme
A potential problem with the SA scheme is that it may align a pair of residues Formula 3 with each other, even when Sw contains residue-pairs with higher wscore values for either or both of the two residues. This happens, because SA's alignment extension operation extends the alignment as soon as it extracts the highest scoring residue-pair from Sw and there may be some higher-scoring wmers for these positions in Sw.

For this reason, we developed a hybrid scheme that combines the CA and SA approaches. Specifically, the new scheme first computes a CA alignment and then augments it by applying the alignment extension approach used by SA to each pair of its aligned residues.

2.2.4 Variable wmer alignment scheme
The alignment schemes, CA, SA and CSA were discussed in the context of a fixed length wmer. The potential drawback of this scheme is that if w is set to a relatively large value, it may fail to identify positive scoring subsequences; whereas if it is set too low, it may fail to reward residue-pairs that have relatively long similar subsequences.

For this reason we extended the algorithms to also operate with variable length wmers. The key difference, from the use of fixed length wmers centered around residue-pairs Formula 3 and Formula 3, is the fact that we define length Formula 3 in the range of 1 to w, such that

Formula 4(4)
where Kscore is the Formula 4 subsequence score as defined in Equation (2).

Our alignment schemes start by computing the set Formula 4 of residue-pairs that are candidates for inclusion in the alignment by considering only pairs that have positive Formula 4 values. With this change all steps of our alignment algorithms remain the same. Note that the SA scheme using the variable-length wmers will have its alignment extension operation extended till a maximum length of Formula 4.

As a notation reference we denote the variable Formula 4 alignment algorithms by Formula 4, Formula 4 and Formula 4 to distinguish them from the fixed wmer alignment algorithms denoted in this study by Formula 4, Formula 4 and Formula 4.


    3 Materials
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Materials
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
3.1 Evaluation methodologies and metrics
We evaluated the performance of the proposed window-based alignment algorithms by considering (1) the quality of the alignment itself and (2) the extent to which the inherent ordering of the aligned pairs of residues can be used to identify portions of the alignment that are more reliable than others. In order to assess alignment quality we used two widely used methodologies, often referred to as template based (Edgar and Sjolander, 2004b) and model based (Elofsson, 2002), whereas the reliability was assessed by following a methodology that was recently proposed in the context of comparative modeling (Tress et al., 2003).

3.1.1 Template-based approach
The first method for evaluating alignment quality compares the differences between the alignment generated to template alignments (Edgar and Sjolander, 2004b; Sauder et al., 2000; Elofsson, 2002). These template alignments are generally derived from various structural alignment programs and are considered to be the gold standard.

We use three quality scores, namely the developer's score (fD) (Sauder et al., 2000), the modeler's score (fM) (Sauder et al., 2000) and the Cline score (CS) (Cline et al., 2002) to compare the template alignments with the generated alignments. The developer's score is the number of correctly aligned residue-pairs in the generated alignment divided by the length of the template alignment. (The length of an alignment is defined as the number of aligned residue-pairs.) The modeler's score computes the ratio of correctly aligned residue-pairs with the length of the generated alignment. The Cline score was developed to address the issues with Formula 4 and Formula 4 by penalizing both under-alignment and over-alignment, and also crediting regions in the generated alignment that may be shifted by a few positions relative to the reference alignment (Edgar and Sjolander, 2004b; Cline et al., 2002). The steps for computation of the Cline score can be found in the study (Cline et al., 2002).

Note that the fD and fM scores are equivalent to the more traditional measures of recall and precision (Fawcett, 2004), respectively, that are used extensively to measure prediction performance. In the rest of the discussion we will primarily refer to fD and fM by the more intuitive names of recall and precision, respectively.

3.1.2 Model-based approach
An alternative to using a template-based approach is to build a structural model from the alignment and to evaluate the similarity between the model and the template structure (Elofsson, 2002; Panchenko et al., 2000). Starting from the alignment between a pair of proteins (one protein considered to be the query protein, the second considered to be the target protein whose 3D structure is known), a model protein is created which consists of the carbon alpha, Formula 4 atoms of the query protein. The atomic coordinates of this model protein are the atomic coordinates of the target protein i.e. for every aligned pair of residues, the query protein has its Formula 4 atomic coordinates replaced by the corresponding atomic coordinates of the target protein. The similarity between the two structures (the model protein and target protein) after a structural superimposition (Levitt and Gerstein, 1998), is used as an assessment of sequence alignment quality. In our study, we computed this similarity using the LGscore (Cristobal et al., 2001) that takes into account the common segments between the pair of proteins. The LGscore measure was used to evaluate the structures obtained by threading methods (Panchenko et al., 2000) in the CAFASP2 (Fischer et al., 2001) and LiveBench (Bujnicki et al., 2001) experiments as well as a sequence alignment quality measure (Elofsson, 2002).

3.1.3 Reliability of aligned regions
In comparative modeling and several other applications, it is essential not only to align residue-pairs but also to provide some reliability index or confidence measure associated with the aligned residue-pairs. While building protein structure models using comparative modeling strategies it is important to include only those regions where the alignment is considered to be good or reliable (Jaroszewski, 2002; Cline et al., 2002; Schlosshauer and Ohlsson, 2002; Mevissen and Vingron, 1996; Tress et al., 2003).

Using the template-based benchmarks we evaluated the reliability of the aligned residue-pairs by ranking the aligned pairs in the query alignment. We score the aligned positions using fixed length wscores. The reliability measure is computed as the recall at different percent levels of incorrectly aligned residue-pairs (up to 5%). The notion of a hit is defined as having the same aligned residue-pairs in both the query and template alignments. Our approach of using the wscores was similar to a method that used a smoothed profile-derived score for reliability assessment (Tress et al., 2003).

3.2 Datasets
For the template-based assessment scheme we used a dataset created to evaluate the various profile-to-profile scoring functions for protein sequence alignment (Edgar and Sjolander, 2004b). The dataset consists of 588 reference alignment pairs having high structural similarity but low sequence identity (≤30%). This dataset was selected to have a high pairwise structural similarity using the consensus of FSSP (Holm and Sander, 1996) and CE (Shindyalov and Bourne, 1998).

For the model-based evaluation scheme, we used a benchmark created from SCOP 1.39 filtered to only contain domains with <50% pairwise sequence identity (Elofsson, 2002). This dataset contains 9983 protein domain pairs, such that 1903 belong to the same families, 3101 share only the same superfamily and 4979 share only the same fold. Due to the non-symmetrical nature of models built from alignments, each pair of sequences were evaluated twice—leading to a benchmark of 19 966 domain pairs.

3.3 Profile generation
The position-specific score and frequency matrices used by the profile-based scoring method of Equation (1) were generated using the latest version of the PSI-BLAST algorithm (available in NCBI's blast release 2.2.10), and were derived from the multiple sequence alignment constructed after five iterations using an e-value of 10–3. The PSI-BLAST was performed against NCBI's nr database that was downloaded in November of 2004 and contained 2 171 938 sequences.


    4 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Materials
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
4.1 Assessment of incremental window-based alignments
Table 1 provides an extensive set of results illustrating the performance of the CA, SA and CSA schemes on the template-based dataset for different values of w and for fixed- and variable-length wmers. Note that the column labeled ‘Formula 4’ shows the CS results for the subset of sequence-pairs that have <15% sequence identity (i.e. a subset that is inherently harder to align well).


View this table:
[in this window]
[in a new window]

 
Table 1 Alignment accuracy results on a template-based dataset

 
4.1.1 Central versus subset versus combined
The results of Table 1 show that with respect to the CS scores, SA tends to perform better than either CA or CSA, whereas CA performs consistently the worst. The only exception is for variable-length wmers, in which SA's performance is comparable to that of CSA. The relative advantage of SA is more evident if we consider the subset of sequence-pairs with <15% sequence identity, for which its CS scores are consistently higher than those achieved by the other schemes (SA achieves a score of 0.649 whereas CA and CSA achieve scores of 0.614 and 0.628, respectively).

By looking at the performance of the various schemes in terms of recall, we can see that SA's higher CS-based performance is due to the fact that it achieves significantly better recall values than the other schemes. This was to be expected, as it was one of the motivation behind the development of SA. Also, the precision-based results show that CA achieves somewhat better precisions than CSA, whereas SA's precision is comparable with or better than that of the other schemes.

4.1.2 Fixed versus variable-length alignments
Analyzing the performance of alignment methods that use fixed length wmers compared with the methods that use variable-length wmers, we notice that for the CA and CSA schemes, for the same wmer length the recall as well as the precision scores have higher values. Note that the higher recall is expected, because the methods using a variable wmer size window will have a higher flexibility in allowing larger number of wmers (with a positive score) to be picked for the candidate set Formula 4.

Another key observation is that Formula 4 performs better in terms of recall than Formula 4. This is because for the same value of w, the Formula 4 value selected by Formula 4 may be smaller than w (i.e. the value used by Formula 4). As a result, Formula 4 's alignment extension operations will involve longer windows, which can produce longer alignments than Formula 4, and thus higher recall values.

4.1.3 Sensitivity of schemes with respect to varying wmer size
Looking at the performance achieved by the various schemes in Table 1 as w ranges from two to five, we see that in general, SA's and CSA's performance does not significantly change (e.g. CS scores stay within a tight range), whereas Formula 4's performance tends to deteriorate with increasing w. This latter behavior is due to the fact that as we increase the Formula 4 size, fewer wmers will have a positive score and hence will not be included as part of the set Formula 4. We see a direct effect of this leading to a decrease in the recall scores. Also increase in the Formula 4 size does lead to a decrease in precision score as well. This is because for a larger wmer window the positive scoring wmers may not be due to the more ‘central’ positions. Evidence of this can be seen by comparing the behavior of the Formula 4 scheme in which both the precision and recall scores stay the same.

Another key observation is that the schemes that utilize variable-length wmers tend to perform better for larger values of w. This is because of the flexibility associated with using a variable-length wmer.

4.2 Comparison with earlier results
4.2.1 Template-based benchmark
Table 2 shows the comparative performance of our window-based schemes against some of the best profile-to-profile scoring techniques studied previously (Edgar and Sjolander, 2004b). In the table we show results for the schemes pdotp, correlp and coach. pdotp uses dot product to compute the similarity between two profiles, correlp computes the Pearson correlation between the profile columns, whereas coach (Edgar and Sjolander, 2004a) uses an asymmetrical complex dot product between the HMM profile and a position frequency matrix.


View this table:
[in this window]
[in a new window]

 
Table 2 Comparative performance with earlier results on template-based dataset

 
We show results of these schemes as published previously (Edgar and Sjolander, 2004b) using SAM T99 profiles [(The performance of these alignment methods using SAM T99 profiles is 3–4% better than the PSI-BLAST-based profiles (Edgar and Sjolander, 2004b)]. Our methods show comparable performance to these alignment methods using SAM T99 templates.

We also compare the results of the window-based alignment methods to a local Smith–Waterman (1981) alignment algorithm implementation (SW-PSSM) using the same profile-to-profile scoring function as used for the window-based alignments [Equation (1)]. Within this local alignment framework we use an affine gap model along with a zero-shift parameter (Wang and Dunbrack, 2004) to maintain certain necessary requirements of a good optimal alignment. We optimize the gap modeling parameters [gap opening (go), gap extension (ge)] and the zero-shift value (zs) to obtain highly optimal alignments for comparative purposes.

We observe in Table 2 that the incremental window-based alignment schemes perform very competitively when compared with our fully optimized SW-PSSM implementation. Also notice the superiority of our optimized SW-PSSM implementation to the alignment methods using pdotp, correlp and coach as their profile–profile scoring functions. The difference in the SW-PSSM results with the other standard alignment techniques may be due to the use of a more sensitive PICASSO-based profile-to-profile scoring function. Further, these results verify that we are comparing our novel window-based alignment methods to a fully optimized SW-PSSM alignment algorithm.

The performance of the window-based scheme is actually very promising. We select one of the better performing schemes (Formula 4) and compare it with the optimized SW-PSSM algorithm using the CS score. Figure 1 shows that the comparative performance of the two methods across the 588 alignment pairs in the dataset.


Figure 1
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Cline score comparison of SW-PSSM scheme against SAf scheme for the 588 alignment pairs in the template-based dataset.

 
4.2.2 Model-based benchmark
We also performed a comparative study of alignment quality using the average log LGscore based on the model-based benchmark. Our results in Table 3 reiterate the closeness in performance of the incremental window-based alignment method to the highly optimized SW-PSSM alignment algorithm for the family-, superfamily- and fold-level subsets. For this dataset too, we performed a thorough parameter study by varying wmer lengths for our alignment schemes. We observed similar results as seen in Table 1 for the template-based dataset. (Owing to space constraints we do not show the detailed results on the model-based dataset here.)


View this table:
[in this window]
[in a new window]

 
Table 3 Comparative performance with earlier results on a model-based dataset

 
Table 3 also shows results for the optimized local (local sequence alignment using a global scoring matrix), global (global sequence alignment using a global scoring matrix), PSI [3D-PSSM (Kelley et al., 2000) based global sequence alignment against a profile (Gribskov et al., 1987) obtained from PSI-BLAST], SSPSI (Elofsson, 2002) (3D-PSSM-based global sequence alignment against a profile obtained from PSI-BLAST using secondary structure information) and structural (alignment using structural superimposition by LGscore2) alignment methods published previously (Elofsson, 2002). The structural alignment sets up a higher reference quality score for the benchmark. Using sequence alignment techniques we would like to achieve these high levels of accuracy. The results shown in Table 3 for the various previously published schemes, as well as for our methods, are the best achieved after optimization of the various parameters.

We further analyze the data by annotating a model as being correct based on the LGscore value. As done in the study (Elofsson, 2002) we use the less strict LGscore cutoff (10–3) to define a correct model and a more stringent cutoff (10–5) to identify models of higher quality. The percentage of models correct based on these cutoffs are shown in Table 4. Both the incremental window-based alignment methods, as well as the SW-PSSM alignment method, are able to pick the correct models with similar degrees of accuracy. Our techniques also seem to identify a higher percentage of correct models when compared with the previously studied schemes, especially PSI and SSPSI, both of which also incorporate some profile information. As seen from Table 4 our methods are able to pick a larger fraction of higher quality models for the family and superfamily levels.


View this table:
[in this window]
[in a new window]

 
Table 4 Fraction of correct models based on the LGscore

 
4.2.3 Reliability performance
Table 5 shows the reliability performance for the window-based alignment schemes in comparison to the optimized SW-PSSM-based alignment scheme. These results correspond to the average recall scores obtained for all the alignment pairs at different error rates using the procedure described in Section 3.1.3.


View this table:
[in this window]
[in a new window]

 
Table 5 Reliability assessment: recall for the first k% errors

 
Though the SW-PSSM algorithm showed slightly better performance in terms of the overall alignment quality (Tables 2 and 3), it is interesting to note the window-based schemes using variable-length wmers showed far better performance at the lower error rates. In particular before seeing any incorrect predictions in the ranked aligned positions, the alignment methods using variable-length wmers have a recall around 0.260 compared with the recall of 0.205 for the SW-PSSM algorithm. Note that the recall performance of the CSA scheme is slightly better than the CA scheme and slightly worse compared with the SA alignment scheme. These results can be explained by the fact that the high-scoring residue-pairs aligned by CA are also aligned by the CSA scheme.


    5 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Materials
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 
In this study we developed algorithms that identify the aligned pairs of residues using an incremental approach. These algorithms capture the most similar pairs of subsequences as part of the final alignment. The concepts from string-kernel theory (use of ungapped subsequences, scored using profiles) played an integral role in the design of these alignment algorithms.

Our comprehensive experimental study on the template-based and model-based benchmark datasets showed comparable performance to a fully optimized Smith–Waterman profile-based implementation. In terms of the reliability performance of the aligned residue-pairs we notice that the alignment schemes using variable-length wmers had very promising results. Among the window-based schemes we noticed that the subset alignment, SA using both the fixed and variable wmers showed the best performance. The sensitivity analysis done by varying the wmer size showed the SA schemes to have a robust performance.

The simplicity of our methods and competitive alignment quality as well as aligned region reliability will lead to the application of our algorithms in key bioinformatic problems, especially comparative modeling.


    Acknowledgments
 
We would like to express our deepest thanks to Professor Arne Elofsson and Dr Robert C. Edgar for providing us their datasets and codes for the study. This work was supported by NSF EIA-9986042, ACI-0133464, IIS-0431135, NIH RLM008713A, the Army High Performance Computing Research Center contract number DAAD19-01-2-0014 and by the Digital Technology Center at the University of Minnesota.


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Materials
 4 RESULTS
 5 CONCLUSION
 REFERENCES
 

    Altschul, S.F., et al. (1990) Basic local alignment search tool. J. Mol. Biol, . 215, 403–410[CrossRef][Web of Science][Medline].

    Altschul, S.F., et al. (1997) Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res, . 25, 3389–3402[Abstract/Free Full Text].

    Bujnicki, J., et al. (2001) Livebench: continuous benchmarking of protein structure prediction servers. Protein Sci, . 10, 352–361[CrossRef][Web of Science][Medline].

    Cline, M., et al. (2002) Predicting reliable regions in protein sequence alignments. Bioinformatics, 18, 306–314[Abstract/Free Full Text].

    Cristobal, S., et al. (2001) A study of quality measured for protein threading models. BMC Bioinformatics, 2, 5[CrossRef][Medline].

    Edgar, R. and Sjolander, K. (2004a) Coach: profile-profile alignment of protein families using hidden Markov models. Bioinformatics, 20, 1309–1318[Abstract/Free Full Text].

    Edgar, R. and Sjolander, K. (2004b) A comparison of scoring functions for protein sequence profile alignment. Bioinformatics, 20, 1301–1308[Abstract/Free Full Text].

    Elofsson, A. (2002) A study on protein sequence alignment quality. Proteins, 46, 330–339[CrossRef][Web of Science][Medline].

    Fawcett, T. (2004) Roc graphs: Notes and practical considerations for researchers. Technical Report HPL-2003-4, HP Labs.

    Fischer, D., et al. (2001) Cafasp2: the second critical assessment of fully automated structure prediction methods. Proteins, 45, 171–183[CrossRef].

    Gribskov, M. and Robinson, N. (1996) Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem, . 20, 25–33[CrossRef][Web of Science][Medline].

    Gribskov, M., et al. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci USA, 84, 4355–4358[Abstract/Free Full Text].

    Gusfield, D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, (1997) , New York, NY Cambridge University Press.

    Heger, A. and Holm, L. (2001) Picasso: generating a covering set of protein family profiles. Bioinformatics, 17, 272–279[Abstract/Free Full Text].

    Henikoff, S. and Henikoff, J.G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA, 89, 10915–10919[Abstract/Free Full Text].

    Holm, L. and Sander, C. (1996) Mapping the protein universe. Science, 273, 595–602[Abstract/Free Full Text].

    Jaroszewski, L., et al. In search for more accurate alignments in the twilight zone. Protein Sci, . 11, 1702–1713.

    Jones, D.T., et al. (1992) A new approach to protein fold recognition. Nature, 358, 86–89[CrossRef][Medline].

    Kelley, L., et al. (2000) Enhanced genome annotation using structural profiles in the program 3d-PSSM. J. Mol. Biol, . 299, 523–544[CrossRef].

    Kuang, R., et al. (2004) Profile-based string kernels for remote homology detection and motif extraction. Comput. Syst. Bioinform, ., pp. 152–160.

    Leslie, C., et al. (2002) The spectrum kernel: a string kernel for SVM protein classification. Pac. Symp. Biocomput, . 564–575.

    Levitt, M. and Gerstein, M. (1998) A unified statistical framework for sequence comparison and structure comparison. Proc. Natl Acad. Sci. USA, 95, 5913–5920[Abstract/Free Full Text].

    Marti-Renom, M., et al. (2004) Alignment of protein sequences by their profiles. Protein Sci, . 13, 1071–1087[CrossRef][Web of Science][Medline].

    Mevissen, H.T. and Vingron, M. (1996) Quantifying the local reliability of sequence alignment. Protein Eng, . 9, 127–132[Abstract/Free Full Text].

    Mittelman, D., et al. (2003) Probabilistic scoring measures for profile-profile comparison yield more accurate short seed alignments. Bioinformatics, 19, 1531–1539[Abstract/Free Full Text].

    Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol, . 48, 443–453[CrossRef][Web of Science][Medline].

    Panchenko, A., et al. (2000) Combination of threading potentials and sequence profiles improves fold recognition. J. Mol. Biol, . 296, 1319–1331[CrossRef][Web of Science][Medline].

    Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448[Abstract/Free Full Text].

    Rangwala, H. and Karypis, G. (2005) Profile based direct kernels for remote homology detection and fold recognition. Bioinformatics, 21, 4239–4247[Abstract/Free Full Text].

    Sauder, J.M., et al. (2000) Large-scale comparison of protein sequence alignments with structural alignments. Proteins, 40, 6–22[CrossRef][Web of Science][Medline].

    Schlosshauer, M. and Ohlsson, M. (2002) A novel approach to local reliability of sequence alignments. Bioinformatics, 18, 847–854[Abstract/Free Full Text].

    Shindyalov, I. and Bourne, P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng, . 11, 739–747[Abstract/Free Full Text].

    Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol, . 147, 195–197[CrossRef][Web of Science][Medline].

    Tress, M.L., et al. (2003) Predicting reliable regions in protein alignments from sequence profiles. J. Mol. Biol, . 330, 705–718[CrossRef][Web of Science][Medline].

    Wang, G. and Dunbrack, R.L., Jr. (2004) Scoring profile-to-profile sequence alignments. Protein Sci, . 13, 1612–1626[CrossRef][Web of Science][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Rangwala, H.
Right arrow Articles by Karypis, G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Rangwala, H.
Right arrow Articles by Karypis, G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?