Bioinformatics Advance Access originally published online on December 10, 2004
Bioinformatics 2005 21(8):1421-1428; doi:10.1093/bioinformatics/bti198
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Statistical evaluation and comparison of a pairwise alignment algorithm that a priori assigns the number of gaps rather than employing gap penalties
Centre for Bioinformatics and Biological Computing, Murdoch University Murdoch, WA 6150, Australia
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Although pairwise sequence alignment is essential in comparative genomic sequence analysis, it has proven difficult to precisely determine the gap penalties for a given pair of sequences. A common practice is to employ default penalty values. However, there are a number of problems associated with using gap penalties. First, alignment results can vary depending on the gap penalties, making it difficult to explore appropriate parameters. Second, the statistical significance of an alignment score is typically based on a theoretical model of non-gapped alignments, which may be misleading. Finally, there is no way to control the number of gaps for a given pair of sequences, even if the number of gaps is known in advance.
Results: In this paper, we develop and evaluate the performance of an alignment technique that allows the researcher to assign a priori set of the number of allowable gaps, rather than using gap penalties. We compare this approach with the SmithWaterman and NeedlemanWunsch techniques on a set of structurally aligned protein sequences. We demonstrate that this approach outperforms the other techniques, especially for short sequences (56133 residues) with low similarity (<25%). Further, by employing a statistical measure, we show that it can be used to assess the quality of the alignment in relation to the true alignment with the associated optimal number of gaps.
Availability: The implementation of the described methods SANK_AL is available at http://cbbc.murdoch.edu.au/
Contact: matthew{at}cbbc.murdoch.edu.au
| INTRODUCTION |
|---|
|
|
|---|
Pairwise sequence alignment is one of the most fundamental tools for comparing DNA and protein sequences. It establishes the basis for the interpretation of evolutionary and functional relationships between gene sequences and species. Through their seminal work in 1970, Needleman and Wunsch (1970) provided the catalyst to numerous other profound contributions to sequence alignment. To date, the most successful and widely understood algorithm is the SmithWaterman technique (Smith and Waterman, 1981), which can detect similar segments in genomic sequences with maximal score if the compared sequences are closely related. More recently, there have been a number of modifications to the SmithWaterman technique as well as novel approaches that are based on structural information (Geourjon et al., 2001; Wallqvist et al., 2000), profile information (Durbin et al., 1998; Gough et al., 2001) and evolutionary models of insertions and deletions (Knudsen and Miyamoto, 2003; Irie et al., 1999).
Synonymous with these alignment techniques are gap penalties that must be prescribed in order to obtain optimal or near optimal alignments. However, it has proven difficult to precisely determine the penalties that should be selected for a given pair of sequences. There are a number of unavoidable problems when employing gap penalties. Firstly, sequence alignment results vary depending on the penalties associated with the initiation and extension of gaps, which attempt to model insertion and/or deletion events that have occurred in either or both sequences through the course of their evolution. Biologists often find it difficult to determine the gap parameters in order to obtain the true biological alignment. It is well known that artificially assigned default gap parameters are not selective and/or sensitive enough for all types of alignments (Arslan et al., 2001; Apostolico and Giancarlo, 1998; Morgenstern, 1999; Pearson, 1995; Roytberg, 1998). Secondly, the alignments with gaps do not always produce correct E- and P-value statistics. These statistics are based on a solid theoretical foundation for local alignments without gaps (Dembo et al., 1994; Karlin and Altschul, 1990). According to numerous computational experiments and some analytic results (Altschul and Gish, 1996; Altschul et al., 1990; Arratia and Waterman, 1994; Collins et al., 1988; Mott, 1992; Smith et al., 1985; Pearson, 1988; Waterman and Vingron, 1994a, b), it has been suggested that the formula for calculating statistics of ungapped alignments should be applied to gapped alignment as well. However, the statistics of gapped alignments do not exactly fit the theoretical distribution (extreme value distribution). Therefore, very small E- and P-values may not be reliable. Finally, there is no way to control the number of gaps to be assigned for a given pair of sequences even if the number of gaps is known in advance. This problem has been highlighted previously (Bellgard et al., 2003; Elleman, 1978; Sankoff and Cedergren, 1973; Zhu et al., 1997).
There is an alignment program, DIALIGN 2, which does not employ any gap penalties (Morgenstern, 1999). This approach uses segment-to-segment comparison rather than residue-to-residue comparison and alignments are composed from gap-free segment pairs. However, DIALIGN 2 requires user-defined parameters to determine the minimum size of segment and the significance between two segments' pairs and one must therefore specify appropriate parameters.
In a previous study, an approach that constrains the maximal number of allowable gaps was developed (Bellgard et al., 2003). This implementation, which is referred to as gap mapping, is a dynamic programming approach that employs an extra dimension corresponding to the number of gap constraints. Thus, for an priori defined of the maximum allowable number of gaps k, an alignment is obtained with the best alignment score with at most k gaps. Changing the number of allowable gaps can give a different alignment. Therefore, a novel way to explore optimal and near optimal alignments was developed (Bellgard et al., 2003). A search of the literature (and M. Waterman, personal communication) revealed that this implementation was related closely to an approach proposed by Sankoff (1972). Surprisingly, this approach has received little or no attention. For this reason, we have referred to our implementation as the SANK_AL approach.
In the present study, the SANK_AL implementation is evaluated and compared to both the NeedlemanWunsch algorithm (Needleman and Wunsch, 1970) as a global alignment technique and the SmithWaterman algorithm (Smith and Waterman, 1981) as a local alignment technique. A statistical significance score P-value is developed based on a Monte Carlo procedure (Lipman and Pearson, 1985; Gilks et al., 1995) and used as a measure of the alignment for a given number of gaps. The sequences used for comparison are pairs of sequences taken from structurally aligned BAliBASE for a range of protein sequence length and percentage similarity (Thompson et al., 1999a, b). We show that in the majority of cases, the SANK_AL achieves biologically meaningful alignments over the other two techniques without the need to explore gap penalty parameter space. As SANK_AL produces more than one alignment (one alignment corresponds to a given number of maximal allowable gaps), each alignment produces its own score and associated statistical significance. Interestingly, the SANK_AL technique, as we shall show, in the majority of cases, associates the correct biological alignment to a minimal P-value for a given number of allowable gaps. This form of self-evaluation appears to be intuitive and novel in that it allows for the assessment of the most appropriate alignment out of the range of possible alignments. SANK_AL is available at http://cbbc.murdoch.edu.au/
| METHODS |
|---|
|
|
|---|
Outline of SANK_AL algorithm
The SANK_AL algorithm implementation developed independently is in fact a modification to the algorithm proposed by Sankoff (1972). The difference between the SANK_AL implementation and the original Sankoff algorithm is that SANK_AL can apply not only to a given pair of DNA sequences but also to amino acid sequences as well as provide global alignments for them. The SANK_AL algorithm is described briefly below. Let {ai} (where i = 1, 2,..., m) and {bj} (where j = 1, 2,..., n) be two sequences to be aligned and k be a number of maximal allowable gaps. To match a sequence {ai} against another sequence {bj}, three m by n by (k + 1) matrices P, Q and R are constructed. The three matrices correspond to optimal alignments ending (ai, bj), (, bj) and (ai, ), respectively, where is a gap character. The matrix element in P, Q and R at position s, t, g denotes the best score obtainable by matching {ai} (where i = 1, 2,...,s) and {bj} (where j = 1, 2,...,t) including g gap regions. P[i][j][g], Q[i][j][g] and R[i][j][g] are initialized as follows:
![]() |
g
k, and
![]() |
g
k and 2
j
n, and
![]() |
g
k and 2
i
m, where s(ai, bj) is a score in a substitution matrix corresponding to ai and bj. The element
implies impossible alignments for a given i, j and g. For instance, P[1][3][2] should be
, because it is impossible to align a1 and b1b2b3 satisfying both constraints allowing two gap regions and ending (a1, b3). The element 0 implies that such alignments have no residue-to-residue pair. For instance, R[2][1][3] is derived only from the case of alignment a1a2 and b1, therefore 0 is assigned for it. Then inner elements of matrices are calculated recursively by
![]() |
i
m, 1
j
n and 0
g
k. To get an optimal alignment for a pair of sequences {ai} (where i = 1, 2, ..., m) and {bj} (where j = 1, 2, ..., n) with g gap regions, first, the highest value C among P[m][n][g], Q[m][m][g] and R[m][n][g] is selected. The matrix containing C determines whether the last pair of alignment has one of the following patterns (am, bn), (, bn) or (am, ). Then, just as in NeedlemanWunsch algorithm, matrices are traced back to find a path of elements that contribute the highest value C. Finally, the alignment is constructed by assigning (ai, bj), (, bj) or (ai, ) according to the matrices of elements in the path. Similarly, for the other number of gap regions g', an alignment can be obtained as well by finding a path of elements contributing to the highest value C' among P[m][n][g'], Q[m][m][g'] and R[m][n][g'].
In this algorithm, the number of calculations is obviously proportional to the size of matrices, in other words, the total running time of this SANK_AL algorithm is O(mnk). Note that this SANK_AL algorithm does not make use of any gap penalties or any other specific information in the process of alignment. Moreover, this algorithm can find subsequences under the given number of gap regions, in other words, satisfying insertion and/or deletion constraints.
Dataset of structure alignments
To assess the quality of our alignment algorithm, we require a standard for the true alignment. This is usually derived from alignments of protein structures. Several comparative reports of protein alignment algorithms (Sauder et al., 2000; Elofsson, 2002) use SCOP, a protein family database (Andreeva et al., 2004). However, as SCOP does not have alignment information, it is necessary to make true alignments from it by using structural alignment programs such as CE (Shindyalov and Bourne, 1998) and DALI (Holm and Sander, 1999). It is acknowledged that sequence alignments produced by two structure alignment programs may differ substantially when the sequence identity between them is low (Feng and Sippl, 1996; Godzik, 1996).
For that reason, in this study, we selected BAliBASE which is manually constructed and is not intended for any specific alignment program. It should be noted that BAliBASE has a potential bias toward global alignment programs since aligned sequences are trimmed down at the boundaries of the alignment to the core block only (Karplus and Hu, 2001). However, it has been used as a standard multiple alignment sequence set to evaluate alignment algorithms in recent studies (Thompson et al., 1999b; Karplus and Hu, 2001; Notredame et al., 2000). BAliBASE is currently available at http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE2/. BAliBASE is classified into eight hierarchical reference sets by the length and similarity of the sequences in the core blocks and by the presence of insertions, N/C-terminal extensions and repeats. In this paper, Reference 1 which contains 82 multiple alignment sequences was used. Reference 1 is divided into nine groups depending on the length of the sequences: short (between 56 and 133 residues), medium (182317 residues) and long (329995 residues); and the identity of sequences: <25, 2040 and >35%. For the analysis of pairwise alignment, the first and second sequences from each multiple alignment were chosen. Our test data, therefore, results in the comparison of 712 pairs of sequences in each group.
Comparison of SANK_AL with other techniques
In order to show the performance of our SANK_AL algorithm, SANK_AL (Bellgard et al., 2003) was compared with other alignment methods. As our program produces global alignments and focuses on detailed analysis of alignment for the two given sequences, we chose the NeedlemanWunsch algorithm for comparison. We also chose the SmithWaterman algorithm for a local alignment comparison. Like SANK_AL, these two algorithms can be used generically as they are not intended for specific tasks such as fold recognition or gene finding. The NeedlemanWunsch and the SmithWaterman algorithms were carried out through a local copy of EMBOSS (Rice et al., 2000). The substitution matrix for evaluating the alignments used BLOSUM50, BLOSUM62 and PAM250 and the gap penalties employed a range of parameters for initiation (9, 10, 11, 12 and 13) and extension (0.5, 1, 2, 3 and 4). Therefore, 25 combinations of gap penalties for each substitution matrix have been tested.
To measure alignment accuracy compared to the BAliBASE structural alignment, we chose to use the fD score (Sauder et al., 2000). The fD is defined to be the percentage of paired residues matched against the standard alignment. Although many other measures for assessing the significance of alignments are proposed (Levitt and Gerstein, 1998; Cristobal et al., 2001), the fD score is the simplest and most intuitive method and it has been used in many papers (Sauder et al., 2000; Blake and Cohen, 2001).
Statistics of alignments
The fD score can assess the likely biological significance of the similarity although it is also necessary to know whether the alignments themselves are statistically significant or not. In this study, statistical significance was estimated by using the Monte Carlo techniques first proposed by Lipman and Pearson (1985) and examined by Comet et al. (1999) and Bacro and Comet (2001). This method is known to be less dependent on the sequence lengths and compositions (Comet et al., 1999; Bacro and Comet, 2001). For any two given sequences a and b, this method compares one of the sequences a with the randomly shuffled second sequence b* many times. The shuffled sequences share the same amino acid composition and length with the initial sequence. For an empirical mean
and standard deviation
of the alignment score S(a, b*) estimated by a number of shuffled sequences, the Z-value is defined as
![]() |
![]() |
To obtain a converged Z-value, the number of shuffling times is recommended from 100 to 1000 (Comet et al., 1999; Aude and Louis, 2002); therefore, we have used 1000 times. Then the probability (P-value) observed at least Z-value z, P(Z > z), was calculated by
![]() |
is a Euler constant. This P-value is based on the fact that the Z-value should follow the extreme value distribution, more precisely, the Grumbel distribution. In fact, interestingly, the histogram of Z-values was well fitted with its distribution for random sequences (unpublished data). However, in practical applications, the Z-value may deviate from the Grumbel distribution when it is large. This is because real genomic sequences can be biased (Comet et al., 1999; Bacro and Comet, 2001). Therefore, P-value was used as an ordered relative measure of alignment significance rather than an absolute measure. | RESULTS |
|---|
|
|
|---|
For a given k gaps, SANK_AL produces the alignment with the best score with at most k gaps. Figure 1 shows an example of SANK_AL outputs allowing from one gap through to five gaps. It is worth noting that these alignments do not use any artificial gap penalties. Therefore, they are naturally aligned using only a substitution matrix and a given maximal allowable gap.
|
We have compared our SANK_AL algorithm against the NeedlemanWunsch and SmithWaterman algorithms using the BAliBASE database. In this comparison, SANK_AL performed sequence alignment allowing from 1 to 25 gaps. Figures 2 and 3 show the results of each algorithm using BLOSUM62, by measuring averages of the fD score within each group. Note that SANK_AL produces 25 alignments for a given pair of sequences while the other two programs produce only one. Therefore, we explored 25 gap penalty combinations for the other two programs and picked the top three averages of the fD score for each group among the 25 combinations of gap penalties. These are referred to as N&W1, N&W2 and N&W3 for the NeedlemanWunsch program in Figure 2, and S&W1, S&W2 and S&W3 for the SmithWaterman program in Figure 3. Also, the averages of the fD score obtained from the default gap penalty (insertion = 10, extension = 0.5) are plotted in Figures 2 and 3. These gap penalty combinations are represented as initiation/extension in the box near the plots. The fD score of SANK_AL represents the highest fD among all 25 allowable gaps (SANK_AL).
|
|
As can be seen in Figures 2 and 3, in the majority of cases, the fD scores of SANK_AL alignments are the same or better than that of the other two algorithms. SANK_AL is particularly effective when a pair of sequences is short and the identity between them is low. In fact, 69.6, 56.8 and 55.1% of all SANK_AL alignments for short, medium and long sequences produce higher fD scores than those of the NeedlemanWunsch program. Similarly, 84.7, 78.0 and 71.2% of all SANK_AL alignments for short, medium and long sequences, respectively produce higher fD scores than the score of the SmithWaterman program. Furthermore, note that there is no one gap penalty combination which can provide superior performance at all times. This highlights the difficulty in selecting appropriate gap penalties for a given pair of sequences. In fact, when the default gap penalty is employed for the NeedlemanWunsch and SmithWaterman programs, 68.7 and 78.7% of all SANK_AL alignments outperform them, respectively. As we discuss, SANK_AL can find the optimal or near optimal number of gaps for a given pair of sequences. Consequently, in the majority of cases, SANK_AL makes biologically better alignments than the other two algorithms. Although the alignments using the BLOSUM62 are shown, other alignments using the BLOSUM50 and PAM250 matrices have similar characteristics (data not shown).
We also examined the probability of obtaining each alignment (P-value). Figure 4 plots the fD scores and logarithms of P-values while the number of allowable gaps changes from 1 to 25. As can be seen, the higher fD score and the lower P-value are almost correlated with each other. Actually, 73.1% of the number of gap regions, having the maximum fD, correspond to the number of gap regions with the minimum P-value plus or minus two gaps. If the number of gap regions in alignments is less than the optimal gap, alignments must have unlikely aligned residue pairs and result in small alignment scores. As shuffled sequence alignments with a small number of gaps generally produce small alignment scores, the Z-values are also small. From the formula for P-value, smaller Z-values result in larger P-values. Similarly, if the number of gap regions in alignments is greater than the optimal gap, the alignment tends to contain many unnecessary gaps. Such alignments will be similar to shuffled sequence alignments when the number of gaps increases. Hence, the Z-values will be small. As a result, by using statistics of alignments, we can find an optimal or near optimal number of gaps effectively. However, on the other hand, 26.9% of the number of gap regions, having the maximum fD, do not correspond to the number of gap regions with the minimum P-value plus or minus two gaps. These can be classified into two different cases. One is that the alignment corresponding to the smallest P-value gives almost the same as that with the highest fD value. In other words, the curve of fD value is flat around the highest fD. In this case, even though the alignment obtained from the smallest P-value is not correlated to the best alignment with respect to fD value, the alignment is still guaranteed to have biological features and the alignment is quite similar to the one corresponding to the highest fD value. The other case is that P-value is consistently large (between 1 and 103) while the number of allowable gaps changes. That is, the alignment obtained by SANK_AL is not regarded as a significant alignment for any number of gaps. Generally speaking, the fD value of such alignment is quite low (<0.3). Accordingly, if one finds that P-values are consistently large while the number of gap changes, one should be wary of using this alignment. This P-value statistics allows us to avoid less meaningful alignments.
|
| DISCUSSION |
|---|
|
|
|---|
The source code of SANK_AL was written in the ANSI C++ language. As described before, the computation of SANK_AL is O(mnk) for a pair of sequences of length m and n allowing up to k gap regions. In practice, for a pair of short (56133 residues), medium (182317 residues) and long (329995 residues) sequences, it takes 0.02, 0.20, and 0.99 s on average, respectively, to find the optimal alignment after testing 25 allowable gaps, running on Linux operating system with Intel® XeonTM 2.80 GHz and 2 GB RAM. This computation time is reasonable for the investigation of a few sequences of interest, even if a number of computations to calculate the P-value statistics are required. However, this method is not applicable to a large screening of databanks unlike BLAST (Altschul et al., 1990), FASTA (Pearson and Lipman, 1988) or Automat (Cantalloube et al., 1994). SANK_AL targets detailed and thorough analysis of alignment. In our testing, the number of allowable gaps that led to the best alignment is likely to be <11 for a pair of short sequences (96% of all short pairs and 81% of all medium pairs are under 11) or for a pair of high-similarity sequences (96% of all >35% identity pairs and 87% of all 2040% identity pairs are under 11). Therefore, the recommended maximal gaps is 15 for a short pair or high identity pair of sequences and 25 for other sequences.
In this paper, we detail a new alignment algorithm that does not employ gaps and propose a statistical evaluation that provides the basis for identifying biologically optimal alignments. To our knowledge, this is the first implementation of the algorithm. It is shown that the SANK_AL implementation for pairwise alignment (Bellgard et al., 2003) has a better performance compared with other mainstream approaches. In particular, it performs better than other approaches when the sequence identity is low. For such low identity pairs (often referred to as the Twilight Zone), it is known that almost all existing alignment programs do not perform well (Sauder et al., 2000; Elofsson, 2002). There are some alignment programs attempting to improve alignment quality in the Twilight Zone (Geourjon et al., 2001; Blake and Cohen, 2001). Callebaut et al. (1997) for instance, developed a method that works through a different representation of protein sequences around their hydrophobic core structure and has proven very efficient for aligning low identity proteins. Practically, as there are many low identity pairs in the PDB (Gough et al., 2001), such algorithms may be more useful.
Another feature is that SANK_AL can provide many alignments corresponding to the maximal number of allowable gaps in the alignment. It is well known that optimal alignments in terms of maximal score may not be biologically optimal. For this reason, Vingron and Argos (1990) first proposed a suboptimal alignment algorithm that generates near optimal solutions deduced from the local reliability of alignment. Subsequently various suboptimal algorithms have been proposed (Zuker, 1991; Naor and Brutlag, 1994). However, these algorithms usually create a large number of alternative alignments and it is therefore difficult to discover the differences and/or similarities among alignments within the set of alternatives (Smoot et al., 2004). The SANK_AL algorithm makes it possible to find potential biologically optimal (or near optimal) alignments among the many alignments by calculating the statistics for each of these alignments. While it appears that there are many alignments to assess for a given pair of sequences, in practice, this is not the case. Biologists can select the appropriate alignment(s) using P-value statistics and/or using prior knowledge of the sequences, such as evolutionary information or the number of introns in coding DNA.
| Acknowledgments |
|---|
We wish to acknowledge Ross Taplin and Mark Reynolds for their comments and suggestions, and Hitachi Software Engineering Co., Ltd for supporting Y.N. to do this work.
Received on July 11, 2004; revised on November 17, 2004; accepted on November 29, 2004
| REFERENCES |
|---|
|
|
|---|
Altschul, S. and Gish, W. (1996) Local alignment statistics. Methods Enzymol., 266, 460480[Web of Science][Medline].
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403410[CrossRef][Web of Science][Medline].
Andreeva, A., Howorth, D., Brenner, S.E., Hubbard, T.J., Chothia, C., Murzin, A.G. (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res., 32, 226229.
Apostolico, A. and Giancarlo, R. (1998) Sequence alignment in molecular biology. J. Comput. Biol., 5, 173196[Web of Science][Medline].
Arratia, R. and Waterman, M. (1994) A phase transition for the score in matching random sequences allowing deletions. Ann. Appl. Prob., 4, 200225.
Arslan, A., Egecioglu, O., Pevzner, P. (2001) A new approach to sequence comparison: normalized sequence alignment. Bioinformatics, 17, 327337
Aude, J.C. and Louis, A. (2002) An incremental algorithm for Z-value computations. Comput. Chem., 26, 403411[Web of Science][Medline].
Bacro, J.N. and Comet, J.P. (2001) Sequence alignment: an approximation law for the Z-value with applications to databank scanning. Comput. Chem., 25, 401410[CrossRef][Web of Science][Medline].
Bellgard, M., Gamble, T., Reynolds, R., Hunter, A., Trifonov, E., Taplin, R. (2003) Gap Mapping: a paradigm for aligning two sequences. Appl. Bioinformatics, 2, 3135.
Blake, J.D. and Cohen, F.E. (2001) Pairwise sequence alignment below the twilight zone. J. Mol. Biol., 23, 721735.
Callebaut, I., Labesse, G., Durand, P., Poupon, A., Canard, L., Chomilier, J., Henrissat, B., Mornon, J.P. (1997) Deciphering protein sequence information through hydrophobic cluster analysis (HCA): current status and perspectives. Cell. Mol. Life Sci., 53, 621645[CrossRef][Web of Science][Medline].
Cantalloube, H., Nahum, C., Achour, A., Lehner, T., Callebaut, I., Burny, A., Bizzini, B., Mornon, J.P., Zagury, D., Zagury, J.F. (1994) Automat: a novel software system for the systematic search for protein (or DNA) similarities with a notable application to autoimmune diseases and AIDS. Comput. Appl. Biosci., 10, 153161
Collins, J., Coulson, A., Lyall, A. (1988) The significance of protein sequence similarities. Comput. Appl. Biosci., 4, 6771
Comet, J.P., Aude, J.C., Glemet, E., Risler, J.L., Henaut, A., Slonimski, P.P., Codani, J.J. (1999) Significance of Z-value statistics of SmithWaterman scores for protein alignments. Comput. Chem., 23, 317331[CrossRef][Web of Science][Medline].
Cristobal, S., Zemla, A., Fischer, D., Rychlewski, L., Elofsson, A. (2001) A study of quality measures for protein threading models. BMC Bioinformatics, 2, 5[CrossRef][Medline].
Dembo, A., Karlin, S., Zeitouni, O. (1994) Limit distribution of maximal nonaligned two-sequence segmental score. Ann. Prob., 22, 20222039[CrossRef].
Durbin, R., Eddy, S., Krogh, A., Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, (1998) , Cambridge Cambridge University Press.
Elleman, T. (1978) A method for detecting distant evolutionary relationships between protein or nucleic acid sequences in the presence of deletions or insertions. J. Mol. Evol., 11, 143161[CrossRef][Web of Science][Medline].
Elofsson, A. (2002) A study on protein sequence quality. Proteins, 46, 330339[CrossRef][Web of Science][Medline].
Feng, Z.K. and Sippl, M.J. (1996) Optimum superimposition of protein structures: ambiguities and implications. Fold. Des., 1, 123132[CrossRef][Web of Science][Medline].
Geourjon, C., Combet, C., Blanchet, C., Deleage, G. (2001) Identification of related proteins with weak sequence identity using secondary structure information. Protein Sci., 10, 788797[CrossRef][Web of Science][Medline].
Gilks, W.R., Richardson, S., Spiegelhalter, D.J. Markov Chain Monte Carlo in Practice, (1995) , Boca Raton, FL CRC Press.
Godzik, A. (1996) The structural alignment between two proteins: is there a unique answer? Protein Sci., 5, 13251338[Web of Science][Medline].
Gough, J., Karplus, K., Hughey, R., Chothia, C. (2001) Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol., 313, 903919[CrossRef][Web of Science][Medline].
Holm, L. and Sander, C. (1999) Protein folds and families: sequence and structure alignments. Nucleic Acids Res, 27, 244227
Irie, R., Hiraoka, S., Kasahara, N., Nagai, K. (1999) DNA sequence comparison considering both amino acid and nucleotide insertions/deletions because of evolution and experimental error. J. Biotechnol., 69, 1926[CrossRef][Web of Science][Medline].
Levitt, M. and Gerstein, M. (1998) A unified statistical framework for sequence comparison and structure comparison. Proc. Natl Acad. Sci. USA, 26, 59135920.
Lipman, D.J. and Pearson, W.R. (1985) Rapid and sensitive protein similarity searches. Science, 22, 14351441.
Karlin, S. and Altschul, S. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci. USA, 87, 22642268
Karplus, K. and Hu, B. (2001) Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics, 17, 713720
Knudsen, B. and Miyamoto, M.M. (2003) Sequence alignments and pair hidden Markov models using evolutionary history. J. Mol. Biol., 333, 453460[CrossRef][Web of Science][Medline].
Morgenstern, B. (1999) DALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics, 15, 211218
Mott, R. (1992) Maximum-likelihood estimation of the statistical distribution of SmithWaterman local sequence similarity scores. Bull. Math. Biol., 54, 5975[Web of Science].
Naor, D. and Brutlag, D.L. (1994) On near-optimal alignments of biological sequences. J. Comput. Biol., 1, 349366[Medline].
Needleman, S. and Wunsch, C. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443453[CrossRef][Web of Science][Medline].
Notredame, C., Higgins, D., Heringa, J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205217[CrossRef][Web of Science][Medline].
Pearson, W. (1995) Comparison of methods for searching protein sequence databases. Protein Sci., 4, 11451160[Web of Science][Medline].
Pearson, W. (1998) Empirical statistical estimates for sequence similarity searches. J. Mol. Biol., 276, 7184[CrossRef][Web of Science][Medline].
Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 24442448
Rice, P., Longden, I., Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet., 16, 276277[CrossRef][Web of Science][Medline].
Roytberg, M. (1998) Sequence alignment without gap penalties. Proceedings of the First International Conference on Bioinformatics of Genome Regulation and Structure , Novosibirsk, Russia Vol. 2, , pp. 311313.
Sankoff, D. (1972) Matching sequences under deletion/insertion constraint. Proc. Natl Acad. Sci. USA, 69, 46
Sankoff, D. and Cedergren, R. (1973) A test for nucleotide sequence homology. J. Mol. Biol, 77, 159164[CrossRef][Web of Science].
Sauder, M., Arthur, J., Dunbrack, R. (2000) Large-scale comparison of protein sequence alignment algorithms with structural alignments. Proteins, 40, 622[CrossRef][Web of Science][Medline].
Shindyalov, I.N. and Bourne, P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng., 11, 739747
Smith, T. and Waterman, M. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195197[CrossRef][Web of Science][Medline].
Smith, T., Waterman, M., Burks, C. (1985) The statistical distribution of nucleic acid similarities. Nucleic Acids Res., 13, 645656
Smoot, M.E., Guerlain, S.A., Pearson, W.R. (2004) Visualization of near optimal sequence alignments. Bioinformatics, 20, 953958
Thompson, J., Plewniak, F., Poch, O. (1999a) BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, 15, 8788
Thompson, J., Plewniak, F., Poch, O. (1999b) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res., 27, 26822690
Vingron, M. and Argos, P. (1990) Determination of reliable regions in protein sequence alignments. Protein Eng., 3, 565569
Wallqvist, A., Fukunishi, Y., Murphy, L.R., Fadel, A., Levy, R.M. (2000) Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases. Bioinformatics, 16, 9881002
Waterman, M. and Vingron, M. (1994a) Rapid and accurate estimates of statistical significance for sequence database searches. Proc. Natl Acad. Sci. USA, 91, 46254628
Waterman, M. and Vingron, M. (1994b) Sequence comparison significance and Poisson approximation. Stat. Sci., 9, 367381.
Zhu, J., Liu, J., Lawrence, C. (1997) Bayesian adaptive alignment and inference. Proc. Int. Conf. Intell. Syst. Mol. Biol., 5, 358368[Medline].
Zuker, M. (1991) Suboptimal sequence alignment in molecular biology. Alignment with error analysis. J. Mol. Biol., 20, 403420.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||










