Skip Navigation


Bioinformatics Advance Access originally published online on December 10, 2004
Bioinformatics 2005 21(8):1421-1428; doi:10.1093/bioinformatics/bti198
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1421    most recent
bti198v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Nozaki, Y.
Right arrow Articles by Bellgard, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Nozaki, Y.
Right arrow Articles by Bellgard, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Statistical evaluation and comparison of a pairwise alignment algorithm that a priori assigns the number of gaps rather than employing gap penalties

Yasuyuki Nozaki and Matthew Bellgard *

Centre for Bioinformatics and Biological Computing, Murdoch University Murdoch, WA 6150, Australia

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 

Motivation: Although pairwise sequence alignment is essential in comparative genomic sequence analysis, it has proven difficult to precisely determine the gap penalties for a given pair of sequences. A common practice is to employ default penalty values. However, there are a number of problems associated with using gap penalties. First, alignment results can vary depending on the gap penalties, making it difficult to explore appropriate parameters. Second, the statistical significance of an alignment score is typically based on a theoretical model of non-gapped alignments, which may be misleading. Finally, there is no way to control the number of gaps for a given pair of sequences, even if the number of gaps is known in advance.

Results: In this paper, we develop and evaluate the performance of an alignment technique that allows the researcher to assign a priori set of the number of allowable gaps, rather than using gap penalties. We compare this approach with the Smith–Waterman and Needleman–Wunsch techniques on a set of structurally aligned protein sequences. We demonstrate that this approach outperforms the other techniques, especially for short sequences (56–133 residues) with low similarity (<25%). Further, by employing a statistical measure, we show that it can be used to assess the quality of the alignment in relation to the true alignment with the associated optimal number of gaps.

Availability: The implementation of the described methods SANK_AL is available at http://cbbc.murdoch.edu.au/

Contact: matthew{at}cbbc.murdoch.edu.au


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Pairwise sequence alignment is one of the most fundamental tools for comparing DNA and protein sequences. It establishes the basis for the interpretation of evolutionary and functional relationships between gene sequences and species. Through their seminal work in 1970, Needleman and Wunsch (1970) provided the catalyst to numerous other profound contributions to sequence alignment. To date, the most successful and widely understood algorithm is the Smith–Waterman technique (Smith and Waterman, 1981), which can detect similar segments in genomic sequences with maximal score if the compared sequences are closely related. More recently, there have been a number of modifications to the Smith–Waterman technique as well as novel approaches that are based on structural information (Geourjon et al., 2001; Wallqvist et al., 2000), profile information (Durbin et al., 1998; Gough et al., 2001) and evolutionary models of insertions and deletions (Knudsen and Miyamoto, 2003; Irie et al., 1999).

Synonymous with these alignment techniques are gap penalties that must be prescribed in order to obtain optimal or near optimal alignments. However, it has proven difficult to precisely determine the penalties that should be selected for a given pair of sequences. There are a number of unavoidable problems when employing gap penalties. Firstly, sequence alignment results vary depending on the penalties associated with the initiation and extension of gaps, which attempt to model insertion and/or deletion events that have occurred in either or both sequences through the course of their evolution. Biologists often find it difficult to determine the gap parameters in order to obtain the true ‘biological’ alignment. It is well known that artificially assigned default gap parameters are not selective and/or sensitive enough for all types of alignments (Arslan et al., 2001; Apostolico and Giancarlo, 1998; Morgenstern, 1999; Pearson, 1995; Roytberg, 1998). Secondly, the alignments with gaps do not always produce correct E- and P-value statistics. These statistics are based on a solid theoretical foundation for local alignments without gaps (Dembo et al., 1994; Karlin and Altschul, 1990). According to numerous computational experiments and some analytic results (Altschul and Gish, 1996; Altschul et al., 1990; Arratia and Waterman, 1994; Collins et al., 1988; Mott, 1992; Smith et al., 1985; Pearson, 1988; Waterman and Vingron, 1994a, b), it has been suggested that the formula for calculating statistics of ungapped alignments should be applied to gapped alignment as well. However, the statistics of gapped alignments do not exactly fit the theoretical distribution (extreme value distribution). Therefore, very small E- and P-values may not be reliable. Finally, there is no way to control the number of gaps to be assigned for a given pair of sequences even if the number of gaps is known in advance. This problem has been highlighted previously (Bellgard et al., 2003; Elleman, 1978; Sankoff and Cedergren, 1973; Zhu et al., 1997).

There is an alignment program, DIALIGN 2, which does not employ any gap penalties (Morgenstern, 1999). This approach uses segment-to-segment comparison rather than residue-to-residue comparison and alignments are composed from gap-free segment pairs. However, DIALIGN 2 requires user-defined parameters to determine the minimum size of segment and the significance between two segments' pairs and one must therefore specify appropriate parameters.

In a previous study, an approach that constrains the maximal number of allowable gaps was developed (Bellgard et al., 2003). This implementation, which is referred to as ‘gap mapping’, is a dynamic programming approach that employs an extra dimension corresponding to the number of gap constraints. Thus, for an priori defined of the maximum allowable number of gaps k, an alignment is obtained with the best alignment score with at most k gaps. Changing the number of allowable gaps can give a different alignment. Therefore, a novel way to explore optimal and near optimal alignments was developed (Bellgard et al., 2003). A search of the literature (and M. Waterman, personal communication) revealed that this implementation was related closely to an approach proposed by Sankoff (1972). Surprisingly, this approach has received little or no attention. For this reason, we have referred to our implementation as the SANK_AL approach.

In the present study, the SANK_AL implementation is evaluated and compared to both the Needleman–Wunsch algorithm (Needleman and Wunsch, 1970) as a global alignment technique and the Smith–Waterman algorithm (Smith and Waterman, 1981) as a local alignment technique. A statistical significance score P-value is developed based on a Monte Carlo procedure (Lipman and Pearson, 1985; Gilks et al., 1995) and used as a measure of the alignment for a given number of gaps. The sequences used for comparison are pairs of sequences taken from structurally aligned BAliBASE for a range of protein sequence length and percentage similarity (Thompson et al., 1999a, b). We show that in the majority of cases, the SANK_AL achieves biologically meaningful alignments over the other two techniques without the need to explore gap penalty parameter space. As SANK_AL produces more than one alignment (one alignment corresponds to a given number of maximal allowable gaps), each alignment produces its own score and associated statistical significance. Interestingly, the SANK_AL technique, as we shall show, in the majority of cases, associates the correct biological alignment to a minimal P-value for a given number of allowable gaps. This form of self-evaluation appears to be intuitive and novel in that it allows for the assessment of the most appropriate alignment out of the range of possible alignments. SANK_AL is available at http://cbbc.murdoch.edu.au/


    METHODS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Outline of SANK_AL algorithm
The SANK_AL algorithm implementation developed independently is in fact a modification to the algorithm proposed by Sankoff (1972). The difference between the SANK_AL implementation and the original Sankoff algorithm is that SANK_AL can apply not only to a given pair of DNA sequences but also to amino acid sequences as well as provide global alignments for them. The SANK_AL algorithm is described briefly below. Let {ai} (where i = 1, 2,..., m) and {bj} (where j = 1, 2,..., n) be two sequences to be aligned and k be a number of maximal allowable gaps. To match a sequence {ai} against another sequence {bj}, three m by n by (k + 1) matrices P, Q and R are constructed. The three matrices correspond to optimal alignments ending (ai, bj), (‘–’, bj) and (ai, ‘–’), respectively, where ‘–’ is a gap character. The matrix element in P, Q and R at position s, t, g denotes the best score obtainable by matching {ai} (where i = 1, 2,...,s) and {bj} (where j = 1, 2,...,t) including g gap regions. P[i][j][g], Q[i][j][g] and R[i][j][g] are initialized as follows:

for all 0 ≤ g ≤ k, and

for all 0 ≤ g ≤ k and 2 ≤ j ≤ n, and

for all 0 ≤ g ≤ k and 2 ≤ i ≤ m, where s(ai, bj) is a score in a substitution matrix corresponding to ai and bj. The element –{infty} implies impossible alignments for a given i, j and g. For instance, P[1][3][2] should be –{infty}, because it is impossible to align a1 and b1b2b3 satisfying both constraints allowing two gap regions and ending (a1, b3). The element 0 implies that such alignments have no residue-to-residue pair. For instance, R[2][1][3] is derived only from the case of alignment a1a2 and –b1–, therefore 0 is assigned for it. Then inner elements of matrices are calculated recursively by

for all 1 ≤ i ≤ m, 1 ≤ j ≤ n and 0 ≤ g ≤ k.

To get an optimal alignment for a pair of sequences {ai} (where i = 1, 2, ..., m) and {bj} (where j = 1, 2, ..., n) with g gap regions, first, the highest value C among P[m][n][g], Q[m][m][g] and R[m][n][g] is selected. The matrix containing C determines whether the last pair of alignment has one of the following patterns (am, bn), (‘–’, bn) or (am, ‘–’). Then, just as in Needleman–Wunsch algorithm, matrices are traced back to find a path of elements that contribute the highest value C. Finally, the alignment is constructed by assigning (ai, bj), (‘–’, bj) or (ai, ‘–’) according to the matrices of elements in the path. Similarly, for the other number of gap regions g', an alignment can be obtained as well by finding a path of elements contributing to the highest value C' among P[m][n][g'], Q[m][m][g'] and R[m][n][g'].

In this algorithm, the number of calculations is obviously proportional to the size of matrices, in other words, the total running time of this SANK_AL algorithm is O(mnk). Note that this SANK_AL algorithm does not make use of any gap penalties or any other specific information in the process of alignment. Moreover, this algorithm can find subsequences under the given number of gap regions, in other words, satisfying insertion and/or deletion constraints.

Dataset of structure alignments
To assess the quality of our alignment algorithm, we require a standard for the ‘true’ alignment. This is usually derived from alignments of protein structures. Several comparative reports of protein alignment algorithms (Sauder et al., 2000; Elofsson, 2002) use SCOP, a protein family database (Andreeva et al., 2004). However, as SCOP does not have alignment information, it is necessary to make ‘true’ alignments from it by using structural alignment programs such as CE (Shindyalov and Bourne, 1998) and DALI (Holm and Sander, 1999). It is acknowledged that sequence alignments produced by two structure alignment programs may differ substantially when the sequence identity between them is low (Feng and Sippl, 1996; Godzik, 1996).

For that reason, in this study, we selected BAliBASE which is manually constructed and is not intended for any specific alignment program. It should be noted that BAliBASE has a potential bias toward global alignment programs since aligned sequences are trimmed down at the boundaries of the alignment to the core block only (Karplus and Hu, 2001). However, it has been used as a standard multiple alignment sequence set to evaluate alignment algorithms in recent studies (Thompson et al., 1999b; Karplus and Hu, 2001; Notredame et al., 2000). BAliBASE is currently available at http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE2/. BAliBASE is classified into eight hierarchical reference sets by the length and similarity of the sequences in the core blocks and by the presence of insertions, N/C-terminal extensions and repeats. In this paper, Reference 1 which contains 82 multiple alignment sequences was used. Reference 1 is divided into nine groups depending on the length of the sequences: short (between 56 and 133 residues), medium (182–317 residues) and long (329–995 residues); and the identity of sequences: <25, 20–40 and >35%. For the analysis of pairwise alignment, the first and second sequences from each multiple alignment were chosen. Our test data, therefore, results in the comparison of 7–12 pairs of sequences in each group.

Comparison of SANK_AL with other techniques
In order to show the performance of our SANK_AL algorithm, SANK_AL (Bellgard et al., 2003) was compared with other alignment methods. As our program produces global alignments and focuses on detailed analysis of alignment for the two given sequences, we chose the Needleman–Wunsch algorithm for comparison. We also chose the Smith–Waterman algorithm for a local alignment comparison. Like SANK_AL, these two algorithms can be used generically as they are not intended for specific tasks such as fold recognition or gene finding. The Needleman–Wunsch and the Smith–Waterman algorithms were carried out through a local copy of EMBOSS (Rice et al., 2000). The substitution matrix for evaluating the alignments used BLOSUM50, BLOSUM62 and PAM250 and the gap penalties employed a range of parameters for initiation (9, 10, 11, 12 and 13) and extension (0.5, 1, 2, 3 and 4). Therefore, 25 combinations of gap penalties for each substitution matrix have been tested.

To measure alignment accuracy compared to the BAliBASE structural alignment, we chose to use the fD score (Sauder et al., 2000). The fD is defined to be the percentage of paired residues matched against the standard alignment. Although many other measures for assessing the significance of alignments are proposed (Levitt and Gerstein, 1998; Cristobal et al., 2001), the fD score is the simplest and most intuitive method and it has been used in many papers (Sauder et al., 2000; Blake and Cohen, 2001).

Statistics of alignments
The fD score can assess the likely biological significance of the similarity although it is also necessary to know whether the alignments themselves are statistically significant or not. In this study, statistical significance was estimated by using the Monte Carlo techniques first proposed by Lipman and Pearson (1985) and examined by Comet et al. (1999) and Bacro and Comet (2001). This method is known to be less dependent on the sequence lengths and compositions (Comet et al., 1999; Bacro and Comet, 2001). For any two given sequences a and b, this method compares one of the sequences a with the randomly shuffled second sequence b* many times. The shuffled sequences share the same amino acid composition and length with the initial sequence. For an empirical mean and standard deviation of the alignment score S(a, b*) estimated by a number of shuffled sequences, the Z-value is defined as

where S(a, b) is an alignment score of a and b. However, Comet et al. (1999) reported shuffling the first sequence [Z(b,a)] can lead to Z-values that are markedly different from those obtained by shuffling the second sequence [Z(a,b)]. Thus, a conservative Z-value Z'(a,b) was used by following the formula suggested by Comet et al. (1999) as

To obtain a converged Z-value, the number of shuffling times is recommended from 100 to 1000 (Comet et al., 1999; Aude and Louis, 2002); therefore, we have used 1000 times. Then the probability (P-value) observed at least Z-value z, P(Z > z), was calculated by

where {gamma} is a Euler constant. This P-value is based on the fact that the Z-value should follow the extreme value distribution, more precisely, the Grumbel distribution. In fact, interestingly, the histogram of Z-values was well fitted with its distribution for random sequences (unpublished data). However, in practical applications, the Z-value may deviate from the Grumbel distribution when it is large. This is because real genomic sequences can be biased (Comet et al., 1999; Bacro and Comet, 2001). Therefore, P-value was used as an ordered relative measure of alignment significance rather than an absolute measure.


    RESULTS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
For a given k gaps, SANK_AL produces the alignment with the best score with at most k gaps. Figure 1 shows an example of SANK_AL outputs allowing from one gap through to five gaps. It is worth noting that these alignments do not use any artificial gap penalties. Therefore, they are ‘naturally’ aligned using only a substitution matrix and a given maximal allowable gap.



View larger version (63K):
[in this window]
[in a new window]
 
Fig. 1 Two examples of SANK_AL alignments using the BLOSUM62 substitution matrix are shown (short sequence pair <25% identity and short sequence pair >35% identity). While the number of allowable gaps is changed from one to five, SANK_AL produces biologically meaningful optimal alignments for each.

 
We have compared our SANK_AL algorithm against the Needleman–Wunsch and Smith–Waterman algorithms using the BAliBASE database. In this comparison, SANK_AL performed sequence alignment allowing from 1 to 25 gaps. Figures 2 and 3 show the results of each algorithm using BLOSUM62, by measuring averages of the fD score within each group. Note that SANK_AL produces 25 alignments for a given pair of sequences while the other two programs produce only one. Therefore, we explored 25 gap penalty combinations for the other two programs and picked the top three averages of the fD score for each group among the 25 combinations of gap penalties. These are referred to as N&W1, N&W2 and N&W3 for the Needleman–Wunsch program in Figure 2, and S&W1, S&W2 and S&W3 for the Smith–Waterman program in Figure 3. Also, the averages of the fD score obtained from the default gap penalty (insertion = 10, extension = 0.5) are plotted in Figures 2 and 3. These gap penalty combinations are represented as initiation/extension in the box near the plots. The fD score of SANK_AL represents the highest fD among all 25 allowable gaps (SANK_AL).



View larger version (33K):
[in this window]
[in a new window]
 
Fig. 2 Comparison of SANK_AL and the Needleman–Wunsch algorithm. A range of 25 allowable gaps is used in SANK_AL. N&W1, N&W2 and N&W3 represent the top three averages of fD scores for each group among 25 gap penalty combinations. The averages of fD scores obtained from the default gap penalty are also plotted in this figure. These gap penalties are shown in the figure as insertion/extension. All comparisons are conducted against (a) long sequences, (b) medium sequences and (c) short sequences.

 


View larger version (33K):
[in this window]
[in a new window]
 
Fig. 3 Comparison of SANK_AL and the Smith–Waterman algorithm. A range of 25 allowable gaps is used in SANK_AL. S&W1, S&W2 and S&W3 represent the top three averages of fD scores for each group among 25 gap penalty combinations. The averages of fD scores obtained from the default gap penalty are also plotted in this figure. These gap penalties are shown as insertion/extension. All comparisons are conducted against (a) long sequences, (b) medium sequences and (c) short sequences.

 
As can be seen in Figures 2 and 3, in the majority of cases, the fD scores of SANK_AL alignments are the same or better than that of the other two algorithms. SANK_AL is particularly effective when a pair of sequences is short and the identity between them is low. In fact, 69.6, 56.8 and 55.1% of all SANK_AL alignments for short, medium and long sequences produce higher fD scores than those of the Needleman–Wunsch program. Similarly, 84.7, 78.0 and 71.2% of all SANK_AL alignments for short, medium and long sequences, respectively produce higher fD scores than the score of the Smith–Waterman program. Furthermore, note that there is no one gap penalty combination which can provide superior performance at all times. This highlights the difficulty in selecting appropriate gap penalties for a given pair of sequences. In fact, when the default gap penalty is employed for the Needleman–Wunsch and Smith–Waterman programs, 68.7 and 78.7% of all SANK_AL alignments outperform them, respectively. As we discuss, SANK_AL can find the optimal or near optimal number of gaps for a given pair of sequences. Consequently, in the majority of cases, SANK_AL makes biologically better alignments than the other two algorithms. Although the alignments using the BLOSUM62 are shown, other alignments using the BLOSUM50 and PAM250 matrices have similar characteristics (data not shown).

We also examined the probability of obtaining each alignment (P-value). Figure 4 plots the fD scores and logarithms of P-values while the number of allowable gaps changes from 1 to 25. As can be seen, the higher fD score and the lower P-value are almost correlated with each other. Actually, 73.1% of the number of gap regions, having the maximum fD, correspond to the number of gap regions with the minimum P-value plus or minus two gaps. If the number of gap regions in alignments is less than the optimal gap, alignments must have unlikely aligned residue pairs and result in small alignment scores. As shuffled sequence alignments with a small number of gaps generally produce small alignment scores, the Z-values are also small. From the formula for P-value, smaller Z-values result in larger P-values. Similarly, if the number of gap regions in alignments is greater than the optimal gap, the alignment tends to contain many unnecessary gaps. Such alignments will be similar to shuffled sequence alignments when the number of gaps increases. Hence, the Z-values will be small. As a result, by using statistics of alignments, we can find an optimal or near optimal number of gaps effectively. However, on the other hand, 26.9% of the number of gap regions, having the maximum fD, do not correspond to the number of gap regions with the minimum P-value plus or minus two gaps. These can be classified into two different cases. One is that the alignment corresponding to the smallest P-value gives almost the same as that with the highest fD value. In other words, the curve of fD value is flat around the highest fD. In this case, even though the alignment obtained from the smallest P-value is not correlated to the best alignment with respect to fD value, the alignment is still guaranteed to have biological features and the alignment is quite similar to the one corresponding to the highest fD value. The other case is that P-value is consistently large (between 1 and 10–3) while the number of allowable gaps changes. That is, the alignment obtained by SANK_AL is not regarded as a significant alignment for any number of gaps. Generally speaking, the fD value of such alignment is quite low (<0.3). Accordingly, if one finds that P-values are consistently large while the number of gap changes, one should be wary of using this alignment. This P-value statistics allows us to avoid less meaningful alignments.



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 4 Correlations between fD score and P-value for a range of 25 allowable gaps using (a) long sequence pair, >35% identity (P43831 [GenBank] versus P47897 [GenBank] ), (b) medium sequence pair, 20–40% identity (P09211 [GenBank] versus P20136 [GenBank] ) and (c) short sequence pair, >35% identity (P02244 [GenBank] versus P27686 [GenBank] ).

 

    DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
The source code of SANK_AL was written in the ANSI C++ language. As described before, the computation of SANK_AL is O(mnk) for a pair of sequences of length m and n allowing up to k gap regions. In practice, for a pair of short (56–133 residues), medium (182–317 residues) and long (329–995 residues) sequences, it takes 0.02, 0.20, and 0.99 s on average, respectively, to find the optimal alignment after testing 25 allowable gaps, running on Linux operating system with Intel® XeonTM 2.80 GHz and 2 GB RAM. This computation time is reasonable for the investigation of a few sequences of interest, even if a number of computations to calculate the P-value statistics are required. However, this method is not applicable to a large screening of databanks unlike BLAST (Altschul et al., 1990), FASTA (Pearson and Lipman, 1988) or Automat (Cantalloube et al., 1994). SANK_AL targets detailed and thorough analysis of alignment. In our testing, the number of allowable gaps that led to the best alignment is likely to be <11 for a pair of short sequences (96% of all short pairs and 81% of all medium pairs are under 11) or for a pair of high-similarity sequences (96% of all >35% identity pairs and 87% of all 20–40% identity pairs are under 11). Therefore, the recommended maximal gaps is 15 for a short pair or high identity pair of sequences and 25 for other sequences.

In this paper, we detail a new alignment algorithm that does not employ gaps and propose a statistical evaluation that provides the basis for identifying biologically optimal alignments. To our knowledge, this is the first implementation of the algorithm. It is shown that the SANK_AL implementation for pairwise alignment (Bellgard et al., 2003) has a better performance compared with other mainstream approaches. In particular, it performs better than other approaches when the sequence identity is low. For such low identity pairs (often referred to as the ‘Twilight Zone’), it is known that almost all existing alignment programs do not perform well (Sauder et al., 2000; Elofsson, 2002). There are some alignment programs attempting to improve alignment quality in the ‘Twilight Zone’ (Geourjon et al., 2001; Blake and Cohen, 2001). Callebaut et al. (1997) for instance, developed a method that works through a different representation of protein sequences around their hydrophobic core structure and has proven very efficient for aligning low identity proteins. Practically, as there are many low identity pairs in the PDB (Gough et al., 2001), such algorithms may be more useful.

Another feature is that SANK_AL can provide many alignments corresponding to the maximal number of allowable gaps in the alignment. It is well known that optimal alignments in terms of maximal score may not be biologically optimal. For this reason, Vingron and Argos (1990) first proposed a suboptimal alignment algorithm that generates near optimal solutions deduced from the local reliability of alignment. Subsequently various suboptimal algorithms have been proposed (Zuker, 1991; Naor and Brutlag, 1994). However, these algorithms usually create a large number of alternative alignments and it is therefore difficult to discover the differences and/or similarities among alignments within the set of alternatives (Smoot et al., 2004). The SANK_AL algorithm makes it possible to find potential biologically optimal (or near optimal) alignments among the many alignments by calculating the statistics for each of these alignments. While it appears that there are many alignments to assess for a given pair of sequences, in practice, this is not the case. Biologists can select the appropriate alignment(s) using P-value statistics and/or using prior knowledge of the sequences, such as evolutionary information or the number of introns in coding DNA.


    Acknowledgments
 
We wish to acknowledge Ross Taplin and Mark Reynolds for their comments and suggestions, and Hitachi Software Engineering Co., Ltd for supporting Y.N. to do this work.

Received on July 11, 2004; revised on November 17, 2004; accepted on November 29, 2004

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 

    Altschul, S. and Gish, W. (1996) Local alignment statistics. Methods Enzymol., 266, 460–480[Web of Science][Medline].

    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410[CrossRef][Web of Science][Medline].

    Andreeva, A., Howorth, D., Brenner, S.E., Hubbard, T.J., Chothia, C., Murzin, A.G. (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res., 32, 226–229.

    Apostolico, A. and Giancarlo, R. (1998) Sequence alignment in molecular biology. J. Comput. Biol., 5, 173–196[Web of Science][Medline].

    Arratia, R. and Waterman, M. (1994) A phase transition for the score in matching random sequences allowing deletions. Ann. Appl. Prob., 4, 200–225.

    Arslan, A., Egecioglu, O., Pevzner, P. (2001) A new approach to sequence comparison: normalized sequence alignment. Bioinformatics, 17, 327–337[Abstract/Free Full Text].

    Aude, J.C. and Louis, A. (2002) An incremental algorithm for Z-value computations. Comput. Chem., 26, 403–411[Web of Science][Medline].

    Bacro, J.N. and Comet, J.P. (2001) Sequence alignment: an approximation law for the Z-value with applications to databank scanning. Comput. Chem., 25, 401–410[CrossRef][Web of Science][Medline].

    Bellgard, M., Gamble, T., Reynolds, R., Hunter, A., Trifonov, E., Taplin, R. (2003) Gap Mapping: a paradigm for aligning two sequences. Appl. Bioinformatics, 2, 31–35.

    Blake, J.D. and Cohen, F.E. (2001) Pairwise sequence alignment below the twilight zone. J. Mol. Biol., 23, 721–735.

    Callebaut, I., Labesse, G., Durand, P., Poupon, A., Canard, L., Chomilier, J., Henrissat, B., Mornon, J.P. (1997) Deciphering protein sequence information through hydrophobic cluster analysis (HCA): current status and perspectives. Cell. Mol. Life Sci., 53, 621–645[CrossRef][Web of Science][Medline].

    Cantalloube, H., Nahum, C., Achour, A., Lehner, T., Callebaut, I., Burny, A., Bizzini, B., Mornon, J.P., Zagury, D., Zagury, J.F. (1994) Automat: a novel software system for the systematic search for protein (or DNA) similarities with a notable application to autoimmune diseases and AIDS. Comput. Appl. Biosci., 10, 153–161[Abstract/Free Full Text].

    Collins, J., Coulson, A., Lyall, A. (1988) The significance of protein sequence similarities. Comput. Appl. Biosci., 4, 67–71[Abstract/Free Full Text].

    Comet, J.P., Aude, J.C., Glemet, E., Risler, J.L., Henaut, A., Slonimski, P.P., Codani, J.J. (1999) Significance of Z-value statistics of Smith–Waterman scores for protein alignments. Comput. Chem., 23, 317–331[CrossRef][Web of Science][Medline].

    Cristobal, S., Zemla, A., Fischer, D., Rychlewski, L., Elofsson, A. (2001) A study of quality measures for protein threading models. BMC Bioinformatics, 2, 5[CrossRef][Medline].

    Dembo, A., Karlin, S., Zeitouni, O. (1994) Limit distribution of maximal nonaligned two-sequence segmental score. Ann. Prob., 22, 2022–2039[CrossRef].

    Durbin, R., Eddy, S., Krogh, A., Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, (1998) , Cambridge Cambridge University Press.

    Elleman, T. (1978) A method for detecting distant evolutionary relationships between protein or nucleic acid sequences in the presence of deletions or insertions. J. Mol. Evol., 11, 143–161[CrossRef][Web of Science][Medline].

    Elofsson, A. (2002) A study on protein sequence quality. Proteins, 46, 330–339[CrossRef][Web of Science][Medline].

    Feng, Z.K. and Sippl, M.J. (1996) Optimum superimposition of protein structures: ambiguities and implications. Fold. Des., 1, 123–132[CrossRef][Web of Science][Medline].

    Geourjon, C., Combet, C., Blanchet, C., Deleage, G. (2001) Identification of related proteins with weak sequence identity using secondary structure information. Protein Sci., 10, 788–797[CrossRef][Web of Science][Medline].

    Gilks, W.R., Richardson, S., Spiegelhalter, D.J. Markov Chain Monte Carlo in Practice, (1995) , Boca Raton, FL CRC Press.

    Godzik, A. (1996) The structural alignment between two proteins: is there a unique answer? Protein Sci., 5, 1325–1338[Web of Science][Medline].

    Gough, J., Karplus, K., Hughey, R., Chothia, C. (2001) Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol., 313, 903–919[CrossRef][Web of Science][Medline].

    Holm, L. and Sander, C. (1999) Protein folds and families: sequence and structure alignments. Nucleic Acids Res, 27, 244–227[Abstract/Free Full Text].

    Irie, R., Hiraoka, S., Kasahara, N., Nagai, K. (1999) DNA sequence comparison considering both amino acid and nucleotide insertions/deletions because of evolution and experimental error. J. Biotechnol., 69, 19–26[CrossRef][Web of Science][Medline].

    Levitt, M. and Gerstein, M. (1998) A unified statistical framework for sequence comparison and structure comparison. Proc. Natl Acad. Sci. USA, 26, 5913–5920.

    Lipman, D.J. and Pearson, W.R. (1985) Rapid and sensitive protein similarity searches. Science, 22, 1435–1441.

    Karlin, S. and Altschul, S. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci. USA, 87, 2264–2268[Abstract/Free Full Text].

    Karplus, K. and Hu, B. (2001) Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics, 17, 713–720[Abstract/Free Full Text].

    Knudsen, B. and Miyamoto, M.M. (2003) Sequence alignments and pair hidden Markov models using evolutionary history. J. Mol. Biol., 333, 453–460[CrossRef][Web of Science][Medline].

    Morgenstern, B. (1999) DALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics, 15, 211–218[Abstract/Free Full Text].

    Mott, R. (1992) Maximum-likelihood estimation of the statistical distribution of Smith–Waterman local sequence similarity scores. Bull. Math. Biol., 54, 59–75[Web of Science].

    Naor, D. and Brutlag, D.L. (1994) On near-optimal alignments of biological sequences. J. Comput. Biol., 1, 349–366[Medline].

    Needleman, S. and Wunsch, C. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453[CrossRef][Web of Science][Medline].

    Notredame, C., Higgins, D., Heringa, J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205–217[CrossRef][Web of Science][Medline].

    Pearson, W. (1995) Comparison of methods for searching protein sequence databases. Protein Sci., 4, 1145–1160[Web of Science][Medline].

    Pearson, W. (1998) Empirical statistical estimates for sequence similarity searches. J. Mol. Biol., 276, 71–84[CrossRef][Web of Science][Medline].

    Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448[Abstract/Free Full Text].

    Rice, P., Longden, I., Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet., 16, 276–277[CrossRef][Web of Science][Medline].

    Roytberg, M. (1998) Sequence alignment without gap penalties. Proceedings of the First International Conference on Bioinformatics of Genome Regulation and Structure , Novosibirsk, Russia Vol. 2, , pp. 311–313.

    Sankoff, D. (1972) Matching sequences under deletion/insertion constraint. Proc. Natl Acad. Sci. USA, 69, 4–6[Abstract/Free Full Text].

    Sankoff, D. and Cedergren, R. (1973) A test for nucleotide sequence homology. J. Mol. Biol, 77, 159–164[CrossRef][Web of Science].

    Sauder, M., Arthur, J., Dunbrack, R. (2000) Large-scale comparison of protein sequence alignment algorithms with structural alignments. Proteins, 40, 6–22[CrossRef][Web of Science][Medline].

    Shindyalov, I.N. and Bourne, P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng., 11, 739–747[Abstract/Free Full Text].

    Smith, T. and Waterman, M. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197[CrossRef][Web of Science][Medline].

    Smith, T., Waterman, M., Burks, C. (1985) The statistical distribution of nucleic acid similarities. Nucleic Acids Res., 13, 645–656[Abstract/Free Full Text].

    Smoot, M.E., Guerlain, S.A., Pearson, W.R. (2004) Visualization of near optimal sequence alignments. Bioinformatics, 20, 953–958[Abstract/Free Full Text].

    Thompson, J., Plewniak, F., Poch, O. (1999a) BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, 15, 87–88[Abstract/Free Full Text].

    Thompson, J., Plewniak, F., Poch, O. (1999b) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res., 27, 2682–2690[Abstract/Free Full Text].

    Vingron, M. and Argos, P. (1990) Determination of reliable regions in protein sequence alignments. Protein Eng., 3, 565–569[Abstract/Free Full Text].

    Wallqvist, A., Fukunishi, Y., Murphy, L.R., Fadel, A., Levy, R.M. (2000) Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases. Bioinformatics, 16, 988–1002[Abstract/Free Full Text].

    Waterman, M. and Vingron, M. (1994a) Rapid and accurate estimates of statistical significance for sequence database searches. Proc. Natl Acad. Sci. USA, 91, 4625–4628[Abstract/Free Full Text].

    Waterman, M. and Vingron, M. (1994b) Sequence comparison significance and Poisson approximation. Stat. Sci., 9, 367–381.

    Zhu, J., Liu, J., Lawrence, C. (1997) Bayesian adaptive alignment and inference. Proc. Int. Conf. Intell. Syst. Mol. Biol., 5, 358–368[Medline].

    Zuker, M. (1991) Suboptimal sequence alignment in molecular biology. Alignment with error analysis. J. Mol. Biol., 20, 403–420.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1421    most recent
bti198v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Nozaki, Y.
Right arrow Articles by Bellgard, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Nozaki, Y.
Right arrow Articles by Bellgard, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?