Bioinformatics Advance Access originally published online on April 7, 2005
Bioinformatics 2005 21(12):2827-2831; doi:10.1093/bioinformatics/bti433
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Convergent Island Statistics: a fast method for determining local alignment score significance
Eidogen-Sertanty Inc. 9381 Judicial Dr., San Diego, CA 92121, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Background distribution statistics for profile-based sequence alignment algorithms cannot be calculated analytically, and hence such algorithms must resort to measuring the significance of an alignment score by assessing its location among a distribution of background alignment scores. The Gumbel parameters that describe this background distribution are usually pre-computed for a limited number of scoring systems, gap schemes, and sequence lengths and compositions. The use of such look-ups is known to introduce errors, which compromise the significance assessment of a remote homology relationship. One solution is to estimate the background distribution for each pair of interest by generating a large number of sequence shuffles and use the distribution of their scores to approximate the parameters of the underlying extreme value distribution. This is computationally very expensive, as a large number of shuffles are needed to precisely estimate the score statistics.
Results: Convergent Island Statistics (CIS) is a computationally efficient solution to the problem of calculating the Gumbel distribution parameters for an arbitrary pair of sequences and an arbitrary set of gap and scoring schemes. The basic idea behind our method is to recognize the lack of similarity for any pair of sequences early in the shuffling process and thus save on the search time. The method is particularly useful in the context of profileprofile alignment algorithms where the normalization of alignment scores has traditionally been a challenging task.
Contact: aleksandar{at}eidogen.com
Supplementary information: http://www.eidogen-sertanty.com/Documents/convergent_island_stats_sup.pdf
| INTRODUCTION |
|---|
|
|
|---|
Sequence homology detection algorithms employ rare event statistical approaches to estimate the significance of an alignment between two gene or protein sequences. For ungapped local alignments and in the asymptotic limit of long sequences, it is well established that the alignment scores follow an extreme value distribution described by two parameters
and K (Altschul et al., 1990; 1997; Dembo et al., 1994). For profile-based algorithms, with or without gaps, it has been conjectured that the score distribution is still of the Gumbel form. However, estimating alignment score significance based on a single or a finite set of Gumbel parameters can introduce errors. In practice
may vary by >10% from one pair of sequences to another, due to variations in sequence composition (Altschul et al., 2001). On the other hand, for marginally significant alignments, even a 4% error in
leads to an error in E-value greater than a factor of 2.7 (Altschul et al., 2001). Various approaches to addressing sequence specific features during the score normalization have been used in lieu of a rigorous theory. In PSI-BLAST, lengths are dealt with using edge-effect correction (Altschul and Gish, 1996) and composition-based effects are dealt with using composition-based statistics (Schaffer et al., 2001). There are other efficient methods not implemented in BLAST (Mott and Tribe, 1999; Mott, 2000) that estimate length- and composition-dependent statistical parameters without further simulation. In contrast to profile-sequence methods, profileprofile alignment algorithms still lack a fast and accurate assessment of the score statistics. This is in particular true for algorithms that use structural information, position specific gaps and various other constraints in the alignment process. Accurate score normalization in these methods is very important because there is plenty of evidence suggesting that profileprofile algorithms can recognize some extremely distant sequence relationships. In the last CASP experiment, for example, profileprofile methods were among the top performers across all categories.
Many of the existing profileprofile alignment methods use Z-score statistics to measure the score significance (Rychlewski et al., 2000; Ginalski et al., 2003). Other methods normalize the alignment scores by assessing their locations in a fixed Gumbel distribution. The first approach is limited in the number of shuffles that can be employed in order to process the database in a timely fashion. It also makes the wrong assumption about the Gaussian form of the underlying score distribution. The second technique is extremely fast, but it does not address the profiles' lengths and the composition bias. COMPASS (Sadreyev and Grishin, 2003) generalizes the PSI-BLAST approach in computing alignment score significance. However, the method is designed to work with the COMPASS alignment algorithm, i.e. for its specific scoring function and gap penalties.
Recently, methods that use scores for local alignment islands have been described (Olsen et al., 1999; Altschul et al., 2001). The so-called island method overcomes many of the bottlenecks of the earlier methods. But, despite its favorable properties, it alone still does not allow for quick estimates of
and K in the context of a large database search. In the island method, independent of how similar sequences are, they are repeatedly shuffled and aligned in order to produce a large number of island scores needed to derive the two Gumbel parameters.
Our method, Convergent Island Statistics (CIS), builds upon the island method, but is able to recognize the lack of sequence similarity early in the shuffling process and thus save on the search time. Full blown shuffling is needed only when there is enough evidence that the sequences are related. Convergent Island Statistics has the following properties:
- It is reasonably fast. Profileprofile alignment algorithms incorporating CIS are able to process standard databases like PFAM or PDB in real time.
- Since it is based on sequence shuffles, the estimation of distribution parameters in CIS takes account of sequence (profile) lengths and compositions.
- CIS provides an explicit, analytical tradeoff between the algorithm's speed and sensitivity.
- It uses no lookup tables and contains no parameters to optimize.
- It estimates score statistics on-the-fly and therefore it can be readily applied to a broad class of local alignment algorithms, without pre-processing of the data of any kind.
Background
Statistics of sequence alignment scores have been extensively studied (Dembo et al., 1994; Altschul and Gish, 1996; Mott and Tribe, 1999; Mott, 2000; Karlin and Altschul, 1990; 1993; Pearson, 1998; Collins et al., 1988; Mott, 1992; Waterman and Vingron, 1994a, b). For sequencesequence alignments lacking gaps, the expected number of locally optimal sub-alignments with a score of at least x is approximately Poisson distributed with mean value E,
![]() | (1) |
is the scale parameter for the scoring system. Equation (1) implies that the probability of finding exactly k alignments with score
x is
![]() | (2) |
![]() | (3) |
, K) in order to specify the underlying distribution parameters. Note that the accurate estimates of
are much more important than those of K, since
enters Equation (1) exponentially.
Island Statistics
Recently, Olsen et al. (1999) proposed a computationally efficient method for determining
and K using scores for local alignment islands. The value of each cell in a SmithWaterman matrix corresponds to the highest scoring local alignment that ends at that cell. The local alignment starts at a so-called anchor cell, and an island is defined to be the set of all cells that have the same anchor. The score of an island is the maximum score of the cells it contains. A simple modification of the SmithWaterman algorithm involving a small extra computation per cell allows one to keep track of the anchor cells, as well as the island end-points and their scores. Since island scores are scores of distinct sub-optimal alignments, Equation (1) describes well the number of islands with a score of at least x, and is increasingly accurate for larger values of x. Thus, the precise estimates of
and K may be obtained by considering those islands with scores of at least some threshold value c. For the case of discrete alignment scores, the maximum likelihood estimate of
is
![]() | (4) |
![]() | (5) |
c} and N = | Ic | (Altschul et al., 2001).
In the case of continuous alignment scores,
![]() | (6) |
![]() | (7) |
. | SYSTEMS AND METHODS |
|---|
|
|
|---|
It has been shown that the island method has a speed advantage over the direct shuffling method. For recommended asymptotic parameter estimation, the speed advantage of the island method even approaches an order of magnitude (Altschul et al., 2001). Yet, in the context of a database search it is still too time-consuming to re-estimate
and K for each sequence pair of potential interest. One accurate estimate of the two parameters would require as much time as searching a typical current database with a standard heuristic method. Convergent Island Statistics is capable of quickly recognizing significant matches and estimating the score distribution only in the case where there is evidence that the two sequences are related. Although our method is applicable to a more general setting, for the sake of simplicity we will only describe the version that is an enhancement of the island statistics method. The main idea is based upon the following two observations:
- If
1
2 and K1
K2 then p(x|
2,K1)
p(x|
1,K2).
- The distribution of
is approximately normal with mean 1 and standard deviation
. The distribution of
is approximately normal with mean 1 and standard deviation
(see the Supplementary material).
2,K1, then it is not statistically significant with respect to EVD
1,K2 for any
1
2 and K1
K2. On the other hand, observation 2 allows one to estimate and control the probability that the true
is within an interval that depends on sampled
and the number of islands used to estimate
. The same argument can be applied to the other parameter K. Observation 2 further implies that the confidence intervals corresponding to the jth multiple of the standard deviation
are respectively
![]() | (8) |
![]() | (9) |
![]() | (10) |
![]() |
![]() | (11) |
![]() |
| ALGORITHM |
|---|
|
|
|---|
Suppose that we are given a sequence Sq and we want to search a (possibly large) database
for sequences similar to Sq. Assume that we are only interested in the matches with the P-value below a certain cutoff p0 and we want to precisely estimate the P-value for every such match. For each target sequence, our algorithm is a series of steps s1,..., sk, where each step si may be described as follows.
Shuffle the sequences (or columns of both sequential profiles) as many times as needed to generate Ni islands. Note that shuffling randomizes the order of the symbols in a sequence without changing the sequence's composition. Calculate
and
. If
(according to observation 1) discard the score as insignificant and proceed to the next sequence. Otherwise, go to the next step. If reached, the last step k ends with computing the final P-value from the distribution with parameters
and
which are estimated from Nk islands.
The pseudo-code for the CIS algorithm is given below.
for r1 to R do
for i
1 to k do
while(number of islands < Ni)
shuffle Sq
shuffle
align shuffled sequences and extract more islands;
end-while
compute
, and
if Pval > p0 then
print No similarity found.
exit inner for-loop (go to the next sequence)
else if i == k
print alignment between Sq and
print Pval.
end-if
end-for
end-for
The number of islands Ni in our algorithm increases at each step (N1 <
< Nk). This allows for more precise estimates of the parameters compared to the estimates in the previous step, but it also requires more CPU time.
Analysis
The algorithm's accuracy is defined as the percentage of true positives (i.e. those found by the regular island method to have the P-value <p0) that are not discarded in any of the algorithm's steps. The accuracy depends on both j (the number of standard deviations above the mean in the distributions of
and
) and k (the number of algorithm steps). For example, j = 3 corresponds to
99.86% one-tailed confidence interval in the Gaussian distribution, which means that Equations (8) and (9) are each true
99.86% of the time. But, since both
and
are required to pass the tests, assuming (wrongly but conservatively) independence of
and
, the confidence drops to about 99.7% (in other words, the chances of keeping a significant hit in any pass of the algorithm are at least 99.7%). Also,
and
have to pass the test k separate times. Assuming (again wrongly but conservatively) independence of the k tests, the confidence decreases accordingly. For j = 3 and k = 3, the confidence interval is at least 99.1%. Figure 1 shows the relationship between j, k and the accuracy of the parameter estimates.
|
In the case of BLOSUM62 substitution matrix with affine gap scores of (11 + k) for gaps of length k, the best cutoff score for saving the alignment islands is around c = 28 (Altschul et al., 2001). In general, for an arbitrary alignment algorithm (including profile-based methods), a cutoff score of c = const * m can be used, where m is the largest positive entry of the scoring matrix (Olsen et al., 1999).
As in the regular island method, the estimates of statistical parameters in CIS can be made free of edge-effect bias by extending the dynamic programming matrix and considering only those islands that are anchored within the central region of the matrix (Altschul et al., 2001).
To test the effectiveness of our method, we have implemented a simple version of the SmithWaterman algorithm and compared the performance of CIS and the regular island method on Lindahl's dataset (Lindahl and Elofssons, 2000). The algorithm uses the BLOSUM62 substitution matrix in conjunction with affine gap scores of (11 + k) for gaps of length k, and the island threshold score of c = 28. To reduce edge-effects, each dynamic programming matrix is surrounded by a border of width 100, filled with the randomly chosen scores from the central area of the matrix (Altschul et al., 2001). For the sake of simplicity and consistency, we assume continuous alignment scores. However, the results should be comparable to those obtained with discrete alignment scores, as the two distributions of ML estimates are very similar for this algorithm's setting (Altschul et al., 2001). The standard error of
in case of continuous scores is
and the standard error for discrete scores is
(Altschul et al., 2001). In our experiment, the island method is set to generate at least 10 000 islands for each pair of sequences [corresponding to standard error in
of about 1% (Altschul et al., 2001)]. The numbers of islands assigned to various steps of the CIS method are 100, 400 and 10 000, corresponding to standard errors of about 10, 5 and 1%, respectively. The P-value cutoff of p0 = 2e 6 is chosen to correspond to the E-value cutoff of E0
1, which is often the default value in database search algorithms. As seen in Table 1, the CIS method is, on average, about 90 times faster then the regular island method.
|
The actual speed advantage of our method also depends on the P-value cutoff p0 set in the search. In other words, the speed advantage is higher if one is not interested in weak matches and it is increasing as the P-value cutoff is lowered. Figure 2 shows the speed advantage of CIS over the regular island method using various threshold values for E0.
|
It should be noted that the speed will not vary much as the significance threshold is reduced if the database contains many true positives. For example, if a query belongs to an abundantly represented superfamily, CIS will still be slow, as every database hit will be thoroughly evaluated by all steps of the algorithm. Also, in its present form, the algorithm creates a completely new set of islands in each step (i.e. islands from step si are not used in step si+1). Thus, an additional speed gain may be obtained by saving the islands from each pass and using them in subsequent steps of the algorithm.
It is easy to see that the algorithm's speed can be further increased by filtering out the database hits based on a background distribution defined by some fixed, pre-computed, conservative estimates of the parameters
and K. Table 2 shows the speed comparison between the CIS method described above and the same method applied in conjunction with the filtering of the database hits based on various conservative estimates of
and K. However, despite the fact that the latter version of CIS has an additional speed advantage over the regular CIS method, this particular approach does not provide on-the-fly statistics, as an additional effort is needed to pre-compute the estimates of the two distribution parameters.
|
| DISCUSSION AND CONCLUSION |
|---|
|
|
|---|
In the case of sequencesequence alignments that are not allowed to contain gaps, the parameters of the score distribution may be calculated analytically. Although no rigorous analytical theory has been developed for profile-based method, the score normalization problem in profile-sequence methods has been extensively studied and efficiently addressed. Profileprofile alignment algorithms still lack fast and accurate score statistics. To account for the rich profile content, structural constraints and different gap models, one has to employ a brute force method of doing extensive random shuffles for every pair of profiles of interest. However, the sequence databases have grown in size enormously over the last few years, making the brute force approach computationally prohibitive.
The method we propose is able to recognize the lack of similarity for any pair of sequences early in the shuffling process and thus save on the search time. Any given sequence will typically have a small number of significant hits in a representative, large database, so the vast percentage of comparisons will be computed very efficiently. On the other hand, if the sequences are related, our method approaches the complexity of the brute force method of doing extensive random shuffles and therefore is able to recognize and precisely estimate the statistics of every such pair.
Convergent Island Statistics can be readily applied to any alignment algorithm whose background scores follow an extreme value distribution. The method contains no parameters to optimize and there is no need for fitting the data of any kind. The CIS method was particularly useful to our group in the preparation for the CASP6 structure prediction experiments (http://predictioncenter.llnl.gov). For CASP6 we needed to test the performance of different profileprofile search strategies by frequently changing the method's parameters, such as gap penalties, scoring schemes and weights on various other input data. Having a method for computing on-the-fly statistics proved to be very convenient. It would be almost impossible to test and validate various theoretical and heuristic approaches to database search if we had to re-estimate the score statistics each time we switch from one algorithm's setting to another. All three of our automated servers in CASP6 and CAFASP4 (http://www.cs.bgu.ac.il/~dfischer/CAFASP4/alev.html) experiments, namely Eidogen-EXPM, Eidogen-BNMX and Eidogen-SFST, used CIS to estimate the significance of database hits. However, the detailed description of the algorithms as well as their CASP and CAFASP performance is beyond the scope of this paper and will be published elsewhere.
Received on March 2, 2005; revised on April 4, 2005; accepted on April 4, 2005
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F. and Gish, W. (1996) Local alignment statistics. Methods Enzymol, 266, 460480[Web of Science][Medline].
Altschul, S.F., et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403410[CrossRef][Web of Science][Medline].
Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 33893402
Altschul, S.F., et al. (2001) The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res., 29, 351361
Collins, J.F., et al. (1988) The significance of protein sequence similarities. Comput. Appl. Biosci, 4, 6771
Dembo, A., et al. (1994) Critical phenomena for sequence matching with scoring. Ann. Prob., 22, 19932021.
Ginalski, K., et al. (2003) Detection of distant homology using sequence profiles and predicted secondary structure. Nucleic Acids Res., 31, 38043807
Karlin, S. and Altschul, S.F. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci. USA, 87, 22642268
Karlin, S. and Altschul, S.F. (1993) Aplications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl Acad. Sci. USA, 90, 58735877
Lindahl, E. and Elofsson, A. (2000) Identification of related proteins on family, superfamily and fold level. J. Mol. Biol., 295, 613625[CrossRef][Web of Science][Medline].
Mood, A.M., Graybill, F.A., Boes, D.C. Introduction to the Theory of Statistics, (1974) 3rd ed. McGraw-Hill.
Mott, R. (2000) Accurate formula for P-values of gapped local sequence and profile alignments. J. Mol. Biol., 300, 649659[CrossRef][Web of Science][Medline].
Mott, R. (1992) Maximum-likelihood estimation of the statistical distribution of SmithWaterman local sequence similarity scores. Bull. Math. Biol, 54, 5975[Web of Science].
Mott, R. and Tribe, R. (1999) Approximate statistics of gapped alignments. J. Comput. Biol., 6, 91112[Web of Science][Medline].
Olsen, R., Bundschuh, R., Hwa, T. (1999) Rapid assessment of extremal statistics for gapped local alignment. In Lengauer, T., Schneider, R., Bork, P., Brutlag, D., Glasgow, J., Mewes, H.-W., Zimmer, R. (Eds.). Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology., , Menlo Park, CA AAAI Press, pp. 211222.
Pearson, W.R. (1998) Empirical statistical estimates for sequence similarity searches. J. Mol. Biol., 276, 7184[CrossRef][Web of Science][Medline].
Rychlewski, L., et al. (2000) Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci., 9, 232241[Web of Science][Medline].
Sadreyev, R.I. and Grishin, N.V. (2003) COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J. Mol. Biol., 326, 317336[CrossRef][Web of Science][Medline].
Schaffer, A.A., et al. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res., 29, 29943005
Waterman, M.S. and Vingron, M. (1994a) Sequence comparison significance and Poisson approximation. Statist. Sci, 9, 367381.
Waterman, M.S. and Vingron, M. (1994b) Rapid and accurate estimates of statistical significance for sequence database searches. Proc. Natl Acad. Sci. USA, 91, 46254628
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||













1 to R do


