Bioinformatics Advance Access originally published online on January 6, 2008
Bioinformatics 2008 24(4):577-578; doi:10.1093/bioinformatics/btm594
FAST: Fourier transform based algorithms for significance testing of ungapped multiple alignments
1University of Maryland, College Park, MD-20740 and 2Cornell University, Ithaca, NY-14850, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: As was shown in Nagarajan et al. (2005), commonly used approximations for assessing the significance of multiple alignments can be be very inaccurate. To address this, we present here the FAST package, an open-source collection of programs and libraries for efficiently and reliably computing the significance of ungapped local alignments. We also describe other potential applications in Bioinformatics where these programs can be adapted for significance testing.
Availability: The FAST package includes C++ implementations of various algorithms that can be used as stand-alone programs or as a library of subroutines. The package and a web-server for some of the programs are available at www.cs.cornell.edu/~keich/FAST
Contact: keich{at}cs.cornell.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Assessing the significance of a local multiple alignment can be critical to discriminating between an alignment that is simply a random artifact of the input as opposed to a biologically relevant one. For example, such analysis is commonly used in motif-finding tools to assess the significance of motifs found in vivo (Bailey and Elkan, 1994; Hertz and Stormo, 1999). Recently, Nagarajan et al. (2006), demonstrated that it can also aid in designing more sensitive motif-finders. Other areas where such analysis has been found to be valuable include repeat finding (Darling et al., 2006) and assessing the reliability of alignments (Sadreyev and Grishin, 2004). Consequently, reliable programs to assess the significance of ungapped multiple alignments can be a valuable general-purpose tool in a bioinformatician's toolbox.
In the study of ungapped multiple alignments, their goodness is typically summarized by the information content, or entropy (Hertz and Stormo, 1999) of the alignment. For an alignment of n sequences of length L, over an alphabet of A letters, the entropy score is defined as being
where I(i) is the entropy score for the ith column, i.e.
|
|
A common framework to compute the p-value is as follows: let p denote the probability mass function (pmf) of the column score I(i) under the hypothesis that the column is noise—in the sense that it was sampled from the multinomial distribution described by the background probabilities. Assuming that the entropy score for each of the L columns in the alignment is an independent random variable, the pmf of the alignment's total entropy score I is given by the L-fold convolution of p :
|
|
s0, which is prohibitively expensive in practice because of the large number of possible values of s. As a result, multiple alignment programs have relied on approximations to compute the p-value, striving for a balance between the time spent computing and the accuracy of the result. However, as shown by Nagarajan et al. (2005), two of the popular approximations used in the motif-finders MEME (Bailey and Elkan, 1994) and Consensus (Hertz and Stormo, 1999) can produce very inaccurate results. The sFFT and csFFT algorithms (Nagarajan et al., 2005) on the other hand can reliably compute these p-values and are very efficient in practice. Here, we present the FAST package that includes various algorithms for reliably computing the significance of ungapped local alignments. In particular, it includes C++ implementations for sFFT and csFFT that can be used as stand-alone programs or as part of a library. In addition, the package also includes a new version of csFFT (refine-csFFT) that as described in Section 2.1 can automatically adjust its runtime to achieve a user-specified level of accuracy. Finally, the package includes an implementation of the memo-sFFT algorithm (Nagarajan et al., 2006) (see Section 2.3). The memo-sFFT algorithm enables the design of motif-finders that score motifs based on p-values without paying a significant time penalty (Nagarajan et al., 2006). The resulting motif-finders demonstrate increased sensitivity compared to existing programs in several motif-finding tasks.
| 2 METHODS |
|---|
|
|
|---|
2.1 The refine-csFFT algorithm
As described in the previous section, the p-value for a given score s0 can easily be computed if p*L the L-fold convolution of the pmf p is known. The csFFT algorithm (Nagarajan et al., 2005) relies on Fast Fourier Transform (FFT) based cyclic convolutions on a lattice to approximate this p-value. For doing so, the algorithm conservatively guesses a lattice size for the cyclic convolution, where larger lattice sizes tend to give more accurate results. However, the tradeoff is that larger lattice sizes also lead to slower running times. In the approach proposed in Nagarajan et al. (2005), the lattice size is fixed throughout the computation and there is no direct mechanism to set it based on a user specified level of accuracy. The FAST package includes an alternative novel approach (refine-csFFT) that dynamically tunes the lattice size, based on user input for the desired minimum accuracy of the computed p-values. This goal is achieved by starting with a small lattice size to compute the p-value and then increasing it as required based on a test for the accuracy of the computed p-value. Since the correct p-value is not known during this process, the accuracy of computed p-values cannot be directly calculated. Instead, the agreement between successively computed p-values can be used to estimate this quantity. To make this approach competitive with one that simply computes the p-value with the right lattice size, we also need a procedure to increase the lattice size with little or no overhead. The refine-csFFT algorithm is essentially a variant of csFFT that implements these two ideas. In each step, refine-csFFT increases the lattice size by a factor of two using the following identity to avoid recomputing values in the FFT calculations:
|
|
Q that is padded with zeros when needed. Details of the refine-csFFT algorithm can be found in the documentation accompanying the implementation. The reliability of refine-csFFT was tested and confirmed (i.e. it meets the user-specified accuracy threshold) over a wide range of parameters (see Supplementary Table 1 and 2). In addition, the runtime for refine-csFFT was found to be nearly identical to that of csFFT in all cases (where for a fair comparison, we set the lattice size for csFFT to the final lattice size determined by refine-csFFT and also used the same FFT libraries).
2.2 Other applications for sFFT and csFFT
While sFFT and csFFT were originally designed for computing the p-value for the entropy score, we believe that the general technique can be applied in many other areas where convolutions are needed to compute the p-value. The implementations for sFFT and csFFT, provided as part of the FAST package, could therefore be applied in such applications with little or no change. For example, in the analysis of high-throughput DNA copy number data, a common task is to identify a genomic region where the copy number for a test group (for e.g. breast cancer samples) is significantly different from that for normal samples. If the average copy number of the test group for a region is used as a test-statistic then a p-value can be calculated to assess its significance in a fashion analogous to that used for the entropy score. Another, possible application is in the area of detecting mis-assemblies of paired-read shotgun sequence data. As proposed by some researchers (http://www.genome.umd.edu/reconciliation.htm), regions where the distance between many paired reads is substantially different from that expected for the library are indicative of mis-assembly. The significance of this discrepancy can also be computed using a procedure similar to that used for the entropy score. We hope that the availability of open-source implementations of csFFT and sFFT as part of the FAST package will spur the use of this framework to accurately and efficiently compute p-values in these and other applications.
2.3 Improving motif-finders using p-values
In motif-finding tasks where the dimensions of the sought for motif are unknown, using the p-value [or more precisely the E-value as defined by Hertz and Stormo (1999)] as the score to be optimized can be valuable. In the case of motifs of unknown width, this was shown to be the case by Nagarajan et al. (2006). However, in order to design motif-finders that optimize for p-values, even faster tools are needed to compute the p-value. This is because while traditional scores for evaluating motifs consume a few machine cycles, reliably computing p-values is a much slower operation (typical runtimes are of the order of 0.01 s with a relatively fast csFFT implementation). Since in a typical motif-finding task, thousands of motifs need to be compared, using p-values can slow a motif-finder down significantly. Fortunately, we can exploit the fact that algorithms such as sFFT can reliably compute a range of p-values in one invocation. The memo-sFFT algorithm (Nagarajan et al., 2006) relies on selectively memorizing values computed using sFFT to provide an efficient interface for implementing motif-finders that optimize p-values. The interface for memo-sFFT is illustrated using sample code in Supplementary Figure 1. Also, based on extensive tests we found that memo-sFFT provides a reliable estimate of the p-value over a wide range of parameter values (see Supplementary Figure 1).
The memo-sFFT package has already been incorporated into two motif-finders, conspv, that is similar to CONSENSUS (Hertz and Stormo, 1999) and gibbspv, a gibbs-sampling based program, that both use p-values to effectively search for motifs of unknown width. Using, memo-sFFT in these programs brings down the amortized cost of computing p-values to essentially that of a table-lookup enabling them to efficiently search for motifs. We believe therefore, that its availability as part of the FAST package can be a useful tool for designers of new motif-finders.
| Acknowledgments |
|---|
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alex Bateman
Received on September 10, 2007; revised on November 7, 2007; accepted on November 27, 2007
| REFERENCES |
|---|
|
|
|---|
Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. (1994) Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. GA, USA: Menlo Park. 28–36.
Darling AE, et al. Procrastination leads to efficient filtration for local multiple alignment. (2006) Proceedings of the Sixth Workshop on Algorithms in Bioinformatics: Zurich, Switzerland.
Hertz GZ, Stormo GD. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics (1999) 15:563–577.
Nagarajan N, Jones N, Keich U. Computing the P-value of the information content from an alignment of multiple sequences. Bioinformatics (2005) 21(Suppl. 1):i311–i318.[Abstract]
Nagarajan N, et al. Refining motif finders with E-value calculations. (2006) Proceedings of the Third Annual RECOMB Satellite Workshop on Regulatory Genomics. Singapore.
Sadreyev RI, Grishin NV. Estimates of statistical significance for comparison of individual positions in multiple sequence alignments. BMC Bioinformatics (2004) 5.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
