Skip Navigation


Bioinformatics Advance Access originally published online on January 6, 2008
Bioinformatics 2008 24(4):577-578; doi:10.1093/bioinformatics/btm594
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/4/577    most recent
btm594v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Nagarajan, N.
Right arrow Articles by Keich, U.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Nagarajan, N.
Right arrow Articles by Keich, U.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

FAST: Fourier transform based algorithms for significance testing of ungapped multiple alignments

Niranjan Nagarajan 1,* and Uri Keich 2

1University of Maryland, College Park, MD-20740 and 2Cornell University, Ithaca, NY-14850, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 REFERENCES
 

Summary: As was shown in Nagarajan et al. (2005), commonly used approximations for assessing the significance of multiple alignments can be be very inaccurate. To address this, we present here the FAST package, an open-source collection of programs and libraries for efficiently and reliably computing the significance of ungapped local alignments. We also describe other potential applications in Bioinformatics where these programs can be adapted for significance testing.

Availability: The FAST package includes C++ implementations of various algorithms that can be used as stand-alone programs or as a library of subroutines. The package and a web-server for some of the programs are available at www.cs.cornell.edu/~keich/FAST

Contact: keich{at}cs.cornell.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 REFERENCES
 
Assessing the significance of a local multiple alignment can be critical to discriminating between an alignment that is simply a random artifact of the input as opposed to a biologically relevant one. For example, such analysis is commonly used in motif-finding tools to assess the significance of motifs found in vivo (Bailey and Elkan, 1994; Hertz and Stormo, 1999). Recently, Nagarajan et al. (2006), demonstrated that it can also aid in designing more sensitive motif-finders. Other areas where such analysis has been found to be valuable include repeat finding (Darling et al., 2006) and assessing the reliability of alignments (Sadreyev and Grishin, 2004). Consequently, reliable programs to assess the significance of ungapped multiple alignments can be a valuable general-purpose tool in a bioinformatician's toolbox.

In the study of ungapped multiple alignments, their ‘goodness’ is typically summarized by the information content, or entropy (Hertz and Stormo, 1999) of the alignment. For an alignment of n sequences of length L, over an alphabet of A letters, the entropy score is defined as being Formula where I(i) is the entropy score for the ith column, i.e.


Formula

and nij is the number of occurrences of the jth letter in the ith column of the alignment and bj is the background frequency of the jth letter. While the entropy score can be used to rank alignments of the same dimension, it cannot be used to fairly compare two alignments of varying L and n. In addition, it does not provide any direct information about an alignment's significance. So, to assess the significance of an alignment with entropy score s0, we rely on its p-value, which is the probability of seeing a score of s0 or better under the assumption that each of the L columns has n letters independently sampled according to the background distribution {b1, ... , bA}. The computed p-value is then interpreted as follows: if the value is near 1 then the columns in the alignment are too similar to the background for the alignment to be considered interesting. However, if the p-value is sufficiently small then the alignment is unlikely to be a random artifact and is therefore considered significant.

A common framework to compute the p-value is as follows: let p denote the probability mass function (pmf) of the column score I(i) under the hypothesis that the column is noise—in the sense that it was sampled from the multinomial distribution described by the background probabilities. Assuming that the entropy score for each of the L columns in the alignment is an independent random variable, the pmf of the alignment's total entropy score I is given by the L-fold convolution of p :


Formula

The p-value for an alignment with score s0 is then given by Formula . Unfortunately, to naively compute this requires traversing all s ≥ s0, which is prohibitively expensive in practice because of the large number of possible values of s. As a result, multiple alignment programs have relied on approximations to compute the p-value, striving for a balance between the time spent computing and the accuracy of the result. However, as shown by Nagarajan et al. (2005), two of the popular approximations used in the motif-finders MEME (Bailey and Elkan, 1994) and Consensus (Hertz and Stormo, 1999) can produce very inaccurate results. The sFFT and csFFT algorithms (Nagarajan et al., 2005) on the other hand can reliably compute these p-values and are very efficient in practice.

Here, we present the FAST package that includes various algorithms for reliably computing the significance of ungapped local alignments. In particular, it includes C++ implementations for sFFT and csFFT that can be used as stand-alone programs or as part of a library. In addition, the package also includes a new version of csFFT (refine-csFFT) that as described in Section 2.1 can automatically adjust its runtime to achieve a user-specified level of accuracy. Finally, the package includes an implementation of the memo-sFFT algorithm (Nagarajan et al., 2006) (see Section 2.3). The memo-sFFT algorithm enables the design of motif-finders that score motifs based on p-values without paying a significant time penalty (Nagarajan et al., 2006). The resulting motif-finders demonstrate increased sensitivity compared to existing programs in several motif-finding tasks.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 REFERENCES
 
2.1 The refine-csFFT algorithm
As described in the previous section, the p-value for a given score s0 can easily be computed if p*L the L-fold convolution of the pmf p is known. The csFFT algorithm (Nagarajan et al., 2005) relies on Fast Fourier Transform (FFT) based cyclic convolutions on a lattice to approximate this p-value. For doing so, the algorithm conservatively guesses a lattice size for the cyclic convolution, where larger lattice sizes tend to give more accurate results. However, the tradeoff is that larger lattice sizes also lead to slower running times. In the approach proposed in Nagarajan et al. (2005), the lattice size is fixed throughout the computation and there is no direct mechanism to set it based on a user specified level of accuracy. The FAST package includes an alternative novel approach (refine-csFFT) that dynamically tunes the lattice size, based on user input for the desired minimum accuracy of the computed p-values. This goal is achieved by starting with a small lattice size to compute the p-value and then increasing it as required based on a test for the accuracy of the computed p-value. Since the correct p-value is not known during this process, the accuracy of computed p-values cannot be directly calculated. Instead, the agreement between successively computed p-values can be used to estimate this quantity. To make this approach competitive with one that simply computes the p-value with the ‘right’ lattice size, we also need a procedure to increase the lattice size with little or no overhead. The refine-csFFT algorithm is essentially a variant of csFFT that implements these two ideas. In each step, refine-csFFT increases the lattice size by a factor of two using the following identity to avoid recomputing values in the FFT calculations:


Formula

where Ds is the Discrete Fourier Transform operator for a vector of size s and x is a vector of size ≤ Q that is padded with zeros when needed. Details of the refine-csFFT algorithm can be found in the documentation accompanying the implementation. The reliability of refine-csFFT was tested and confirmed (i.e. it meets the user-specified accuracy threshold) over a wide range of parameters (see Supplementary Table 1 and 2). In addition, the runtime for refine-csFFT was found to be nearly identical to that of csFFT in all cases (where for a fair comparison, we set the lattice size for csFFT to the final lattice size determined by refine-csFFT and also used the same FFT libraries).

2.2 Other applications for sFFT and csFFT
While sFFT and csFFT were originally designed for computing the p-value for the entropy score, we believe that the general technique can be applied in many other areas where convolutions are needed to compute the p-value. The implementations for sFFT and csFFT, provided as part of the FAST package, could therefore be applied in such applications with little or no change. For example, in the analysis of high-throughput DNA copy number data, a common task is to identify a genomic region where the copy number for a test group (for e.g. breast cancer samples) is significantly different from that for normal samples. If the average copy number of the test group for a region is used as a test-statistic then a p-value can be calculated to assess its significance in a fashion analogous to that used for the entropy score. Another, possible application is in the area of detecting mis-assemblies of paired-read shotgun sequence data. As proposed by some researchers (http://www.genome.umd.edu/reconciliation.htm), regions where the distance between many paired reads is substantially different from that expected for the library are indicative of mis-assembly. The significance of this discrepancy can also be computed using a procedure similar to that used for the entropy score. We hope that the availability of open-source implementations of csFFT and sFFT as part of the FAST package will spur the use of this framework to accurately and efficiently compute p-values in these and other applications.

2.3 Improving motif-finders using p-values
In motif-finding tasks where the dimensions of the sought for motif are unknown, using the p-value [or more precisely the E-value as defined by Hertz and Stormo (1999)] as the score to be optimized can be valuable. In the case of motifs of unknown width, this was shown to be the case by Nagarajan et al. (2006). However, in order to design motif-finders that optimize for p-values, even faster tools are needed to compute the p-value. This is because while traditional scores for evaluating motifs consume a few machine cycles, reliably computing p-values is a much slower operation (typical runtimes are of the order of 0.01 s with a relatively fast csFFT implementation). Since in a typical motif-finding task, thousands of motifs need to be compared, using p-values can slow a motif-finder down significantly. Fortunately, we can exploit the fact that algorithms such as sFFT can reliably compute a range of p-values in one invocation. The memo-sFFT algorithm (Nagarajan et al., 2006) relies on selectively memorizing values computed using sFFT to provide an efficient interface for implementing motif-finders that optimize p-values. The interface for memo-sFFT is illustrated using sample code in Supplementary Figure 1. Also, based on extensive tests we found that memo-sFFT provides a reliable estimate of the p-value over a wide range of parameter values (see Supplementary Figure 1).

The memo-sFFT package has already been incorporated into two motif-finders, conspv, that is similar to CONSENSUS (Hertz and Stormo, 1999) and gibbspv, a gibbs-sampling based program, that both use p-values to effectively search for motifs of unknown width. Using, memo-sFFT in these programs brings down the amortized cost of computing p-values to essentially that of a table-lookup enabling them to efficiently search for motifs. We believe therefore, that its availability as part of the FAST package can be a useful tool for designers of new motif-finders.


    Acknowledgments
 
Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alex Bateman

Received on September 10, 2007; revised on November 7, 2007; accepted on November 27, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 REFERENCES
 

    Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. (1994) Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. GA, USA: Menlo Park. 28–36.

    Darling AE, et al. Procrastination leads to efficient filtration for local multiple alignment. (2006) Proceedings of the Sixth Workshop on Algorithms in Bioinformatics: Zurich, Switzerland.

    Hertz GZ, Stormo GD. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics (1999) 15:563–577.[Abstract/Free Full Text]

    Nagarajan N, Jones N, Keich U. Computing the P-value of the information content from an alignment of multiple sequences. Bioinformatics (2005) 21(Suppl. 1):i311–i318.[Abstract]

    Nagarajan N, et al. Refining motif finders with E-value calculations. (2006) Proceedings of the Third Annual RECOMB Satellite Workshop on Regulatory Genomics. Singapore.

    Sadreyev RI, Grishin NV. Estimates of statistical significance for comparison of individual positions in multiple sequence alignments. BMC Bioinformatics (2004) 5.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/4/577    most recent
btm594v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Nagarajan, N.
Right arrow Articles by Keich, U.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Nagarajan, N.
Right arrow Articles by Keich, U.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?