Skip Navigation


Bioinformatics Advance Access originally published online on February 26, 2009
Bioinformatics 2009 25(8):1084-1085; doi:10.1093/bioinformatics/btp112
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
25/8/1084    most recent
btp112v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Homann, R.
Right arrow Articles by Rehmsmeier, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Homann, R.
Right arrow Articles by Rehmsmeier, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2009 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

mkESA: enhanced suffix array construction tool

Robert Homann 1,2,*, David Fleer 2, Robert Giegerich 2 and Marc Rehmsmeier 3,{dagger}

1International NRW Graduate School in Bioinformatics and Genome Research, Center for Biotechnology (CeBiTec), Bielefeld University, 33594 Bielefeld, 2Technische Fakultät, Bielefeld University, Postfach 100 131, 33501, Bielefeld, Germany and 3GMI - Gregor Mendel Institute of Molecular Plant Biology GmbH, Dr. Bohr-Gasse 3, 1030 Vienna, Austria

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 RUNTIME BENCHMARKS
 4 CONCLUSION
 REFERENCES
 

Summary: We introduce the tool mkESA, an open source program for constructing enhanced suffix arrays (ESAs), striving for low memory consumption, yet high practical speed. mkESA is a user-friendly program written in portable C99, based on a parallelized version of the Deep-Shallow suffix array construction algorithm, which is known for its high speed and small memory usage. The tool handles large FASTA files with multiple sequences, and computes suffix arrays and various additional tables, such as the LCP table (longest common prefix) or the inverse suffix array, from given sequence data.

Availability: The source code of mkESA is freely available under the terms of the GNU General Public License (GPL) version 2 at http://bibiserv.techfak.uni-bielefeld.de/mkesa/.

Contact: rhomann{at}techfak.uni-bielefeld.de


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 RUNTIME BENCHMARKS
 4 CONCLUSION
 REFERENCES
 
The program mkESA is a software tool for constructing enhanced suffix arrays (ESAs) from biological sequence data. The ESA is an index data structure for textual data, introduced in Abouelhoda et al. (2004) as an extension of the well-known suffix array (Manber and Myers, 1993). The ESA is equivalent to the suffix tree, another very important, but more space consuming full-text index data structure (Gusfield, 1997). The major advantages of ESAs over suffix trees are their lower space overhead, improved locality of reference and simple storing to files.

A suffix array for text T of length n is a table of size n + 1 that lists the start positions of the suffixes of T in lexicographic order. Using a suffix array, exact string queries can be answered in O(m logn) time, where m is the length of the query, instead of O(m+n) time without a suffix array. ESAs are composed of a suffix array and additional tables that can be used to improve query performance [e.g. O(m+logn) time using the LCP table, called Hgt array in Manber and Myers (1993)], or enabling efficient implementation of more advanced queries (e.g. finding maximum unique matches). Thus, ESAs are fundamental technology in sequence analysis.

Many interesting problems on sequences from the field of computational biology can be solved efficiently by transforming sequence data into (enhanced) suffix arrays [see, for instance, (Beckstette et al., 2006; De Bona et al., 2008; Höhl et al., 2002; Krumsiek et al., 2007; Rahmann, 2003)]. Linear-time algorithms for suffix array construction have been proposed as well as algorithms that are fast in practice and/or tuned for space efficiency, rendering use of suffix arrays feasible for large datasets; see Puglisi et al. (2007) for a comprehensive overview. In addition, by the results of Abouelhoda et al. (2004), any program using suffix trees can be transformed so to employ ESAs instead and benefit from the advantages offered by that data structure.

Despite the great interest in suffix arrays in the literature, only few actual programs for ESA construction are available. Most existing programs are useful for mere suffix array construction, and do not address specificities of computational biology such as handling multiple sequences and very large datasets. A notable exception is the widely used mkvtree program (http://www.vmatch.de/). mkvtree can read common file formats such as FASTA and keeps sequences separated from their descriptions. An ESA generated by mkvtree may contain multiple sequences, stored so that a match can easily be mapped to its corresponding sequence. The program is available free of charge as part of the Vmatch package, but, unfortunately, in binary form and for non-commercial purposes only. This implies that software relying on mkvtree cannot be distributed easily since the terms of the Vmatch license agreement restrict the legal use of mkvtree. Software that requires using mkvtree also requires all users to obtain the Vmatch package, if available for their platform of choice, and have them sign a license agreement, too.

We have implemented the alternative open source software tool mkESA, using the Deep-Shallow algorithm (Manzini and Ferragina, 2004) for in-memory suffix array construction instead of multikey quicksort as used by mkvtree. Thus, mkESA is efficient even for highly repetitive sequence data, and is fast as long as all data can be held in main memory. As further improvement, our implementation of Deep-Shallow can use multiple CPUs for increased speed.


    2 IMPLEMENTATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 RUNTIME BENCHMARKS
 4 CONCLUSION
 REFERENCES
 
With mkvtree being the most widely spread program for ESA construction, we tried to pick up all of the important ideas implemented in mkvtree and improve upon its weaknesses. mkESA has been designed so to produce output as compatible with mkvtree as possible. The files generated by mkESA are in fact the same as those made by mkvtree, meaning that data produced by mkESA can be processed by programs that expect mkvtree-generated ESAs.

mkESA employs the ‘Deep-Shallow’ algorithm of Manzini and Ferragina (2004) for suffix array construction. This algorithm belongs to the family of ‘lightweight’ suffix sorting algorithms, covering algorithms that use only very small additional space besides the suffix array and the input text, i.e. only O((5+{varepsilon})n) bytes space for a text of length n, and using 32 bit integers for the suffix array. Our version of Deep-Shallow is multithreaded, i.e. the computational work for suffix sorting can be distributed over multiple CPUs or CPU cores. Since Deep-Shallow is not useful for building LCP tables as by-product of suffix sorting (as is the case with simple multikey quicksort), we use the space-efficient, linear-time algorithm of Manzini (2004) to construct LCP tables from suffix arrays. Moreover, mkESA can generate the inverse suffix array and the skip table (Beckstette et al., 2006). It is worth noting that mkESA can incrementally add additional tables when they are needed.


    3 RUNTIME BENCHMARKS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 RUNTIME BENCHMARKS
 4 CONCLUSION
 REFERENCES
 
We compared the performance of mkESA with other programs for suffix array construction, namely mkvtree, mksary 1.2.0 (http://sary.sourceforge.net/, included for its ability to run multithreaded), and Manzini's implementation of Deep-Shallow ds. We measured the time and space consumption for building suffix arrays from the datasets in Table 1, using memtime version 1.3. mkESA and mkvtree processed FASTA files, the other programs processed the bare sequence data with FASTA headers removed so that all programs had comparable workloads. Only ‘parallel mkESA’ and ‘parallel mksary’ (Table 2) made explicit use of multiple CPU cores. Measurements were taken on a Sun Fire X4450 (4 Intel Xeon CPUs at 2.93 GHz, 16 cores, 96 GB RAM) running Solaris 10. The programs were compiled with gcc 4.1.1 using flags -m64 -O3 -fomit-frame-pointer. Each experiment was repeated four times in a row; the best (shortest elapsed time) of the results are displayed in Table 2. Our results show comparable memory requirements for all tested programs, while mkESA is usually the fastest among them, even when using only one CPU.


View this table:
[in this window]
[in a new window]

 
Table 1. Datasets used for performance measurements

 

View this table:
[in this window]
[in a new window]

 
Table 2. Results of performance measurements

 

    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 RUNTIME BENCHMARKS
 4 CONCLUSION
 REFERENCES
 
We presented mkESA, a portable, lightweight, multithreaded and fast program for constructing enhanced suffix arrays. We carefully tested the software on a variety of UNIX-like operating systems and hardware architectures, including recent versions of Linux, Solaris, Mac OS X, FreeBSD, OpenBSD and NetBSD. Its ability to generate output compatible with mkvtree makes mkESA a convenient open source drop-in replacement for earlier programs.

Conflict of Interest: none declared.


    FOOTNOTES
 
{dagger}Present address: GMI - Gregor Mendel Institute of Molecular Plant Biology GmbH, Dr. Bohr-Gasse 3, 1030 Vienna, Austria. Back

Associate Editor: Limsoon Wong

Received on January 21, 2009; revised on February 19, 2009; accepted on February 20, 2009

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 RUNTIME BENCHMARKS
 4 CONCLUSION
 REFERENCES
 

    Abouelhoda M, et al. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms (2004) 2:53–86.[CrossRef]

    Beckstette M, et al. Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics (2006) 7.

    De Bona F, et al. Optimal spliced alignments of short sequence reads. Bioinformatics (2008) 24:i174–i180.[Abstract/Free Full Text]

    Giegerich R, et al. Efficient implementation of lazy suffix trees. Softw. Pract. Exp. (2003) 33:1035–1049.[CrossRef]

    Gusfield D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. (1997) New York, NY, USA: Cambridge University Press.

    Höhl M, et al. Efficient multiple genome alignment. Bioinformatics (2002) 18:S312–S320.[Abstract]

    Krumsiek J, et al. Gepard: a rapid and sensitive tool for creating dotplots on genome scale. Bioinformatics (2007) 23:1026–1028.[Abstract/Free Full Text]

    Manber U, Myers E. Suffix Arrays: a new method for on-line string searches. SIAM J. Comput. (1993) 22:935–948.[CrossRef]

    Manzini G. Two space saving tricks for linear time LCP array computation. In: Proceedings of 9th Scandinavian Workshop on Algorithm Theory (SWAT '04)—Hagerup T, Katajainen J, eds. (2004) 3111. Berlin, Germany: Springer-Verlag. 372–383. of Lecture Notes in Computer Science.

    Manzini G, Ferragina P. Engineering a lightweight suffix array construction algorithm. Algorithmica (2004) 40:33–50.[CrossRef][Web of Science]

    Puglisi SJ, et al. A taxonomy of suffix array construction algorithms. ACM Comput. Surv. (2007) 39:4.[CrossRef]

    Rahmann S. Fast large scale oligonucleotide selection using the longest common factor approach. J. Bioinform. Comput. Biol. (2003) 1:343–361.[CrossRef][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
25/8/1084    most recent
btp112v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Homann, R.
Right arrow Articles by Rehmsmeier, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Homann, R.
Right arrow Articles by Rehmsmeier, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?