Bioinformatics Advance Access published online on January 18, 2007
Bioinformatics, doi:10.1093/bioinformatics/btl647
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Nested Containment List (NCList): A new algorithm for accelerating interval query of genome alignment and interval databases
aDepartment of Biomathematics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA 90095, USA. E-mail: shuriko{at}ucla.edu
bMolecular Biology Institute, Center for Computational Biology, Institute for Genomics and Proteomics, Department of Chemistry and Biochemistry, University of California Los Angeles, Los Angeles, CA 90095-1570, USA. E-mail: leec{at}mbi.ucla.edu
*To whom correspondence should be addressed. Christopher J Lee, E-mail: leec{at}chem.ucla.edu
| Abstract |
|---|
Motivation: The exponential growth of sequence databases poses a major challenge to bioinformatics tools for querying alignment and annotation databases. There is a pressing need for methods for finding overlapping sequence intervals that are highly scalable to database size, query interval size, result size, and construction / updating of the interval database.
Results: We have developed a new interval database representation, the Nested Containment List (NCList), whose query time is O(n + log N), where N is the database size and n is the size of the result set. In all cases tested this query algorithm is 5 -500 fold faster than other indexing methods tested in this study, such as MySQL mult-column indexing, MySQL binning, and R-Tree indexing. We provide performance comparisons both in simulated datasets and real-world genome alignment databases, across a wide range of data-base sizes and query interval widths. We also present an in-place NCList construction algorithm that yields database construction times that are approximately 100-fold faster than other methods available. The NCList data structure appears to provide a useful foundation for highly scalable interval database applications.
Availability: NCList data structure is part of Pygr, a bioinformatics graph database library, available at http://sourceforge.net/projects/pygr
Received on June 16, 2006; revised on December 9, 2006; accepted on December 18, 2006
This article has been cited by other articles:
![]() |
M. E. Skinner, A. V. Uzilov, L. D. Stein, C. J. Mungall, and I. H. Holmes JBrowse: A next-generation genome browser Genome Res., September 1, 2009; 19(9): 1630 - 1638. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Kim and C. Lee QPRIMER: a quick web-based application for designing conserved PCR primers from multigenome alignments Bioinformatics, September 1, 2007; 23(17): 2331 - 2333. [Abstract] [Full Text] [PDF] |
||||

