Bioinformatics Advance Access originally published online on November 15, 2005
Bioinformatics 2006 22(2):134-141; doi:10.1093/bioinformatics/bti774
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
WindowMasker: window-based masker for sequenced genomes
National Center for Biotechnology Information, National Institutes of Health, Department of Health and Human Services Building 38A, Room 1003N, 8600 Rockville Pike, Bethesda, MD 20894, USA
*To whom correspondence should be addressed.
Motivation: Matches to repetitive sequences are usually undesirable in the output of DNA database searches. Repetitive sequences need not be matched to a query, if they can be masked in the database. RepeatMasker/Maskeraid (RM), currently the most widely used software for DNA sequence masking, is slow and requires a library of repetitive template sequences, such as a manually curated RepBase library, that may not exist for newly sequenced genomes.
Results: We have developed a software tool called WindowMasker (WM) that identifies and masks highly repetitive DNA sequences in a genome, using only the sequence of the genome itself. WM is orders of magnitude faster than RM because WM uses a few linear-time scans of the genome sequence, rather than local alignment methods that compare each library sequence with each piece of the genome. We validate WM by comparing BLAST outputs from large sets of queries applied to two versions of the same genome, one masked by WM, and the other masked by RM. Even for genomes such as the human genome, where a good RepBase library is available, searching the database as masked with WM yields more matches that are apparently non-repetitive and fewer matches to repetitive sequences. We show that these results hold for transcribed regions as well. WM also performs well on genomes for which much of the sequence was in draft form at the time of the analysis.
Availability: WM is included in the NCBI C++ toolkit. The source code for the entire toolkit is available at ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/. Once the toolkit source is unpacked, the instructions for building WindowMasker application in the UNIX environment can be found in file src/app/winmasker/README.build.
Contact: richa{at}helix.nih.gov
Supplementary information: Supplementary data are available at ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/windowmasker_suppl.pdf.
Received on July 26, 2005; revised on November 8, 2005; accepted on November 8, 2005
This article has been cited by other articles:
![]() |
O. Gotoh A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence Nucleic Acids Res., May 1, 2008; 36(8): 2630 - 2638. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Li, T. Kahveci, and A. M. Settles A novel genome-scale repeat finder geared towards transposons Bioinformatics, February 15, 2008; 24(4): 468 - 476. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. M. Bergman and H. Quesneville Discovering and detecting transposable elements in genome sequences Brief Bioinform, November 1, 2007; 8(6): 382 - 392. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Koressaar and M. Remm Enhancements and modifications of primer design program Primer3 Bioinformatics, May 15, 2007; 23(10): 1289 - 1291. [Abstract] [Full Text] [PDF] |
||||


