Skip Navigation



Bioinformatics Advance Access published online on November 15, 2005

Bioinformatics, doi:10.1093/bioinformatics/bti774
This Article
Right arrow Advance Access manuscript (PDF) Freely available
Right arrowOA All Versions of this Article:
22/2/134    most recent
bti774v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Morgulis, A.
Right arrow Articles by Agarwala, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Morgulis, A.
Right arrow Articles by Agarwala, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Published by Oxford University Press 2005
Received July 26, 2005
Revised November 8, 2005
Accepted November 8, 2005

Article

WindowMasker: window based masker for sequenced genomes

Aleksandr Morgulis 1 *, E. Michael Gertz 1, Alejandro A. Schäffer 1, and Richa Agarwala 1 *

1 National Center for Biotechnology Information, National Institutes of Health, Department of Health and Human Services, Bldg. 38A, Room 1003N, 8600 Rockville Pike, Bethesda, MD 20894 USA

* To whom correspondence should be addressed.
Richa Agarwala, E-mail: richa{at}helix.nih.gov


   Abstract

Motivation: Matches to repetitive sequences are usually undesirable in the output of DNA database searches. Repetitive sequences need not be matched to a query, if they can be masked in the database. RepeatMasker/Maskeraid (RM), currently the most widely used software for DNA sequence masking, is slow and requires a library of repetitive template sequences, such as a manually curated RepBase library, that may not exist for newly sequenced genomes.

Results: We have developed a software tool called WindowMasker (WM) that identifies and masks highly repetitive DNA sequences in a genome, using only the sequence of the genome itself. WM is orders of magnitude faster than RM because WM uses a few linear-time scans of the genome sequence, rather than local alignment methods that compare each library sequence to each piece of the genome. We validate WM by comparing BLAST outputs from large sets of queries applied to two versions of the same genome, one masked by WM, and the other masked by RM. Even for genomes such as the human genome, where a good RepBase library is available, searching the database as masked with WM yields more matches that are apparently nonrepetitive and fewer matches to repetitive sequences. We show that these results hold for transcribed regions as well. WM also performs well on genomes for which much of the sequence was in draft form at the time of the analysis.

Availability: WM is included in the NCBI C++ toolkit. The source code for the entire toolkit is available at: ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/.


* under contract to MSD Inc., Fairfax, VA, USA
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
RNAHome page
R. S. Hamilton, E. Hartswood, G. Vendra, C. Jones, V. Van De Bor, D. Finnegan, and I. Davis
A bioinformatics search pipeline, RNA2DSearch, identifies RNA localization elements in Drosophila retrotransposons
RNA, February 1, 2009; 15(2): 200 - 207.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Morgulis, G. Coulouris, Y. Raytselis, T. L. Madden, R. Agarwala, and A. A. Schaffer
Database indexing for production MegaBLAST searches
Bioinformatics, August 15, 2008; 24(16): 1757 - 1764.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
O. Gotoh
A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence
Nucleic Acids Res., May 1, 2008; 36(8): 2630 - 2638.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
X. Li, T. Kahveci, and A. M. Settles
A novel genome-scale repeat finder geared towards transposons
Bioinformatics, February 15, 2008; 24(4): 468 - 476.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
C. M. Bergman and H. Quesneville
Discovering and detecting transposable elements in genome sequences
Brief Bioinform, November 1, 2007; 8(6): 382 - 392.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
T. Koressaar and M. Remm
Enhancements and modifications of primer design program Primer3
Bioinformatics, May 15, 2007; 23(10): 1289 - 1291.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.