Bioinformatics Advance Access originally published online on November 8, 2006
Bioinformatics 2006 22(24):3099-3100; doi:10.1093/bioinformatics/btl551
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Pattern locator: a new tool for finding local sequence patterns in genomic DNA sequences
1 Department of Microbiology and Institute of Bioinformatics, University of Georgia Athens, GA 30602, USA
2 Department of Computer Science, University of Georgia Athens, GA 30602, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: We present a new tool for finding local sequence patterns in long DNA sequences. The program, Pattern Locator, uses an intuitive syntax for pattern description, and provides more flexibility than existing programs by allowing combinations of specific nucleotide sequences, direct and inverted repeats, variable length tandem repeats of subpatterns, and a specified number of errors in any part of the pattern.
Availability: The program is available for download and as a web service accessible through a CGI interface at http://www.cmbl.uga.edu/software.html. The source code is written in C and distributed under the GNU General Public License.
Contact: mrazek{at}uga.edu
This work was motivated by inquiries from our colleagues about appropriate tools for finding short sequence patterns in complete prokaryotic genomes. The patterns in question are generally too short to trigger significant hits in BLAST (Altschul et al., 1990) searches. Although there are programs designed for this task, in particular PatScan (Dsouza et al., 1997) and several tools integrated in EMBOSS (Rice et al., 2000), their potential users often find them difficult to use or not suitable for the specific problem at hand. In Pattern Locator, we aim to provide an easy to use and versatile tool to search for local sequence patterns in long DNA sequences.
Specifying sequence patterns. Pattern Locator emphasizes the ease of use and utilizes an intuitive syntax for pattern description. We use the standard IUPAC code (e.g. NC-UIB, 1986) to refer to individual nucleotides. Additional codes include +n referring to the actual nucleotide (A, C, G or T) at the nth position past the start of the pattern or an active reference point, and n to signify the nucleotide complementary to that at position n. These codes can be used to describe direct and inverted repeats. In addition, a specified number of errors (mismatches) can be allowed in any segment of the pattern (encoded as {...}[k]), where k is the maximum number of errors in the segment within the curly brackets), and any subpattern can be repeated a given number of times (encoded as (...)[n:m]), where n and m signify the minimum and maximum number of repeats, respectively, of the segment in the parentheses. The symbol # sets the reference point for n and +n syntax, which affects the subsequent part of the pattern until another #. Table 1 shows several examples of pattern descriptions. Note that parentheses can be nested whereas curly brackets cannot, i.e. constructions, such as ({()}) are allowed but {({})} are not. Characters > or < may be included at the start of the pattern definition to specify search in the direct strand (>), complementary strand (<) or both strands (<>). If not specified only the direct strand is searched. Multiple patterns can be located simultaneously. For example, stemloop structures of the type NNNN(N)[3:7]4 321 but allowing a single base bulge in any stem segment can be located by simultaneous search for patterns NNNN(N)[3:7] 4N321, NNNNN(N)[3:7]5321, NNNN(N)[3:7] 43N21, NNNNN(N)[3:7]5421, NNNN(N)[3:7] 432N1 and NNNNN(N)[3:7]5431.
|
Input and output. Pattern Locator reads sequences in the standard FASTA or GenBank format. Patterns are read from a separate file, one pattern per line. Pattern Locator performs an exhaustive search for all matches with an option to subsequently combine overlapping matches. This approach guarantees the same result regardless of the direction of the search. Pattern Locator generates two output files. The first file contains only locations of the patterns found (starting and ending positions). The second file shows the actual nucleotide sequences and their flanks, and indicates overlapping patterns.
Limitations. Pattern Locator uses a recursive algorithm that allows flexibility in pattern definitions. On the downside, it can become slow when combinatorial complexity of the search, affected mainly by the number of allowed mismatches and/or repeated segments of variable length, increases. In particular, Pattern Locator is not intended for finding distant direct or inverted repeats. Patterns, such as NNNNNN(N)[0:1000]+1+2+3+4+5+6 (a 6 bp direct repeat within a 1000 bp region) can be found more effectively by specialized programs (e.g. Rice et al., 2000), which utilize specifically designed algorithms. Note that only variable gaps, not those of exact length, increase the search time. For example, searching for the pattern NNNNNN(N)[990:1000]+1+2+3+4+5+6 will take roughly the same time as NNNNNN(N)[0:10]+1+2+3+4+5+6.
Availability. The source code written in C is available for download from our website http://www.cmbl.uga.edu/software.html and distributed under the GNU General Public License (http://www.gnu.org/copyleft/gpl.html). We also provide a CGI interface to a restricted version of Pattern Locator.
Interface and implementation. Our interface is easy to use. In step one, users can choose a DNA sequence from the provided list, which contains the complete prokaryotic genomes downloaded from the National Center for Biotechnology Information (NCBI) ftp server (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/), or to upload their own DNA sequence. The search can be limited to a specific region of the genome (optional). In step two, users type or paste the search pattern(s) into a provided text area, and chose how to treat overlapping patterns. In step three, users are prompted to enter their email address to which the results will be sent. The process is completed by clicking Submit Query. The two output files and the error/warning log are sent to the email address.
The online version of Pattern Locator has two restrictions that do not apply to the downloaded version of the program. The restricted version estimates the time needed for completion of the search and stops if the estimated CPU time exceeds a certain limit (currently 120 s). The CPU time limit was introduced in order to stop long jobs that could result from inadvertently mistyped patterns or patterns that are not suitable for Pattern Locator, (such as distant direct or inverted repeats). The 120 s limit should be sufficient to find even complex patterns in prokaryotic genomes. For example, a simultaneous search for all 15 patterns listed in Table 1 in the Escherichia coli K12 chromosome of 4.6 Mb lengths takes about 100 s. The second restriction limits the size of the output files, which is approximately proportional to the number of patterns found in the sequence. Both restrictions can be overcome by limiting the search to a part of the genome rather than the complete sequence.
The CGI interface was written in Python and another Python script was designed to periodically update the locally stored dataset of complete prokaryotic genomes.
| Acknowledgments |
|---|
J.M. wishes to thank Dr Samuel Karlin for his support and suggestions in developing an early version of this program. The authors also wish to thank Dr Ellen Neidle for comments on the manuscript. This work was supported in part by a Faculty Research Grant from the University of Georgia Research Foundation, Inc.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Limsoon Wong
Received on September 5, 2006; revised on October 19, 2006; accepted on October 19, 2006
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F., et al. (1990) Basic local alignment search tool. J. Mol. Biol, . 215, 403410[CrossRef][Web of Science][Medline].
Dsouza, M., et al. (1997) Searching for patterns in genomic data. Trends Genet, . 13, 497498[Web of Science][Medline].
Nomenclature Committee of the International Union of Biochemistry (NC-IUB). (1986) Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. Proc. Natl. Acad. Sci. USA, 83, 48
Rice, P., et al. (2000) EMBOSS: the European molecular biology open software suite. Trends Genet, . 16, 276277[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
B. P. Higgins, A. C. Popkowski, P. R. Caruana, and A. C. Karls Site-Specific Insertion of IS492 in Pseudoalteromonas atlantica J. Bacteriol., October 15, 2009; 191(20): 6408 - 6414. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Mrazek Finding sequence motifs in prokaryotic genomes--a brief practical guide for a microbiologist Brief Bioinform, September 1, 2009; 10(5): 525 - 536. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Mrazek, S. Xie, X. Guo, and A. Srivastava AIMIE: a web-based environment for detection and interpretation of significant sequence motifs in prokaryotic genomes Bioinformatics, April 15, 2008; 24(8): 1041 - 1048. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Mrazek, X. Guo, and A. Shah Simple sequence repeats in prokaryotic genomes PNAS, May 15, 2007; 104(20): 8472 - 8477. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



