Bioinformatics Advance Access originally published online on December 6, 2005
Bioinformatics 2006 22(3):361-362; doi:10.1093/bioinformatics/bti809
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TRAP: automated classification, quantification and annotation of tandemly repeated sequences

1Departamento de Parasitologia, Instituto de Ciências Biomédicas, Universidade de São Paulo São Paulo SP, 05508-000, Brazil
2Departamento de Ciências da Computação, Instituto de Matemática e Estatística, Universidade de São Paulo São Paulo SP, 05508-000, Brazil
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: TRAP, the Tandem Repeats Analysis Program, is a Perl program that provides a unified set of analyses for the selection, classification, quantification and automated annotation of tandemly repeated sequences. TRAP uses the results of the Tandem Repeats Finder program to perform a global analysis of the satellite content of DNA sequences, permitting researchers to easily assess the tandem repeat content for both individual sequences and whole genomes. The results can be generated in convenient formats such as HTML and comma-separated values. TRAP can also be used to automatically generate annotation data in the format of feature table and GFF files.
Availability: TRAP is available under the GNU General Public License at http://www.coccidia.icb.usp.br/trap/
Contact: argruber{at}usp.br
Supplementary Information: Supplementary data are available at http://www.coccidia.icb.usp.br/trap/
| 1 INTRODUCTION |
|---|
|
|
|---|
Repetitive sequences are ubiquitously found in the genome of living organisms and are represented by interspersed and tandem repeats. The latter category is composed by clusters of different copy numbers of tandemly repeated sequences (Chambers and MacAvoy, 2000). The high mutation rate of these repeat loci can be used for the differentiation of individuals and populations. In fact, microsatellite markers have become an invaluable tool for genotyping. The classification and quantification of tandem repeats can be useful to understand genome structure and evolution, as well as to determine potential loci involved in genetic diseases. Finally, given the high throughput of genome sequencing, automated annotation of tandem repeats, among other sequence features, has become an important need in any large-scale sequencing project.
Various programs have been designed to find tandemly repeated sequences using basically two approaches: (1) searching for repeats known a priori, through a dictionary, and (2) ab initio repeat finding. The first approach, adopted by TROLL (Castelo et al., 2002), is more appropriate for microsatellite finding, since the complexity of the dictionary limits the search. The second approach, ab initio finding, implemented by programs such as STRING (Parisi et al., 2003), Mreps (Kolpakov et al., 2003), REPuter (Kurtz et al., 2001) and Tandem Repeats Finder (TRF) (Benson, 1999) can be used for repeats of larger period sizes. All these programs differ in the definition of a tandem repeat and none of them permit to perform an extensive quantification of the different subclasses of repeats.
Here we report the development of TRAP, the Tandem Repeats Analysis Program, a tool that analyzes TRF's output to achieve the following objectives: tandem repeat classification and quantification, automated selection of the best satellite marker candidates and automated annotation of repeat loci.
| 2 SYSTEM ARCHITECTURE |
|---|
|
|
|---|
TRAP is a Perl program that uses TRF's HTML output files as an input. The relevant information is parsed and sorted out from these files, and the repeat sequences are selected, classified, quantified and stored according to the end-user's requirements. We chose TRF as the primary repeat finder for three main reasons: it is one of the most flexible repeat finding programs, is world widely used and allows for the identification of both perfect and degenerate repeats. It is important to mention that TRAP is not itself an ab initio tandem repeat finder, but rather a companion tool for TRF. Since TRAP is able to select and classify only those repeats previously found by TRF, it is essential to fine tune the most appropriate TRF parameters case by case, according to the user's objectives for each study.
TRAP is configured using a set of parameters that can be grouped into four categories: (1) input/outputdescribing the name and location of input and output files and directories; (2) selectionspecifying the criteria for selecting repeat loci; (3) table sorting and formatdefining sorting criteria and output format of the tables and (4) miscellaneousdefining some output files containing additional information and/or format. A complete list and detailed description of all available TRAP's parameters can be found in the Supplementary Material.
The selection parameters are especially important, since they determine the criteria utilized for considering the repeat loci appropriate for each one of the possible uses of TRAP's output. The user can define many requirements on the repeat loci: the minimum and maximum repeat copy number, minimum and maximum repeat period size, minimum size of the flanking regions and minimum percentage of matches between adjacent repeat units. Additionally, repeat loci can also be selected according to a predefined nucleotide sequence. All these selection parameters can be set independently, permitting the user to employ different combinations of criteria.
TRAP can produce a variety of output files such as comma-separated values and HTML files, thus allowing the data to be analyzed in any spreadsheet software and web browser, respectively. Repeat motifs representing circular permutations and/or reverse complements are all classified into a single group. In addition, TRAP also detects redundant loci, for those cases when TRF reports repeat units on overlapping coordinates and subtracts the redundant bases from the calculation. TRAP is able to deal with both single- and multiple-sequence FASTA files. In the latter case, the final repeat quantification is calculated as an overall value for the whole set of sequences. Additional output files generated by TRAP can be used for microsatellite marker development (see Supplementary Material). TRAP can also produce a comprehensive automated annotation of tandem repeats for sequencing projects. For this application, TRAP can create feature table flat files (http://www.ncbi.nlm.nih.gov/collab/FT/), which can be used for editing/viewing using specific tools such as Artemis (Rutherford et al., 2000) and/or submitted to public databases. Alternatively, TRAP can also generate GFF files, another format widely used by sequence annotation editors such as Apollo (Lewis et al., 2002).
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
In order to test TRAP in real-life examples, we decided to analyze the satellite content of the following genomes: Escherichia coli, Saccharomyces cerevisiae, Plasmodium falciparum, Caenorhabditis elegans and Drosophila melanogaster. The values determined by TRAP for the overall repetitive content of these genomes were similar to those available in the literature. As an example, Karaoglu et al. (2005) reported for S.cerevisiae an occurrence of 3618 repeat loci for repeats of 10 bp or longer, whereas TRAP has found a total of 3697 loci. A detailed protocol of the analysis, the corresponding results and a comparison with the literature data are available in the Supplementary Material. Most of the literature reports evaluate the repeat content based only on perfect repeats, thus excluding degenerate repeats from the calculation. Since there is no universal standardization for the definition of tandem repeats in terms of minimum copy number and extent of divergence, any census should be made using more than a single set of criteria. By permitting flexibility in the criteria used for repeat definition, TRAP can generate more comprehensive and comparative surveys.
A second application of TRAP is for the selection of the best candidates for microsatellite marker development. We tested TRAP for selecting repeat loci on Eimeria tenella, a coccidian parasite that infects the domestic fowl. A draft version of the genome (assembly version of December 18, 2002) was downloaded from the Sanger Institute's web site (http://www.sanger.ac.uk/Projects/E_tenella/) and submitted to TRF and TRAP. Markers were selected with a minimum copy number of 5 and minimum period size of 2. From a total of 40 markers selected by TRAP, 15 revealed polymorphism when tested against a panel of 20 distinct isolates of the parasite (unpublished data).
Finally, an important application for TRAP is the automated annotation of the satellite content of DNA sequences. Figure 1 displays an example of a typical automated annotation, including the copy number and period size of the repetitive locus, some additional information, such as the TRF parameters utilized in the analysis, and the respective score obtained for the repeat.
|
| 4 CONCLUSIONS |
|---|
|
|
|---|
TRAP is a tool that processes the results of TRF, a mainstream application for ab initio tandem repeat finding. TRAP can be used to perform three different tasks: analyze the satellite content of a genome, select candidates for microsatellite marker development and automatically annotate the tandem repeat loci of DNA sequences. In conclusion, TRAP extends the analysis scope of TRF, allowing for performing qualitative and quantitative surveys of the tandemly repeated sequences of a genome.
| 5 SYSTEM REQUIREMENTS |
|---|
|
|
|---|
TRAP was designed to run on Unix/Linux operating systems with an installed Perl interpreter. TRAP requires and is compatible with Tandem Repeats Finder (http://tandem.bu.edu/trf/trf.html) versions 3.21 and 4.00. A detailed list of tested platforms and operating systems is provided in the Supplementary Material.
| Acknowledgments |
|---|
T.J.P.S. received a fellowship from CNPq/PIBIC. The authors are indebted to André Y. Kashiwabara for the web page construction and TRAP's logo design.
| FOOTNOTES |
|---|
Current address: Instituto do CoraçãoUSP, Av. Prof. Enéas de Carvalho Aguiar 44, Bloco 2, 10° andar, 05403-000, São Paulo SP, Brazil Associate Editor: Alfonso Valencia
Received on November 21, 2005; revised on November 30, 2005; accepted on November 30, 2005
| REFERENCES |
|---|
|
|
|---|
Benson, G. (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res, . 27, 573580
Castelo, A.T., et al. (2002) TROLL-Tandem Repeat Occurrence Locator. Bioinformatics, 18, 634636
Chambers, G.K. and MacAvoy, E.S. (2000) Microsatellites: consensus and controversy. Comp. Biochem. Physiol. B, . 126, 455476[CrossRef][Medline].
Karaoglu, H., et al. (2005) Survey of simple sequence repeats in completed fungal genomes. Mol. Biol. Evol, . 22, 639649
Kolpakov, R., et al. (2003) Mreps: efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res, . 31, 36723678
Kurtz, S., et al. (2001) REPuter: the manifold application of repeats analysis on a genomic scale. Nucleic Acids Res, . 29, 46334642
Lewis, S.E., et al. (2002) Apollo: a sequence annotation editor. Genome Biol, . 3, RESEARCH0082[Medline].
Parisi, V., et al. (2003) STRING: finding tandem repeats in DNA sequences. Bioinformatics, 19, 17331738
Rutherford, K., et al. (2000) Artemis: sequence visualization and annotation. Bioinformatics, 16, 944945
This article has been cited by other articles:
![]() |
K.-H. Ling, M.-A. Rajandream, P. Rivailler, A. Ivens, S.-J. Yap, A. M.B.N. Madeira, K. Mungall, K. Billington, W.-Y. Yee, A. T. Bankier, et al. Sequencing and analysis of chromosome 1 of Eimeria tenella reveals a unique segmental organization Genome Res., March 1, 2007; 17(3): 311 - 319. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

