Bioinformatics Advance Access originally published online on March 22, 2007
Bioinformatics 2007 23(10):1181-1187; doi:10.1093/bioinformatics/btm097
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
IMEx: Imperfect Microsatellite Extractor
Laboratory of Computational Biology, Centre for DNA Fingerprinting and Diagnostics (CDFD), ECIL Road, Nacharam, Hyderabad 500 076, India
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Microsatellites, also known as simple sequence repeats, are the tandem repeats of nucleotide motifs of the size 1–6 bp found in every genome known so far. Their importance in genomes is well known. Microsatellites are associated with various disease genes, have been used as molecular markers in linkage analysis and DNA fingerprinting studies, and also seem to play an important role in the genome evolution. Therefore, it is of importance to study distribution, enrichment and polymorphism of microsatellites in the genomes of interest. For this, the prerequisite is the availability of a computational tool for extraction of microsatellites (perfect as well as imperfect) and their related information from whole genome sequences. Examination of available tools revealed certain lacunae in them and prompted us to develop a new tool.
Results: In order to efficiently screen genome sequences for microsatellites (perfect as well as imperfect), we developed a new tool called IMEx (Imperfect Microsatellite Extractor). IMEx uses simple string-matching algorithm with sliding window approach to screen DNA sequences for microsatellites and reports the motif, copy number, genomic location, nearby genes, mutational events and many other features useful for in-depth studies. IMEx is more sensitive, efficient and useful than the available widely used tools. IMEx is available in the form of a stand-alone program as well as in the form of a web-server.
Availability: A World Wide Web server and the stand-alone program are available for free access at http://203.197.254.154/IMEX/ or http://www.cdfd.org.in/imex
Contact: han{at}cdfd.org.in
| 1 INTRODUCTION |
|---|
|
|
|---|
Microsatellites or simple sequence repeats (SSRs) are the nucleotide sequences arising out of tandem repeating of short sequence motifs of the size 1–6 bp (Schlotterer, 2000). Microsatellites have been found in all the known genomes so far and are widely distributed both in coding and non-coding regions (Sreenu et al., 2006, 2007; Toth et al., 2000). They are known to be highly polymorphic as a result of high rate of mutations in the form of increase/decrease of their repeat copy numbers (Jarne and Lagoda, 1996). Increase/decrease of repeat copy numbers in microsatellites in coding regions often lead to shifts in reading frames thereby causing changes in protein products (Li et al., 2004; Sreenu et al., 2006) and in non-coding regions, known to effect the gene regulation (Martin et al., 2005). Mutations occurring at microsatellite loci within or near certain genes have been implicated to be responsible for some human neurodegenerative diseases (Tautz and Schlotterer, 1994). Furthermore, microsatellite instability has also been implicated in the induction of cancer (Thibodeau et al., 1993). Owing to their high mutability, it is thought that the microsatellites are one of the sources of genetic diversity (Kashi and King, 2006). In the recent times, efforts have also been made to study the possible functional roles of microsatellites in giving rise to certain amount of plasticity and also in the evolution of genomes (Sreenu et al., 2006).
Apart from repeat copy number variation, a microsatellite tract (e.g. GCGCGCGCGC) also suffers from substitutions and indels of nucleotides thereby becoming an Imperfect tract (e.g. GCGCGCAGCGC: GC repeat with an insertion of A). Genomes harbor significant number of imperfect microsatellites (Brinkmann et al., 1998; Sreenu and Nagarajaram, unpublished data). Imperfect microsatellites are more stable than perfect microsatellites as they are less prone to slippage mutations (Sturzeneker et al., 1998) and are known to play a role in gene regulation (Meloni et al., 1998).
Most of the studies reported in the literature on microsatellites have focused on their frequencies, abundance and polymorphisms in various genomes, both prokaryotes and eukaryotes. Few hypotheses have also been proposed on the life cycle—birth and death—of microsatellites in eukaryotes (Buschiazzo and Gemmell, 2006; Chambers and MacAvoy, 2000). Some studies have also revealed the role of point mutations (indels and substitutions of nucleotides) in the genesis/annihilation and evolution of microsatellites (Messier et al., 1996; Sreenu and Nagarajaram, unpublished data). However, a large body of microsatellite data from several genome sequences still remains unexplored. Studies pertaining to distribution, enrichment, mutational dynamics of microsatellites along with their role in gene function and expression are very essential to understand the processes that underpin the evolution and diversity of genomes.
In the due course of our studies on microsatellites, we made a survey of existing software tools for identification and extraction of microsatellites from nucleotide sequences. All these tools can be broadly grouped into two categories: those which can identify only perfect microsatellites (e.g. SSRF (Sreenu et al., 2003), Poly (Bizzaro and Marx, 2003), SSRIT (Temnykh et al., 2001)) and the others which can identify perfect as well as imperfect microsatellites (e.g. TRF (Benson, 1999), ATR Hunter (Wexler et al., 2004) and Sputnik (Abajian, 1994)). Our survey also revealed certain lacunae in the tools. Programs such as mreps (Kolpakov et al., 2003) and TandemSWAN (Boeva et al., 2006) consider only substitutions but not indels. TROLL (Castelo et al., 2002), STAR (Delgrange and Rivals, 2004) and SSRscanner (Anwar and Khan, 2006) use predefined set of motifs to search for microsatellites in genomes and therefore not very convenient for global automated searches. The algorithms of TRF (Benson, 1999), ATR Hunter (Wexler et al., 2004) and STRING (Parisi et al., 2003) have been designed to find tandem repeats of large-size motifs as large as 2000 bases and hence large numbers of microsatellites go unidentified by these methods. Many of these programs do not generate alignments between imperfect microsatellites and their expected perfect counterparts, and therefore require additional post-processing in order to study the mutational events in microsatellites. In view of these lacunae and to aid our systematic analysis of imperfect microsatellites, we developed a program called IMEx (Imperfect Microsatellite Extractor) with a number of discovery-friendly features. IMEx is fast, highly sensitive and is also flexible where user can set the limits for imperfection (thus can be used for both perfect and imperfect microsatellites). The output comprises of a list of microsatellites each of which with information such as its total imperfection content, point mutations, sequence alignment with its perfect counterpart, whether the locus lies in the coding or non-coding region along with corresponding known details. The IMEx program is available in two modes: as a stand-alone program and also in the form of a web server. The stand-alone program as well as web server are available from the web site http://203.197.254.154/IMEX/ or http://www.cdfd.org.in/imex.
| 2 ALGORITHM |
|---|
|
|
|---|
We define a sequence at a given locus as a microsatellite if that sequence can be expressed as a tandem repeat of a motif of 1–6 bp size. The repeating motif at every iteration can harbor up to k number of point mutations (substitutions or indels of nucleotides). For example, the sequence ATATGTAGAT is a tandem repeat of the motif AT with two substitutions A
G and T
G at third and fourth iterations, respectively. IMEx algorithm uses this definition and employs simple string-searching algorithm with sliding window approach. Conceptually, IMEx may be described as a two-step procedure: (a) identification of microsatellite nucleation sites which are nothing but the loci where a repeat motif is repeated twice either tandemly (type I nucleation site) (Fig. 1) or after certain intervening nucleotides (type II nucleation sites) (Fig. 2), in both cases the repeat motif does not contain any imperfection (i.e. k = 0) and (b) extension of the nucleation sites on both sides in steps of the motif (with imperfections less than k value) as long as one of the termination criteria is satisfied: (i) the number of imperfections (inclusive of substitutions and a maximum of one indel) between the individual repeat copy and the perfect repeat motif is more than the limit (denoted by k parameter set by the user) and (ii) the percentage of imperfection is more than the limit set by the user (denoted by p parameter). The percentage imperfection is calculated as follows: |
|
|
|
While identifying the repeating copy, IMEx treats substitutions and indels on par. However, in certain instances, substitutions may have to take precedence over indels or vice versa. For example, the sequence ATGATGATATGATG can be viewed as ATG ATG ATA -TG ATG with G->A at the third iteration followed by a deletion of A at the fourth iteration. The same tract can be expressed as ATG ATG AT- ATG ATG with one deletion at the third iteration. IMEx chooses the longest repeating tract with least edit distance. In this example, the latter is reported.
The flowchart of IMEx algorithm is shown in Figure 3. IMEx progressively scans for nucleation sites starting from the longest repeat unit i.e. hexanucleotide to the shortest repeat unit i.e. mononucleotide, at a given locus. In other words, for each position i in the sequence, first it looks for hexanucleotide repeat nucleation sites. If no hexanucleotide repeat tract is detected, then it looks for pentanucleotide repeat nucleation site (m = 5) and so on. IMEx automatically removes redundancies. For example, ATGCCCATGCCC is identified as (ATGCCC)2 only and the internal repeat of C within the hexanucleotide motif is ignored.
|
While detecting the microsatellite tract as a tandem repeat of a motif, IMEx also simultaneously stores the edit operations (indels and substitutions). Pairwise alignment between the identified tract and its perfect counter part is, nevertheless, produced to indicate the matches, mismatches and gaps. A sample alignment produced by IMEx is shown in the Figure 4. Along with the alignments, the details of the repeat tract such as consensus (repeating unit), number of iterations, tract length, imperfection percentage, nucleotide composition and coding region (if it is in the coding region) or flanking coding regions (if it is in the non-coding region) are written on a file in the form of a table. IMEx uses.ptt file (NCBI's protein table file) for protein-coding region information.
|
| 3 IMPLEMENTATION |
|---|
|
|
|---|
IMEx has been developed in standard C language and has undergone extensive preliminary testing and comparison with other existing tools yielding satisfactory results. The program can read a sequence of any length as memory is dynamically allocated. However, the size limit is subjected to the system configuration. The program has been successfully tested on Human X chromosome of the size 147MB, on a system with Intel Xeon processor with 2GB RAM. A web server has also been created and this can be accessed from http://203.197.254.154/IMEX/ or http://www.cdfd.org.in/imex. The web server has been developed using CGI-Perl. HTML forms have been created for getting input sequences and parameters used by the C program and display the results on the browser. The stand-alone program can be downloaded from the downloads section of the web server homepage.
Input to the program consists of a sequence file and the following parameters: (a) number of edit operations/motif (k); (b) percentage imperfection for the entire tract ( p); (c) minimum repeat number (n); (d) coding information file. The web version offers three different modes of access to the program: basic, intermediate and advanced. The basic mode contains very few options to be set by the user. The basic mode runs with default values, except for an option to select either perfect or imperfect microsatellites. The default parameters of IMEx are as follows: imperfection percentage (p) is 10% for all repeat sizes; imperfection limit/repeat unit (k) of each repeat size: (Mono: 1, Di: 1, Tri: 1, Tetra: 2, Penta: 2, Hexa: 3) and the minimum number of repeat units (n) is set to 2 for all repeat sizes i.e. any repeat unit that is repeated at least twice is reported. The intermediate mode offers few options where the user can adjust the p value for all repeat tracts, k value for each repeat unit size and other options. Advanced mode offers all the options available for this program and can adjust all the available parameters. The advanced mode can set the flanking sequences size limit, switch to generate text outputs, search for a particular pattern, etc. The interface has been designed for the convenience of the users. Using IMEx, the user can also search for a particular pattern (such as, CAG repeats) or can search for a particular size (di or tetra) repeats or can search only perfect repeats or a combination of perfect and imperfect repeats.
The program generates two files, one of which gives a summary table describing the microsatellite tracts along with their information that includes tract size, number of iterations, percentage imperfection, nucleotide composition and coding/non-coding information. The second file contains the alignment of each repeat with its consensus sequence. These two files are produced both in HTML form as well as in text formats. The text files produced can be downloaded and used for further studies. In HTML outputs, the files are linked so that on clicking a repeat will display its corresponding alignment in a separate HTML page. A link has also been provided to know the function of the coding region near which microsatellite is located. Figure 5 gives a partial extract of the program output.
|
Primer3, primer design software (Rozen and Skaletsky, 2000) has been linked to the web version of IMEx. In the summary table HTML output, the user can select a microsatellite tract for which he wants to design primer and the interface automatically prepares the input for Primer3 software to design the primers. The user can also modify the input to Primer3.
| 4 RESULTS AND DISCUSSION |
|---|
|
|
|---|
To demonstrate the capabilities of our program, we analyzed the human atrophin1 gene (BC051795 [GenBank] ) and compared the results obtained with those obtained using tandem repeat finder (TRF) and Sputnik. TRF was initially tested with the parameters used in the earlier studies (Archak et al., 2007; Boby et al., 2005; Ross et al., 2003) which yielded very few microsatellites. Hence, we used the most relaxed set of parameters (Match: +2, Substitution: –7, Indel: –7, Min Score: 2) which yielded substantial number of microsatellites. This is because the length of microsatellite detected by TRF is dependent on the value of Min Score. For sputnik also, we used the least stringent parameters (Match: +1, Mismatch: –3, Min Score: –5). For IMEx, we set the p value of all tracts to 10%; k value for each pattern size: Mono: 1, Di: 1, Tri: 1, Tetra: 2, Penta: 2, Hexa: 3 and further restricted to report only those microsatellites with minimum repeat copy number (Mono:5, Di: 3, Tri: 2, Tetra: 2, Penta: 2, Hexa: 2) to match those reported by TFR and Sputnik. TRF and Sputnik identified 50 and 19 repeats respectively, whereas IMEx identified 146 microsatellite tracts (Table 1). In fact, IMEx picks up a total of 876 repeats (if the minimum repeat number of all repeat sizes is set to 2) which are, needless to mention, useful for the studies concerning evolution of microsatellites especially when one is making cross-genome comparisons. Table 1 gives the list of microsatellites picked up by IMEx, TRF and Sputnik.
|
As can be seen from the results, IMEx reports many more tracts which are missed by the other two programs. It is important to mention that Sputnik does not report mononucleotide and hexanucleotide tracts. Out of the 146 microsatellites reported by IMEx, 94 correspond to di–penta tracts of which Sputnik reports only 19.
We also ran the three programs on four whole genome sequences: Plasmodium falciparum chromosome IV (NC_004318 [GenBank] .1), yeast chromosome IV (NC_001136 [GenBank] .8), Mycobacterium tuberculosis H37Rv genome (NC_000962 [GenBank] .2) and E.coli K12 genome (NC_000913 [GenBank] .2). The sequences were downloaded from ftp://ftp.ncbi.nih.gov/genomes. The three programs were run with the same parameters that we used in the above analysis. Table 2 shows the execution times and number of repeat tracts extracted from each sequence by each of the three programs.
|
From the results shown in Table 2, it is clear that IMEx outperforms TRF and Sputnik in terms of its ability to identify and report microsatellite tracts in relatively shorter time. It is also clear from the table that execution time of IMEx is linear (directly proportional) to the sequence length rather than the number of repeats detected whereas execution time of TRF is correlated to the number of repeats detected. TRF uses a probabilistic algorithm which includes a detection step to identify the candidate repeats and an analysis step that uses different statistical criteria to filter the candidate repeats. Therefore, more the number of candidate repeats detected, more time is taken by TRF for the execution. Sputnik uses a recursive algorithm and the performance depends on the recursion depth of the program. Hence, Sputnik's execution time seems to be dependent on the sequence composition. On the other hand, IMEx uses the simple string-matching algorithm that scans the entire sequence using sliding window approach and reports the results in a single run. Hence, the processing time of IMEx is dependent on the length of the DNA sequence and not on the number of microsatellites.
In quintessence, IMEx embodies all the required features for a systematic analysis of microsatellites which are not readily available in the other tools, as IMEx has been designed keeping in view of the limitations we encountered with the other available tools. Using IMEx, the users can: (i) search only perfect as well as imperfect microsatellites; (ii) get the coding/non-coding information of the microsatellite tracts; (iii) generate alignments with their perfect counter parts to know about substitutions and indels; (iv) restrict the imperfection limit for repeat unit of each size; (v) set the imperfection percentage threshold of the entire tract of each repeat size; (vi) restrict the minimum number of repeat units of a tract of each size; (vii) search for repeats of a particular size or for all sizes; (viii) search for a particular pattern microsatellite tracts; (ix) set the flanking sequence size limit and (x) design primers seamlessly.
It is clear from the results that IMEx seems more attractive in terms of speed, sensitivity to identify microsatellites and has discover-friendly inputs and outputs.
| 5 CONCLUSION |
|---|
|
|
|---|
In this article, we have presented a new tool for extracting imperfect microsatellites in genomic sequences according to the requirement of the user. It uses a simple algorithm, which scans the entire DNA sequence and reports the microsatellites in a single run. Information such as coding/non-coding information, nucleotide composition, number of iterations, imperfections, etc. about the microsatellites are also generated along with the alignments. The tool is extremely sensitive and fast. We have demonstrated the speed and accuracy of the tool by comparing with other existing tools. This tool can serve as a valuable medium for studying the evolution of microsatellites.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The authors would like to thank Mr Pankaj Kumar, Mr Mohammad Anwaruddin, Dr V.B.Sreenu and Mr Suprabhat Reddy for their valuable suggestions and assistance. A grant from the Department of Biotechnology (DBT), India is gratefully acknowledged. The authors also thank the anonymous referees for their critical and constructive comments.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alex Bateman
Received on December 15, 2006; revised on February 26, 2007; accepted on March 7, 2007
| REFERENCES |
|---|
|
|
|---|
Abajian C. Sputnik. http://espressosoftware.com/pages/sputnik.jsp..
Anwar T, Khan AU. SSRscanner: a program for reporting distribution and exact location of simple sequence repeats. Bioinformation, ( (2006) ) 1, : 89–91.[Medline].
Archak S, et al. InSatDb: a microsatellite database of fully sequenced insect genomes. Nucleic Acids Res., ( (2007) ) 35, : D36–D39.
Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res., ( (1999) ) 27, : 573–580.
Bizzaro JW, Marx KA. Poly: a quantitative analysis tool for simple sequence repeat (SSR) tracts in DNA. BMC Bioinformatics, ( (2003) ) 4, : 22.[CrossRef][Medline].
Boby T, et al. TRbase: a database relating tandem repeats to disease genes in the human genome. Bioinformatics, ( (2005) ) 21, : 811–816.
Boeva V, et al. Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression. Bioinformatics, ( (2006) ) 22, : 676–684.
Brinkmann B, et al. Mutation rate in human microsatellites: influence of the structure and length of the tandem repeat. Am. J. Hum. Genet., ( (1998) ) 62, : 1408–1415.[CrossRef][ISI][Medline].
Buschiazzo E, Gemmel NJ. The rise, fall and renaissance of microsatellites in eukaryotic genomes. Bioessays, ( (2006) ) 28, : 1040–1050.[CrossRef][ISI][Medline].
Castelo A, et al. TROLL – Tandem repeat ocurrence locator. Bioinformatics, ( (2002) ) 18, : 634–636.
Chambers GK, MacAvoy ES. Microsatellites:consensus and controversy. Comp. Biochem. Physiol. B-Biochem. Mol. Biol., ( (2000) ) 126, : 455–476.[CrossRef][Medline].
Delgrange O, Rivals E. STAR: an algorithm to search for tandem approximate repeats. Bioinformatics, ( (2004) ) 20, : 2812–2820.
Jarne P, Lagoda PJL. Microsatellites, from molecules to populations and back. Trends Ecol. Evol., ( (1996) ) 11, : 424–429.[CrossRef].
Kashi Y, King DG. Simple sequence repeats as advantageous mutators in evolution. Trends Genet., ( (2006) ) 22, : 253–259.[CrossRef][ISI][Medline].
Kolpakov R, et al. mreps: efficient and flexible detection of tandem repeats in DNA sequences. Nucleic Acid Res., ( (2003) ) 31, : 3672–3678.
Li YC, et al. Microsatellites within genes: structure, function, and evolution. Mol. Biol. Evol., ( (2004) ) 21, : 991–1007.
Martin P, et al. Microsatellite instability regulates transcription factor binding and gene expression. PNAS, ( (2005) ) 102, : 3800–3804.
Meloni R, et al. A tetranucleotide polymorphic microsatellite, located in the first intron of the tyrosine hydroxylase gene, acts as a transcription regulatory element in vitro. Hum. Mol. Genet., ( (1998) ) 7, : 423–428.
Messier W, et al. The birth of microsatellites. Nature, ( (1996) ) 381, : 483.[Medline].
Parisi V, et al. STRING: finding tandem repeats in DNA sequences. Bioinformatics, ( (2003) ) 19, : 1733–1738.
Ross CL, et al. Rapid divergence of microsatellite abundance among species of Drosophila. Mol. Biol. Evol., ( (2003) ) 20, : 1143–1157.
Rozen S, Skaletsky HJ. Primer3 on the WWW for general users and for biologist programmers. In: Bioinformatics Methods and Protocols: Methods in Molecular Biology, —Krawetz S, Misener S, eds. ( (2000) ) Totowa, NJ: Humana Press. 365–386..
Schlotterer C. Evolutionary dynamics of microsatellite DNA. Chromosoma, ( (2000) ) 109, : 365–371.[ISI][Medline].
Sreenu VB, et al. MICAS: a fully automated web server for microsatellite extraction and analysis from prokaryote and viral genomic sequences. Appl. Bioinformatics, ( (2003) ) 2, : 165–168.[Medline].
Sreenu VB, et al. Microsatellite polymorphism across the M. tuberculosis and M. bovis genomes: implications on genome evolution and plasticity. BMC Genomics, ( (2006) ) 7, : 78–88.[CrossRef][Medline].
Sreenu VB, et al. Simple sequence repeats in mycobacterial genomes. J. Biosci., ( (2007) ) 32, : 3–15.[CrossRef][ISI][Medline].
Sturzeneker R, et al. Polarity of mutation in tumor-associated microsatellite instability. Hum. Genet., ( (1998) ) 102, : 231–235.[CrossRef][ISI][Medline].
Tautz D, Schlotterer C. Simple sequences. Curr. Opin. Genet. Dev., ( (1994) ) 4, : 832–837.[CrossRef][Medline].
Temnykh S, et al. Computational and experimental analysis of microsatellites in rice (Oryza sativa L.): frequency, length variation, transposon associations, and genetic marker potential. Genome Res., ( (2001) ) 11, : 1441–1452.
Thibodeau SN, et al. Microsatellite instability in cancer of the proximal colon. Science, ( (1993) ) 260, : 816–819.
Toth G, et al. Microsatellites in different eukaryotic genomes: survey and analysis. Genome Res., ( (2000) ) 10, : 967–981.
Wexler Y, et al. Finding approximate tandem repeats in genomic sequences. In: RECOMB 2004, ( (2004) )..
This article has been cited by other articles:
![]() |
A. Merkel and N. Gemmell Detecting short tandem repeats from genome data: opening the software black box Brief Bioinform, July 10, 2008; (2008) bbn028v1. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




