Bioinformatics Advance Access originally published online on December 18, 2007
Bioinformatics 2008 24(4):468-476; doi:10.1093/bioinformatics/btm613
A novel genome-scale repeat finder geared towards transposons
1CISE Department and 2Plant Molecular and Cellular Biology Program and Horticultural Sciences Department, University of Florida, Gainesville, FL 32611, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Repeats are ubiquitous in genomes and play important roles in evolution. Transposable elements are a common kind of repeat. Transposon insertions can be nested and make the task of identifying repeats difficult.
Results: We develop a novel iterative algorithm, called Greedier, to find repeats in a target genome given a repeat library. Greedier distinguishes itself from existing methods by taking into account the fragmentation of repeats. Each iteration consists of two passes. In the first pass, it identifies the local similarities between the repeat library and the target genome. Greedier then builds graphs from this comparison output. In each graph, a vertex denotes a similar subsequence pair. Edges denote pairs of subsequences that can be connected to form higher similarities. In the second pass, Greedier traverses these graphs greedily to find matches to individual repeat units in the repeat library. It computes a fitness value for each such match denoting the similarity of that match. Matches with fitness values greater than a cutoff are removed, and the rest of the genome is stitched together. The similarity cutoff is then gradually reduced, and the iteration is repeated until no hits are returned from the comparison. Our experiments on the Arabidopsis and rice genomes show that Greedier identifies approximately twice as many transposon bases as those found by cross_match and WindowMasker. Moreover, Greedier masks far fewer false positive bases than either cross_match or WindowMasker. In addition to masking repeats, Greedier also reports potential nested transposon structures.
Contact: xli{at}cise.ufl.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Genomes typically are rich in repeat sequences. For example, it is estimated that 50–80% of the maize genome (Sanmiguel et al., 1996) and >50% of the human genome (I. H. G. Consortium, 2001) consist of repetitive sequences. Repeats vary in size from less than a hundred bases to tens of kilobases. They are found as either tandem arrays or dispersed throughout the genome. Repeats can generate insertions, deletions and unequal crossing-over within genomes. Hence, repeats play important roles in genome evolution (Bowen and Jordan, 2002) by causing mutations and rearrangements (Mal et al., 2004) that lead to altered gene functions (Buard and Jeffreys, 1997). Repeats also present difficulties in genome annotation and analyses (e.g. (Bennetzen et al., 2004)). For these reasons, it is critical to identify repetitive sequences accurately.
Transposons are mobile genetic elements and often make up a substantial fraction of genomes. For example, 45% of the human genome (Smith, 1999) and 12.4% of the rice genome consist of transposable elements (http://www.tigr.org/tdb/e2k1/osa1/). Transposons accumulate copies within a genome based on a variety of replicative mechanisms. Many transposons have a tendency to target new insertions within existing copies of transposons (Mal et al., 2004; Sanmiguel et al., 1996). Thus, repeats within a genome sequence are split by nested transposon insertions, resulting in individual repeat units being fragmented (Mal et al., 2004; Sanmiguel et al., 1996) (Also see Fig. 1).
|
Fragmented repeats are hard to identify using existing tools for two reasons. First, fragments may be too short to be considered significant for an algorithm's criteria. Second, fragments are likely to have a high level of sequence divergence from known repeat structures. Repeat finders need to maximize the identification of true positives (bona fide repeats) while limiting false positives. False positives are sequences that are tagged as repeats, but are actually non-repeats. The precise definition of a false positive can be debated, but in this article we take a biological perspective towards the identification of repeats. Our goal is to separate transposable elements and tandem repeat arrays from exon sequences that are more likely to have a functional role in an organism's phenotype.
Current repeat finding algorithms either identify repeats de novo or use a library of annotated repeats. Several de novo algorithms such as TRF (Benson, 1999), LTR_STRUC (McCarthy and McDonald, 2003), and PILER (Edgar and Myers, 2005) identify specific structures within the DNA sequence to find repeats. These algorithms are most robust for identifying novel, recently-evolved repeats. REPuter (Kurtz et al., 2000)), WindowMasker (Morgulis, 2006) and RAP (Campagna, 2005) identify fragmented or more divergent repeats based on pairwise or multiple similarities within a genome. REPuter uses suffix trees to locate exact repeats and extends them to degenerate repeats. WindowMasker and RAP use word counting to identify short sequences that are over-represented within an input genome sequence. Both word counting and pairwise similarity algorithms have the potential for masking desired gene families in addition to repeats. These similarity-based algorithms have not been benchmarked against genome annotations to determine if the masked sequences include false positives. Some de novo repeat finders identify fragmented or more divergent repeats based on multiple alignments between related genomes. Caspi and Pachter (2006) is a comparative approach that identifies transposable elements. Inserted sequences that have little or no alignment to other genomes lead to signatures with multiple alignments that can be used to identify transposable elements. Caspi and Pachter (2006) searches for disrupted conservation patterns in whole genome alignments. However, poorly characterized genome families are not suitable for such comparative approaches. These approaches rely on well-assembled genomes, the selection of appropriate evolutionary models. Repeat finders that use annotated repeat libraries include RepeatMasker (www.repeatmasker.org/), MaskerAid (Bedell et al., 2000) and CENSOR (Jurka et al., 1996). All these algorithms use cross_match (www.phrap.org/phredphrapconsed.html) as the default search engine to determine levels of pairwise similarity to a collection of known repeats. cross_match is an efficient implementation of the Gotoh algorithm (Gotoh, 1982). It uses banded searches to find matches with score no less than a cutoff. Dynamic programming has also been used to find multiple non-overlapping copies of the subsequence of a repeat unit in a target sequence (Durbin et al., 1998). However, this solution fails to find divergent fragmented repeats.
Here we consider the problem of finding repeats in a genome while maximizing the true positives and minimizing the number of false positives. We show that both cross_match and WindowMasker mask a small percentage of annotated repeats while masking a large percentage of annotated genic exons. We develop an iterative algorithm, called Greedier, that uses a repeat library. Our tests with the Arabidopsis and rice genomes show that Greedier has a much higher level of accuracy at finding repeats than both cross_match and WindowMasker. In addition to masking repeats, Greedier also reports potential nested transposon structures.
| 2 DESCRIPTION OF THE ALGORITHM |
|---|
|
|
|---|
Greedier takes a target sequence, such as a chromosome, and a repeat library as input. Greedier masks the target sequence based on its sequence similarities to the repeat units in the repeat library. Ideally, a repeat unit is a complete repeat structure. In this article, a repeat unit is an individual sequence entry in a repeat library. Greedier uses multiple iterations to identify divergent matches to the repeat library. Figure 2 shows the pseudocode for Greedier. Each iteration consists of two passes. In the first pass (Steps 2–4), it identifies the local similarities between the chromosome and the repeat sequences (Step 2). This can be done using any off-the-shelf sequence comparison algorithm. We used BLAST (Tatusova and Madden, 1999) for this purpose. It then processes these local similarities and builds graphs (Steps 3–4). A vertex in a graph denotes one similarity. An edge in a graph denotes two similar subsequences that can be attached to form a longer match. A path in a graph represents a longer match consisting of multiple edges, i.e. two or more sequence similarities that can be connected to form a longer match. Each path is associated with a fitness value between zero and one. This value reflects the percent identity and the length of the match between the target subsequences on the path and the repeat unit. In the second pass (Steps 5–12), Greedier traverses the graphs greedily and forms a pool of repeat candidates. Each candidate in the pool corresponds to a path with a fitness value no less than a cutoff,
. Greedier reports the subsequences whose corresponding path has the biggest fitness value as repeats (Step 6). It then modifies all the graphs accordingly (Steps 7–8). It repeats this process until it cannot find more repeats (Step 9). Finally, it removes these reported repeats from the chromosome and stitches the rest of the chromosome together (Steps 10–11). This allows repeats to be split and nested within one another. Finally it relaxes the fitness constraint for the next iteration (Step 12). Note that
decreases to allow divergent repeats to be identified after the most recent insertions have been removed from the genome. Greedier iterates until BLAST returns no hits. Finally, Greedier parses the masked regions from different iterations and reports the potential nested transposon structures (Step 14). These structures are identified either as individual blocks at higher level iterations or as a result of stitching at lower level iterations.
|
2.1 Graph construction
For each repeat in the repeat library, we build an acyclic and directed graph if BLAST reports similarities between that repeat and the target sequence. Each alignment has an alignment orientation, denoted by o. When both positive or negative strands of the chromosome and the repeat participate in this alignment, we say that the alignment has Plus/Plus orientation. Otherwise, we say that it has Plus/Minus orientation. Let sr and er be the starting and the ending coordinates (i.e. positions) of the subsequence of the repeat in the alignment, respectively. Let sc and ec be the starting and the ending coordinates of the subsequence of the target sequence in the alignment, respectively. Let a and b denote the number of identical letters in the alignment and the length of the alignment, respectively. For each alignment, we build a vertex, denoted by (sr, er, sc, ec, a, b, o). In Figure 3, the two vertices corresponding to the two alignments are (400, 889, 1200, 1700, 450, 500, Plus/Plus) and (922, 1545, 1750, 2370, 550, 650, Plus/Plus). Let lr denote the length of the remaining repeat subsequence (i.e. excluding the subsequence corresponding to the alignment). Let x denote the average identity between two random sequences. All letters in the nucleotide alphabet appear at each position of a random sequence with the same probability of 0.25. Hence, x = 0.25. For each vertex, we calculate a fitness value as
|
|
|
Recall that
is the fitness value cutoff introduced at the beginning of Section 2. We say that two vertices from the same repeat, (sr1, er1, sc1, ec1, a1, b1, o1) and (sr2, er2, sc2, ec2, a2, b2, o2), do not conflict if any of the following six sets of conditions is satisfied:- o1 = o2 = Plus/Plus, ec1
sc2 and er1
sr2;
- o1 = o2 = Plus/Plus, sc1
sc2
ec1, er1
sr2, and (ec1–sc2)
(1–
) * min(a1, a2);
- o1 = o2 = Plus/Plus, sr1
sr2
er1, ec1
sc2, and (er1–sr2)
(1–
) * min(a1, a2);
- o1 = o2 = Plus/Minus, ec1
sc2 and er1
sr2;
- o1 = o2 = Plus/Minus, sc1
sc2
ec1, er1
sr2, and (ec1–sc2)
(1–
) * min(a1, a2);
- o1 = o2 = Plus/Minus, sr1
sr2
er1, ec1
sc2, and (sr2–er1)
(1–
) * min(a1, a2).
We construct an edge between two vertices if they do not conflict. Each edge is a directed edge that goes from a vertex with a smaller starting coordinate to another vertex with a bigger starting coordinate. It connects a pair of alignments of the same orientation. Each such pair of alignments forms an alignment longer than the two alignments between the repeat and the chromosome. In Figure 3, for example, the repeat subsequence between coordinates 400 and 1545 and the target subsequence between coordinates 1200 and 2370 form an alignment.
Let r denote the number of repeats in the repeat library. Assume each repeat has t alignments with the chromosome on average. Thus, the average number of vertices in each graph is t. Vertex construction for each graph takes O(t) time. The number of edges in each graph is O(t2) in the worst case. This means that graph construction for each repeat takes O(t+t2) time. The number of graphs is the same as the number of repeats in the library. Hence, graph construction for all repeats at each iteration takes O(rt2) time in the worst case.
2.2 Graph traversal
A path in a graph consists of vertices connected by edges. Suppose all vertices on a path are ordered according to their starting coordinates on the chromosome. For a path consisting of k vertices, we calculate a fitness value as
|
|
For each graph, we adopt a greedy traversal strategy. We start from the vertex with the maximum fitness value in the graph. We extend this vertex to a path in the left and right directions. When extending in one direction, we include a neighbor vertex of the most recently chosen vertex into the path each time. The neighbor vertex is the one which makes the currently extended path have the biggest fitness value. We keep extending the path in one direction until the path cannot be extended further. After extending the path in both directions, we find the longest subpath of the extended path with fitness value no
. We report this subpath as the repeat candidate to be joined into the pool. Note that when extending the path in the right direction, we consider outgoing edges. When extending the path in the left direction, we consider the incoming edges. Figure 4 shows the pseudocode of the traversal.
|
Suppose that Greedier iterates i times before BLAST returns no hits. At each iteration, the worst time spent to traverse graphs is linear in the size of edges, O(rt2). Hence, the total time complexity of Greedier is O((rt2+rt2)i), i.e. O(rit2).
2.3 Further improvement to greedier
In this section, we analyze the gap length constraints enforced by our fitness value formula of a path automatically. These constraints reduce the number of constructed edges and hence the graph size. As a result, they reduce the space and the time complexities of the Greedier. Recall that a gap associated with an edge is the interval between two subsequences that define the alignments for two vertices connected by that edge. There are two kinds of gaps involved in an edge: a gap on a repeat with length gr and a gap on the target sequence with length gc. Let L be the repeat length. We consider two extreme cases of an edge. These cases provide upper bounds for the length difference between the two kinds of gaps.
One extreme case is when there is no gap on the repeat (i.e. gr = 0), but there is a gap on the target sequence. Figure 5(a) shows this case. The fitness value, f, of the edge satisfies
|
|
. Thus, we have |
|
(1–
)L/
when gr > 0.
|
The other extreme case is when there is no gap on the chromosome (i.e., gc = 0), but there is a gap on the repeat. Figure 5(b) illustrates this case. The fitness value, f, of the edge satisfies
|
|
. Thus, we have |
|
(1–
) L, when gc > 0.
During the graph construction, we include an edge between two vertices only if the gaps gr and gc for that edge satisfy these constraints. For
[0.25, 0.95], the upper bound for gr varies between 0.75L and 0.05L and the upper bound for gc varies between 3L and 0.05L. This indicates that only a limited number of edges are drawn from each vertex. Thus, this filter significantly reduces the graph size.
| 3 RESULTS |
|---|
|
|
|---|
We evaluated the accuracy of Greedier by comparing its ability to identify repeats. We compared Greedier with three well known tools, namely cross_match, RepeatMasker and WindowMasker. cross_match is the core algorithm of RepeatMasker, MaskerAid and CENSOR. Similar to Greedier, cross_match and RepeatMasker use a repeat library to identify the repeats in a target genome. RepeatMasker can also identify de novo low complexity regions by self comparison. WindowMasker (ftp://ftp.ncbi.nih.gov/pub/agarwala/windowmasker/) is a de novo repeat finder that was developed specifically to mask large sequence databases. We have downloaded cross_match, RepeatMasker and WindowMasker. Caspi and Pachter (Caspi and Pachter, 2006) proposed to use multiple alignment of the target sequence to a set of sequences that are homolog to the target sequence and that have well annotated transposons. Although it would have been interesting to evaluate the performance of this algorithm with respect to Greedier, we excluded it for several reasons. First, its code or algorithm was not publicly available. Second, it requires having access to homolog and well annotated sequences. None of the four methods we tested requires such a dataset. Therefore, this method would have an unfair advantage against those methods. Furthermore, comparative methods can be used only if well annotated homolog sequences are available. For other sequences, they can not be used.
We used the arabidopsis genome and rice chromosome 10 as benchmarks to compare the algorithms. Both of these genomes have been sequenced to near completion, have extensive annotations of both genes and likely transposons, and have existing repeat libraries. We obtained the Arabidopsis genome data including the five chromosomes assembled into pseudomolecules (ftp://ftp.arabidopsis.org/home/tair/home/tair/Sequences/whole_chromosomes/), the gene model function annotations (ftp://ftp.arabidopsis.org/home/tair/home/tair/Genes/TAIR_sequenced_genes) and the gene model coordinates (ftp://ftp.arabidopsis.org/home/tair/home/tair/Maps/seqviewer_data/sv_gene_feature.data) from TAIR (http://www.arabidopsis.org/). Similarly, we downloaded the rice genome and its annotations from TIGR (ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_4.0/). We obtained the arabidopsis and rice repeat libraries from TIGR (ftp://ftp.tigr.org/pub/data/TIGR_Plant_Repeats/). These repeat libraries contain annotated transposons and other repeat classes such as tandem arrays.
Autonomous transposons encode proteins and share structural features with non-repeat genes. Within the genome annotation, these transposons are identified as gene models, but they are also annotated as transposons. A masked letter was considered to be a true positive (TP) if it belonged to a gene model annotated as a transposon. Repeat sequences can also be found outside of gene models or within non-coding segments of non-transposon genes. Letters in these regions of the genome were ignored, because they were not assigned as repeat or non-repeat sequences in the genome annotation. The remaining exon sequences were considered to be non-repeat sequences that should be retained after masking. Letters masked in these exons were considered to be false positives (FPs).
3.1 Evaluation of accuracy
Tables 1 and 2 show the relative accuracies of Greedier, cross_match and WindowMasker when run with default parameters. For Greedier, we set the average identity between two random DNA samples, y = 0.25. We also computed the actual identity using the Needleman–Wunsch algorithm and obtained similar results (data not shown). Greedier has a much higher rate of identifying annotated transposons and a much lower false positive rate than both cross_match and WindowMasker. The percentages of bases masked by the three programs corresponds to relative levels of repeat sequences found on the chromosome arms. However, in all chromosomes tested, we found a significant increase in the number of transposon bases masked by Greedier.
|
|
For the Arabidopsis genome (Table 1), Greedier masked 2.4 and 2.8 times as many transposons bases as cross_match and WindowMasker, respectively. At the same time, Greedier masked 0.3 and 0.1 times the number of false positives, respectively. Thus, Greedier could be considered to be roughly 8 and 28 times more accurate (2.4/0.3 = 8 and 2.8/0.1 = 28, ratio of true positives to false positives) than cross_match and WindowMasker, respectively. RepeatMasker, which uses cross_match masked 2.3 times more letters than cross_match and Greedier. The TP rate of RepeatMasker is greater than that of Greedier. This is mainly because it masks more letters. The TP/FP rate of Greedier, however, is better than that of RepeatMasker. To understand, why Greedier incurred false negatives, we aligned the false negatives of Greedier with the repeat library using BLAST. We did not get any significant matches. This implies that either repeat library is incomplete or BLAST is not accurate enough to find the false negatives.
Greedier also showed higher accuracy than cross_match and WindowMasker for rice chromosome 10 (Table 2). Greedier masked 1.5 and 2 times as many transposons bases as cross_match and WindowMasker, respectively. At the same time, Greedier masked 0.5 and 0.9 times the number of false positives, respectively. However, the differences between the number of true positives and false positives recognized by the three programs were not as pronounced as with Arabidopsis. Potentially, this is due to incorrect annotation of transposons as gene sequences (Bennetzen et al., 2004) leading to higher false positive rates with all three algorithms.
We also measured the number of fragments that are masked by each of the methods. On the Arabidopsis genome, Greedier, RepeatMasker, cross_match and WindowMasker masked 7364, 15 421, 6750 and 760 243 fragments, respectively. On the rice genome, the same methods masked 7738, 20 827, 7690, 116 614 fragments, respectively. This demonstrates that RepeatMasker and especially WindowMasker masks a large number of small discontinuous regions, whereas Greedier and cross_match mask relatively longer and contiguous regions.
3.2 Fragmented and potential nested repeats
Greedier was developed based on the hypothesis that identifying nested or fragmented repeats would improve the accuracy of library-based repeat masking. Our genome-wide results suggest that Greedier has a higher accuracy. Does this result from the identification of nested repeats, or is this simply due to the implementation of a graph-based fitness value that allows for fragmented pairwise matches? In order to understand this we performed the following experiment to extract potential nested insertions. We defined a match A as a potential insertion if (1) It is similar to a retrotransposon or transposon (i.e., the fitness value is high, (2) There is a new match to another repeat in the repeat library after A is removed and the rest of the sequence is stitched in the subsequent iteration of Greedier which did not exist before.
Figure 6 shows an example of a retrotransposon that has fragmented identity to the repeat library. This retrotransposon, At4g06656, corresponds to coordinates of 3,841,924 and 3,850,389 of Arabidopsis chromosome 4. Greedier masked the entire annotated transposon region, while cross_match missed 43% of the letters. At iteration one, the subsequence with coordinates 3,846,848 and 3,847,232 is identified as a putative Ty3-gypsy-like retrotransposon sequence within the library. This sequence is masked due to a high fitness value, because the genome sequence matches 96% of the repeat over the entire unit length of the repeat. At iteration seven, the subsequence with coordinates 3,844,713 and 3,845,381 is identified. This is a 77% match to a conserved centromere sequence over the entire length of the repeat unit in the repeat library. At iteration eight, the remaining sequence is masked due to identity to two retrotransposons within the repeat library. The middle segment, coordinates 3,845,382–3,846 847, shows no similarity to the repeat library, which reduces the fitness value of the overall match. However, the end segments show high levels of identity with both retrotransposons in the repeat library and the entire segment is masked. This example illustrates how Greedier can compensate for incomplete and redundant entries within the repeat library yet still output accurate masking results.
|
The example above also suggests Greedier could have a higher accuracy due to the graph-based fitness value instead of identifying nested repeats. If this were the case, we would expect that implementing the fitness value at a single, low cutoff would yield the same extent of masking as completing 15 iterations with Greedier. To test this prediction, we ran a single iteration of Greedier on the Arabidopsis genome using the lowest fitness cutoff (i.e. 25%). With this single iteration, we only masked 10.8% of the annotated transposon bases in comparison to 16% of the transposon bases when all iterations are completed. This experiment also gave a slightly lower false positive rate of 0.52% versus that of Greedier at 0.78%. Interestingly, the relative ratio of true positives to false positives in the single iteration versus Greedier is about the same, 19.2 versus 20.5, respectively. These results suggest that stitching the genome together after removing repeats gives a significant gain in the total number of transposon bases masked, while retaining a low rate of masking true genes. Moreover, these results show that the graph-based approach of Greedier results in a larger total number of transposon bases masked with a higher accuracy than cross_match and WindowMasker even using a single iteration (see Table 1).
In addition to masking repeats, Greedier also reports potential nested transposon structures. On the Arabidopsis genome level, Greedier finds a total of potential nested structures with 92, 92, 106, 89 and 208 structures from chromosomes 1, 2, 3, 4 and 5, respectively. These structures can be used by biologists as the candidate set to identity nested transposon structures.
It would be interesting to see the percentage of the transposons that are nested insertions. In order to calculate this number we analyze the Greedier results and the TAIR annotations. Greedier reports which fragments are potential nested insertions. However, Greedier can generate false positives. TAIR annotates repeats, however it does not distinguish insertions and it can have false negatives. In order to count the nested insertions correctly we do the following measurements. We count the number of bases that belong to potential nested insertions identified by Greedier and annotated as repeats by TAIR. Let X denote this number. We also count the number of bases identified as repeats (nested or not) by both Greedier and TAIR. Let Y denote this number. We count the percentage of bases in nested insertions as X x 100/Y. This percentage for chromosomes 1, 2, 3, 4 and 5 of the Arabidopsis genome is 24.6%, 36.1%, 24.5%, 36.5% and 41.4% respectively, with an average of 32.6%.
RepeatMasker identifies 74%, 81%, 91%, 75% and 75% of these nested repeats for chromosomes 1, 2, 3, 4 and 5, respectively, with an average of 79%. Thus, although RepeatMasker masks more than twice as many bases as Greedier, it misses significant portion of the nested insertions. This means that Greedier identifies potential nested repeats more accurately than RepeatMasker. Furthermore, RepeatMasker does not distinguish a nested insertion from a repeat that is not nested insertion.
3.3 Probability analyses
Only vertices on traversed paths correspond to BLAST matches, Greedier, however, masks the regions between these vertices. Would it be possible that these regions are purely random? To test this hypothesis, we performed two probability analyses. As a by-product of the single-iteration experiment in Section 3.2, we calculated the TP/FP (True Positive/False Positive) rate of all regions between the vertices. This rate is 23.52. As shown in Table 1, the total numbers of annotated transposon bases and bases of exons that do not belong to transposons are 5 905 785 and 42 900 000, respectively. These two numbers yields an expected TP/FP rate of 0.14, which is far
23.52. As a second analysis, we calculated the probability (or the p-value) of finding as many TPs as it appears in these regions with uniform background distribution of TPs as given in Table 1. This p-value was zero when calculated using the incomplete beta function. In other words, it was less than the machine precision. Both probability analyses argue that regions between vertices masked by Greedier are not random.
3.4 Evaluation of possible optimization strategies
Tables 1 and 2 show the number of bases masked by Greedier after the final iteration of the program, once no more BLAST hits can be found. It is possible that the last few iterations contribute the bulk of the false positive masking and that the algorithm could be optimized by retaining a higher stringency for the pairwise matches. Figure 7 shows the cumulative percentages of masked bases with each iteration for the Arabidopsis genome. Greedier completes the bulk of the repeat masking in the last few iterations but retains a nearly constant rate of false positive masking. These data suggest that increasing the stringency of Greedier will not significantly reduce false positives but will significantly reduce the number of true positive bases masked. Figure 7 also shows that Greedier identifies more TPs than cross_match and WindowMasker after 12th and 11th iterations, respectively while identifying fewer false positives all the time.
|
Greedier, cross_match, and WindowMasker all masked only a small fraction of the transposons that were annotated in the Arabidopsis and rice genomic sequences we tested. It is possible that significant BLAST hits were missed by Greedier, because they had insufficient fitness values to be masked. This hypothesis predicts that the masked genome should contain transposons that still have significant matches to the repeat library. We extracted the annotated transposons missed by Greedier from the Arabidopsis genome. We ran BLAST with these transposons as the query set and the repeat library as the library. BLAST did not report any matches. Thus, annotated transposons missed by Greedier are primarily due to the absence of these sequences in the repeat library.
| 4 DISCUSSION |
|---|
|
|
|---|
Repeat identification is a challenging problem. Conventionally, repeats are identified through a mix of structure-based and similarity-based searches that are annotated manually. As more genomes have been sequenced, it has become important to identify repeats using automated approaches. However, the biological characteristics of repeat sequences create challenges for automated repeat masking. Transposons share many sequence features with true genes, which confuses gene finding algorithms and leads to the annotation of transposons as genes (Bennetzen et al., 2004). Transposons and other repeat sequences also evolve rapidly making it difficult to recognize repeats through pairwise comparisons. Similarly, transposons have the tendency to insert in existing repeats creating repeat fragments that are more difficult to identify using similarity scores. Greedier addresses many of these challenges and shows a clear improvement over the standard repeat masking algorithm, cross_match (www.phrap.org/phredphrapconsed.html). It also has an additional feature of reporting potential nested transposon structures, which can help biologists annotate complex repeat structures in genomic sequences. However, Greedier requires a library of known repeat units to identify the remaining repeats within a target sequence. Hence, Greedier is limited by the accuracy and completeness of the repeat library.
Word counting algorithms such as WindowMasker (Morgulis et al., 2006) circumvent the need for a repeat library. These algorithms mask short sequences that are over-represented in the target genome. In our experiments, we found WindowMasker to have a low level of accuracy even though it can mask more total bases than Greedier. Word counting methods have lower accuracies, because word counting does not take into account biological characteristics of both repeat and gene sequences. First, repeats do not always exist in high copies (Bennetzen, 1996). Repeats can diverge rapidly leading to sequences that will not be over-represented relative to the whole genome. Second, genes can contain sequences that would be considered high-copy. Genes can evolve through amplification events and be found in closely-related families. Also, sequence motifs within genes are conserved and found in many genes. These conservations are likely to cause segments of genes to be detected as over-represented. In contrast to word counting algorithms, structure-based de novo repeat finders can find low-copy repeats. However, these algorithms tend to be limited to finding repeats that have well-conserved structures and are unlikely to find divergent and fragmented repeats. Potentially, generating a library of highly-conserved repeat sequences using algorithms like TRF (Benson, 1999), LTR_STRUC (McCarthy and McDonald, 2003) and PILER (Edgar and Myers, 2005) would serve to generate a more complete representation of the highly-conserved transposons and other repeat units within a genome. Greedier could then be used to identify divergent and fragmented repeats based on the computationally-generated libraries.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alfonso Valencia
Received on March 15, 2007; revised on December 10, 2007; accepted on December 10, 2007
| REFERENCES |
|---|
|
|
|---|
Bedell J, et al. MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics (2000) 16:1040–1041.
Bennetzen J. The contributions of retroelements to plant genome organization, function and evolution. Trends Microbiol (1996) 4:347–353.[CrossRef][Web of Science][Medline]
Bennetzen J, et al. Consistent over-estimation of gene number in complex plant genomes. Curr. Opin. Plant Biol (2004) 7:732–726.[CrossRef][Web of Science][Medline]
Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research (1999) 27:573–580.
Bowen N, Jordan I. Transposable elements and the evolution of eukaryotic complexity. Curr. Issues Mol. Biol (2002) 4:65–76.[Medline]
Buard J, Jeffreys A. Big, bad minisatellites. Nature Genetics (1997) 15:327–328.[CrossRef][Web of Science][Medline]
Campagna D, et al. RAP: a new computer program for de novo identification of repeated sequences in whole genomes. Bioinformatics (2005) 21:582–588.
Caspi A, Pachter L. Identification of transposable elements using multiple alignments of related genomes. Genome Research (2006) 16:260–270.
Consortium IHG. Initial sequenccing and analysis of the human genome. Nature (2001) 409:860–921.[CrossRef][Medline]
Durbin R, et al. Biological Sequence Analysis. (1998) Cambridge: Cambridge University Press. 24–26. chapter 2. http://www.amazon.com/exec/oboidos/ASIN/0521629713/internationcommun.
Edgar RC, Myers EW. PILER: identification and classification of genomic repeats. Bioinformatics (2005) 21:152–158.
Gotoh O. An improved algorithm for matching biological sequences. J. Mol. Biol (1982) 162:705–708.[CrossRef][Web of Science][Medline]
Jurka J, et al. CENSOR - a program for identification and elimination of repetitive elements from DNA sequences. Comput. Chem (1996) 20:119–122.[CrossRef][Web of Science][Medline]
Kurtz S, et al. Computation and visualization of degenerate repeats in complete genomes. In: Intelligent Systems for Molecular Biology (ISMB). (2000) California: AAAI Press. 228–238.
Ma1 J, et al. Analyses of LTR-retrotransposon structures reveal recent and rapid genomic DNA loss in rice. Genome Research (2004) 14:860–869.
McCarthy EM, McDonald JF. LTR STRUC: a novel search and identification program for LTR retrotransposons. Bioinformatics (2003) 19:363–367.
Morgulis A, et al. WindowMasker: window-based masker for sequenced genomes. Bioinformatics (2006) 22:134–141.
Needleman SB, Wunsch CD. A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology (1970) 48:443–53.[CrossRef][Web of Science][Medline]
Sanmiguel P, et al. Nested retrotransposons in the intergenic regions of the maize genome. Science (1996) 274:765–768.
Smit AF. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr. Opin. Gene. Dev (1999) 9:657–663.[CrossRef][Web of Science][Medline]
Tatusova T, Madden T. BLAST 2 Sequences, A New Tool for Comparing Protein and Nucleotide Sequences. FEMS Microbiol. Lett (1999) 177:247–250.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




