Bioinformatics Advance Access originally published online on December 7, 2004
Bioinformatics 2005 21(8):1695-1698; doi:10.1093/bioinformatics/bti181
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TERMINUSTelomeric End-Read Mining IN Unassembled Sequences
1Department of Biological Sciences, University of Kentucky Lexington, KY 40546, USA
2Department of Plant Pathology, University of Kentucky Lexington, KY 40546, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: TERMINUS is a set of tools to map telomeres on draft sequences of whole genome shotgun sequencing projects. It mines raw sequence reads (from a trace archive) for telomeric reads, assembles them into contigs representing individual chromosome ends and BLASTs the resulting consensus sequences against the genome assembly to identify telomere-proximal genomic contigs. Finally, it estimates the sizes of telomeric gaps and identifies clones for gap closure. TERMINUS is implemented as a set of Perl scripts that requires two sets of inputs: the NCBI Trace Archive files for a given genome project; and ancillary genome assembly information. Results are output in spreadsheets containing information that facilitates manual validation.
Availability: The TERMINUS package and supplementary information can be downloaded from http://www.genome.kbrin.uky.edu/fungi_tel/terminus/
Contact: farman{at}uky.edu
Telomeres are specialized structures at the ends of linear eukaryotic chromosomes, and are crucial for the maintenance of chromosome integrity and genome stability. A telomere usually consists of a simple, tandemly repeated DNA sequence, with (TTAGGG)n being the most commonly seen form in vertebrates and fungi (McEachern et al., 2000). Recent studies of microbial telomere regions suggest that subtelomeric regions harbor genes that facilitate niche adaptation (Gardner et al., 2002; Rudenko, 2000; Winzeler et al., 2003). The ever-expanding availability of whole genome shotgun sequences, especially the sequences of fungi from the Fungal Genome Initiative (FGI), presents us with an ideal opportunity to test this hypothesis using comparative genomics. However, when the canonical telomere repeat sequence (TTAGGG)5 was used to query several fungal genome databases, the number of hits obtained was usually far fewer than the actual number of chromosomal ends. This shortcoming is true for most genome sequence assemblies (Rehmeyer and Farman, unpublished results) and, in some cases, combined approaches such as targeted cloning, physical mapping and bioinformatic analysis have been necessary to close the gaps (Riethman et al., 2004).
Even for genome projects with few telomere repeats in the assembly, we found that the corresponding NCBI Trace Archive usually contained hundreds of reads matching the (TTAGGG)5 query, suggesting that valuable telomere information might be retrievable from archived reads. To use this potentially rich source of telomere information, we developed TERMINUS, a package of Perl scripts which extracts, assembles and categorizes telomeric reads and their mate-pair reads from the trace archive; and links them to the whole genome assembly through BLAST searches, followed by validation of the resulting matches (Fig. 1).
|
TERMINUS has multiple uses in a genome sequencing project. By mapping the telomeres and subtelomeric regions in the draft sequence, TERMINUS allows analysis of telomere organization and content well before the finished version of the assembly is available. TERMINUS also facilitates gap closure by estimating the gap sizes and revealing the identities of clones that can be used for primer walking across the gaps. In addition, TERMINUS can identify potential assembly problems if a telomere maps to a contig at a position that is too far away from the end of the genomic scaffold. As such, TERMINUS represents a useful first step in an integrated approach toward obtaining complete sequence coverage of telomere regions.
TERMINUS consists of three components, each of which uses parameters that are user-definable (the values described in the descriptions below are the defaults). Component one extracts telomeric mate-pairs (paired reads from each end of a sequencing template) from the trace archives and subjects them to quality screening, vector trimming and screening of internal telomere sequences. The input is the name of a directory containing uncompressed files, including the sequences (fasta.*), quality scores (qual.*) and ancillary information (anc.*). TERMINUS formats each sequence into a BLAST-ready database using formatdb (supplied with the BLAST executables (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/); it then uses a local version of BLAST (Altschul et al., 1990) to search the database for a user-defined sequence (in this case [CCCTAA]5), using parameters -e 1e-7, -F F, -S 1 and 100% identity. The TEMPLATE_ID for each positive hit is retrieved from the ancillary information file, allowing identification of the corresponding subtelomeric mate-pair sequence. The two sets of reads (telomeric and subtelomeric) are then quality-screened based on the PHRED scores (Ewing et al., 1998) provided in the quality files. Low quality portions (average PHRED score in 20 bp sliding window <20) are trimmed from the front and back ends of each read. Vector sequences are removed with cross_match (http://www.phrap.org/phredphrap/general.html), using the NCBI UniVec sequence file (minmatch 10 minscore 20). Finally, mate-pairs are discarded if either read has <100 bp of usable sequence. Reads containing the telomere repeat are then re-examined to ensure that the repeat begins within 50 bp of the sequence start (as would be expected if a true chromosome end were ligated directly to the plasmid vector). Failure to satisfy this requirement results in elimination of the mate-pair from further analysis. Telomere and subtelomere reads are output to two FASTA files, named tel_asb_query and subtel_asb_query, respectively. Relevant ancillary information is also extracted and output to corresponding files. The second component of TERMINUS takes the FASTA files from Component 1 and interacts with StackPack (Christoffels et al., 2001; http://www.sanbi.ac.za/CODES/STACKPACK_REQUEST/) to create separate assemblies for the telomeric and subtelomeric reads. The assembled contigs are designated as TelContigs and SubTelContigs and the ancillary information is then used to rebuild mate-pair relationships, thereby providing links between TelContig and corresponding SubTelContig sequences. It is important to note that a TelContig usually has more than one associated SubTelContig, due to the variation of clone insert sizes.
The last component in TERMINUS retrieves TelContigs and SubTelContigs from the StackPACK system and queries them against the genome assembly using BLAST with stringent parameters (-e 1e-100,
98% identity, query sequence length completely covered by BLAST hit, with the last rule being waived if the match runs off the end of a genomic contig). Hits from TelContigs should verify the assembly, while a SubTelContig hit effectively anchors its associated TelContig to a position in the genome assembly (Fig. 1). To guard against misleading hits due to repetitive sequences or segmental duplications, TERMINUS requires that: (1) at least 50 bp within the query sequence should exhibit a unique BLAST match (see below); (2) the position of the BLAST match within an assembly scaffold must be consistent with a telomere proximal location. More specifically, the distance between the BLAST hit and the scaffold end must not exceed the clone insert size (obtained from the ancillary information file).
TERMINUS produces several output files including: (a) consensus sequences for the TelContigs and SubTelContigs in FASTA format and their respective BLAST results; (b) spreadsheets listing the TelContigs/SubTelContigs for which unique BLAST matches can be identified (and the results of the positional consistency tests); (c) a third spreadsheet listing only those TelContigs/SubTelContigs that satisfy all the validation criteria; and (d) a fourth and final spreadsheet that identifies templates that can be used for closing telomeric gaps. This hierarchical system of reports allows the user to focus initially on TelContigs that have been mapped automatically with a high degree of confidence, while facilitating the mapping of additional telomeres through manual interpretation of the complete BLAST results.
Performance was tested by identifying and assembling telomeres for several fungi, as well as a selection of higher organisms. Inspection of the TERMINUS reports for each project revealed that assembly of the two types of reads (telomeres and mate-pairs) produced a number of contigs (based on two or more reads) and several remaining singlets. Only TelContigs were considered reliable because it was not possible to rule out the possibility that TelSinglet templates were chimeric. As expected, the TelContigs rarely exhibited BLAST matches to the genome sequence. Nevertheless, it was still possible to link most of them to the assembly via their associated SubTelContigs/Singlets; genomic matches meeting TERMINUS' positional consistency criteria were considered to be most reliable if confirmed by two or more SubTelContigs/Singlets associated with the TelContig in question. A minimum requirement would be that the match involves a SubTelContig, so that at least two independent reads support telomere placement.
In the cases of Magnaporthe grisea and Aspergillus nidulans, there were instances where multiple SubTelContigs matched a given genomic contig, yet the positional consistency requirement was not met. Given that TERMINUS reports only unique BLAST matches, this points to a probable assembly error, either involving a false sequence merge, or failure of the assembly to document a duplication.
The results of several analyses are summarized in Table 1. For the fungal genomes, TERMINUS was consistently able to identify and map new telomeres to the fungal assemblies and, in some cases, allowed mapping of all chromosome ends, even when these regions were rich in repeated sequences. In the case of M.grisea, TERMINUS enabled us to identify clones that have been used to complete the sequences of all 14 telomeres. For some fungi, TERMINUS identified more TelContigs/Singlets than the known number of chromosome ends. Analysis of M.grisea revealed that this was due to de novo telomere formation during culture of the organism for DNA extraction (Rehmeyer and Farman, unpublished results).
|
The more complex genomes proved more problematic, with the result that it was possible to map only three telomeres each for rice and rat, and none from the nematode or zebrafish (Table 1). TERMINUS' inability to assemble telomeres in these larger genomes was largely due to low telomere read density, absence of overlapping sequences in the assembly and/or the inability to identify BLAST matches to unique genomic segments. Even in cases where multiple BLAST hits were detected, the matches to the genome sequence were often far from exact, indicating that the true overlapping regions were simply not present in the assembly. As such, incorporation of telomeres into these genome assemblies will likely require targeted cloning and a variety of physical mapping approaches, such as those employed by Riethman et al. (2004). Nevertheless, TERMINUS' strength lies in its ability to explore all possible links between telomeric sequence reads and a genome assembly and provide results and other relevant data in a format that facilitates manual interpretation and validation of predicted telomere locations.
TERMINUS can run on any UNIX/LINUX-based platform. It requires the following external software: NCBI BLAST and formatdb, StackPACK v2.2, Perl 5.x or higher, and MySQL (v. 3.23, or higher). On a Linux system (Redhat 7.3) with a Pentium 2.0 GHz processor and 1.0 GB RAM memory, it took
30 min to analyze each fungal genome and up to 29 h for the largest project analyzed (rat) (Table 1).
| Acknowledgments |
|---|
This work was supported by a subcontract to Chuck Staben from the Kentucky Biomedical Research Infrastructure Network, 5P20RR016481-03, awarded to Nigel Cooper of the University of Louisville, by the National Center for Research Resources and by a National Science Foundation award, MCB-0135462, to Mark Farman. This is Kentucky Agricultural Experiment Station publication 04-12-186.
Received on March 14, 2004; revised on November 8, 2004; accepted on November 23, 2004
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403410[CrossRef][Web of Science][Medline].
Christoffels, A., van Gelder, A., Greyling, G., Miller, R., Hide, T., Hide, W. (2001) STACK: sequence tag alignment and consensus knowledgebase. Nucleic Acids Res., 29, 234238
Ewing, B., Hillier, L., Wendl, M.C., Green, P. (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res., 8, 175185
Gardner, M.J., Hall, N., Fung, E., White, O., Berriman, M., Hyman, R.W., Carlton, J.M., Pain, A., Nelson, K.E., Bowman, S., Barrell, B. (2002) Genome sequence of the human malarial parasite, Plasmodium falciparum. Nature, 419, 498511[CrossRef][Medline].
McEachern, M.J., Krauskopf, A., Blackburn, E.H. (2000) Telomeres and their control. Annu. Rev. Genet., 34, 331358[CrossRef][Web of Science][Medline].
Riethman, H., Ambrosini, A., Castaneda, C., Finkelstein, J., Xue-Lan, H., Mudunuri, U., Paul, S., Wei, J. (2004) Mapping and initial analysis of human subtelomeric sequence assemblies. Genome Res., 14, 1828
Rudenko, G. (2000) The polymorphic telomeres of the African trypanosome Trypanosoma brucei. Biochem. Soc. Transact., 28, 536540[Web of Science][Medline].
Winzeler, E.A., Castillo-Davis, C.I., Oshiro, G., Liang, D., Richards, D.R., Zhou, Y., Hartl, D.L. (2003) Genetic diversity in yeast assessed with whole-genome oligonucleotide arrays. Genetics, 163, 7989
This article has been cited by other articles:
![]() |
C. Wu, Y.-S. Kim, K. M. Smith, W. Li, H. M. Hood, C. Staben, E. U. Selker, M. S. Sachs, and M. L. Farman Characterization of Chromosome Ends in the Filamentous Fungus Neurospora crassa Genetics, March 1, 2009; 181(3): 1129 - 1145. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Lee, J. E. Jurgenson, J. F. Leslie, and R. L. Bowden Alignment of Genetic and Physical Maps of Gibberella zeae Appl. Envir. Microbiol., April 15, 2008; 74(8): 2349 - 2359. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. E. Galagan, M. R. Henn, L.-J. Ma, C. A. Cuomo, and B. Birren Genomics of the fungal kingdom: Insights into eukaryotic biology Genome Res., December 1, 2005; 15(12): 1620 - 1631. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



