Skip Navigation


Bioinformatics Advance Access originally published online on December 7, 2004
Bioinformatics 2005 21(8):1695-1698; doi:10.1093/bioinformatics/bti181
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1695    most recent
bti181v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Li, W.
Right arrow Articles by Farman, M. L
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, W.
Right arrow Articles by Farman, M. L
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

TERMINUS—Telomeric End-Read Mining IN Unassembled Sequences

Weixi Li 1, Cathryn J. Rehmeyer 2, Chuck Staben 1 and Mark L Farman 2,*

1Department of Biological Sciences, University of Kentucky Lexington, KY 40546, USA
2Department of Plant Pathology, University of Kentucky Lexington, KY 40546, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 REFERENCES
 

Summary: TERMINUS is a set of tools to map telomeres on draft sequences of whole genome shotgun sequencing projects. It mines raw sequence reads (from a trace archive) for telomeric reads, assembles them into contigs representing individual chromosome ends and BLASTs the resulting consensus sequences against the genome assembly to identify telomere-proximal genomic contigs. Finally, it estimates the sizes of telomeric gaps and identifies clones for gap closure. TERMINUS is implemented as a set of Perl scripts that requires two sets of inputs: the NCBI Trace Archive files for a given genome project; and ancillary genome assembly information. Results are output in spreadsheets containing information that facilitates manual validation.

Availability: The TERMINUS package and supplementary information can be downloaded from http://www.genome.kbrin.uky.edu/fungi_tel/terminus/

Contact: farman{at}uky.edu

Telomeres are specialized structures at the ends of linear eukaryotic chromosomes, and are crucial for the maintenance of chromosome integrity and genome stability. A telomere usually consists of a simple, tandemly repeated DNA sequence, with (TTAGGG)n being the most commonly seen form in vertebrates and fungi (McEachern et al., 2000). Recent studies of microbial telomere regions suggest that subtelomeric regions harbor genes that facilitate niche adaptation (Gardner et al., 2002; Rudenko, 2000; Winzeler et al., 2003). The ever-expanding availability of whole genome shotgun sequences, especially the sequences of fungi from the Fungal Genome Initiative (FGI), presents us with an ideal opportunity to test this hypothesis using comparative genomics. However, when the canonical telomere repeat sequence (TTAGGG)5 was used to query several fungal genome databases, the number of hits obtained was usually far fewer than the actual number of chromosomal ends. This shortcoming is true for most genome sequence assemblies (Rehmeyer and Farman, unpublished results) and, in some cases, combined approaches such as targeted cloning, physical mapping and bioinformatic analysis have been necessary to close the gaps (Riethman et al., 2004).

Even for genome projects with few telomere repeats in the assembly, we found that the corresponding NCBI Trace Archive usually contained hundreds of reads matching the (TTAGGG)5 query, suggesting that valuable telomere information might be retrievable from archived reads. To use this potentially rich source of telomere information, we developed TERMINUS, a package of Perl scripts which extracts, assembles and categorizes telomeric reads and their ‘mate-pair’ reads from the trace archive; and links them to the whole genome assembly through BLAST searches, followed by validation of the resulting matches (Fig. 1).



View larger version (19K):
[in this window]
[in a new window]
 
Fig. 1 Strategy for assembling telomeric and telomere-associated reads into telomeric scaffolds and subsequent linking to the genome assembly. (A) Telomeric reads are identified based on sequence content and their mate-pair reads are retrieved using the TEMPLATE_ID information. (B) Telomeric reads and subtelomeric reads thus identified are assembled into TelContigs and SubTelContigs, respectively. TelContigs and SubTelContigs linked via mate-pair relationships are defined as ‘telomeric scaffolds’. (C) Overlaps between telomeric scaffolds and the genome assembly are identified through unique BLAST matches and are classified as valid if the distance between the match and the scaffold's end is less than the templates' insert sizes.

 
TERMINUS has multiple uses in a genome sequencing project. By mapping the telomeres and subtelomeric regions in the draft sequence, TERMINUS allows analysis of telomere organization and content well before the finished version of the assembly is available. TERMINUS also facilitates gap closure by estimating the gap sizes and revealing the identities of clones that can be used for primer walking across the gaps. In addition, TERMINUS can identify potential assembly problems if a telomere maps to a contig at a position that is too far away from the end of the genomic scaffold. As such, TERMINUS represents a useful first step in an integrated approach toward obtaining complete sequence coverage of telomere regions.

TERMINUS consists of three components, each of which uses parameters that are user-definable (the values described in the descriptions below are the defaults). Component one extracts ‘telomeric’ mate-pairs (paired reads from each end of a sequencing template) from the trace archives and subjects them to quality screening, vector trimming and screening of internal telomere sequences. The input is the name of a directory containing uncompressed files, including the sequences (fasta.*), quality scores (qual.*) and ancillary information (anc.*). TERMINUS formats each sequence into a BLAST-ready database using formatdb (supplied with the BLAST executables (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/); it then uses a local version of BLAST (Altschul et al., 1990) to search the database for a user-defined sequence (in this case [CCCTAA]5), using parameters ‘-e 1e-7’, ‘-F F’, -S 1 and ‘100% identity’. The TEMPLATE_ID for each positive ‘hit’ is retrieved from the ancillary information file, allowing identification of the corresponding ‘subtelomeric’ mate-pair sequence. The two sets of reads (telomeric and subtelomeric) are then quality-screened based on the PHRED scores (Ewing et al., 1998) provided in the ‘quality’ files. Low quality portions (average PHRED score in 20 bp sliding window <20) are trimmed from the ‘front’ and ‘back’ ends of each read. Vector sequences are removed with cross_match (http://www.phrap.org/phredphrap/general.html), using the NCBI UniVec sequence file (–minmatch 10 –minscore 20). Finally, mate-pairs are discarded if either read has <100 bp of usable sequence. Reads containing the telomere repeat are then re-examined to ensure that the repeat begins within 50 bp of the sequence start (as would be expected if a true chromosome end were ligated directly to the plasmid vector). Failure to satisfy this requirement results in elimination of the mate-pair from further analysis. Telomere and subtelomere reads are output to two FASTA files, named ‘tel_asb_query’ and ‘subtel_asb_query’, respectively. Relevant ancillary information is also extracted and output to corresponding files. The second component of TERMINUS takes the FASTA files from Component 1 and interacts with StackPack (Christoffels et al., 2001; http://www.sanbi.ac.za/CODES/STACKPACK_REQUEST/) to create separate assemblies for the telomeric and subtelomeric reads. The assembled contigs are designated as ‘TelContigs’ and ‘SubTelContigs’ and the ancillary information is then used to rebuild ‘mate-pair’ relationships, thereby providing ‘links’ between TelContig and corresponding SubTelContig sequences. It is important to note that a TelContig usually has more than one associated SubTelContig, due to the variation of clone insert sizes.

The last component in TERMINUS retrieves TelContigs and SubTelContigs from the StackPACK system and queries them against the genome assembly using BLAST with stringent parameters (-e 1e-100, ≥98% identity, query sequence length completely covered by BLAST hit, with the last rule being waived if the match ‘runs off’ the end of a genomic contig). Hits from TelContigs should verify the assembly, while a SubTelContig hit effectively anchors its associated TelContig to a position in the genome assembly (Fig. 1). To guard against misleading hits due to repetitive sequences or segmental duplications, TERMINUS requires that: (1) at least 50 bp within the query sequence should exhibit a unique BLAST match (see below); (2) the position of the BLAST match within an assembly scaffold must be consistent with a telomere proximal location. More specifically, the distance between the BLAST hit and the scaffold end must not exceed the clone insert size (obtained from the ancillary information file).

TERMINUS produces several output files including: (a) consensus sequences for the TelContigs and SubTelContigs in FASTA format and their respective BLAST results; (b) spreadsheets listing the TelContigs/SubTelContigs for which unique BLAST matches can be identified (and the results of the positional consistency tests); (c) a third spreadsheet listing only those TelContigs/SubTelContigs that satisfy all the validation criteria; and (d) a fourth and final spreadsheet that identifies templates that can be used for closing telomeric gaps. This hierarchical system of reports allows the user to focus initially on TelContigs that have been mapped automatically with a high degree of confidence, while facilitating the mapping of additional telomeres through manual interpretation of the complete BLAST results.

Performance was tested by identifying and assembling telomeres for several fungi, as well as a selection of higher organisms. Inspection of the TERMINUS reports for each project revealed that assembly of the two types of reads (telomeres and mate-pairs) produced a number of contigs (based on two or more reads) and several remaining singlets. Only TelContigs were considered reliable because it was not possible to rule out the possibility that TelSinglet templates were chimeric. As expected, the TelContigs rarely exhibited BLAST matches to the genome sequence. Nevertheless, it was still possible to link most of them to the assembly via their associated SubTelContigs/Singlets; genomic matches meeting TERMINUS' positional consistency criteria were considered to be most reliable if confirmed by two or more SubTelContigs/Singlets associated with the TelContig in question. A minimum requirement would be that the match involves a SubTelContig, so that at least two independent reads support telomere placement.

In the cases of Magnaporthe grisea and Aspergillus nidulans, there were instances where multiple SubTelContigs matched a given genomic contig, yet the positional consistency requirement was not met. Given that TERMINUS reports only unique BLAST matches, this points to a probable assembly error, either involving a false sequence merge, or failure of the assembly to document a duplication.

The results of several analyses are summarized in Table 1. For the fungal genomes, TERMINUS was consistently able to identify and map new telomeres to the fungal assemblies and, in some cases, allowed mapping of all chromosome ends, even when these regions were rich in repeated sequences. In the case of M.grisea, TERMINUS enabled us to identify clones that have been used to complete the sequences of all 14 telomeres. For some fungi, TERMINUS identified more TelContigs/Singlets than the known number of chromosome ends. Analysis of M.grisea revealed that this was due to de novo telomere formation during culture of the organism for DNA extraction (Rehmeyer and Farman, unpublished results).


View this table:
[in this window]
[in a new window]
 
Table 1 Summary of TERMINUS results for a selection of organisms

 
The more complex genomes proved more problematic, with the result that it was possible to map only three telomeres each for rice and rat, and none from the nematode or zebrafish (Table 1). TERMINUS' inability to assemble telomeres in these larger genomes was largely due to low telomere read density, absence of overlapping sequences in the assembly and/or the inability to identify BLAST matches to unique genomic segments. Even in cases where multiple BLAST hits were detected, the matches to the genome sequence were often far from exact, indicating that the true overlapping regions were simply not present in the assembly. As such, incorporation of telomeres into these genome assemblies will likely require targeted cloning and a variety of physical mapping approaches, such as those employed by Riethman et al. (2004). Nevertheless, TERMINUS' strength lies in its ability to explore all possible links between telomeric sequence reads and a genome assembly and provide results and other relevant data in a format that facilitates manual interpretation and validation of predicted telomere locations.

TERMINUS can run on any UNIX/LINUX-based platform. It requires the following external software: NCBI BLAST and formatdb, StackPACK v2.2, Perl 5.x or higher, and MySQL (v. 3.23, or higher). On a Linux system (Redhat 7.3) with a Pentium 2.0 GHz processor and 1.0 GB RAM memory, it took ~30 min to analyze each fungal genome and up to 29 h for the largest project analyzed (rat) (Table 1).


    Acknowledgments
 
This work was supported by a subcontract to Chuck Staben from the Kentucky Biomedical Research Infrastructure Network, 5P20RR016481-03, awarded to Nigel Cooper of the University of Louisville, by the National Center for Research Resources and by a National Science Foundation award, MCB-0135462, to Mark Farman. This is Kentucky Agricultural Experiment Station publication 04-12-186.

Received on March 14, 2004; revised on November 8, 2004; accepted on November 23, 2004

    REFERENCES
 TOP
 Abstract
 REFERENCES
 

    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410[CrossRef][Web of Science][Medline].

    Christoffels, A., van Gelder, A., Greyling, G., Miller, R., Hide, T., Hide, W. (2001) STACK: sequence tag alignment and consensus knowledgebase. Nucleic Acids Res., 29, 234–238[Abstract/Free Full Text].

    Ewing, B., Hillier, L., Wendl, M.C., Green, P. (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res., 8, 175–185[Abstract/Free Full Text].

    Gardner, M.J., Hall, N., Fung, E., White, O., Berriman, M., Hyman, R.W., Carlton, J.M., Pain, A., Nelson, K.E., Bowman, S., Barrell, B. (2002) Genome sequence of the human malarial parasite, Plasmodium falciparum. Nature, 419, 498–511[CrossRef][Medline].

    McEachern, M.J., Krauskopf, A., Blackburn, E.H. (2000) Telomeres and their control. Annu. Rev. Genet., 34, 331–358[CrossRef][Web of Science][Medline].

    Riethman, H., Ambrosini, A., Castaneda, C., Finkelstein, J., Xue-Lan, H., Mudunuri, U., Paul, S., Wei, J. (2004) Mapping and initial analysis of human subtelomeric sequence assemblies. Genome Res., 14, 18–28[Abstract/Free Full Text].

    Rudenko, G. (2000) The polymorphic telomeres of the African trypanosome Trypanosoma brucei. Biochem. Soc. Transact., 28, 536–540[Web of Science][Medline].

    Winzeler, E.A., Castillo-Davis, C.I., Oshiro, G., Liang, D., Richards, D.R., Zhou, Y., Hartl, D.L. (2003) Genetic diversity in yeast assessed with whole-genome oligonucleotide arrays. Genetics, 163, 79–89[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
GeneticsHome page
C. Wu, Y.-S. Kim, K. M. Smith, W. Li, H. M. Hood, C. Staben, E. U. Selker, M. S. Sachs, and M. L. Farman
Characterization of Chromosome Ends in the Filamentous Fungus Neurospora crassa
Genetics, March 1, 2009; 181(3): 1129 - 1145.
[Abstract] [Full Text] [PDF]


Home page
Appl. Environ. Microbiol.Home page
J. Lee, J. E. Jurgenson, J. F. Leslie, and R. L. Bowden
Alignment of Genetic and Physical Maps of Gibberella zeae
Appl. Envir. Microbiol., April 15, 2008; 74(8): 2349 - 2359.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
J. E. Galagan, M. R. Henn, L.-J. Ma, C. A. Cuomo, and B. Birren
Genomics of the fungal kingdom: Insights into eukaryotic biology
Genome Res., December 1, 2005; 15(12): 1620 - 1631.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1695    most recent
bti181v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Li, W.
Right arrow Articles by Farman, M. L
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, W.
Right arrow Articles by Farman, M. L
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?