Bioinformatics Vol. 19 no. 5 2003
Pages 579-586
© 2003 Oxford University Press
Mapping multiple co-sequenced T-DNA integration sites within the Arabidopsis genome
Torrey Mesa Research Institute, 3115 Merryfield Row, San Diego, CA 92121, USA
Received on September 3, 2002
; revised on October 27, 2002
; accepted on November 4, 2002
Motivation: Insertion mutagenesis, using transgenes or endogenous transposons, is a popular method for generating null mutations (knockouts) in model organisms. Insertions are mapped to specific genes by amplifying (via TAIL-PCR) and sequencing genomic regions flanking the inserted DNA. The presence of multiple TAIL-PCR templates in one sequencing reaction results in chimeric sequence of intermittently low quality. Standard processing of this sequence by applying Phred quality requirements results in loss of informative sequence, whereas not trimming low-quality sequence causes inclusion of low-complexity homopolymers from the ends of sequence runs. Accurate mapping of the flanking sequences is complicated by the presence of gene families.
Results: Methods for extracting informative regions from sequence traces obtained by sequencing multiple TAIL-PCR fragments in a single reaction are described. The completely sequenced Arabidopsis genome was used to identify informative TAIL-PCR sequence regions. Methods were devised to define and select high quality matches and precisely map each insert to the correct genome location. These methods were used to analyze sequence of TAIL-PCR-amplified flanking regions of the inserts from individual plants in a T-DNA-mutagenized population of Arabidopsis thaliana, and are applicable to similar situations where a reference genome can be used to extract information from poor-quality sequence.
Contact: gernot{at}genome.clemson.edu
Supplementary information: Tables of the calculated T-DNA insertion sites and supplementary data are available at http://www.tmri.org/pages/collaborations/garlic_files/Bioinfo.html