Skip Navigation


Bioinformatics Advance Access originally published online on January 25, 2005
Bioinformatics 2005 21(9):2097-2098; doi:10.1093/bioinformatics/bti257
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/9/2097    most recent
bti257v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Li, W.
Right arrow Articles by Farman, M. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, W.
Right arrow Articles by Farman, M. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

TruMatch—a BLAST post-processor that identifies bona fide sequence matches to genome assemblies

Weixi Li 1, Cathryn J. Rehmeyer 2, Chuck Staben 1 and Mark L. Farman 2,*

1Department of Biological Sciences, University of Kentucky Lexington, KY 40546, USA
2Department of Plant Pathology, University of Kentucky Lexington, KY 40546, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 TruMatch
 ADDITIONAL SOFTWARE REQUIREMENTS
 REFERENCE
 

Summary: BLAST is a widely used alignment tool for detecting matches between a query sequence and entries in nucleotide sequence databases. Matches (high-scoring pairs, HSPs) are assigned a score based on alignment length and quality and, by default, are reported with the top-scoring matches listed first. For certain types of searches, however, this method of reporting is not optimal. This is particularly true when searching a genome sequence with a query that was derived from the same genome, or a closely related one. If the genome is complex and the assembly is far from complete, correct matches are often relegated to low positions in the results, where they may be easily overlooked. To rectify this problem, we developed TruMatch—a program that parses standard BLAST outputs and identifies HSPs that involve query segments with unique matches to the assembly. Candidates for bona fide matches between a query sequence and a genome assembly are listed at the top of the TruMatch output.

Availability: TruMatch is written in Perl and is freely available to non-commercial users via web download at the URL: http://genome.kbrin.uky.edu/fungi_tel/TruMatch/

Contact: farman{at}uky.edu

A complete genome sequence is an invaluable resource for studying genome organization, rearrangement and evolution. Analysis of chromosome deletions or translocations, as well as comparison of chromosome structure among related species, often necessitates the use of sequence searching tools to orient one genomic region with respect to another. The most widely used tool for comparing DNA sequences in this manner is BLASTN (Altschul et al., 1990), which reports each sequence match as an HSP. Ideally, when a genome sequence is searched with a query from that genome (or a closely related one) the genomic region that corresponds to the query should occur in the top HSP. Unfortunately, this is often not the case. While mapping telomere-associated sequences from the fungus Magnaporthe grisea to the M.grisea genome assembly, we found that the genomic contig representing the true match to a query was often reported at a very low position in the BLAST output. Moreover, in the absence of obvious distinguishing features, the true match was not easily found. Not surprisingly, all of the ‘offending’ queries contained repeated DNA sequences but many also contained regions with a unique match to the genome assembly. Close inspection revealed that the low ranking of alignments spanning unique regions was invariably due to the match being curtailed by a gap in the assembly (Fig. 1). Consequently, if these alignments were shorter than those involving repeated sequences also present in the query, the unique (correct) matches were assigned lower scores and, therefore, occurred at lower positions in the results.



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 1 Relegation of ‘true’ BLAST matches as a result of short overlaps. Lines represent single-copy DNA sequences and boxes represent repeated DNA. A hypothetical query sequence containing parts of two repeats and a stretch of unique DNA (heavy line) is shown. In this example, the query sequence overlaps the very end of the correct genomic contig (W), resulting in an alignment that is shorter than those between repeat 2 and similar sequences in the genome, exemplified by copies found in contigs X, Y and Z.

 

    TruMatch
 TOP
 Abstract
 TruMatch
 ADDITIONAL SOFTWARE REQUIREMENTS
 REFERENCE
 
TruMatch recognizes the fact that the correct alignment need not necessarily be the longest one. Instead, it considers the primary indication of a true match to the genome to be when the query sequence or a portion thereof, exhibits a match to only one genomic ‘subject’. Accordingly, TruMatch takes a standard BLASTN output and systematically examines every HSP for a given query to determine if the query contains a region that matches a single genomic sequence. Before it does this, however, it imposes two rules to qualify HSPs for further analysis: first, it requires an HSP to have ≥ 98% identity over the aligned region (a value < 100% allows for base-calling errors). This condition screens out instances where the true matching sequence is absent from the genome assembly, yet a number of similar but non-identical sequences are present.

Next, TruMatch checks the extent of each alignment to make sure that the HSP spans the whole length of the query sequence, as would be expected for a correct match (a 50 nt ‘buffer’ is allowed at each end of the query to account for base-calling errors). In recognition of the possibility that valid alignments can be curtailed when an HSP runs off the end of an ‘incomplete’ subject sequence, the latter rule is waived if the HSP overlaps either end of the subject.

HSPs that pass these preliminary tests are then scrutinized to determine which portion of the query sequence is involved in each alignment. The query is thus divided into segments, and the number of matches involving each segment is tallied. If TruMatch identifies a query segment that occurs in only one HSP, the corresponding alignment considered to be the correct one and the full HSP alignment is reported at the top of the results section devoted to that query. Next, HSPs containing segments with a cardinality of two are listed, and so on, up to a user-defined cutoff. HSPs that do not fall within the cutoff, as well as those that failed to qualify through the match percentage and alignment length tests, are reported in a separate file, along with the reason for rejection. To provide flexibility of use, the validation parameters described above are all user-definable.

An example of TruMatch's efficacy is illustrated by results obtained when it was used to process the BLAST output generated when the genome assembly for the fungus M.grisea was searched with query sequences derived from 14 telomeric cosmid clones, each of which contained an insert of ~45 kb. BLASTN identified a total of 14 402 matches with an e-value below 1–100. TruMatch considered only 68 to be candidates for bona fide matches (i.e. they contained a segment with a unique match to the genome). Of these, 41 were fully verified by independent means and no valid matches were overlooked. Fourteen of the confirmed matches were elevated from below the fifth position in the BLAST report to the top-ranking TruMatch result, with an average change in ranking of 124 positions (7–630). Thus, TruMatch dramatically reduced the amount of manual interpretation required to link telomere sequences to the genome assembly.


    ADDITIONAL SOFTWARE REQUIREMENTS
 TOP
 Abstract
 TruMatch
 ADDITIONAL SOFTWARE REQUIREMENTS
 REFERENCE
 
TruMatch can run on any platform running Perl version 5.x or higher and requires installation of the BioPerl module BIO::SearchIO (http://bio.perl.org/), which it uses to parse the BLAST reports.


    Acknowledgments
 
Thanks to Chris Schardl for his critical review of the manuscript. This work was supported by a subcontract to Chuck Staben from the Kentucky Biomedical Research Infrastructure Network, 5P20RR016481-03, awarded to Nigel Cooper of the University of Louisville, by the National Center for Research Resources and by a National Science Foundation award, MCB-0135462, to Mark Farman. This is Kentucky Agricultural Experiment Station publication #05-12-004.

Received on November 11, 2004; revised on December 29, 2004; accepted on December 29, 2004

    REFERENCE
 TOP
 Abstract
 TruMatch
 ADDITIONAL SOFTWARE REQUIREMENTS
 REFERENCE
 

    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/9/2097    most recent
bti257v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Li, W.
Right arrow Articles by Farman, M. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, W.
Right arrow Articles by Farman, M. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?