Bioinformatics Advance Access originally published online on January 25, 2005
Bioinformatics 2005 21(9):2097-2098; doi:10.1093/bioinformatics/bti257
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TruMatcha BLAST post-processor that identifies bona fide sequence matches to genome assemblies
1Department of Biological Sciences, University of Kentucky Lexington, KY 40546, USA
2Department of Plant Pathology, University of Kentucky Lexington, KY 40546, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: BLAST is a widely used alignment tool for detecting matches between a query sequence and entries in nucleotide sequence databases. Matches (high-scoring pairs, HSPs) are assigned a score based on alignment length and quality and, by default, are reported with the top-scoring matches listed first. For certain types of searches, however, this method of reporting is not optimal. This is particularly true when searching a genome sequence with a query that was derived from the same genome, or a closely related one. If the genome is complex and the assembly is far from complete, correct matches are often relegated to low positions in the results, where they may be easily overlooked. To rectify this problem, we developed TruMatcha program that parses standard BLAST outputs and identifies HSPs that involve query segments with unique matches to the assembly. Candidates for bona fide matches between a query sequence and a genome assembly are listed at the top of the TruMatch output.
Availability: TruMatch is written in Perl and is freely available to non-commercial users via web download at the URL: http://genome.kbrin.uky.edu/fungi_tel/TruMatch/
Contact: farman{at}uky.edu
A complete genome sequence is an invaluable resource for studying genome organization, rearrangement and evolution. Analysis of chromosome deletions or translocations, as well as comparison of chromosome structure among related species, often necessitates the use of sequence searching tools to orient one genomic region with respect to another. The most widely used tool for comparing DNA sequences in this manner is BLASTN (Altschul et al., 1990), which reports each sequence match as an HSP. Ideally, when a genome sequence is searched with a query from that genome (or a closely related one) the genomic region that corresponds to the query should occur in the top HSP. Unfortunately, this is often not the case. While mapping telomere-associated sequences from the fungus Magnaporthe grisea to the M.grisea genome assembly, we found that the genomic contig representing the true match to a query was often reported at a very low position in the BLAST output. Moreover, in the absence of obvious distinguishing features, the true match was not easily found. Not surprisingly, all of the offending queries contained repeated DNA sequences but many also contained regions with a unique match to the genome assembly. Close inspection revealed that the low ranking of alignments spanning unique regions was invariably due to the match being curtailed by a gap in the assembly (Fig. 1). Consequently, if these alignments were shorter than those involving repeated sequences also present in the query, the unique (correct) matches were assigned lower scores and, therefore, occurred at lower positions in the results.
|
| TruMatch |
|---|
|
|
|---|
TruMatch recognizes the fact that the correct alignment need not necessarily be the longest one. Instead, it considers the primary indication of a true match to the genome to be when the query sequence or a portion thereof, exhibits a match to only one genomic subject. Accordingly, TruMatch takes a standard BLASTN output and systematically examines every HSP for a given query to determine if the query contains a region that matches a single genomic sequence. Before it does this, however, it imposes two rules to qualify HSPs for further analysis: first, it requires an HSP to have
98% identity over the aligned region (a value < 100% allows for base-calling errors). This condition screens out instances where the true matching sequence is absent from the genome assembly, yet a number of similar but non-identical sequences are present. Next, TruMatch checks the extent of each alignment to make sure that the HSP spans the whole length of the query sequence, as would be expected for a correct match (a 50 nt buffer is allowed at each end of the query to account for base-calling errors). In recognition of the possibility that valid alignments can be curtailed when an HSP runs off the end of an incomplete subject sequence, the latter rule is waived if the HSP overlaps either end of the subject.
HSPs that pass these preliminary tests are then scrutinized to determine which portion of the query sequence is involved in each alignment. The query is thus divided into segments, and the number of matches involving each segment is tallied. If TruMatch identifies a query segment that occurs in only one HSP, the corresponding alignment considered to be the correct one and the full HSP alignment is reported at the top of the results section devoted to that query. Next, HSPs containing segments with a cardinality of two are listed, and so on, up to a user-defined cutoff. HSPs that do not fall within the cutoff, as well as those that failed to qualify through the match percentage and alignment length tests, are reported in a separate file, along with the reason for rejection. To provide flexibility of use, the validation parameters described above are all user-definable.
An example of TruMatch's efficacy is illustrated by results obtained when it was used to process the BLAST output generated when the genome assembly for the fungus M.grisea was searched with query sequences derived from 14 telomeric cosmid clones, each of which contained an insert of
45 kb. BLASTN identified a total of 14 402 matches with an e-value below 1100. TruMatch considered only 68 to be candidates for bona fide matches (i.e. they contained a segment with a unique match to the genome). Of these, 41 were fully verified by independent means and no valid matches were overlooked. Fourteen of the confirmed matches were elevated from below the fifth position in the BLAST report to the top-ranking TruMatch result, with an average change in ranking of 124 positions (7630). Thus, TruMatch dramatically reduced the amount of manual interpretation required to link telomere sequences to the genome assembly.
| ADDITIONAL SOFTWARE REQUIREMENTS |
|---|
|
|
|---|
TruMatch can run on any platform running Perl version 5.x or higher and requires installation of the BioPerl module BIO::SearchIO (http://bio.perl.org/), which it uses to parse the BLAST reports.
| Acknowledgments |
|---|
Thanks to Chris Schardl for his critical review of the manuscript. This work was supported by a subcontract to Chuck Staben from the Kentucky Biomedical Research Infrastructure Network, 5P20RR016481-03, awarded to Nigel Cooper of the University of Louisville, by the National Center for Research Resources and by a National Science Foundation award, MCB-0135462, to Mark Farman. This is Kentucky Agricultural Experiment Station publication #05-12-004.
Received on November 11, 2004; revised on December 29, 2004; accepted on December 29, 2004
| REFERENCE |
|---|
|
|
|---|
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403410[CrossRef][ISI][Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
