Bioinformatics Advance Access originally published online on February 19, 2004
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics 20(9) © Oxford University Press 2004; all rights reserved.
Enhanced homology searching through genome reading frame predetermination





Department of Bioinformatics, Merck & Co., Inc., P.O. Box 2000, RY80-A1, Rahway, NJ 07065, USA
Received on July 29, 2003; revised on October 20, 2003; accepted on December 18, 2003
Advance Access Publication February 19, 2004
Motivation: Many bioinformatic approaches exist for finding novel genes within genomic sequence data. Traditionally, homology search-based methods are often the first approach employed in determining whether a novel gene exists that is similar to a known gene. Unfortunately, distantly related genes or motifs often are difficult to find using single query-based homology search algorithms against large sequence datasets such as the human genome. Therefore, the motivation behind this work was to develop an approach to enhance the sensitivity of traditional single query-based homology algorithms against genomic data without losing search selectivity.
Results: We demonstrate that by searching against a genome fragmented into all possible reading frames, the sensitivity of homology-based searches is enhanced without degrading its selectivity. Using the ETS-domain, bromodomain and acetyl-CoA acetyltransferase gene as queries, we were able to demonstrate that direct proteinprotein searches using BLAST2P or FASTA3 against a human genome segmented among all possible reading frames and translated was substantially more sensitive than traditional proteinDNA searches against a raw genomic sequence using an application such as TBLAST2N. Receiver operating characteristic analysis was employed to demonstrate that the algorithms remained selective, while comparisons of the algorithms showed that the proteinprotein searches were more sensitive in identifying hits. Therefore, through the overprediction of reading frames by this method and the increased sensitivity of proteinprotein based homology search algorithms, a genome can be deeply mined, potentially finding hits overlooked by proteinDNA searches against raw genomic data.
Contact: jeffrey_yuan{at}merck.com
* To whom correspondence is to be addressed.
Present address: Department of Molecular Profiling, Merck & Co., Inc., P.O. Box 2000, RY80M-162, Rahway, NJ 07065, USA.