Bioinformatics Advance Access published online on September 19, 2007
Bioinformatics, doi:10.1093/bioinformatics/btm420
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Effect of the mutation rate and background size on the quality of pathogen identification
1Department of Computer Science, University of Houston, 501 Philip G. Hoffman Hall, Houston, TX USA 77204.
2Department of Statistics, Rice University, 6100 Main Street, MS138, Houston, TX USA 77005.
3Department of Biology and Biochemistry, University of Houston, Science and Research Bldg 2, Houston, TX USA 77204.
4Departmento de Fisica, CUCEI, Universidad de Guadalajara, Revolucion 1500, Guadalajara, Jal. Mexico 44430.
5Computations Department, Lawrence Livermore National Laboratory, 7000 East Ave. L-174, Livermore, CA USA 94550
*To whom correspondence should be addressed. Yuriy Fofanov, E-mail: yfofanov{at}bioinfo.uh.edu
| Abstract |
|---|
Motivation: Genomic-based methods have significant potential for fast and accurate identification of organisms or even genes of interest in complex environmental samples (air, water, soil, food, etc.), especially when isolation of the target organism cannot be performed by a variety of reasons. Despite this potential, the presence of the unknown, variable, and usually large quantities of background DNA can cause interference resulting in false positive outcomes.
Results: In order to estimate how the genomic diversity of the background (total length of all of the different genomes present in the background), target length, and target mutation rate affect the probability of misidentifications, we introduce a mathematical definition for the quality of an individual signature in the presence of a background based on its length and number of mismatches needed to transform the signature into the closest subsequence present in the background. This definition, in conjunction with a probabilistic framework, allows one to predict the minimal signature length required to identify the target in the presence of different sizes of backgrounds and the effect of the targets mutation rate on the quality of its identification. The model assumptions and predictions were validated using both Monte-Carlo simulations and real genomic data examples. The proposed model can be used to determine appropriate signature lengths for various combinations of target and background genome sizes. It also predicted that any genomic signatures will be unable to identify target if its mutation rate is greater than 5%.
Contact: yfofanov{at}bioinfo.uh.edu
Supplementary information: Supplementary data is available at Bioinformatics online
Associate Editor: Dr. Chris Stoeckert
Received on May 10, 2007; revised on August 10, 2007; accepted on August 13, 2007