Bioinformatics Advance Access originally published online on September 19, 2007
Bioinformatics 2007 23(20):2665-2671; doi:10.1093/bioinformatics/btm420
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Effect of the mutation rate and background size on the quality of pathogen identification
1Department of Computer Science, University of Houston, 501 Philip G. Hoffman Hall, Houston, TX 77204, 2Department of Statistics, Rice University, 6100 Main Street, MS138, Houston, TX 77005, 3Department of Biology and Biochemistry, University of Houston, Science and Research Building 2, Houston, TX 77204, USA, 4Departmento de Fisica, CUCEI, Universidad de Guadalajara, Revolucion 1500, Guadalajara, Jalisco 44430, Mexico and 5Computations Department, Lawrence Livermore National Laboratory, 7000 East Avenue L-174, Livermore, CA 94550, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Genomic-based methods have significant potential for fast and accurate identification of organisms or even genes of interest in complex environmental samples (air, water, soil, food, etc.), especially when isolation of the target organism cannot be performed by a variety of reasons. Despite this potential, the presence of the unknown, variable and usually large quantities of background DNA can cause interference resulting in false positive outcomes.
Results: In order to estimate how the genomic diversity of the background (total length of all of the different genomes present in the background), target length and target mutation rate affect the probability of misidentifications, we introduce a mathematical definition for the quality of an individual signature in the presence of a background based on its length and number of mismatches needed to transform the signature into the closest subsequence present in the background. This definition, in conjunction with a probabilistic framework, allows one to predict the minimal signature length required to identify the target in the presence of different sizes of backgrounds and the effect of the target's mutation rate on the quality of its identification. The model assumptions and predictions were validated using both Monte Carlo simulations and real genomic data examples. The proposed model can be used to determine appropriate signature lengths for various combinations of target and background genome sizes. It also predicted that any genomic signatures will be unable to identify target if its mutation rate is >5%.
Contact: yfofanov{at}bioinfo.uh.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Nucleic acid-based assays, such as PCR, RT-PCR and oligonucleotide microarrays, have the ability to provide a fast, specific, reliable and easy to use means for target (especially pathogen) identification. The success of any assay, however, is directly dependent upon the signature sequence(s) used to recognize and hybridize to the target organism's genomic material. A variety of different signature (probe and primer) design algorithms have been developed during the past decade (e.g. Draghici et al., 2005; Emrich et al., 2003; Fitch et al., 2002; Phillippy et al., 2007; Rahmann, 2003; Slezak et al., 2003; Tembe et al., 2007; Wu et al., 2004). When more than one target sequence is available, most algorithms either select signatures from conserved genomic regions (e.g. Draghici et al., 2005; Fitch et al., 2002; Slezak et al., 2003; Tembe et al., 2007) or select signatures from the set of all short subsequences of length n (n-mers) present simultaneously in all of the targets (e.g. Phillippy et al., 2007; Rahmann, 2003). Moreover, some applications also perform screening of potential signatures against GenBank to insure specificity (e.g. Draghici et al., 2005; Fitch et al., 2002; Phillippy et al., 2007; Slezak et al., 2003; Tembe et al., 2007). The ability of all of the available approaches to avoid false positive outcomes is limited by both the search algorithm employed and the amount of sequence data available.
In practice, one can distinguish at least three different scenarios for the identification of the target genome or gene of interest:
- The organism of interest has already been isolated and cultivated in the lab. The purpose of a test is to confirm the presence of the target and distinguish it from close relatives. The isolation process, however, can be time consuming, costly and sometimes even impossible [as it is estimated that only 0.001–0.1% of living microorganisms are cultivable (Amann et al., 1995)].
- The organism of interest has not been isolated and thus must be identified in the presence of a large, but known background, such as in the presence of host DNA (e.g. detecting bacterial infection in a sample of blood taken from a patient). In this case the challenge for the test is to not only distinguish the target from its close relatives, but also to make sure that the signatures used are unique for the target and do not appear in the background genomic material. In many cases, target amplification via PCR is combined with signature identification (as in TaqMan® PCR) to attempt to overcome this problem.
- The organism of interest has not been isolated and thus must be identified in the presence of a large and unknown background, such as complex environmental samples (air, water, soil, food, etc.).
The third scenario is the most challenging as the signatures must not only be unique enough to distinguish the target from its close relatives but also must have a low probability to appear in the background genomic material which is unknown and typically significantly larger and more variant than the intended target's genome.
In the present study, we seek to explore how the genomic diversity of the background (total length of all the different genomes present in the background), target genome length and target genome mutation rate affect the probability of false positive and false negative outcomes of detection.
| 2 METHODS |
|---|
|
|
|---|
For a glossary of the terms and the derivation of all of the formulas presented within the methods, the reader is referred to Appendices 1 and 2 in the Supplementary Material.
2.1 Quality of genomic signatures
For the purpose of the study presented here, we will consider any subsequence present in target (T) as a potential signature (sT). This signature sT is said to be M mismatches away from the background (B) if the closest subsequence of the same size that is present in the background (sB) is M mismatches away from sT. Based on this definition, M = 0 if the signature sT is present in the background and M > 0 if (a) any combination of M – 1 mismatches or mutations cannot transform sT to any sequence present in the background and (b) there is at least one combination of M mutations which can transform sT into at least one subsequence (sB) that is present in the background. Thus forth, we will consider mutations as substitutions only; incorporating all types of mutations, including any combination of insertions, deletions and substitutions, is the subject of future work.
In order to minimize the probability of a false (positive and negative) outcome of the test, a good signature must be as short as possible and as far away as possible from the closest sequence present in the background. We define the quality (qST) of signature sT in the presence of a particular background B:
|
| (1) |
2.2 Model
The analysis of the frequency of appearance of short n-mers, first discussed over a decade ago for short (2–9 nt long) sequences (Campbell et al., 1999; Deschavanne et al., 1999; Karlin and Ladunga, 1994; Karlin et al., 1997; Nakashima et al., 1997, 1998; Nussinov, 1984; Sandberg et al., 2001), remains in the scope of interest of many practical applications including pathogen identification (Lehner, 2005; Karlin, 1998; Phillippy et al., 2007; Putonti et al., 2006). In contrast, from the analysis of the presence/absence of longer (4n>>m) subsequences (or n-mers of length n), one may conclude that the appearance of these longer n-mers in genomes can be approximated as a random and independent process (Chumakov et al., 2005; Fofanov et al., 2004).
Assuming that (a) the presence/absence of each n-mer in the background sequence(s) is random and (b) the frequency of appearance of all nucleotides are equal (25% each), the total number of different n-mers (K0, n) expected to be present in both strands of a genome of size m is (also see Appendix 2 in the Supplementary Material):
|
| (2) |
|
| (3) |
|
| (4) |
Obviously, the assumptions that the appearance of all of the overlapping n-mers (as well as their 1, 2, 3, etc. mismatch derivatives) in both strands of the genome can be modeled as a random and independent process has to be experimentally validated. Therefore, we performed a series of Monte Carlo simulations as well as explicit calculations using real genomic sequences. In the Monte Carlo simulations, we rigorously computed the total number of different n-mers (n = 6–14) in the 1, 2, 3 and 4+ mismatch areas for randomly generated double-stranded genomic sequences. We observed an extremely high correspondence between the number of n-mers observed and the number of n-mers predicted by Equations (2) and (4) (see e.g. Fig. 1).
|
The number of different n-mers predicted was also found to fit reasonably well (variance < 6.5%) to the number of n-mers observed for the complete genomic sequences of several viral, bacterial and eukaryotic organisms; Figure 2 illustrates the results for organisms of different sizes, the shorter viral Tick-borne Encephalitis virus [NCBI: NC_001672] (Fig. 2A) and the longer bacterium Escherichia coli O157:H7 [NCBI: NC_002695 [GenBank] ] (Fig. 2B), as well as organisms with extreme G+C contents, Bacillus anthracis str. Ames [NCBI: NC_003997 [GenBank] ] with a G+C content of 35.38% (Fig. 2C) and Mycobacterium tuberculosis H37Rv with a G+C content of 65.61% (Fig. 2D). Not surprisingly, the prediction was observed to work better for the shorter and less repetitious viral genome than the longer genomes with a G+C content near 50% (for more details see Supplementary Material).
|
Considering that there is a special interest in identifying pathogens within human samples, we have rigorously computed the number of all 10–18-mers present in the human genome (assembly 3.2.2) as well as those sequences in the 1, 2, 3 and 4+ mismatch areas (Table 1 and Supplementary Material). As one can see, the actual number of different n-mers found in the human genome and 1, 2, 3 and 4+ mismatches away deviates slightly from the predictions of Equations (2) and (4). This discrepancy is most likely due to the G+C bias (41%) (Lander et al., 2001) and the fact that
43% of the human genome is comprised of repetitive sequences (Li et al., 2001); accordingly the effective human genome size is probably several hundred megabases smaller than its actual size. The results of the comparisons performed between the predicted, simulated and real genomic data leads us to the conclusion that the proposed model can predict with reasonable accuracy the total number of different n-mers present in a genomic sequence as well as the number of n-mers located several mismatches away.
|
2.3 Average quality of signature in the presence of the background
In order to select signature sequences which are unique and unlikely to interfere with the background, one needs to consider the number of different n-mers which are expected to be present in the target sequence (T) of size mT and located M (M = 0, 1, 2, 3,...) mismatches away from a background (B) of size mB. Expanding the estimation of Equation (2) for the total number of different n-mers present in the target is equivalent to:
|
|
Likewise, the total number of different n-mers present in the background is:
|
|
Assuming that the appearances of n-mers in the background (as well as any mismatches away from the background) and the target are independent, the probability that a given n-mer will be present simultaneously in both the target and the background (number of such n-mers as a fraction of the total number of n-mers, 4n) is:
|
|
The probability for a signature (n-mer) to be found simultaneously in the target and M mismatches distance from the background can be formulated as:
|
|
Therefore, the probability for any signature randomly chosen from the target to be found M mismatches from the background area (including the perfect match: M = 0) is:
|
| (5) |
It is important to note that PB,M,n does not depend on the target size mT but rather only on the size of the background mB. The average (expected) quality of a signature randomly chosen from the target sT can be calculated as:
|
| (6) |
2.4 Mutations in the target
In many practical situations, the target organism present in the sample does not have the exact same genomic sequence as the strain or variant which has been sequenced and whose genomic sequence is used for designing signatures for its detection. For simplicity all of the single point base differences between the sequenced landmark target genome (from which signatures will be designed) and the sequence of the target genome present in the sample will henceforth be referred to as mutations. Such mutations can impact the reliability of a particular signature by destroying the signature such that it will no longer recognize the intended target organism. Thus, valid signatures must be common to both the landmark target genome and the target genome such that the signature is likely to be present in the sequence which can be generated from the landmark sequence given a particular mutation rate.
Assuming that each mutation appears at random and is independent from any other mutations in the sequence, the total number of n-mers in the target which is affected (changed) by mutations will be equivalent to (see Appendix 2 in the Supplementary Material for more details):
|
| (7) |
|
|
Assuming that these new n-mers will also appear at random from the possible 4n choices, it is expected that the appearance of these mutated n-mers will appear proportionately within the subset of n-mers that are already present in the target and the rest of the 4n space. Therefore, the number of new (not already present in the target) n-mers expected to appear will be:
|
|
Considering that the total number of n-mers in the mutated target remains the same, the number of n-mers common to the landmark and mutated target genomes will be (Appendix 2 in the Supplementary Material):
|
| (8) |
The probability of a random signature being conserved (not destroyed by mutations) is:
|
|
Multiplying this by the probability of finding an n-mer M mismatches from the background gives the probability for an n-mer picked at random to be common to both the original and mutated sequences and M mismatches from the background as:
|
|
|
| (9) |
In contrast to the average quality prediction of Equation (6), the average signature quality now depends on the target size mT.
Many studies, however, have shown that mutations within genomic sequences do not appear in a completely random manner (e.g. Graur and Li, 2000). In order to estimate how far the proposed model of mutations deviates from observations, we explicitly calculated the number of different n-mers present in various genomic sequences, including complete viral genomes (HIV), microbial genes (16S rRNA) and eukaryotic genes (hemoglobin alpha chain 1). The minor deviation from the prediction (Fig. 3) was found to be greater for the larger n as is expected given the tendency for mutations to appear more frequently in particular hot spots (Bailey et al., 2004; Galtier et al., 2006; Nigro et al., 1989; Tsunoyama et al., 2001; Vowles and Amos, 2004; Webber and Ponting, 2005). Not surprisingly, the predicted number of n-mers is an overestimation such that more signatures are predicted to be destroyed than are observed in the sequence data considered. Thus, the model from a practical viewpoint appears to always be on the safe side.
|
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
3.1 Correspondence between the model and real genomic sequences
In order to compare the results predicted by the model with that of real genomic data, the exact distance from the human genome of each 16- thru 21-mer present in 139 dengue virus serotypes 1–4 genomes (
11 kb) as well as 100+ microbial genomes of various sizes (1–10 mb) and various G+C contents (25–75%) was rigorously computed. The average quality of the signatures for each genome was then calculated (for more details see Supplementary Material) and is summarized in Figure 4A for the 139 dengue virus genomes and in Figure 4B for 14 pathogenic bacteria genomes (3 B.anthracis: G + C = 35.38%, 4 E.coli: G + C = 50.56%, 3 Shigella sp.: G + C = 50.94%, and 4 Yersinia sp.: G + C = 47.63%).
|
As can be seen in the case of the bacterial genomes with a G + C content
50%, the actual quality is greater than the prediction. This discrepancy is most likely due to the fact that the effective human genome size is smaller than the actual size which was used for the predictions. The deviation between the predicted and observed average signature qualities is smaller for the 3 B.anthracis strains considered, presumably because its low G + C content is reducing its effective genome size. In contrast with the bacterial genomes, the correspondence of the actual and predicted average quality of the signatures within the dengue genomes is essentially the same as was predicted. This is likely the result of two competing processes: (a) the overestimation of the effective genome size of the background which increases the predicted average quality of signatures and (b) the observation that all ssRNA viruses which are known to infect mammalian hosts have a tendency to have more n-mers present or close to the human genome than viruses which do not infect mammalian hosts, a possible residual of pathogen adaptation (data not shown).
3.2 Application of the model for signature design
Figure 5A illustrates the model's prediction for the dependence of the average quality from the signature (n-mer) size for different background sizes (mB). As one can observe, for each background size there is a specific n-mer size (threshold) below which the average quality of any given signature is zero. For example, this figure suggests that for the background size 10 Tb (expected diversity of environmental samples) all signatures that are
21 nt in length are expected to interfere with the background. Above this minimum length, however, it is apparent that longer signatures or smaller backgrounds always increase the average quality.
|
Using the proposed model, one can also observe that, while for small mutation rates longer signatures have an improved average quality, when the target has a higher mutation rate, the average signature quality rapidly decreases regardless of the signature length (Fig. 5B). Interestingly, when considering signatures with a length typical of primers currently being employed in assays (16–27 nt long), the reduction of the signature quality due to mutation occurs at nearly the same rate µ
3–5% (Fig. 5C). This leads us to the conclusion, which is of vital importance to many target identification assays: signatures designed using a known landmark genome will be virtually useless to identify any variant strains if it has a mutation rate greater than 5%. For example, signatures designed to identify dengue virus serotype 1 will probably miss dengue virus serotype 2 (28.5% average mutation rate). It is also important to mention that under certain conditions (such as mutation rate, target and background size) one can observe a well pronounced maximum for the average quality versus the length of the signature (Fig. 5D). This observation, however, requires the sizes of the target and background genomes to be of the same magnitude, which is not very likely to occur when performing actual pathogen identification tests.
We believe that the main benefits of the proposed approach are that it allows one to predict the minimum appropriate signature length for each background size as well as the critical mutation rate of the target to be detected by the signature (for more details see Table S1 in the Supplementary Material). Future work, however, could include incorporating into the model additional statistical properties of genomic sequences in order to address the differences between the real and effective sizes of genomic backgrounds and targets which are caused by the presence of repeatable regions, deviations in G + C content, etc.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
Part of this work was funded the Department of Homeland Security Science and Technology Directorate, award NBCHC070054 (Y.F.) and the Texas Learning and Computation Center (Y.F.). A portion of this work was performed under the auspices of the US Department of Energy by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng-48 (T.S.). The authors would like to thanks Marisa W. Lam for testing early versions of software involved in this work at LLNL.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Chris Stoeckert
Received on May 10, 2007; revised on August 10, 2007; accepted on August 13, 2007
| REFERENCES |
|---|
|
|
|---|
Amann RI, et al. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol. Rev (1995) 59:143–169.
Bailey JA, et al. Hotspots of mammalian chromosomal evolution. Genome Biol (2004) 5:R23.[CrossRef][Medline]
Campbell A, et al. Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proc. Natl Acad. Sci. USA (1999) 96:9184–9189.
Chumakov S, et al. Theoretical basis for universal identification systems for bacteria and viruses. J. Biol. Phys. Chem (2005) 5:121–128.
Deschavanne PJ, et al. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol. Biol. Evol (1999) 16:1391–1399.[Abstract]
Draghici S, et al. Identification of genomic signatures for the design of assays for the detection and monitoring of anthrax threats. Pac. Symp. Biocomput (2005) 10:248–259.
Emrich SJ, et al. PROBEmer: a web-based software tool for selecting optimal DNA oligos. Nucleic Acids Res (2003) 31:3746–3750.
Fitch JP, et al. Rapid development of nucleic acid diagnostics. Proc. IEEE (2002) 90:1708–1720.[CrossRef]
Fofanov Y, et al. How independent are the appearances of n-mers in different genomes? Bioinformatics (2004) 20:2421–2428.
Galtier N, et al. Mutation hot spots in mammalian mitochondrial DNA. Genome Res (2006) 16:215–222.
Graur D, Li W-H. Fundamentals of Molecular Evolution (2000) 2nd. Massachusetts: Sinauer Associates.
Karlin S. Global dinucleotide signatures and analysis of genomic heterogeneity. Curr. Opin. Micobiol (1998) 1:598–610.[CrossRef]
Karlin S, Ladunga I. Comparisons of eukaryotic genomic sequences. Proc. Natl Acad. Sci. USA (1994) 91:12832–12836.
Karlin S, et al. Compositional biases of bacterial genomes and evolutionary implications. J. Bacteriol (1997) 179:3899–3913.
Lander ES, et al. Initial sequencing and analysis of the human genome. Nature (2001) 409:860–921.[CrossRef][Medline]
Lehner A. Oligonucleotide microarray for identification of Enterococcus species. FEMS Microbiol. Lett (2005) 246:133–142.[CrossRef][Web of Science][Medline]
Li WH, et al. Evolutionary analyses of the human genome. Nature (2001) 409:847–849.[CrossRef][Medline]
Nakashima H, et al. Differences in dinucleotide frequencies of human, yeast, and Escherichia coli genes. DNA Res (1997) 4:185–192.[Abstract]
Nakashima H, et al. Genes from nine genomes are separated into their organisms in the dinucleotide composition space. DNA Res (1998) 5:251–259.[Abstract]
Nigro JM, et al. Mutations in the p53 gene occur in diverse human tumour types. Nature (1989) 342:705–708.[CrossRef][Medline]
Nussinov R. Doublet frequencies in evolutionary distinct groups. Nucleic Acids Res (1984) 12:1749–1763.
Phillippy AM, et al. Comprehensive DNA signature discovery and validation. PLoS Comput. Biol (2007) 3:e98.[CrossRef][Medline]
Putonti C, et al. Human-blind probes and primers for dengue virus identification. FEBS J (2006) 273:398–408.[CrossRef][Medline]
Rahmann S. Fast and sensitive probe selection for DNA chips using jumps in matching statistics. Proc. IEEE Comput. Soc. Bioinform. Conf (2003) 2:57–64.[Medline]
Sandberg R, et al. Capturing whole-genome characteristics in short sequences using a naive Bayesian classifier. Genome Res (2001) 11:1404–1409.
Slezak T, et al. Comparative genomics tools applied to bioterrorism defense. Brief. Bioinform (2003) 4:133–149.
Tembe W, et al. Oligonucleotide fingerprint identification for microarray-based pathogen diagnostic assays. Bioinformatics (2007) 23:5–13.
Tsunoyama K, et al. Intragenic variation of synonymous substitution rates is caused by nonrandom mutations at methylated CpG. J. Mol. Evol (2001) 53:456–464.[CrossRef][Web of Science][Medline]
Vowles EJ, Amos W. Evidence for widespread convergent evolution around human microsatellites. PLoS Biol (2004) 2:e199.[CrossRef][Medline]
Webber C, Ponting CP. Hotspots of mutation and breakage in dog and human chromosomes. Genome Res (2005) 15:1787–1797.
Wu J-S, et al. Primer design using genetic algorithm. Bioinformatics (2004) 20:1710–1717.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




