Bioinformatics Advance Access originally published online on June 16, 2005
Bioinformatics 2005 21(16):3424-3426; doi:10.1093/bioinformatics/bti547
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
DNA-BAR: distinguisher selection for DNA barcoding

ndoiu 2,*
1Department of Computer Science, University of Illinois at Chicago Chicago, IL 60607-7053, USA
2Computer Science and Engineering Department, University of Connecticut 371 Fairfield Road, Unit 2155, Storrs, CT 06269-2155, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: DNA-BAR is a software package for selecting DNA probes (henceforth referred to as distinguishers) that can be used in genomic-based identification of microorganisms. Given the genomic sequences of the microorganisms, DNA-BAR finds a near-minimum number of distinguishers yielding a distinct hybridization pattern for each microorganism. Selected distinguishers satisfy user specified bounds on length, melting temperature and GC content, as well as redundancy and cross-hybridization constraints.
Availability: DNA-BAR can be used online through the web interface provided at http://dna.engr.uconn.edu/~software/DNA-BAR/. The open source C code, released under the GNU General Public License, is also available at the above address.
Contact:ion{at}engr.uconn.edu
| INTRODUCTION |
|---|
|
|
|---|
String barcoding is a recently introduced technique for genomic-based identification of microorganisms, such as viruses or bacteria, from among a set of previously sequenced microorganisms. Applications of this technique range from rapid pathogen identification in epidemic outbreaks to point-of-care medical diagnosis to monitoring of microbial communities in environmental studies (Borneman et al., 2001; Rash and Gusfield, 2002; and references therein). Microorganism identification can be performed by spotting or synthesizing on a microarray the WatsonCrick complements of the distinguisher strings, and then hybridizing to the array the fluorescently labeled DNA extracted from the unknown microorganism. Under the assumption of perfect hybridization stringency, the hybridization pattern can be viewed as a string of zeros and ones, referred to as the barcode of the microorganism. For unambiguous identification, distinguishers must be selected such that each microorganism has a distinct barcode.
Since it is difficult to ensure perfect hybridization stringency with current microarray technologies, a method for improving identification robustness is to use redundant distinguishability, e.g. to require that every two barcodes differ in at least r positions, where r is a given integer. Further improvements in identification robustness can be obtained by using a multi-step assay similar to those used for single nucleotide polymorphism genotyping (Hirschhorn et al., 2000). First, primers complementing selected distinguishers are hybridized in solution with unlabeled DNA extracted from the unknown microorganism. Then, primer hybridizations are registered via a single-base extension reaction using the polymerase enzyme and fluorescently labeled dideoxynucleotides. Formed duplexes are separated by heating, and the resulting mixture is hybridized to a microarray containing the distinguishers. Finally, microarray fluorescence levels are used to learn the identity of extended primers and thus determine the barcode of the microorganism. The increased reliability of this multi-step assay comes from two sources. First, solution-based reactions are better understood and much easier to optimize compared to solid-phase hybridization. Second, the relevant oligonucleotides involved in the solid-phase hybridization step have much lower complexity compared to the whole genome of the microorganism, and are fully under the assay designer's control.
DNA-BAR is a tool for selecting sets of distinguishers to be used in this type of identification assays. The tool accepts as input genomic sequences, possibly containing degenerate bases, given either in Fasta format (http://ngfnblast.gbf.de/docs/fasta.html) or interactively entered by the user. Subject to the given barcode redundancy requirements, the tool attempts to minimize the number of distinguishers, since this reduces assay cost and enables higher effective primer concentration in the solution-based assay steps. The tool enforces user specified lower and upper bounds on distinguisher length, melting temperature and GC content. The tool also enforces cross-hybridization constraints between extended primers and non-complementary distinguishers on the microarray, using a hybridization model based on nucleation complex theory (Ben-Dor et al., 2000). According to this model, hybridization between two oligonucleotides can take place only if one contains as substring the reverse WatsonCrick complement of a substring of weight >c of the other, where c is a given constant. The weight of a string is the number of weak bases (A and T) plus twice the number of strong bases (G and C).
| ALGORITHM AND IMPLEMENTATION |
|---|
|
|
|---|
We use a simple greedy distinguisher selection strategyin every iteration we pick a substring that distinguishes the largest number of not-yet-distinguished pairs of genomic sequences. After selecting a distinguisher d, we discard all candidates that have in common with d a substring of weight >c. To achieve high scalability, we use an incremental algorithm for quickly generating a representative set of candidate distinguishers and collecting all their occurrences in the given genomic sequences, and employ a lazy strategy for updating coverage gains in the greedy selection phase of algorithm. Full implementation details can be found in DasGupta et al. (in press).
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
The results of a comprehensive set of experiments on both randomly generated and genomic datasets are reported in DasGupta et al. (in press). Figure 1 gives the distinguishers selected by running DNA-BAR on a set of 20 microbial genomic sequences extracted from NCBI databases (http://www.ncbi.nlm.nih.gov/genomes/MICROBES/Complete.html) with redundancy requirement r = 1, distinguisher melting temperature range of 5560°C, GC content range of 4060% and maximum common substring weight bound of 5. Table 1 gives the number of distinguishers obtained for the 20 microbial genomes using the same melting temperature and GC content bounds, redundancy varying between 1 and 20, and maximum common substring weight bound varying between 5 and 12. For comparison, we include in the table the number of DNA fingerprints, i.e. DNA substrings each appearing in a unique target sequence, required to achieve the same identification redundancy. DNA fingerprints are commonly used in genomic based identification [e.g. in the recent study of North American birds (Hebert et al., 2004)]. The results in Table 1 show that the number of non-unique distinguishers selected by DNA-BAR can be significantly smaller than the corresponding number of fingerprints [up to 4 times for the 20 microbial genomes in our experiment; experiments on simulated data suggest much higher reductions for larger number of sequences (DasGupta et al., in press)]. The reduced number of DNA-BAR distinguishers leads to lower assay cost and, most importantly, makes it possible to enforce more stringent cross-hybridization constraints compared to the fingerprint approach. In future work we plan to experimentally validate our methods and extend them to the problem of simultaneously identifying a small number of microorganisms that may be present in the sample (Klau et al., 2004).
|
|
| Acknowledgments |
|---|
B.D.G. was supported in part by NSF grants CCR-0206795, CCR-0208749 and NSF CAREER grant IIS-0346973. K.M.K. and A.A.S. were supported in part by NSF ITR grant 0121277. I.I.M. was supported in part by a large grant from the University of Connecticut's Research Foundation.
Conflict of Interest: none declared.
| Footnotes |
|---|
Authors are listed in alphabetical order.
Received on May 2, 2005; revised on June 3, 2005; accepted on June 15, 2005
| REFERENCES |
|---|
|
|
|---|
Ben-Dor, A., et al. (2000) Universal DNA tag systems: a combinatorial design scheme. J. Comput. Biol., 7, 503519[CrossRef][ISI][Medline].
Borneman, J., et al. (2001) Probe selection algorithms with applications in the analysis of microbial communities. Bioinformatics, 1, 19.
DasGupta, B., et al. (2005) Highly scalable algorithms for robust string barcoding. Intl. J. of Bioinformatics Research and Applications, 1, 2.
Hebert, P.D.N., et al. (2004) Identification of birds through DNA barcodes. Public Library of Science Biology, 2, 16571663.
Hirschhorn, J.N., et al. (2000) SBE-TAGS: an array-based method for efficient single-nucleotide polymorphism genotyping. Proc. Natl. Acad. Sci. USA, 97, 1216412169
Klau, G.W., et al. (2004) Optimal robust non-unique probe selection using integer linear programming. Bioinformatics, 20, Suppl. 1, i186i193[Abstract].
Rash, S. and Gusfield, D. (2002) String barcoding: uncovering optimal virus signatures. Proceeding of the 6th Annual International Conference on Computational Biology , pp. 254261.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
