Bioinformatics Advance Access originally published online on February 3, 2005
Bioinformatics 2005 21(9):2083-2084; doi:10.1093/bioinformatics/bti176
SNPsFindera web-based application for genome-wide discovery of single nucleotide polymorphisms in microbial genomes
1Bioscience Division, Los Alamos National Laboratory Los Alamos, NM 87545, USA
2Counterterrorism and Forensic Science Research Unit, FBI Academy Quantico, VA 22135, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: Single nucleotide polymorphisms (SNPs) are the most abundant form of genetic variations in closely related microbial species, strains or isolates. Some SNPs confer selective advantages for microbial pathogens during infection and many others are powerful genetic markers for distinguishing closely related strains or isolates that could not be distinguished otherwise. To facilitate SNP discovery in microbial genomes, we have developed a web-based application, SNPsFinder, for genome-wide identification of SNPs. SNPsFinder takes multiple genome sequences as input to identify SNPs within homologous regions. It can also take contig sequences and sequence quality scores from ongoing sequencing projects for SNP prediction. SNPsFinder will use genome sequence annotation if available and map the predicted SNP regions to known genes or regions to assist further evaluation of the predicted SNPs for their functional significance. SNPsFinder can generate PCR primers for all predicted SNP regions according to user's input parameters to facilitate experimental validation. The results from SNPsFinder analysis are accessible through the World Wide Web.
Availability: The SNPsFinder program is available at http://snpsfinder.lanl.gov/.
Contact: murray{at}lanl.gov
Supplementary information: The user's manual is available at http://snpsfinder.lanl.gov/UsersManual/
| INTRODUCTION |
|---|
|
|
|---|
Though single nucleotide polymorphism (SNP) discovery has attracted much attention in human genome research, there are still relatively few studies on SNPs in microbial genomes. Most efforts have so far been focused on identifying unique genes or pathways in different microbial organisms through comparative genomic analysis. But the importance of SNPs in microbial genomics is being recognized. SNPs are the most abundant form of genetic variation in closely related genomes and the study of SNPs will undoubtedly offer new insights into many evolutionary processes, including hostpathogen interaction (Blaser and Musser, 2001). In bacterial pathogens, a variety of SNPs has been discovered that confer a selective advantage during the course of a single infection, epidemic spread or long-term evolution of virulence (Ramaswamy et al., 2003; Sokurenko et al., 1999). SNPs contribute to the ability of pathogens to cause disease (Boddicker et al., 2002; Weissman et al., 2003). SNPs, as genetic markers, have been used to resolve closely related microbial species and strains (Gutacker et al., 2002) and to separate clinical samples collected from a disease outbreak facilitating investigations for infectious disease outbreaks (Cleland et al., 2004; Read et al., 2002).
The major limiting factor for SNP discovery in microbial genomes has been the availability of genome sequences. However, with high-throughput microbial genome sequencing projects worldwide, many closely related species and strains have recently been sequenced and many more are currently being sequenced. These sequence resources provide us unprecedented opportunity for genome-wide SNP analysis. Here we report the development of SNPsFinder, a web-based application for genome-wide SNP discovery in microbial genomes. It takes multiple genomes as input and performs genome-wide SNP analysis. Using this application, we have successfully identified many useful SNPs that can be used as molecular signatures for clinical diagnostics and infectious disease surveillance. These newly discovered SNPS have helped us to better understand variations in both genotypes and phenotypes within many important pathogenic species.
| ALGORITHM AND IMPLEMENTATION |
|---|
|
|
|---|
Our goal is to develop a fully automated, genome-wide SNP discovery program. To achieve this, we have developed an integrated algorithmic solution for the following five major tasks: (1) identifying all of the homologous regions among the multiple genomes being compared using MegaBlast (Zhang et al., 2000); (2) eliminating paralogous sequences from consideration to reduce false positive SNP identification; (3) generating multiple sequence alignments and detecting SNPs; (4) taking into consideration the quality of the sequences as well as the locations of the predicted SNPs to assist further evaluation of the predicted SNPs; (5) picking up PCR primers for each predicted SNP regions using Primer3 (Rozen and Skaletsky, 2000) to facilitate experimental validation. The major steps performed by SNPsFinder are summarized in Figure 1 and a detailed description of algorithm and implementation is available in the Supplementary information (User's Manual).
|
| INPUT DATA |
|---|
|
|
|---|
SNPsFinder allow the user to upload their genome sequences and other related data (genome annotation, sequence quality scores) from local files. The genome sequences can be either complete sequences or contig sequences. Users should choose a high quality sequence that preferably has been annotated as the anchor sequence, because the anchor sequence is used to map the predicted SNPs onto annotated genes or DNA regions to facilitate further evaluation of the predicted SNPs (Fig. 1). When contig sequences are used, SNPsFinder allows inclusion of the corresponding sequence quality scores for consideration to reduce false positive SNP predictions as a result of sequencing errors. Users will be required to choose a percentage of sequence identity as a cutoff for SNPsFinder to determine which homologous sequences will be identified and compared for SNP identification. Users are also required to choose a desired amplicon length (length of the homologous regions) into which the anchor sequence will be fragmentized. To facilitate experimental validation of the predicted SNPs, SNPsFinder also allows users to provide parameters according to which PCR primers will be picked for each predicted SNP region. Finally, users are required to enter an email address by which notification will be sent upon completion of the SNP analysis. A web link will be provided in the email to allow online access to the output of SNPsFinder.
| OUTPUT OF SNPsFinder |
|---|
|
|
|---|
The SNP regions identified by SNPsFinder are presented in a table format. They are sorted by the number of SNPs found in each region, but can also be sorted by genomic coordinates. The hyperlink for each predicted SNP region allows users to view the corresponding multiple sequence alignment. Information on the gene that overlaps the predicted SNP region is also provided and the gene IDs are linked to the GenBank records for more gene annotation. In addition to the SNPs, the numbers of insertions and deletions (InDels) found within the predicted SNP regions are also listed. Because the InDels within gene-coding regions often result in frame shift and gene inactivation, this identification of InDels will facilitate functional genomic analysis.
A pair of primers for each predicted SNP region is generated by SNPsFinder according to the parameters provided by the user. The user can view the primer information for a set of selected SNP regions or for all of the predicated SNP regions. The primer data include the SNP region ID, primer sequences, melting temperature, length of the primers and the expected amplicon length. The SNP region IDs are linked to the multiple sequence alignment where locations of the primer pair are labeled.
| Acknowledgments |
|---|
This research was supported in part by the DOE/DHS Chemical Biological National Security Program (CBNP) and by the FBI. We want to thank Electra Sutton for the help in creating the SNPsFinder logo.
Received on October 5, 2004; revised on November 22, 2004; accepted on November 22, 2004
| REFERENCES |
|---|
|
|
|---|
Blaser, M.J. and Musser, J.M. (2001) Bacterial polymorphisms and disease in humans. J. Clin. Invest., 107, 391392[CrossRef][Medline].
Boddicker, J.D., et al. (2002) Differential binding to and biofilm formation on, HEp-2 cells by Salmonella enterica serovar Typhimurium is dependent upon allelic variation in the fimH gene of the fim gene cluster. Mol. Microbiol., 45, 12551265[CrossRef][Web of Science][Medline].
Cleland, C.A., et al. (2004) Development of rationally designed nucleic acid signatures for microbial pathogens. Expert Rev. Mol. Diagn., 4, 303315[CrossRef][Medline].
Gutacker, M.M., et al. (2002) Genome-wide analysis of synonymous single nucleotide polymorphisms in Mycobacterium tuberculosis complex organisms: resolution of genetic relationships among closely related microbial strains. Genetics, 162, 15331543
Ramaswamy, S.V., et al. (2003) Single nucleotide polymorphisms in genes associated with isoniazid resistance in Mycobacterium tuberculosis. Antimicrob. Agents Chemother., 47, 12411250
Read, T.D., et al. (2002) Comparative genome sequencing for discovery of novel polymorphisms in Bacillus anthracis. Science, 296, 20282033
Rozen, S. and Skaletsky, H. (2000) Primer3 on the WWW for general users and for biologist programmers. Methods Mol. Biol., 132, 365386[Medline].
Sokurenko, E.V., et al. (1999) Pathoadaptive mutations: gene loss and variation in bacterial pathogens. Trends Microbiol., 7, 191195[CrossRef][Web of Science][Medline].
Weissman, S.J., et al. (2003) Enterobacterial adhesins and the case for studying SNPs in bacteria. Trends Microbiol., 11, 115117[CrossRef][Web of Science][Medline].
Zhang, Z., et al. (2000) A greedy algorithm for aligning DNA sequences. J. Comput. Biol., 7, 203214[CrossRef][Web of Science][Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
