Bioinformatics Advance Access originally published online on September 3, 2004
Bioinformatics 2005 21(3):385-387; doi:10.1093/bioinformatics/bti006
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 21 issue 3 © Oxford University Press 2005; all rights reserved.
SNPbox: a modular software package for large-scale primer design
Department of Molecular Genetics (VIB8), Bioinformatics Unit, Flanders Interuniversity Institute for Biotechnology, University of Antwerp Universiteitsplein 1, B-2610 Antwerpen, Belgium
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: We developed a modular software package SNPbox that automates and standardizes the generation of PCR primers and is used in the strategy for constructing single nucleotide polymorphisms (SNPs) maps. In this strategy, the focus of primer design can be either on the validation of annotated public SNPs or on the SNP discovery in exon regions or extended genomic regions, both by resequencing. SNPbox relies on Primer3 for the primer design and combines this program with other publicly available software tools such as BLAST, Spidey and RepeatMasker, and newly developed algorithms. Primer conditions were chosen such that PCR amplifications are uniform for each PCR amplicon facilitating the use of high-throughput genetic platforms. SNPbox can also be used for the design of primer sets for mutation analysis, STR marker genotyping and microarray oligo design. Of the 2500 primer sets designed by SNPbox, 95% successfully amplified genomic DNA under uniform PCR conditions.
Availability: The software is available from the authors upon request.
Contact: jurgen.delfavero{at}ua.ac.be
Supplementary information: SNPbox_supplement.
| INTRODUCTION |
|---|
|
|
|---|
Single nucleotide polymorphisms (SNPs) are the most frequent DNA sequence variations in the human genome with an average spacing of 12 kb (Cooper et al., 1985; Holden, 2002; Sachidanandam et al., 2001) and are therefore the markers of choice in genetic studies aiming at identifying susceptibility genes for complex diseases (Rafalski, 2002). SNPs can be retrieved from public databases like dbSNP (Sherry et al., 1999, Sherry et al., 2001) HGVbase (Fredman, 2002) and JSNP (Hirakawa et al., 2002). However, the majority of SNPs in these databases have not yet been validated as true polymorphisms and/or their polymorphic content still needs to be determined in the population under investigation (Vieux et al., 2002; Marth et al., 2001). As a result, the map density when considering only validated public SNPs is often too small for detailed genetic studies. To efficiently construct high-density SNP maps, a combined strategy of SNP validation and discovery is required. For both steps, high-quality PCR and sequencing primers need to be generated, preferably in a fast, automated process, while carefully taking repeat sequences into account. Furthermore, the use of these primers in high-throughput laboratory environments requires that they amplify DNA under consistent and well-defined criteria.
Nowadays, a number of primer design tools are available as web applications or as stand-alone programs (Chen et al., 2003; Haas et al., 1998, 2003; Li et al., 1997; Proutski and Holmes, 1996; Raddatz et al., 2001; Rozen and Skaletsky, 2003). Although the efficiency of these programs is beyond dispute, most of these programs can only design one primer set at a time and therefore are less useful in large-scale primer design projects.
We present the program SNPbox offering a modular strategy for highly automated and standardized primer design in the construction of high-density SNP maps and mutation analysis based on resequencing of target sequences.
| STRATEGY |
|---|
|
|
|---|
SNPbox automates the primer design for a number of well-defined genomic sequences, further called objects. These objects are the starting points to define targets for which Primer3 will design primers within a frame of 70 bp 5' and 3' of the target. The default length of a target is 450 bp but can be changed if required. In case an object is less than the optimal target length, it is first symmetrically extended 5' and 3' till the optimal length is reached (Fig. 1A). When the object is >450 bp, multiple overlapping targets are defined. Small and neighboring objects will be joined into one or more targets. When SNPbox encounters repeat sequences while selecting suitable target sequences, they can be included in the target depending on their size, nature and distance to the object. Repeats that are included should be <300 bp and belong to the interspersed repeat class. Polymorphic repeats are excluded from a target sequence since these sequences often result in problematic sequence reads. Also, SNPbox is not allowed to design primers within repeat sequences. Several potential scenarios are illustrated in the Supplementary file.
|
| IMPLEMENTATION |
|---|
|
|
|---|
SNPbox holds three modules for automated primer design: an SNP module, an exon module and a saturation module. All sequences can be provided as a FASTA file or as a GenBank gi number. In the latter case, the sequence is downloaded directly from the NCBI using EFetch (http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html). Primer design is always related to a genomic sequence that upon first use is masked for repeats using RepeatMasker (http://repeatmasker.org) and an adapted version of Sputnik (http://www.espressosoftware.com). This adapted version is used to detect microsatellite repeats and single base stretches and its output is arranged in three classes: repeats with less than eight repeat units, repeats with eight or more repeat units and single base stretches of at least eight identical bases.
The SNP module allows primer design for the validation of public SNPs. In order to map public SNPs on a given sequence, the BLAST program (Altschul et al., 1997) is used to align the genomic DNA to a database containing the HGVbase SNPs. SNPs are selected only if they fulfill the following criteria: a sequence similarity of
95% over a minimum length of 40 bp with a maximum E-value of 1E10. The positions of the SNPs in the genomic sequence are determined and used to define the object as the region 30 bp upstream and 30 bp downstream of each SNP. Objects found within a region of 300 bp are joined into one object. In the exon module, coding sequences are identified within a genomic sequence by aligning cDNA and/or expressed sequence tag (EST) sequences using the Spidey program (Wheelan et al., 2001). In the object definition, exons are symmetrically extended by 50 bp on both sides to include the branch point and the splicing sites. In case an excluded repeat sequence is near an exon, the exon can be extended by 25 bp or not at all. Objects within a region of 250 bp are joined into one object.
In the saturation module, the objects are the parts of the genomic sequence between the excluded repeats and can consist of a regulatory region, introns of a specified gene, a complete gene or an extended chromosomal region. Targets are defined with a default overlap of 35 bp. Taking the frame of 70 bp into account in which primers are selected, the real overlap between the amplicons will be a maximum of 175 bp, including the primers. Since the length of an object is not necessarily a multifold of the optimal target length, SNPbox aims to design targets approaching the optimal target length as close as possible (Fig. 1B). The output of SNPbox consists of a HTML page with a graphical representation of the annotated genomic sequence and hyperlinks to a variety of files, allowing easy inspection of data. A tab-delimited file contains the primer sequences, genomic position and PCR amplification conditions. Also GC-content is calculated per 50 bp of amplicon that translates into a value for average GC-content, and a minimal and a maximal GC-content.
SNPbox was successfully used in our laboratory to design primers for about 2500 targets, all on human DNA. In >95%, the PCR amplifications resulted in one specific amplicon of expected size using the built-in PCR conditions without the need for optimization. SNPbox was also used to design primer sets for all exons of the human genome, based on Ensembl data. For the 208.202 exons, 227.187 objects were designated and for 98.62% of these, a target could be defined. SNPbox designed primer pairs for 98.53% of the targets, resulting in a global success rate of 97.17% (Weckx et al., 2004).
In conclusion, given that the standardization of the in silico primer design for defined targets produced high success rates for both primer selection and subsequent PCR amplification, the software package SNPbox is a valuable asset for laboratories involved in resequencing projects particularly when aimed at generating long distance high-density SNP maps.
| Acknowledgments |
|---|
We thank Dirk Van den Bossche for technical support, and Dominique Audenaert, Godelieve Claes and Rosa Rademakers for their valuable feedback during the development, fine-tuning and use of SNPbox. This work was in part funded by the Special Research Fund of the University of Antwerp, the Fund for Scientific Research Flanders and the Interuniversity Attraction Poles program P5/19 of the Belgian Federal Science Policy Office.
Received on February 10, 2004; revised on July 5, 2004; accepted on August 26, 2004
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 33893402
Chen, S.H., Lin, C.Y., Cho, C.S., Lo, C.Z., Hsiung, C.A. (2003) Primer Design Assistant (PDA): a web-based primer design tool. Nucleic Acids Res., 31, 37513754
Cooper, D.N., Smith, B.A., Cooke, H.J., Niemann, S., Schmidtke, J. (1985) An estimate of unique DNA sequence heterozygosity in the human genome. Hum.Genet., 69, 201205[CrossRef][Web of Science][Medline].
Fredman, D., Siegfried, M., Yuan, Y.P., Bork, P., Lehvaslaiho, H., Brookes, A.J. (2002) HGVbase: a human sequence variation database emphasizing data quality and a broad spectrum of data sources. Nucleic Acids Res., 30, 387391
Haas, S., Vingron, M., Poustka, A., Wiemann, S. (1998) Primer design for large scale sequencing. Nucleic Acids Res., 26, 30063012
Haas, S.A., Hild, M., Wright, A.P., Hain, T., Talibi, D., Vingron, M. (2003) Genome-scale design of PCR primers and long oligomers for DNA microarrays. Nucleic Acids Res., 31, 55765581
Hirakawa, M., Tanaka, T., Hashimoto, Y., Kuroda, M., Takagi, T., Nakamura, Y. (2002) JSNP: a database of common gene variations in the Japanese population. Nucleic Acids Res., 30, 158162
Biotechniques. Holden, A.L. (2002) The SNP consortium: summary of a private consortium effort to develop an applied map of the human genome. 26.
Li, P., Kupfer, K.C., Davies, C.J., Burbee, D., Evans, G.A., Garner, H.R. (1997) PRIMO: a primer design program that applies base quality statistics for automated large-scale DNA sequencing. Genomics, 40, 476485[CrossRef][Web of Science][Medline].
Marth, G., Yeh, R., Minton, M., Donaldson, R., Li, Q., Duan, S., Davenport, R., Miller, R.D., Kwok, P.Y. (2001) Single-nucleotide polymorphisms in the public domain: how useful are they?. Nat. Genet., 27, 371372[CrossRef][Web of Science][Medline].
Proutski, V. and Holmes, E.C. (1996) Primer Master: a new program for the design and analysis of PCR primers. Comput. Appl. Biosci., 12, 253255
Raddatz, G., Dehio, M., Meyer, T.F., Dehio, C. (2001) PrimeArray: genome-scale primer design for DNA-microarray construction. Bioinformatics, 17, 9899
Rafalski, A. (2002) Applications of single nucleotide polymorphisms in crop genetics. Curr. Opin. Plant Biol., 5, 94100[CrossRef][Web of Science][Medline].
Rozen, S. and Skaletsky, H.J. (2003) Primer3 on the WWW for general users and for biologist programmers. Methods Mol. Biol., 132, 365286.
Sachidanandam, R., Weissman, D., Schmidt, S.C., Kakol, J.M., Stein, L.D., Marth, G., Sherry, S., Mullikin, J.C., Mortimore, B.J., Willey, D.L., et al. (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, 409, 928933[CrossRef][Medline].
Sherry, S.T., Ward, M., Sirotkin, K. (1999) dbSNPdatabase for single nucleotide polymorphisms and other classes of minor genetic variation. Genome Res., 9, 677679
Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., Sirotkin, K. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res., 29, 308311
Biotechniques. Vieux, E.F., Kwok, P.Y., Miller, R.D. (2002) Primer design for PCR and sequencing in high-throughput analysis of SNPs. 32.
Weckx, S., De Rijk, P., Van Broeckhoven, C., Del Favero, J. (2004) SNPbox: web-based high-throughput primer design from gene to genome. Nucleic Acids Res., 32, W170W172
Wheelan, S.J., Church, D.M., Ostell, J.M. (2001) Spidey: a tool for mRNA-to-genomic alignments. Genome Res., 11, 19521957
This article has been cited by other articles:
![]() |
F. Zhang and Z. Zhao SNPNB: analyzing neighboring-nucleotide biases on single nucleotide polymorphisms (SNPs) Bioinformatics, May 15, 2005; 21(10): 2517 - 2519. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Weckx, J. Del-Favero, R. Rademakers, L. Claes, M. Cruts, P. De Jonghe, C. Van Broeckhoven, and P. De Rijk novoSNP, a novel computational tool for sequence variation discovery Genome Res., March 1, 2005; 15(3): 436 - 442. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


