Bioinformatics Advance Access originally published online on September 22, 2005
Bioinformatics 2005 21(22):4181-4186; doi:10.1093/bioinformatics/bti682
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SNPselector: a web tool for selecting SNPs for genetic association studies
The Duke Center for Human Genetics, Duke University Medical Center Durham, NC 27710, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Single nucleotide polymorphisms (SNPs) are commonly used for association studies to find genes responsible for complex genetic diseases. With the recent advance of SNP technology, researchers are able to assay thousands of SNPs in a single experiment. But the process of manually choosing thousands of genotyping SNPs for tens or hundreds of genes is time consuming. We have developed a web-based program, SNPselector, to automate the process. SNPselector takes a list of gene names or a list of genomic regions as input and searches the Ensembl genes or genomic regions for available SNPs. It prioritizes these SNPs on their tagging for linkage disequilibrium, SNP allele frequencies and source, function, regulatory potential and repeat status. SNPselector outputs result in compressed Excel spreadsheet files for review by the user.
Availability: SNPselector is freely available at http://primer.duhs.duke.edu/
Contact: mike.hauser{at}duke.edu
| INTRODUCTION |
|---|
|
|
|---|
Single nucleotide polymorphism (SNP) is the most common form of polymorphism in the human genome. A variety of genotyping platforms are available for high-throughput assay of SNPs. They are widely used for human evolution research (Hammer et al., 2001; Underhill et al., 2000), association studies of complex diseases (Colomb et al., 2001; Martin et al., 2001) and studies of pharmacogenetics (Goldstein et al., 2003).
The amount of SNP data in public databases is increasing dramatically. The number of unique human SNPs in the current dbSNP release (build 123) is >10 million, approaching the theoretically expected number of SNPs in the human genome (Kruglyak and Nickerson, 2001). The SNP detection method varies, as does the reliability of the SNPs in dbSNP. Only 50% of the SNPs are validated and <20% of the validated SNPs have allele frequency information. In addition to the validation and allele frequency information, it is also important to select SNPs based on their genomic location and their proposed functional significance (coding, intronic, promoter, etc.). These annotation data are available in various resources, such as NCBI dbSNP (http://www.ncbi.nlm.nih.gov/SNP), UCSC Genome Browser (Karolchik et al., 2003) and Ensembl (Birney et al., 2004). They also provide data mining features to retrieve SNP information. There are several other bioinformatics tools developed to select SNPs based on various properties. PromoLign (Zhao et al., 2004) and PupaSNP Finder (Conde et al., 2004) are two web tools to find SNPs that may affect gene transcription levels. SNPper (Riva and Kohane, 2002) provides a web interface to retrieve SNP annotation by chromosome region or SNP names and to refine SNP selection with different filters (such as validation status or minor allele frequency).
Here we describe a new SNP selection program that combines many of these attributes. It has an easy-to-use web interface that provides a feature-rich result spreadsheet. It incorporates LD calculations into SNP selection to help reduce the number of SNPs required for a comprehensive analysis. Further, the output can be tailored to provide the required fields for commercial genotyping systems, such as the Illumina bead-based genotyping platform.
| IMPLEMENTATION |
|---|
|
|
|---|
SNPselector is implemented in object-oriented PERL language to search and analyze SNP data. Its core module can be run as a UNIX command-line application. To make it easy to use, a CGI wrapper is developed to provide the web interface between the users and the application.
To increase the performance of the application, all the SNP data and related genome annotation data are stored in a local MySQL database (http://www.mysql.com/). SNP data, including SNP location, alleles, function and validation information were downloaded from UCSC Genome Browser server. Later two 100 bp flanking sequences for each SNP were extracted from the human genome (NCBI build 35) and added into the SNP table. SNP allele frequency and genotyping data were downloaded from the HapMap project (http://www.hapmap.org/), the SNP Consortium (http://snp.cshl.org/), JSNP (http://snp.ims.u-tokyo.ac.jp/), Affymetrix (http://www.affymetrix.com/) and Perlegen (Hinds et al., 2005). The SNPs with experimentally verified genotyping or allele frequency information are considered as high quality SNPs. Ensembl gene structure information was obtained from the Ensembl project. Conserved region information was downloaded from the UCSC genome browser multi-genome alignments (Blanchette et al., 2004). CpG island, transcription factor binding site (TFBS), microRNA and simple repeat data were also downloaded from UCSC and stored in the local MySQL database.
The local database is updated whenever new public data are released.
| PROGRAM WORKFLOW |
|---|
|
|
|---|
SNPselector takes a list of gene names or genomic regions as input and finds all available SNPs in the genes or genomic regions. SNPselector finds tagging SNPs by calculating LD bins of genotyped SNPs (Carlson et al., 2004). It then finds SNP function based on whether the SNP may affect the gene transcript structure or the protein product. It checks the regulatory potential of the SNP based on SNP location at conserved site from multi-genome comparison, conserved TFBS, CpG island or microRNA gene. It also checks whether the SNP is in a repeat region. It scores and sorts SNPs on their LD tagging property, quality, function, regulatory potential and repeat status. Finally it exports SNP selection result into Excel files (Fig. 1).
|
SNP search
The SNPselector provides the user with four types of SNP searches:
- dbSNP accession ID (rs number).
- Gene names: Ensembl genes and their chromosomal locations are obtained. For each Ensembl gene, SNPselector searches all SNPs in the corresponding chromosomal region (plus the flanking sequence regions defined in the user input).
- Genomic regions (gene centric): SNPselector finds all Ensemble genes and their chromosome locations in that genomic region. For each Ensembl gene in the region, SNPselector searches all SNPs within the gene.
- Genomic regions: The program breaks each region into smaller 2 Mb regions if necessary. Then it will search all SNPs in each of the 2 Mb region.
SNP retrieval
After searching by one of these methods, SNPselector retrieves SNP information from the database by SNP rs numbers. The information includes SNP allele, allele frequency, chromosomal location, validation status, quality, predicted function and flanking sequence. When obtaining the SNP flanking sequence SNPselector annotates any neighboring SNPs within 100 bases of the target SNP using the IUPAC codes. This ensures that no assay will be designed over a neighboring SNP that can cause failure of many SNP genotyping assays. SNPselector also compares the SNP location with Ensembl transcript structure to determine whether the SNP is intronic, exonic or intergenic and annotates at which exon or intron the SNP is located if it is not an intergenic SNP.
For SNPs that are queried by gene or genomic region, SNPselector calculates LD bins using the HapMap genotyping data and the ldSelect program from University of Washington (Carlson et al., 2004). This helps to select the most informative SNPs and to avoid genotyping redundant SNPs. The Perlegen genotyping data of African American, Caucasian and Chinese were also added into the SNPselector database. Users can select one of the genotyping data sources or populations to do LD bin analysis.
SNP scoring and prioritization
SNP scoring
After retrieving SNP information, SNPselector scores each SNP in multiple categories.
- LD score: If the SNP is a tagging SNP of an LD bin, its LD score is assigned as the number of SNPs in the LD bin. Otherwise, its LD score is zero. The LD score reflects how informative the tagging SNP is. The higher the LD score, the more SNPs in the LD bin, the more SNP information can be assayed by the tagging SNP.
- Quality score: If the SNP has experimentally verified genotyping or allele frequency information, it is considered as a high-quality SNP. Its quality score is 1. Otherwise, its quality score is 0.
- Function score: The SNP function score is based on the SNP type annotation from dbSNP. A higher score is assigned to the SNPs that might affect gene transcript structure or protein product, such as coding non-synonymous SNPs or SNPs at a splicing site (Table 1).
- Regulatory potential score: For each SNP not in an exonic region, SNPselector calculates its potential regulatory score base on its location within human/chimp/mouse/rat/dog/chicken/fugu/zebra_fish conserved regions, conserved TFBS, CpG island or microRNA gene (Table 2). These scores are added to build a single regulatory potential score. Thus a high score suggests high regulatory potential.
- Repetitive score: If the SNP overlaps with a simple repeat region annotated by UCSC, its repetitive score is 1. Otherwise, its repetitive score is 0.
- Illumina pre-assay score: If SNP genotyping is to be performed with the Illumina bead platform, the user can upload an optional file containing the Illumina pre-assay score into the database. This score is calculated by the Illumina proprietary algorithm to assess the success rate of genotyping the SNP with this platform.
|
|
SNP prioritization
After searching and assigning scores to SNPs, SNPselector sorts SNPs by LD score in descending order so that the tag SNPs with the larger LD bin will be at the top. Then SNPselector sorts SNPs by quality score in descending order, followed by functional score in descending order, so that SNPs with functional impact, such as non-synonymous coding SNPs, will have higher rank than those with unknown function. SNPselector also sorts SNPs by regulatory potential score so that SNPs in conserved regions, CpG islands or TFBSs will be ranked higher than those outside these regions. Finally SNPselector sorts SNPs by repetitive score in increasing order so that SNPs in non-repetitive regions will be ranked higher than those in repetitive regions. Since the output is an easily manipulated spreadsheet, the user can sort the SNPs to highlight different SNP features. For example, if the user wants to find SNPs that might affect gene expression, he/she may choose to sort SNPs by regulatory potential score before sorting SNPs by function score.
SNP selection and data report
For SNPs that are queried by SNP accession IDs, SNPselector selects and exports all the queried SNPs in one Excel spreadsheet file. For SNPs queried by gene names or gene locations, SNPselector exports top ranked SNPs at the user-specified number per gene into one Excel spreadsheet file. It also exports all SNPs available for each gene into a second gene-SNP spreadsheet. For genome-scan SNPs that are queried by genomic regions, SNPselector selects evenly distributed SNPs at the user-specified spacing (in base pairs) and puts the result into one Excel spreadsheet file. In each gene or genome SNP Excel file, SNPselector generates a hyperlink called DAS Link at the first field of the first row. It links to the LD bins and selected SNPs are displayed as custom tracks in the UCSC genome browser (Fig. 2).
|
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
SNPselector was used to select 5 SNPs for each of 140 candidate genes for human cardiovascular disease (Seo et al., 2004). The 140 genes were widely distributed across the human genome, and SNPs had previously been manually selected from these genes. Among the 700 SNPs selected by SNPselector, all were high quality SNPs with allele frequency or genotyping data, and 582 (83%) were LD tagging SNPs.
Figure 3 shows the distribution of selected SNPs and total SNPs of the 140 genes in different functional categories. The majority of the SNPs were intronic SNPs. Through the SNP prioritizing rules, SNPselector decreased the percentage of intronic SNPs from 67.48% in the total available SNPs to 48.86% in the final selected SNPs. It also enriched SNPs that might have an effect on gene function. These included coding-non-synonymous SNPs (enriched 7 times), splice-site SNPs (enriched 14 times), coding-synonymous SNPs (enriched 7 times) and mRNA-UTR SNPs (enriched 2 times). This enrichment of functional SNPs by SNPselector was similar to the SNPs selected by the manual SNP selection process. In some categories, such as splice-site and mRNA-UTR, SNPselector did even better than the manual SNP selection. SNPselector prioritizes splice-site SNPs at the same level as coding-non-synonymous SNPs and chooses UTR SNPs located in conserved genomic region.
|
There are a few limitations to SNPselector as it is currently configured. The software requires SNP genotyping data to calculate tagging SNPs with ldSelect. To provide the richest possible dataset, we have merged HapMap genotyping data with other genotyping information, such as Perlegen genotyping data (Hinds et al., 2005). This approach generates a large number of LD bins when there is little overlap between the genotyped datasets. This will become less of an issue as the HapMap genotyping progresses. SNPselector infers each SNP's impact on gene function based on its location (e.g. coding region, promoter site or UTR). However, SNPs located in these regions may not affect the gene function. We are working to add more detailed functional annotation from other resources (Karchin et al., 2005).
In summary, SNPselector is a powerful tool for the identification of SNPs for large-scale genetic association studies. This software's output is comparable to that obtained from manual selection, but can be produced in a fraction of the time. The detailed descriptive output can be formatted for submission to a variety of commercial genotyping systems. SNPselector will be a valuable addition to many high-throughput SNP genotyping applications.
| Acknowledgments |
|---|
We would like to thank Carrie Browning, Liyong Wang, Jason Rose and others for their helpful suggestions. This work was supported by the following grants P01 HL73042 (NHLBI); R01 AG021547, R01 NS36768 and R01 NS31153 (NINDS); R01 AG19085 (NIA) and R01 EY12012 and R01 EY13315 (NEI). Conflict of Interest: none declared.
Received on August 8, 2005; revised on September 18, 2005; accepted on September 18, 2005
| REFERENCES |
|---|
|
|
|---|
Birney, E., et al. (2004) An overview of ensembl. Genome Res., 14, 925928
Blanchette, M., et al. (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res., 14, 708715
Carlson, C.S., et al. (2004) Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet., 74, 106120[CrossRef][Web of Science][Medline].
Colomb, E., et al. (2001) Association of a single nucleotide polymorphism in the TIGR/MYOCILIN gene promoter with the severity of primary open-angle glaucoma. Clin. Genet., 60, 220225[CrossRef][Web of Science][Medline].
Conde, L., et al. (2004) PupaSNP Finder: a web tool for finding SNPs with putative effect at transcriptional level. Nucleic Acids Res., 32, W242W248
Goldstein, D.B., et al. (2003) Pharmacogenetics goes genomic. Nat. Rev. Genet., 4, 937947[CrossRef][Web of Science][Medline].
Hammer, M.F., et al. (2001) Hierarchical patterns of global human Y-chromosome diversity. Mol. Biol. Evol., 18, 11891203
Hinds, D.A., et al. (2005) Whole-genome patterns of common DNA variation in three human populations. Science, 307, 10721079
Karchin, R., et al. (2005) LS-SNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources. Bioinformatics, 21, 28142820
Karolchik, D., et al. (2003) The UCSC Genome Browser Database. Nucleic Acids Res., 31, 5154
Kruglyak, L. and Nickerson, D.A. (2001) Variation is the spice of life. Nat. Genet., 27, 234236[CrossRef][Web of Science][Medline].
Martin, E.R., et al. (2001) Association of single-nucleotide polymorphisms of the tau gene with late-onset Parkinson disease. J. Am. Med. Assoc., 286, 22452250
Riva, A. and Kohane, I.S. (2002) SNPper: retrieval and analysis of human SNPs. Bioinformatics, 18, 16811685
Seo, D., et al. (2004) Gene expression phenotypes of atherosclerosis. Arterioscler. Thromb. Vasc. Biol., 24, 19221927
Siepel, A. and Haussler, D. (2005) Phylogenetic hidden Markov models. In Nielsen, R. (Ed.). Statistical Methods in Molecular Evolution, , New York Springer, pp. 325351.
Underhill, P.A., et al. (2000) Y chromosome sequence variation and the history of human populations. Nat. Genet., 26, 358361[CrossRef][Web of Science][Medline].
Zhao, T., et al. (2004) PromoLign: a database for upstream region analysis and SNPs. Hum. Mutat., 23, 534539[Medline].
This article has been cited by other articles:
![]() |
R. Metlapally, Y.-J. Li, K.-N. Tran-Viet, D. Abbott, G. R. Czaja, F. Malecaze, P. Calvas, D. Mackey, T. Rosenberg, S. Paget, et al. COL1A1 and COL2A1 Genes and Myopia Susceptibility: Evidence of Association and Suggestive Linkage to the COL2A1 Locus Invest. Ophthalmol. Vis. Sci., September 1, 2009; 50(9): 4080 - 4086. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. H. Lee and H. Shatkay An integrative scoring system for ranking SNPs by their potential deleterious effects Bioinformatics, April 15, 2009; 25(8): 1048 - 1055. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Karchin Next generation tools for the annotation of human SNPs Brief Bioinform, January 1, 2009; 10(1): 35 - 52. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Lu, M. E. T. Dolle, S. Imholz, R. van 't Slot, W. M. M. Verschuren, C. Wijmenga, E. J. M. Feskens, and J. M. A. Boer Multiple genetic variants along candidate pathways influence plasma high-density lipoprotein cholesterol concentrations J. Lipid Res., December 1, 2008; 49(12): 2582 - 2589. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. S. Sutton, D. R. Crosslin, S. H. Shah, S. C. Nelson, A. Bassil, A. B. Hale, C. Haynes, P. J. Goldschmidt-Clermont, J. M. Vance, D. Seo, et al. Comprehensive genetic analysis of the platelet activating factor acetylhydrolase (PLA2G7) gene and cardiovascular disease in case-control and family datasets Hum. Mol. Genet., May 1, 2008; 17(9): 1318 - 1328. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Zhou, M. Lei, Y. Rao, Q. Nie, H. Zeng, M. Xia, F. Liang, D. Zhang, and X. Zhang Polymorphisms of Vasoactive Intestinal Peptide Receptor-1 Gene and Their Genetic Effects on Broodiness in Chickens Poult. Sci., May 1, 2008; 87(5): 893 - 903. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Grover, A. S. Woodfield, R. Verma, P. P. Zandi, D. F. Levinson, and J. B. Potash QuickSNP: an automated web server for selection of tagSNPs Nucleic Acids Res., July 13, 2007; 35(suppl_2): W115 - W120. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Bhatti, D. M. Church, J. L. Rutter, J. P. Struewing, and A. J. Sigurdson Candidate Single Nucleotide Polymorphism Selection using Publicly Available Tools: A Guide for Epidemiologists Am. J. Epidemiol., October 15, 2006; 164(8): 794 - 804. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Bhasi, L. Zhang, D. Brazeau, A. Zhang, and M. Ramanathan Information-theoretic identification of predictive SNPs and supervised visualization of genome-wide association studies Nucleic Acids Res., September 1, 2006; 34(14): e101 - e101. [Abstract] [Full Text] [PDF] |
||||
![]() |
H.-Y. Yuan, J.-J. Chiou, W.-H. Tseng, C.-H. Liu, C.-K. Liu, Y.-J. Lin, H.-H. Wang, A. Yao, Y.-T. Chen, and C.-N. Hsu FASTSNP: an always up-to-date and extendable service for SNP function analysis and prioritization. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W635 - W641. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||










