Skip Navigation


Bioinformatics Advance Access originally published online on March 15, 2005
Bioinformatics 2005 21(10):2517-2519; doi:10.1093/bioinformatics/bti377
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/10/2517    most recent
bti377v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zhang, F.
Right arrow Articles by Zhao, Z.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhang, F.
Right arrow Articles by Zhao, Z.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

SNPNB: analyzing neighboring-nucleotide biases on single nucleotide polymorphisms (SNPs)

Fengkai Zhang 1 and Zhongming Zhao 1,2,3,*

1Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University Richmond, VA 23298, USA
2Center for the Study of Biological Complexity, Virginia Commonwealth University Richmond, VA 23284, USA
3Kunming Institute of Zoology, Chinese Academy of Sciences Kunming, Yunnan 650223, China

*To whom correspondence should be addressed at: Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, PO Box 980126, Richmond, VA 23298-0126, USA


    Abstract
 TOP
 Abstract
 REFERENCES
 

Summary: SNPNB is a user-friendly and platform-independent application for analyzing Single Nucleotide Polymorphism NeighBoring sequence context and nucleotide bias patterns, and subsequently evaluating the effective SNP size for the bias patterns observed from the whole data. It was implemented by Java and Perl. SNPNB can efficiently handle genome-wide or chromosome-wide SNP data analysis in a PC or a workstation. It provides visualizations of the bias patterns for SNPs or each type of SNPs.

Availability: SNPNB and its full description are freely available at http://bioinfo.vipbg.vcu.edu/SNPNB/

Contact: zzhao{at}vcu.edu

Single nucleotide polymorphism (SNP) discovery is of major interest in the post-genome era because SNPs have broad applications in biological fields, such as fine mapping, disease studies, population genetics and molecular evolution (Gibbs et al., 2003). As of November 2004, >16 million unique SNPs from 23 species (Build 123) have been released in the dbSNP database of the National Center for Biotechnology Information (NCBI); meanwhile, millions of SNPs have been available in the private domains, such as Celera's RefSNP database. This provides us an unprecedented opportunity to examine the local DNA-sequence context of SNPs, and therefore, to understand the molecular mechanisms of genome sequence evolution. Recent studies revealed that neighboring-nucleotide biases on SNPs were strong in the human and mouse genomes (Krawczak et al., 1998; Zhao and Boerwinkle, 2002; Zhang and Zhao, 2004). Although many computational tools have been recently developed for SNP data, they are mainly applied to data extraction from various databases and allele frequency estimation (Riva and Kohane, 2002; Nguyen et al., 2004), PCR primer and microarray probe designs (Weckx et al., 2005), haplotype block partition (Zhang et al., 2005) and functional prediction (Ng and Henikoff, 2003). To our knowledge, there is no user-friendly application for examining the neighboring sequence context of SNPs, which contain abundant genetic information for studying the molecular mechanisms of mutation and genome evolution. As millions of SNPs from many genomes will be discovered in the near future. We develop and present here a novel application we developed, Single Nucleotide Polymorphism NeighBoring application (SNPNB), to facilitate the investigations of the neighboring-nucleotide patterns of the genetic polymorphisms in the genomes or in the different genome structural categories (e.g. coding regions).

The details of implementation and instruction of SNPNB are available at http://bioinfo.vipbg.vcu.edu/SNPNB/. SNPNB was implemented by Java and Perl. Executing it requires the Java 2 Runtime Environment (JRE) and Perl interpreter. SNPNB provides an interactive user-friendly interface that allows the user to choose data, set parameters, execute and monitor the computing jobs and view the results (Fig. 1). The graphic user interface (GUI) and the results displayed in graphics were implemented by Java Swing API. The program has been tested on both standard Windows PC and Linux workstation. It should be able to run on all versions of Windows from 95 to XP operating system and in Linux/Unix environment owing to the platform independence of Java and Perl.



View larger version (51K):
[in this window]
[in a new window]
 
Fig. 1 SNPNB display of the neighboring-nucleotide biases relative to the human genome average. The program computes the nucleotide frequencies for each neighboring site and obtains the biases relative to the average values. The left panel has the options to display the results in different categories.

 
SNPNB provides two main utilities (1) analysis of neighboring-nucleotide patterns of SNPs and (2) statistical evaluation of the effective size of SNPs—the number of SNPs that are sufficient to represent the bias patterns observed from the whole data. The method of analyzing nucleotide compositions and biases in the SNP flanking sequences was, in general, described in our previous studies (Zhao and Boerwinkle, 2002; Zhang and Zhao, 2004). SNPNB accepts the data in FASTA format which has been widely used in the dbSNP, Celera RefSNP and other databases. It has the options for the analysis of the SNPs with known or unknown mutation direction. After the user chooses the sites or ranges in the flanking sequences, SNPNB first computes the nucleotide frequencies for each site/range. It subsequently obtains the proportion biases by comparing the frequency values computed from the data with the reference average values, which can be either entered manually by the user or computed from a reference sequence (e.g. human genome sequence). The results may be displayed in a table or in graphics. The user has the options to display the neighboring-nucleotide frequencies or biases by regions (5' side, 3' side or two sides combined) or by SNP types (e.g. A/G, C/T, A/C, G/T, A/T, C/G or all).

The second utility is to evaluate the effective SNP size. This is important for the genome-wide or chromosome-wide analysis because the user would like to evaluate whether the observed patterns are representative or random in the genome. A small effective size means high confidence of the observed biases. A resampling algorithm was implemented. For the total number of SNPs N, the user sets an initial subsample size n, here n << N. The program generates m random subsamples each of size n. The program then computes the biases for each of the m samples, compares the biases observed from the whole data and shows the likelihood (%) of the m samples having the observed bias patterns relative to the user-defined constraints. Theoretically, the next step is an iterative procedure to find the minimum number of SNPs after each round of evaluation of the m subsamples of size n; however, because the resampling algorithm is usually computationally intensive, it is extremely slow and probably unrealistic for a large dataset like human SNPs (N > 8 millions). We made two major improvements. First, we improved the resampling algorithm as follows:

  1. A stratified sampling strategy is applied to generate random numbers, i.e. the program generates random numbers for each subrange of N. This significantly improves the efficiency, e.g. it reduced the computational time from ~30 to 4 h to generate 1000 samples of size 50 000 for N = 8 043 656 in a PC.
  2. When N is large, it is very time consuming to extract m samples and then to compute the biases because it has to scan the data file for m times. In SNPNB, the random numbers indexed by the sample are transferred and sorted into an array indexed by SNP. This needs to read the data file only once and takes ~1/m (e.g. 1/1000) computational time.

Second, because it should be sufficient to obtain a number close to the effective SNP size in research, we would suggest the user to evaluate whether the number is a reasonable estimate. The user may set the median likelihood value of all sites on the 5' side, 3' side and both sides to be at least 80%, which is commonly accepted in statistical power analysis. The user provides an initial empirical effective size to run SNPNB and then increases or decreases the size for the next round after an evaluation of the likelihood of the bias patterns. It may take a few rounds for the user to finally obtain a number close to the effective SNP size (e.g. 10 000 -> 100 000 -> 50 000 -> 30 000 for human SNPs). Although it is still slow despite the algorithm improvements above, it was able to obtain an approximate effective SNP size for humans within a couple of days and for mice within a day in a standard PC.

To illustrate the use of SNPNB, we reanalyzed the neighboring-nucleotide biases and the effective SNP size using human (8 043 656 SNPs) and mouse (469 445 SNPs) data (dbSNP build 121, ftp://ftp.ncbi.nih.gov/snp/). The results confirmed the bias patterns observed in our previous studies (Fig. 1). If we chose 1000 random subsamples, required the median likelihood value of all sites to be at least 80% and set the limit to be 0.3% (i.e. ±0.3% of the observed biases), we obtained the effective SNP size of 30 000 for humans and mice, which is somewhat larger than our previous estimates (Zhang and Zhao, 2004). Since SNPNB can only obtain a number close to the effective SNP size, the user should interpret the estimated number cautiously. The summarized elapsed time of computation and screenshots of the results are shown in the SNPNB website.

SNPNB provides a powerful tool to examine and compare the patterns of local sequence of SNPs. The evaluation of the effective SNP size provides a confidence level to interpret the bias patterns. New features are being added for analysis of the short fragments and CpG dinucleotides in the SNP flanking sequences, comparative analysis of neighboring biases for coding versus non-coding SNPs and statistical improvements in estimating the effective SNP size.


    Acknowledgments
 
We thank the two anonymous reviewers for their constructive comments. This project was supported in part by a NARSAD Young Investigator Award.

Received on September 26, 2004; revised on February 2, 2005; accepted on March 4, 2005

    REFERENCES
 TOP
 Abstract
 REFERENCES
 

    Gibbs, R.A., et al. (2003) The International HapMap Project. Nature, 426, 789–796[CrossRef][Medline].

    Krawczak, M., et al. (1998) Neighboring-nucleotide effects on the rates of germ-line single-base-pair substitution in human genes. Am. J. Hum. Genet., 63, 474–488[CrossRef][Web of Science][Medline].

    Ng, P.C. and Henikoff, S. (2003) SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res., 31, 3812–3814[Abstract/Free Full Text].

    Nguyen, T.H., et al. (2004) Frequency Finder: a multi-source web application for collection of public allele frequencies of SNP markers. Bioinformatics, 20, 439–443[Abstract/Free Full Text].

    Riva, A. and Kohane, I.S. (2002) SNPper: retrieval and analysis of human SNPs. Bioinformatics, 18, 1681–1685[Abstract/Free Full Text].

    Weckx, S., et al. (2005) SNPbox: a modular software package for large-scale primer design. Bioinformatics, 21, 385–387[Abstract/Free Full Text].

    Zhang, F. and Zhao, Z. (2004) The influence of neighboring-nucleotide composition on single nucleotide polymorphisms (SNPs) in the mouse genome and its comparison with human SNPs. Genomics, 84, 785–795[Medline].

    Zhang, K., et al. (2005) HapBlock: haplotype block partitioning and tag SNP selection software using a set of dynamic programming algorithms. Bioinformatics, 21, 131–134[Abstract/Free Full Text].

    Zhao, Z. and Boerwinkle, E. (2002) Neighboring-nucleotide effects on single nucleotide polymorphisms: a study of 2.6 million polymorphisms across the human genome. Genome Res., 12, 1679–1686[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
A. Han, H. J. Kang, Y. Cho, S. Lee, Y. J. Kim, and S. Gong
SNP@Domain: a web resource of single nucleotide polymorphisms (SNPs) within protein domain structures and sequences.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W642 - W644.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/10/2517    most recent
bti377v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zhang, F.
Right arrow Articles by Zhao, Z.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhang, F.
Right arrow Articles by Zhao, Z.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?