Bioinformatics Advance Access originally published online on April 6, 2005
Bioinformatics 2005 21(11):2791-2793; doi:10.1093/bioinformatics/bti403
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
VariScan: Analysis of evolutionary patterns from large-scale DNA sequence polymorphism data

Departament de Genètica, Facultat de Biologia, Universitat de Barcelona Diagonal 645, 08028 Barcelona, Spain
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: VeriScan is a software package for the analysis of DNA sequence polymorphisms at the whole genome scale. Among other features, the software (1) can conduct many population genetic analyses; (2) incorporates a multiresolution wavelet transform-based method that allows capturing relevant information from DNA polymorphism data; (3) facilitates the visualization of the results in the most commonly used genome browsers.
Availability: The software with documentation is available under the GNU GPL software license from: http://www.ub.es/softevol/variscan
Contact: jrozas{at}ub.edu
| INTRODUCTION |
|---|
|
|
|---|
Analysis of DNA sequence polymorphisms and of single nucleotide polymorphisms (SNPs) are powerful approaches in understanding the evolutionary forces underlying nucleotide variation and in mapping genes of disease. Currently, the detection of Darwinian positive selection (Sabeti et al., 2002; Olson, 2002) is receiving a lot of interest, since, for instance, knowledge of the specific gene or genomic region under selection can help pharmaceutical research in vaccine and drug development, or in vaccination strategies. This detection, nevertheless, is not easy since demographic processes could mimic its footprint. The distinctive signature of natural selection can nonetheless be detected by analysing the spatial distribution of polymorphisms across broad regions of the genome (Sabeti et al., 2002; Quesada et al., 2003) by using coalescent-based methods (Kingman, 1982; Hudson, 1990; Rosenberg and Nordborg, 2002).
The sliding window (SW) method has been extensively used for exploratory DNA polymorphism data analysis (Rozas and Rozas, 1995). Unfortunately, the SW approach has a number of limitations, such as the determination of the appropriate window size or the problem of multiple comparisons, that are critical in genome-wide based analysis. Here, we describe the VariScan software which has been designed for the analysis of DNA sequence polymorphisms at the whole genome scale. Among other features, VariScan incorporates a wavelet transform (WT)-based analysis for capturing relevant information from DNA polymorphism data (Liò, 2003). WT allows obtaining low and high frequency components from signals and therefore, it could be useful in capturing global and local features, such as conserved regions, peaks and valleys of nucleotide diversity, linkage disequilibrium (LD) clusters from DNA polymorphism data. The software has, therefore, the appropriate data handling and analysis capabilities needed for genome-wide resequencing projects, which ultimately could lead to the detection of the imprint of natural selection.
| SYSTEMS AND METHODS |
|---|
|
|
|---|
VariScan software is written in ANSI C and has been tested on Linux, MacOSX and Win32 platforms. It has been optimized for the high speed processing of large DNA sequence data files (
100 sequences of
100 Mb each). Indeed, the algorithms have been implemented for a running time of O(n), where n is the length of the DNA sequence. The input files are multiple aligned DNA sequence data in a number of formats as MAF, MGA, XMFA, PHYLIP or the HapMap genotype format. Although VariScan can conduct some analysis using unphased data (in general, genotypic data is phase-unknown), the gametic phase information is needed in some of the implemented methods (e.g. LD, Haplotype diversity); therefore, the gametic phase should be determined before using these methods in VariScan. | IMPLEMENTATION |
|---|
|
|
|---|
Molecular population genetics parameters
VariScan implements a number of population genetics parameters including coalescent-based statistics (Rozas et al., 2003). In particular, VariScan estimates (1) summary statistics of nucleotide and haplotype polymorphism levels, (2) linkage disequilibrium-based statistics and (3) several coalescent-based tests of neutrality. VariScan can estimate these parameters on a specific number of sequences, or considering different options of treating gaps and missing data. All of these analyses can be conducted using the sliding window (SW) method that, in turn, can be used to obtain a graphical representation of the results.
Wavelet transform-based analysis
VariScan can also conduct a signal decomposition analysis by means of WT methods. Unlike SW, WT-based results are nearly independent of the chosen window length and, therefore, are more suitable for the separate detection of features of variable lengths independently of their genomic background. The signal, which is the raw profile of the population genetic parameter estimated along the DNA sequence, is analysed by using LastWave v2.0 software (E. Bacry, http://www.cmap.polytechnique.fr/~bacry/LastWave/). We chose Daubechies' D4 as the default wavelet filter (Daubechies, 1992) since it is adequate for locating features as peaks and valleys from a signal, with a minimum degree of smoothness (Liò and Vanucci, 2000); this filter, nevertheless, can be changed by the user. The signal is further decomposed to all analysing levels (MRA analysis; Mallat, 1999) using the orthogonal wavelet decomposition method; the orthogonal property of Daubechies wavelets allows the further reconstruction of the signal. The outcome, which is the reconstructed wavelet-transform profiles of the population genetic parameter along the sequence, can be used to identify genomic features at multiple resolution levels (i.e. at global and local scales); for instance, features located in diverse nucleotide diversity backgrounds.
Results visualization
VariScan permits the visualization of the results through available genomic browsers (Fig. 1). For instance, VariScan can write the outcome on custom annotation track formats as the WIG format used in the Genome browser at UCSC (Kent et al., 2002) or the xyplot format in GBrowse (Stein et al., 2002), conferring a visual representation of the wavelet-transform profile integrated with current annotation tracks for the genome of interest. As a result, it is possible to relate statistic profile results (of nucleotide diversity, LD, etc.) with present annotated genomic features (specific genes, intergenic regions, haplotype information, etc.) from available genome projects.
|
| Acknowledgments |
|---|
We are very grateful to M. Aguadé, J.M. Aroca, B. Audit, M. Casas, J. Castresana, S.O. Kolokotronis and C. Segarra for their valuable comments and suggestions. This work was supported by grant BMC2001-2906 from the Dirección General de Investigación Científica y Técnica, Spain, conferred on M. Aguadé, and by grant TXT98-1802 from the Dirección General de Ense nanza Superior e Investigación Científica, Spain, conferred on J.R.
| Footnotes |
|---|
Present address: Department Biology IIEvolutionary Biology, University of Munich, Munich, Germany
Received on January 19, 2005; revised on March 16, 2005; accepted on March 21, 2005
| REFERENCES |
|---|
|
|
|---|
Daubechies, I. Ten Lectures on Wavelets, (1992) , Philadelphia SIAM.
Hudson, R.R. (1990) Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol., 7, 144.
Kent, W.J., et al. (2002) The Human Genome Browser at UCSC. Genome Res., 12, 9961006
Kingman, J.F.C. (1982) On the genealogy of large populations. J. Appl. Prob., 19A, 2743[CrossRef].
Liò, P. (2003) Wavelets in bioinformatics and computational biology: state of art and perspectives. Bioinformatics, 19, 29
Liò, P. and Vanucci, M. (2000) Finding pathogenicity islands and gene transfer events in genome data. Bioinformatics, 16, 932940
Mallat, S. A Wavelet Tour of Signal Processing, 2nd edn., (1999) , San Diego Academic Press.
Olson, S. (2002) Seeking the signs of selection. Science, 298, 13241325
Patil, N., et al. (2001) Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science, 294, 17191723
Quesada, H., et al. (2003) Large-Scale Adaptive Hitchhiking upon High Recombination. Genetics, 165, 895900
Rosenberg, N.A. and Nordborg, M. (2002) Genealogical trees, coalescent theory, and the analysis of genetic polymorphisms. Nat. Rev. Genet., 3, 380390[CrossRef][Web of Science][Medline].
Rozas, J. and Rozas, R. (1995) DnaSP, DNA sequence polymorphism: an interactive program for estimating population genetics parameters from DNA sequence data. Comput. Appl. Biosci., 11, 621625
Rozas, J., et al. (2003) DnaSP, DNA polymorphism analyses by the coalescent and other methods. Bioinformatics, 19, 24962497
Sabeti, P.C., et al. (2002) Detecting recent positive selection in the human genome from haplotype structure. Nature, 419, 832837[CrossRef][Medline].
Stein, L.D., et al. (2002) The generic genome browser: a building block for a model organism system database. Genome Res., 12, 15991610
This article has been cited by other articles:
![]() |
E. M. Hill-Burns and A. G. Clark X-Linked Variation in Immune Response in Drosophila melanogaster Genetics, December 1, 2009; 183(4): 1477 - 1491. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Librado and J. Rozas DnaSP v5: a software for comprehensive analysis of DNA polymorphism data Bioinformatics, June 1, 2009; 25(11): 1451 - 1452. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Bensasson, M. Zarowiecki, A. Burt, and V. Koufopanou Rapid Evolution of Yeast Centromeres in the Absence of Drive Genetics, April 1, 2008; 178(4): 2161 - 2167. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. M. Ehrenreich and M. D. Purugganan Sequence Variation of MicroRNAs and Their Binding Sites in Arabidopsis Plant Physiology, April 1, 2008; 146(4): 1974 - 1982. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. J. Tsai, D. Bensasson, A. Burt, and V. Koufopanou Population genomics of the wild yeast Saccharomyces paradoxus: Quantifying the life cycle PNAS, March 25, 2008; 105(12): 4957 - 4962. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Egea, S. Casillas, E. Fernandez, M. A. Senar, and A. Barbadilla MamPol: a database of nucleotide polymorphism in the Mammalia class Nucleic Acids Res., January 12, 2007; 35(suppl_1): D624 - D629. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Glinka, D. De Lorenzo, and W. Stephan Evidence of Gene Conversion Associated with a Selective Sweep in Drosophila melanogaster Mol. Biol. Evol., October 1, 2006; 23(10): 1869 - 1878. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






