Bioinformatics Advance Access originally published online on September 7, 2004
Bioinformatics 2005 21(3):402-404; doi:10.1093/bioinformatics/bti003
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 21 issue 3 © Oxford University Press 2005; all rights reserved.
SNAP: workbench management tool for evolutionary population genetic analysis
1 Center for Integrated Fungal Research, Department of Plant Pathology, North Carolina State University Raleigh, NC 27695, USA
2 Department of Computer Sciences, North Carolina State University Raleigh, NC 27695, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: The reconstruction of population processes from DNA sequence variation requires the coordinated implementation of several coalescent-based methods, each bound by specific assumptions and limitations. In practice, the application of these coalescent-based methods for parameter estimation is difficult because they make strict assumptions that must be verified a priori and their parameter-rich nature makes the estimation of all model parameters very complex and computationally intensive. A further complication is their distribution as console applications that require the user to navigate through console menus or specify complex command-line arguments. To facilitate the implementation of these coalescent-based tools we developed SNAP Workbench, a Java program that manages and coordinates a series of programs. The workbench enhances population parameter estimation by ensuring that the assumptions and program limitations of each method are met and by providing a step-by-step methodology for examining population processes that integrates both summary-statistic methods and coalescent-based population genetic models.
Availability: SNAP Workbench is freely available at http://snap.cifr.ncsu.edu. The workbench and tools can be downloaded for Mac, Windows and Unix operating systems. Each package includes installation instructions, program documentation and a sample dataset.
Contact: ignazio_carbone{at}ncsu.edu
Supplementary information: A description of system requirements and installation instructions can be found at http://snap.cifr.ncsu.edu
| INTRODUCTION |
|---|
|
|
|---|
In recent years, rapid advances in DNA sequencing technology and population genetic theory have resulted in a plethora of new approaches for making inferences on population processes from DNA sequence variation. Among these are the tools for estimating population mutation and migration rates such as MIGRATE, (Beerli and Felsenstein, 1999, 2001) MDIV (Nielsen and Wakeley, 2001 and Genetree Bahlo and Griffiths, 2000); recombination rates such as Recom58 (Griffiths and Marjoram, 1996) Infs and Fins (Fearnhead and Donnelly, 2001) and Recombine (Kuhner et al., 2000); migration and recombination such as LAMARC (Beerli and Felsenstein, 2001; Kuhner et al., 2000); and selection (Neuhauser and Krone, 1997). These approaches were built on WrightFisher population genetic models by incorporating probability-based coalescent methods, which take full advantage of the data and their inherent stochastic properties (Kingman, 1982a,b,c).
In practice, there are three main limitations for using these coalescent-based methods: (1) they make strict assumptions that must be verified a priori; (2) their parameter-rich nature makes the estimation of all model parameters very complex and computationally intensive; and (3) they are distributed as console applications written in C and require the user to navigate through console menus or specify complex command-line arguments. Although many tools and techniques are being developed for analyzing population-based DNA sequence variation, very few provide step-by-step methodologies for integrating multiple analysis methods into a readily accessible, user-friendly package.
The development of an integrated software environment would eliminate incompatibilities due to the strict data format requirements of different programs and allow data input and output to flow seamlessly between different analysis modules. This approach has been the goal of several emerging program suites, such as Mesquite (Maddison and Maddison, 2002, http://mesquiteproject.org) and EMBOSS Rice et al., 2000, which present the user with an arsenal of molecular tools and analysis methods. Two major limitations with these software programs are that they provide little or no guidance on how to perform a specific analysis and they require source code modification of existing C program modules before their inclusion.
| SYSTEMS AND METHODS |
|---|
|
|
|---|
We have developed a workbench program that can manage and coordinate a suite of nucleotide analysis programs (SNAP). These include the alignment program ClustalW (Thompson et al., 1994); the phylogenetic analysis program PHYLIP (Felsenstein, 2004) and PAUP* 4.0 (Swofford, 1998); the non-parametric permutation analysis programs Seqtomatrix and Permtest (Hudson et al., 1992); the recombination detection programs RecMin (Myers and Griffiths, 2003); and RecPars (Hein, 1990); the coalescent-based programs Genetree, Recom58, MDIV and MIGRATE; and additional programs [SNAP Combine, Map, Clade and Matrix; Fig. 1 and (Carbone et al., 2004)]. The workbench was designed to facilitate the incorporation of new tools as they become available, thereby serving as a bridge between theoretical and applied population genetic analysis. Although our workbench was designed to target population genetics, its solution to the problem of workflow integration is flexible and powerful enough to provide solutions in other fields as well. Our goal in developing the workbench was to:
- eliminate the requirement for using command line,
- integrate a wide array of approaches for analyzing population genetic data based on both traditional summary-statistic methods and the newer coalescent-based population genetic models,
- ensure that the assumptions and program requirements of each method are not violated and
- provide user interactive tutorials for teaching and training.
|
| Framework |
|---|
|
|
|---|
The workbench was programmed in Java to preserve platform independence across multiple operating systems. The program modules integrated in the workbench are written in C or Java and can be readily compiled on a variety of computing platforms. SNAP Workbench allows the user to customize the interface for available program modules without requiring computer programming or shell scripting skills. This is accomplished using the template design feature of the workbench. Templates allow the user to create and organize drop-down menus in the interface. Each menu option is further divided into submenus. Submenus define a set of programs or options within a single program that are executed sequentially to complete a particular analysis. For example, the submenu option for performing a Nonparametric test for population subdivision under the Migration menu requires the user to execute the programs SNAP Map, Seqtomatrix and Permtest sequentially. Because there is usually more than one way to perform these analyses, the workbench supports, via multithreading, simultaneous program executions to allow the user to explore different scenarios. Multiple files are displayed in a tabbed format for easy access and line wrapping has been disabled to facilitate viewing of long DNA sequences. Files may be edited and saved directly in the workbench using basic text-editing functions.
| IMPLEMENTATION |
|---|
|
|
|---|
SNAP Workbench has recently been used to examine recombination and migration in Cryphonectria hypovirus 1 (CHV-1) (Carbone et al., 2004). The multithreaded capability of our workbench was particularly important in this study because many independent coalescent runs were necessary to ensure convergence of the programs MIGRATE, MDIV, Recom58 and Genetree. A flowchart showing all the programs and analysis paths for inferring migration and recombination processes in CHV-1 is shown in Figure 1. The template design feature of the workbench allowed us to create menus consisting of a defined set of programs, assumptions and parameter settings for following a particular path in the flowchart (e.g. see paths 1 and 2 in Fig. 1). Currently, the workbench is designed to operate on a single machine. Future versions of SNAP Workbench will be able to use distributed parallel processing on Linux clusters and supercomputers for performing computationally intensive simulations and will integrate tutorials, providing comprehensive hands-on training, for the different analysis methods in the workbench.
| Acknowledgments |
|---|
The authors thank Doug Brown, Judy Jakobek and two anonymous reviewers for providing valuable comments.
Received on June 18, 2004; revised on August 5, 2004; accepted on August 24, 2004
| REFERENCES |
|---|
|
|
|---|
Bahlo, M. and Griffiths, R.C. (2000) Inference from gene trees in a subdivided population. Theoret. Popul. Biol., 57, 7995.
Beerli, P. and Felsenstein, J. (1999) Maximum-likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach. Genetics, 152, 763773
Beerli, P. and Felsenstein, J. (2001) Maximum-likelihood estimation of a migration matrix and effective population sizes in n subpopulations by using a coalescent approach. Proc. Natl Acad. Sci. USA, 98, 45634568
Carbone, I., Liu, Y., Hillman, B.I., Milgroom, M.G. (2004) Recombination and migration of Cryphonectria hypovirus 1 as inferred from gene genealogies and the coalescent. Genetics, 166, 16111629
Fearnhead, P. and Donnelly, P. (2001) Estimating recombination rates from population genetic data. Genetics, 159, 12991318
Felsenstein, J. (2004) PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. , Seattle, WA Department of Genomic Sciences, University of Washington.
Griffiths, R. and Marjoram, P. (1996) Ancestral inference from samples of DNA sequences with recombination. J. Comput. Biol., 3, 479502[Web of Science][Medline].
Hein, J. (1990) Reconstructing evolution of sequences subject to recombination using parsimony. Math. Biosci., 98, 185200[CrossRef][Web of Science][Medline].
Hudson, R.R., Boos, D.D., Kaplan, N.L. (1992) A statistical test for detecting geographic subdivision. Mol. Biol. Evol., 9, 138151[Abstract].
Kingman, J.F.C. (1982a) On the genealogy of large populations. J. Appl. Probab., 19, 2743.
Kingman, J.F.C. (1982b) Exchangeability and the evolution of large populations. In Koch, G. and Spizzichino, F. (Eds.). Exchangeability in Probability and Statistics, , Amsterdam North-Holland, pp. 97112.
Kingman, J.F.C. (1982c) The coalescent. Stochastic Processes and their Applications, 13, 235248[CrossRef].
Kuhner, M.K., Yamato, J., Felsenstein, J. (2000) Maximum likelihood estimation of recombination rates from population data. Genetics, 156, 13931401
Mesquite: a modular system for evolutionary analysis. Version 0.992. Maddison, W.P. and Maddison, D.R. (2002) .
Myers, S.R. and Griffiths, R.C. (2003) Bounds on the minimum number of recombination events in a sample history. Genetics, 163, 375394[Web of Science][Medline].
Neuhauser, C. and Krone, S.M. (1997) The genealogy of samples in models with selection. Genetics, 145, 519534[Abstract].
Nielsen, R. and Wakeley, J. (2001) Distinguishing migration from isolation: a Markov chain Monte Carlo approach. Genetics, 158, 885896
Rice, P., Longden, I., Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet., 16, 276277[CrossRef][Web of Science][Medline].
Swofford, D.L. PAUP*. Phylogenetic Analysis Using Parsimony (* and Other Methods). Version 4.0, (1998) , Sunderland, MA Sinauer Associates.
Thompson, J.D., Higgins, D.G., Gibson, T.J. (1994) ClustalW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, , pp. 46734680
This article has been cited by other articles:
![]() |
N. D. Charlton, I. Carbone, S. M. Tavantzis, and M. A. Cubeta Phylogenetic relatedness of the M2 double-stranded RNA in Rhizoctonia fungi Mycologia, July 1, 2008; 100(4): 555 - 564. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Park, B. Park, K. Jung, S. Jang, K. Yu, J. Choi, S. Kong, J. Park, S. Kim, H. Kim, et al. CFGP: a web-based, comparative fungal genomics platform Nucleic Acids Res., January 11, 2008; 36(suppl_1): D562 - D571. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. R. Grunig, A. Duo, T. N. Sieber, and O. Holdenrieder Assignment of species rank to six reproductively isolated cryptic species of the Phialocephala fortinii s.l.-Acephala applanata species complex Mycologia, January 1, 2008; 100(1): 47 - 67. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. H. Stukenbrock, S. Banke, M. Javan-Nikkhah, and B. A. McDonald Origin and Domestication of the Fungal Wheat Pathogen Mycosphaerella graminicola via Sympatric Speciation Mol. Biol. Evol., February 1, 2007; 24(2): 398 - 411. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. L. Aylor, E. W. Price, and I. Carbone SNAP: Combine and Map modules for multilocus population genetic analysis Bioinformatics, June 1, 2006; 22(11): 1399 - 1401. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Charles, I. Carbone, K. G. Davies, D. Bird, M. Burke, B. R. Kerry, and C. H. Opperman Phylogenetic Analysis of Pasteuria penetrans by Use of Multiple Genetic Loci J. Bacteriol., August 15, 2005; 187(16): 5700 - 5708. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





