Bioinformatics Advance Access originally published online on August 12, 2007
Bioinformatics 2007 23(19):2636-2637; doi:10.1093/bioinformatics/btm391
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
pIPHULA—parallel inference of population parameters using a likelihood approach

1Center for Integrative Bioinformatics Vienna (CIBIV), Max F. Perutz Laboratories (MFPL), 2University of Vienna, 3Medical University Vienna, 4University of Veterinary Medicine, Vienna, Austria and 5Department of Bioinformatics, Institute for Computer Sciences, Heinrich-Heine-University Düsseldorf, Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: pIPHULA is the parallel program to estimate the parameters of a realistic model of population growth.
Availability: pIPHULA (http://www.cibiv.at/software/piphula) is written in ISO C, parallel and sequential executables run on UNIX/Linux, Windows and MacOS systems. For (free) MPI libraries see http://en.wikipedia.org/wiki/Message_Passing_Interface.
Contact: heiko.schmidt{at}univie.ac.at or ha.schmidt{at}web.de
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Population demographic factors provide the cornerstones of evolution within a species. They influence genetic diversity, govern evolutionary processes and, therefore, determine the evolutionary trajectories within lineages through time. In the concerted effort to understand human evolution and diversity, as the basis for, e.g. medical and pharmaceutical applications, the effects of population growth on the molecular diversity of the genome have to be taken into account. The inclusion of population size changes in analyses of selection regimes is especially of importance, since population demography and selective forces produce similar patterns of neutral nucleotide diversity within populations (Depaulis et al., 2003; Nordborg, 2001). Here, multi-gene investigations are of utmost importance to distinguish between these two evolutionary factors, since population demography influences the entire genome while selection is gene-specific. The reliable inference of population demographic factors, thus, forms the basis for further population-genetic investigations into the evolutionary processes governing populations. For the analysis of nucleotide diversity patterns, statistical coalescence-based approaches are in place to reliably infer population size changes. A Monte Carlo simulation approach was introduced by Weiss and von Haeseler (1998) to infer the population demographic parameters of a realistic population growth model. Based on the original implementation, IPHULA, we present an improved and parallelized version of this program that facilitates the large-scale multi-gene studies necessary to unravel population demographic parameters based on the amounts of DNA sequence data available today.
| 2 MODELING POPULATION SIZE CHANGES |
|---|
|
|
|---|
Weiss and von Haeseler (1998) suggested a three-parameter model (
,
,
) of an exponential population size change. The model takes into account the population size
= 4N0µ before the growth or decline event started, with N0 denoting the initial effective population size and µ the mutation rate per gene and generation. The starting point
of the population size change event is considered in units of
represents the ratio between the current and the initial population size.
Population size changes produce typical distributions of neutral genetic variation in populations that provide information on the three parameters of the model. Inference is based on the mean pairwise difference K and the number of polymorphisms S observed in DNA sequence data. A maximum-likelihood approach is applied to estimate the model parameters based on K and S. Weiss and von Haeseler (1998) have suggested a Monte Carlo approach to obtain maximum-likelihood estimates of the parameters. This approach approximates the likelihood surface for the three relevant population parameters by coalescent simulations on a 3D grid. Each simulation is conducted in two steps: first, genealogies are produced. Their coalescent times are determined according to the coalescent approximation for populations with deterministically varying population sizes (Griffiths and Tavaré, 1994). In a second step, a sample of sequences is generated by evolving an ancestral sequence along the genealogy. Here, the mutation process runs independently of the specific genealogy and is determined by the nucleotide substitution model parameters set by the user. Thus, the simulated DNA sequence data sets are analyzed with regard to the K- and S-statistics. The frequency with which the simulated data sets produce K- and S-values in the same range (K) or identical (S) to those found in the original data set approximates the likelihood lik(
,
,
|K, S).
These likelihood-based simulations, however, are computationally very intense. Since investigations of population demography and history need to be conducted on multi-gene data sets, the runtime of the simulations so far has hampered the execution of large-scale studies in reasonable time.
| 3 THE SOFTWARE: PIPHULA |
|---|
|
|
|---|
To reduce the running time of the analysis, we developed an improved and parallel implementation of the IPHULA software.
To accomplish the parallelization two steps were necessary. First, the previously sequential IPHULA version was analyzed in order to adapt the source code and data structures to the implementation of the distributed memory parallelism. In the actual parallelization step, portions of the Monte Carlo simulations on the 3D grid of parameters had to be distributed to different processes using MPI, the message passing interface standard (Snir et al., 1998). MPI has been preferred over OpenMP (Dagum and Menon, 1998), since the latter is restricted to shared memory architectures only, while MPI can be run on almost all parallel platforms including workstation clusters and multi-core PCs. Also preliminary tests have shown superior performance of MPI over the OpenMP version.
We have optimized and parallelized the IPHULA program using a master/worker scheme. The Master coordinates and distributes blocks of simulations to the workers, which actually perform the tasks (Fig. 1).
|
To guarantee optimal performance also on heterogeneous workstation clusters, we applied the smooth guided self-scheduling algorithm (SGSS, Schmidt et al., 2003) that dynamically assigns batches of simulations to free worker processes. SGSS has already been shown to give good performance by keeping worker processes equally busy while keeping the communication overhead low, even on very heterogeneous clusters (Petzold et al., 2006).
In addition, the pIPHULA package also includes scripts to easily generate figures (c.f. Fig. 2) using the free statistics software R (http://www.r-project.org).
|
| 4 BENCHMARKS AND APPLICATION |
|---|
|
|
|---|
pIPHULA was applied to a data set by Freudenberg-Hua et al. (2003) of the neuronal amiloride-sensitive cation channel 2 (accn2) gene comprising 95 individuals representing the European human population. Benchmarks where performed on 1 (sequential run) to 20 processors of an AMD 2 GHZ-Opteron cluster. The inference procedure is based on 10 000 simulations for each set of parameters. The runtimes of 10 independent repetitions for each number of CPUs are averaged.
The parallel version of IPHULA presents an almost perfect speedup, only the master process leads to a slight loss in performance (Supplementary Fig. 1). With 20 processors the complete Monte Carlo analysis of accn2 could be performed within 8.3 h, compared to 6.5 days on a single CPU.
Figure 2 shows the resulting likelihood surface for accn2. The maximum-likelihood estimates for the three population growth parameters (
= 100,
= 0.16,
= 0.1) show the signature of population growth over a comparably long time span. However, this result has to be confirmed by multi-gene analyses to distinguish between demography and selection.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The authors would like to thank Gunter Weiss for providing helpful comments on the IPHULA code. A.V.H. and J.B. were supported by a grant of the Deutsche Forschungsgemeinschaft (DFG Ha 1628/7-1). H.A.S. and A.V.H. acknowledge financial support by the Wiener Wissenschafts-, Forschungs- und Technologiefonds (WWTF).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Present address: Federal Research Centre for Forestry and Forest Products, Institute of Forest Genetics and Forest Plant Breeding, Großhansdorf, Germany. Associate Editor: Martin Bishop
Received on June 20, 2006; revised on July 8, 2007; accepted on July 27, 2007
| REFERENCES |
|---|
|
|
|---|
Dagum L, Menon R. OpenMP: an industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng., ( (1998) ) 5, : 46–55.[CrossRef].
Depaulis F, et al. Power of neutrality tests to detect bottlenecks and hitchhiking. J. Mol. Evol., ( (2003) ) 57, : S190–S200.[CrossRef][ISI][Medline].
Freudenberg-Hua Y, et al. Single nucleotide variation analysis in 65 candidate genes for CNS disorders in a representative sample of the European population. Genome Res., ( (2003) ) 13, : 2271–2276.
Griffiths RC, Tavaré S. Sampling theory for neutral alleles in a varying environment. Philos. Trans. R. Soc. Lond. B, ( (1994) ) 344, : 403–410.[ISI][Medline].
Nordborg M. Coalescent theory. In: Handbook of Statistical Genetics, —Balding DJ, et al, eds. ( (2001) ) Chichester: John Wiley and Sons..
Petzold E, et al. Phylogenetic parameter estimation on COWs. In: Parallel Computing for Bioinformatics, —Zomaya AY, ed. ( (2006) ) New York: Wiley and Sons. 349–370..
Schmidt HA, et al. Molecular phylogenetics: parallelized parameter estimation and quartet puzzling. J. Parallel Distrib. Comput., ( (2003) ) 63, : 719–727.[CrossRef].
Snir M, et al. MPI: The Complete Reference - The MPI Core, ( (1998) ) 1, , 2nd. Cambridge, Massachusetts: The MIT Press..
Weiss G, von Haeseler A. Inference of population history using a likelihood approach. Genetics, ( (1998) ) 149, : 1539–1546.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

