Skip Navigation


Bioinformatics Advance Access originally published online on August 12, 2007
Bioinformatics 2007 23(19):2636-2637; doi:10.1093/bioinformatics/btm391
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/19/2636    most recent
btm391v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Schmidt, H. A.
Right arrow Articles by Buschbom, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Schmidt, H. A.
Right arrow Articles by Buschbom, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

pIPHULA—parallel inference of population parameters using a likelihood approach

Heiko A. Schmidt 1,2,3,4, Arndt von Haeseler 1,2,3,4,* and Jutta Buschbom 5,{dagger}

1Center for Integrative Bioinformatics Vienna (CIBIV), Max F. Perutz Laboratories (MFPL), 2University of Vienna, 3Medical University Vienna, 4University of Veterinary Medicine, Vienna, Austria and 5Department of Bioinformatics, Institute for Computer Sciences, Heinrich-Heine-University Düsseldorf, Germany

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODELING POPULATION SIZE...
 3 THE SOFTWARE: PIPHULA
 4 BENCHMARKS AND APPLICATION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: pIPHULA is the parallel program to estimate the parameters of a realistic model of population growth.

Availability: pIPHULA (http://www.cibiv.at/software/piphula) is written in ISO C, parallel and sequential executables run on UNIX/Linux, Windows and MacOS systems. For (free) MPI libraries see http://en.wikipedia.org/wiki/Message_Passing_Interface.

Contact: heiko.schmidt{at}univie.ac.at or ha.schmidt{at}web.de

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODELING POPULATION SIZE...
 3 THE SOFTWARE: PIPHULA
 4 BENCHMARKS AND APPLICATION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Population demographic factors provide the cornerstones of evolution within a species. They influence genetic diversity, govern evolutionary processes and, therefore, determine the evolutionary trajectories within lineages through time. In the concerted effort to understand human evolution and diversity, as the basis for, e.g. medical and pharmaceutical applications, the effects of population growth on the molecular diversity of the genome have to be taken into account. The inclusion of population size changes in analyses of selection regimes is especially of importance, since population demography and selective forces produce similar patterns of neutral nucleotide diversity within populations (Depaulis et al., 2003; Nordborg, 2001). Here, multi-gene investigations are of utmost importance to distinguish between these two evolutionary factors, since population demography influences the entire genome while selection is gene-specific. The reliable inference of population demographic factors, thus, forms the basis for further population-genetic investigations into the evolutionary processes governing populations. For the analysis of nucleotide diversity patterns, statistical coalescence-based approaches are in place to reliably infer population size changes. A Monte Carlo simulation approach was introduced by Weiss and von Haeseler (1998) to infer the population demographic parameters of a realistic population growth model. Based on the original implementation, IPHULA, we present an improved and parallelized version of this program that facilitates the large-scale multi-gene studies necessary to unravel population demographic parameters based on the amounts of DNA sequence data available today.


    2 MODELING POPULATION SIZE CHANGES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODELING POPULATION SIZE...
 3 THE SOFTWARE: PIPHULA
 4 BENCHMARKS AND APPLICATION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Weiss and von Haeseler (1998) suggested a three-parameter model ({theta}, {tau}, {rho}) of an exponential population size change. The model takes into account the population size {theta} = 4N0µ before the growth or decline event started, with N0 denoting the initial effective population size and µ the mutation rate per gene and generation. The starting point {tau} of the population size change event is considered in units of Formula in the past, and the population growth factor {rho} represents the ratio between the current and the initial population size.

Population size changes produce typical distributions of neutral genetic variation in populations that provide information on the three parameters of the model. Inference is based on the mean pairwise difference K and the number of polymorphisms S observed in DNA sequence data. A maximum-likelihood approach is applied to estimate the model parameters based on K and S. Weiss and von Haeseler (1998) have suggested a Monte Carlo approach to obtain maximum-likelihood estimates of the parameters. This approach approximates the likelihood surface for the three relevant population parameters by coalescent simulations on a 3D grid. Each simulation is conducted in two steps: first, genealogies are produced. Their coalescent times are determined according to the coalescent approximation for populations with deterministically varying population sizes (Griffiths and Tavaré, 1994). In a second step, a sample of sequences is generated by evolving an ancestral sequence along the genealogy. Here, the mutation process runs independently of the specific genealogy and is determined by the nucleotide substitution model parameters set by the user. Thus, the simulated DNA sequence data sets are analyzed with regard to the K- and S-statistics. The frequency with which the simulated data sets produce K- and S-values in the same range (K) or identical (S) to those found in the original data set approximates the likelihood lik({theta}, {tau}, {rho}|K, S).

These likelihood-based simulations, however, are computationally very intense. Since investigations of population demography and history need to be conducted on multi-gene data sets, the runtime of the simulations so far has hampered the execution of large-scale studies in reasonable time.


    3 THE SOFTWARE: PIPHULA
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODELING POPULATION SIZE...
 3 THE SOFTWARE: PIPHULA
 4 BENCHMARKS AND APPLICATION
 ACKNOWLEDGEMENTS
 REFERENCES
 
To reduce the running time of the analysis, we developed an improved and parallel implementation of the IPHULA software.

To accomplish the parallelization two steps were necessary. First, the previously sequential IPHULA version was analyzed in order to adapt the source code and data structures to the implementation of the distributed memory parallelism. In the actual parallelization step, portions of the Monte Carlo simulations on the 3D grid of parameters had to be distributed to different processes using MPI, the message passing interface standard (Snir et al., 1998). MPI has been preferred over OpenMP (Dagum and Menon, 1998), since the latter is restricted to shared memory architectures only, while MPI can be run on almost all parallel platforms including workstation clusters and multi-core PCs. Also preliminary tests have shown superior performance of MPI over the OpenMP version.

We have optimized and parallelized the IPHULA program using a master/worker scheme. The Master coordinates and distributes blocks of simulations to the workers, which actually perform the tasks (Fig. 1).


Figure 1
View larger version (31K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Parallelization scheme of pIPHULA.

 
To guarantee optimal performance also on heterogeneous workstation clusters, we applied the smooth guided self-scheduling algorithm (SGSS, Schmidt et al., 2003) that dynamically assigns batches of simulations to free worker processes. SGSS has already been shown to give good performance by keeping worker processes equally busy while keeping the communication overhead low, even on very heterogeneous clusters (Petzold et al., 2006).

In addition, the pIPHULA package also includes scripts to easily generate figures (c.f. Fig. 2) using the free statistics software R (http://www.r-project.org).


Figure 2
View larger version (50K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Likelihood surface of the gene accn2 for the investigated sample of European humans (darker=higher likelihoods). Results for all {rho} values used in the simulation are presented in Supplementary Figure 2.

 

    4 BENCHMARKS AND APPLICATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODELING POPULATION SIZE...
 3 THE SOFTWARE: PIPHULA
 4 BENCHMARKS AND APPLICATION
 ACKNOWLEDGEMENTS
 REFERENCES
 
pIPHULA was applied to a data set by Freudenberg-Hua et al. (2003) of the neuronal amiloride-sensitive cation channel 2 (accn2) gene comprising 95 individuals representing the European human population. Benchmarks where performed on 1 (sequential run) to 20 processors of an AMD 2 GHZ-Opteron cluster. The inference procedure is based on 10 000 simulations for each set of parameters. The runtimes of 10 independent repetitions for each number of CPUs are averaged.

The parallel version of IPHULA presents an almost perfect speedup, only the master process leads to a slight loss in performance (Supplementary Fig. 1). With 20 processors the complete Monte Carlo analysis of accn2 could be performed within 8.3 h, compared to 6.5 days on a single CPU.

Figure 2 shows the resulting likelihood surface for accn2. The maximum-likelihood estimates for the three population growth parameters ({rho} = 100, {theta} = 0.16, {tau} = 0.1) show the signature of population growth over a comparably long time span. However, this result has to be confirmed by multi-gene analyses to distinguish between demography and selection.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODELING POPULATION SIZE...
 3 THE SOFTWARE: PIPHULA
 4 BENCHMARKS AND APPLICATION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The authors would like to thank Gunter Weiss for providing helpful comments on the IPHULA code. A.V.H. and J.B. were supported by a grant of the Deutsche Forschungsgemeinschaft (DFG Ha 1628/7-1). H.A.S. and A.V.H. acknowledge financial support by the Wiener Wissenschafts-, Forschungs- und Technologiefonds (WWTF).

Conflict of Interest: none declared.


    FOOTNOTES
 
{dagger}Present address: Federal Research Centre for Forestry and Forest Products, Institute of Forest Genetics and Forest Plant Breeding, Großhansdorf, Germany. Back

Associate Editor: Martin Bishop

Received on June 20, 2006; revised on July 8, 2007; accepted on July 27, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MODELING POPULATION SIZE...
 3 THE SOFTWARE: PIPHULA
 4 BENCHMARKS AND APPLICATION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Dagum L, Menon R. OpenMP: an industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng., ( (1998) ) 5, : 46–55.[CrossRef].

    Depaulis F, et al. Power of neutrality tests to detect bottlenecks and hitchhiking. J. Mol. Evol., ( (2003) ) 57, : S190–S200.[CrossRef][ISI][Medline].

    Freudenberg-Hua Y, et al. Single nucleotide variation analysis in 65 candidate genes for CNS disorders in a representative sample of the European population. Genome Res., ( (2003) ) 13, : 2271–2276.[Abstract/Free Full Text].

    Griffiths RC, Tavaré S. Sampling theory for neutral alleles in a varying environment. Philos. Trans. R. Soc. Lond. B, ( (1994) ) 344, : 403–410.[ISI][Medline].

    Nordborg M. Coalescent theory. In: Handbook of Statistical Genetics, —Balding DJ, et al, eds. ( (2001) ) Chichester: John Wiley and Sons..

    Petzold E, et al. Phylogenetic parameter estimation on COWs. In: Parallel Computing for Bioinformatics, —Zomaya AY, ed. ( (2006) ) New York: Wiley and Sons. 349–370..

    Schmidt HA, et al. Molecular phylogenetics: parallelized parameter estimation and quartet puzzling. J. Parallel Distrib. Comput., ( (2003) ) 63, : 719–727.[CrossRef].

    Snir M, et al. MPI: The Complete Reference - The MPI Core, ( (1998) ) 1, , 2nd. Cambridge, Massachusetts: The MIT Press..

    Weiss G, von Haeseler A. Inference of population history using a likelihood approach. Genetics, ( (1998) ) 149, : 1539–1546.[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/19/2636    most recent
btm391v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Schmidt, H. A.
Right arrow Articles by Buschbom, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Schmidt, H. A.
Right arrow Articles by Buschbom, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?