Skip Navigation


Bioinformatics Advance Access originally published online on July 17, 2008
Bioinformatics 2008 24(17):1966-1967; doi:10.1093/bioinformatics/btn329
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/17/1966    most recent
btn329v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Dancik, G. M.
Right arrow Articles by Dorman, K. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dancik, G. M.
Right arrow Articles by Dorman, K. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

mlegp: statistical analysis for computer models of biological systems using R

Garrett M. Dancik 1,2 and Karin S. Dorman 1,2,3,*

1Program in Bioinformatics & Computational Biology, 2Department of Statistics and 3Department of Genetics, Development & Cell Biology, Iowa State University, Ames, IA 50010, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STATISTICAL METHODS
 3 SOFTWARE
 4 EXAMPLE APPLICATION: ANALYSIS...
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: Gaussian processes (GPs) are flexible statistical models commonly used for predicting output from complex computer codes. As such, GPs are well suited for the analysis of computer models of biological systems, which have been traditionally difficult to analyze due to their high-dimensional, non-linear and resource-intensive nature. We describe an R package, mlegp, that fits GPs to computer model outputs and performs sensitivity analysis to identify and characterize the effects of important model inputs.

Availability: http://www.biomath.org/mlegp

Contact: kdorman@iastate.edu

Supplementary information: See http://www.biomath.org/mlegp for a user manual and examples.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STATISTICAL METHODS
 3 SOFTWARE
 4 EXAMPLE APPLICATION: ANALYSIS...
 ACKNOWLEDGEMENTS
 REFERENCES
 
Gaussian processes (GPs) commonly facilitate analysis of computer models that are high-dimensional, non-linear and resource-intensive (Santner et al., 2003) by serving as fast and accurate emulators of these models. GPs play a prominent role in computer model calibration and validation (Bayarri et al., 2007; Kennedy and O'Hagan, 2001)), as well as sensitivity analysis (SA) to rank inputs in order of importance [based on functional analysis of variance (FANOVA) decomposition] and to characterize their effects (through visual plots of main and two-way factor interactions) (Schonlau and Welch, 2006).

We describe an R package, mlegp, that implements GP modeling with power exponential correlation structure (Santner et al., 2003), the SA methods described in Schonlau and Welch (2006) and the modeling of functional computer model output described in Heitmann et al. (2006). In addition, mlegp extends previous GP models to handle stochastic computer output with non-constant (heteroscedastic) variance by no longer requiring a constant nugget term across observations.

The package is appropriate for what Kitano (2002) describes as simulation-based research in systems biology. In this context, computer models have been used to simulate gene expression and signal transduction pathways, e.g. in Escherichia coli (Dobrzynski et al., 2007); and infectious disease at the cellular level, e.g. Mycobacterium tuberculosis infection (Segovia-Juarez et al., 2004). We demonstrate the capabilities of mlegp by analyzing a computer model of parasitic infection.


    2 STATISTICAL METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STATISTICAL METHODS
 3 SOFTWARE
 4 EXAMPLE APPLICATION: ANALYSIS...
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 The Gaussian process
Let zobs=[z(x(1)), ..., z(x(m))] be a vector of observed computer model outputs for m choices of the input vector x(i)=[xFormula, ..., xFormula]. We are interested in predicting output z(t) at untried input t. The correlation between any two computer model outputs is assumed to have the form


Formula 1

(1)
Let µ(x) be the unconditional mean E[z(x)]. Define the mean matrix


Formula

Under the GP model, computer output follows a multivariate normal distribution


Formula 2

(2)
where I is the mxm identity matrix, C{equiv}{Cij} from Equation (1), {sigma}Formula is the unconditional variance of mean computer model output and the nugget {sigma}Formula accounts for computer model stochasticity. For convenience, denote the variance–covariance matrix as V. Then, the GP predictive distribution of z(t) is normal with mean and variance


Formula

where r=[r1,...,rm]', with ri=cor(z(t), z(x(i))) following Equation (1). For more details, see Santner et al. (2003).


Figure 1
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Sensitivity analysis of a computer model of parasitic infection using the R package mlegp. (a) Computer model output, consisting of pathogen load over time, for 100 simulations obtained by varying inputs x1 through x5, (b) main effects for all inputs on pathogen load at 5 dpi, along with the percentage contribution of each effect to the total functional variance of the GP predictor and (c) main effect of x1 on pathogen load over time.

 
2.2 Sensitivity analysis
Schonlau and Welch (2006) describe SA of computer models using GPs. For independent marginal priors on the components of x, the total variance of the GP predictor can be decomposed into contributions from single and interacting inputs, a technique called FANOVA decomposition. The percent-age of total functional variance attributed to the main effect of an input or the interaction effect among inputs provides a measure of the importance of that effect. The main effect E[z(x)|zobs, xk] of the k-th input variable predicts output, given xk and known results zobs, by integrating against a prior {pi}(xk) on all remaining variables in x. The two-way interaction effect E[z(x)|zobs, xk, xl] is similarly defined. Main effects plots and contour plots conveniently illustrate main effects and two-factor interactions as functions of the model inputs xk and (xk,xl).


    3 SOFTWARE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STATISTICAL METHODS
 3 SOFTWARE
 4 EXAMPLE APPLICATION: ANALYSIS...
 ACKNOWLEDGEMENTS
 REFERENCES
 
The package mlegp, available in R (R Development Core Team, 2007)), finds maximum likelihood estimates of Gaussian process parameters using the likelihood that follows from Equation (2). The package extends previous GP models by allowing the user to replace the identity matrix I in (2) with a user-defined diagonal matrix N. This extension leads to more accurate GP emulators of heteroscedastic computer models when the variance is known or well estimated. Another approach to this problem is implemented in the R package tgp, which fits separate GPs to a partitioned input space using a fully Bayesian approach (Gramacy, 2007). Not all non-constant variance can be partitioned in this way. On the other hand, our model requires knowledge of the nugget matrix up to a multiplicative constant.

GPs with constant mean functions (i.e. µ(x){equiv}µ0) or linear functions in x are supported. For high-dimensional or functional output such as time-series data, the user can opt to fit independent GPs to individual outputs or, instead, to the most important principle component weights following singular value decomposition of the output (Heitmann et al., 2006). For each GP, the R package provides cross-validated diagnostics, performs FANOVA decomposition, and produces plots for all main and two-way factor interaction effects. Main effects for functional output can also be produced.


    4 EXAMPLE APPLICATION: ANALYSIS OF A COMPUTER MODEL OF DISEASE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STATISTICAL METHODS
 3 SOFTWARE
 4 EXAMPLE APPLICATION: ANALYSIS...
 ACKNOWLEDGEMENTS
 REFERENCES
 
The SA methods we describe have been used to analyze computer models with up to 40 input variables (Schonlau and Welch, 2006). For demonstration purposes, we use mlegp to analyze the effects of five input variables (x=[x1,...,x5]) in a computer model of Leishmania major infection (Dancik et al., 2006). Inputs are scaled between 0 and 1 and are described in Supplementary Material. Computer model output consists of pathogen load over time (Fig. 1a). Using mlegp, we fit a GP to 100 observations of pathogen load at 5 days post infection (dpi). Main effects for all inputs and their FANOVA contributions are provided in Figure 1b. Lastly, we use mlegp to calculate the main effect of x1 (pathogen growth rate) on the temporal evolution of pathogen load by fitting independent GPs to the six most important principle component weights (Fig. 1c). The SA shows that the input variable x1 is the most important input for determining pathogen load at 5 dpi and has a positive relationship with this response (Fig. 1b). Low values of x1 result in a gradual increase in pathogen load and a relatively longer infection, whereas high values of x1 result in a sharp increase in pathogen load and a higher peak, but a fast resolution of the infection (Fig. 1c).

In Supplementary Material, we report additional output from mlegp, including GP diagnostic plots, the FANOVA decomposition, and two-way interaction contour plots, for pathogen load at both 5 and 18 dpi. We also illustrate the advantage of using a non-constant nugget term for heteroscedastic computer model output.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STATISTICAL METHODS
 3 SOFTWARE
 4 EXAMPLE APPLICATION: ANALYSIS...
 ACKNOWLEDGEMENTS
 REFERENCES
 
Funding: This work was funded by National Institutes of Health (GM068955) and United States Department of Agriculture (2001-52100-11506).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

Received on October 7, 2007; revised on May 2, 2008; accepted on June 23, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STATISTICAL METHODS
 3 SOFTWARE
 4 EXAMPLE APPLICATION: ANALYSIS...
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Bayarri M, et al. A framework for validation of computer models. Technometrics (2007) 49:138–154.[CrossRef][Web of Science]

    Dancik GM, et al. An agent-based model for Leishmania infection. Interj. Complex Sys. (2006) 1853.

    Dobrzynski M, et al. Computational methods for diffusion-influenced biochemical reactions. Bioinformatics (2007) 23:1969–1977.[Abstract/Free Full Text]

    Gramacy RB. tgp: an R package for Bayesian nonstationary, semiparametric nonlinear regression and design by treed Gaussian process models. J. Stat. Soft. (2007) 19.

    Heitmann K, et al. Cosmic Calibration. Astrophys. J. (2006) 646:L1–L4.[CrossRef][Web of Science]

    Kennedy MC, O'Hagan A. Bayesian calibration of computer models. J.R. Stat. Soc. B (2001) 63:425–464.[CrossRef]

    Kitano H. Computational systems biology. Nature (2002) 420:206–210.[CrossRef][Web of Science][Medline]

    R Development Core Team. (2007) last accessed date July 15 2008. R: a Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Available at URL http://www.R-project.org/.

    Santner TJ, et al. The Design and Analysis of Computer Experiments (2003) New York: Springer. .

    Schonlau M, Welch W. Screening the input variables to a computer model via analysis of variance and visualization. Screening: Methods for Experimentation in Industry, Drug Discovery, and Genetics (2006) New York: Springer. 308–327.

    Segovia-Juarez JL, et al. Identifying control mechanisms of granuloma formation during M. tuberculosis infection using an agent-based model. J. Theor. Biol. (2004) 231:357–376.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/17/1966    most recent
btn329v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Dancik, G. M.
Right arrow Articles by Dorman, K. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dancik, G. M.
Right arrow Articles by Dorman, K. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?