Skip Navigation


Bioinformatics Advance Access originally published online on May 26, 2006
Bioinformatics 2006 22(19):2446-2451; doi:10.1093/bioinformatics/btl173
This Article
Right arrow Extract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/19/2446    most recent
btl173v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wu, X.-L.
Right arrow Articles by Joyce, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wu, X.-L.
Right arrow Articles by Joyce, P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Comments on ‘Bayesian hierarchical error model for analysis of gene expression data’

Xiao-Lin Wu 1,{dagger}, Larry J. Forney 1,* and Paul Joyce 2

1 Department of Biological Sciences, University of Idaho Moscow ID 83844-3051, USA
2 Department of Mathematics and Statistics, University of Idaho Moscow ID 83844-1103, USA

*To whom correspondence should be addressed. Email: lforney{at}uidaho.edu

Cho and Lee (2004) proposed a Bayesian hierarchical error model (HEM) to account for heterogeneous error variability in oligonucleotide microarray experiments. They estimated the parameters of their model using Markov Chain Monte Carlo (MCMC) and proposed an F-like summary statistic to identify differentially expressed genes under multiple conditions. Their HEM is one of the emerging Bayesian hierarchical modeling tools that have been developed for the analysis of multiple-level data structures and variation in microarray gene expression data (Broet et al., 2002; Tadesse and Ibrahim, 2004; Cho and Lee, 2004). In this letter, we first discuss the significance of the HEM developed by Cho and Lee. Then, we re-derive the fully conditional distributions for gene and conditional effects, since we think that these two fully conditional distributions were not presented properly in their paper. Finally, we expand the HEM to deal with biological or/and experimental correlations in gene expression data. A FORTRAN 90 program was developed to implement our extended method and it is available from the authors upon request.

Significance of HEM. Many aspects of the HEM presented in the article of Cho and Lee (2004) are significant. The following are the two aspects that we believe are important in the analysis of microarray gene expression data. First, error variability of gene expression data is decomposed into two components, biological and experimental. The former reflects variation owing to biological sample heterogeneity that affects gene expression, while the latter largely comes from variation in sample preparation and sample analysis. Estimating these two sources of errors separately will not only improve estimation of model parameters but also facilitate a better understanding of patterns of over all gene expression.

Second, and more importantly, the use of a Bayesian HEM is an effective way to represent complex relationships that have many parameters, because it allows the parameters to be modeled separately at several levels and provides a link (S) between the data (D) and the unobserved parameters ({theta}) of interest. In multilevel modeling, the top layer is the data, p(D|S,{theta}), which is modeled by an appropriate likelihood function. The data are assumed to be generated by some process that depends on certain unknown parameters. Modeling the link or process is the middle layer, where the art of statistical modeling takes place. On the bottom layer are the prior distributions that represent a priori beliefs about these parameters. Following the Bayesian inference, unknown parameters are inferred from the joint distribution of the parameters given the data

Formula 1(1)
Thus, Bayesian hierarchical modeling based on multivariate normal distributions effectively handles complex relationships that are governed by many parameters, and provides a tool for evaluating correlated gene expression data. For example, the model of Cho and Lee (2004) can be extended to allow for biological or/and experimental correlations. These will be discussed in more detail in the third section of this letter. Statistically, the estimation of these variables separately is biased when they are correlated (Johnston, 1972; Gianola and Sorensen, 2004). From a practical standpoint, ignoring correlations in gene expression data would result in high variability of statistical estimators and cause the reproducibility to deteriorate (Qiu et al., 2005).

Fully conditional distributions of gene (gi) and condition (cj) effects. We claim that Cho and Lee (2004) made a few mistakes when they derived the fully conditional distributions for the gene (gi) and condition (cj) effects. Let y be a vector containing all the data (yijkls), x a vector containing all the link variables (xijks) that correspond to expression intensities free from experimental errors, g a vector containing all the gene effects (gis), c a vector containing all the condition effects (cjs), r a vector containing all the interaction effects (rijs) of genes and conditions, and {Delta} a variance–covariance matrix with diagonal elements or variance Formula 1 and zero off-diagonal elements or covariance. Following their model specification and prior distributions, the posterior of unknown parameters, given the data (y), is


Formula 2

(2)
where {theta}L = {µ, g, c, r, {Delta}} is a collection of unknown parameters determining the link x, with µ, mij, nijk, Formula 1 being defined the same as µ, mi,j, ni,j,k, Formula 1 respectively, in the paper of Cho and Lee (2004). To ensure that our results are comparable with those of Cho and Lee (2004), we made the same assumptions regarding the prior distribution. That is, we also applied a flat prior to the grand mean µ such that p(µ) cancels out in (2) as a constant. [See Cho and Lee (2004) for a complete discussion of the implications of using flat priors.] To derive the fully conditional distribution for an unknown parameter, we simply pick from (2) the component or components that are related to the parameter of interest. First, consider the grand mean. Since

Formula 1
the fully conditional distribution of the overall mean, given the data and all other parameters, is


Formula 3

(3)
where Formula . In (3) and throughout this letter, we use ‘|·’ to mean ‘conditional on the data and all unknown parameters except the one currently under consideration’. For the effect that corresponds to the i-th gene, we pick components in (2) that are related to gi, which leads to

Formula 1
Thus, the fully conditional distribution of gi given the data and all other parameters is

Formula 4(4)
where Formula 4. Similarly, since

Formula 4
the fully conditional distribution of cj, the effect that corresponds to the j-th condition, is

Formula 5(5)
where Formula 5. Comparing (3), (4) and (5) with the first three equations in Appendix 1 of the paper of Cho and Lee (2004), we found that they applied a common quantity Formula for the three fully conditional distributions and this introduced systematic biases to the posterior estimates of model parameters. In our simulation study, we found this flaw to be dramatically downward-biased posterior estimations of both condition effects and gene effects, and it consequently affected posterior estimation of interaction effects and biological variances. Figure 1 illustrates how it affected posterior sampling of a randomly chosen gene effect toward zero. In contrast, using the corrected fully conditional distribution as shown in (4), we were able to obtain a posterior estimate for this gene effect that was close to its true value (i.e. 0.7923). This situation was representative of all gene and condition effects. However, we observed no significant influence of this flaw on the posterior estimation of µ and Formula 5. In other words, the posterior estimates of µ and Formula 5 were very close to their true values using either the algorithm of Cho and Lee (2004) or the corrected one that we propose. This is not very surprising if one considers the restrictions Formula 5 and Formula 5 in our data simulation. When these restrictions hold, bias in the posterior estimation of gi and cj did not significantly affect the posterior sampling of µ and Formula 5.


Figure 1
View larger version (54K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Comparison of Markov chain sampling of a gene effect using (a) previous algorithm by Cho and Lee (2004) and (b) corrected algorithm by Wu, Forney and Joyce. The simulated gene effect was 0.7923. The y-axis represents posterior estimates of the gene effect whereas the x-axis represents saved (i.e. every one-tenth) Markov chain sampling steps.

 
In their paper, the conditional distribution xijk|· was derived with replicates, as shown below:

Formula 6(6)
We think it was improper for them to reduce (6) to

Formula 6
when there is no experimental replication. By our reasoning, no replication is equivalent to putting nijk = 1, not nijk = 0 as stated in their paper, and Formula 6, implying that

Formula 6

Modeling biological/experimental correlations. Cho and Lee (2004) assumed independence for both biological and experimental errors, but these assumptions may be relaxed in the HEM. In this section, we expand their HEM to model biological or/and experimental correlation in gene expression data. First, consider correlations among experimental replicates, which may arise when experimental replicates are exposed to some common environmental factors yet collected at different times or locations. In these cases, the gene expression data may show temporal or spatial variation. In bacterial biofilms, for example, microorganisms undergo significant phenotypic changes during the transition from planktonic to biofilm forms, which reflects altered underlying gene expression and regulation as they adapt to form highly organized and structured sessile communities (Schembri et al., 2003; Watnick and Kolter, 2000) with enhanced resistance to antimicrobial treatments and host immune responses. Furthermore, the development of a biofilm may be tuned differently as a response to a nutrient gradient from the bulk media to the inside of the biofilm. Thus, gene expression at different locations in a biofilm may show spatial variation similar to geographic data in such a way that ‘everything is related to everything else, but near things are more related than distant things’ (Tobler, 1970). Biofilms are the dominant mode of growth for bacteria in the environment, and are of great concern in medical, environmental and industry settings. Therefore, considering the spatial correlation in a model would help better understanding the formation of biofims as well as their altered biological activities.

Consider a balanced experimental design with observations on i = 1, ... , G genes under j = 1, ... , C conditions with k = 1, ... , m biological samples each measured in l = 1, 2, ... , n chips. Suppose that the n chips represent n experimental replicates (e.g. layers of a biofilm). Let Rn be a n x n spatial variance–covariance matrix such that the residuals e ~ N(0, Formula 6), where Formula 6 is an m* x m* identity matrix (m* = G x C x m) and {otimes} indicates the Kronecker product operation. Thus, in the posterior formulation as shown in (2), Formula 6 is then replaced with p(y|x, Rn) and the latter is given as

Formula 6

If we assume that the prior distribution of Rn to be an inverted Wishart distribution with scale matrix Ve with a degree of freedom parameter {kappa}e [denoted as Wishart–1 ({kappa}e, Ve)], it can be shown that posterior samples of Rn are sampled from the following:

Formula 7(7)
where Se is a n x n matrix with the (l, l) diagonal element being Formula 7 and the (l, l') off-diagonal element being Formula 7. We assumed inverted Wishart distributions instead of inverted multivariate gamma distributions as the prior distributions for the variance–covariance matrix for the sake of convenience in programming. Next, each element in x is sampled from the following fully conditional distribution

Formula 8(8)
where µ* = µ + gi + cj + rij, {gamma}ll'is the (l, l') element of matrix Formula 8, tr represents the trace matrix operation. Under the assumption of common experimental error variance (i.e. Formula 8) with zero covariance (i.e. {gamma}ll' = 0), (8) is reduced to (6).

To model correlated biological errors, we similarly define a m x m covariance matrix {Delta}m such that bijk ~ N(0, Ic* {otimes} {Delta}m where Ic* is an C* x C* identity matrix (C* = G x C). Biological covariance matters in analyzing gene expression data when samples are collected from a single experimental system. Then, p(x|µ, g, c, r{Delta}) in (2) is replaced with the following

Formula 8
Again, we assume the prior distribution for {Delta}m to be an inverted Wishart distribution with scale matrix Vb and degree of freedom parameter {kappa}b, leading to the following fully conditional distribution of {Delta}m.

Formula 9(9)
where Sb is a m x m matrix with the diagonal element Formula 9 and the off-diagonal element Formula 9. The fully conditional distributions for elements in g, c and r are derived as below:

Formula 10(10)
where Formula 9 is the (k,k') element in the inverse matrix Formula 9, Formula 9 and

Formula 11(11)
where

Formula 12(12)
where Formula 12 and Formula 12.

Formula 13(13)
where Formula 13 and Formula 13. Under the assumption of a non-zero biological covariance, (10), (11) and (12) are reduced to (3), (4) and (5), respectively, and (13) reduced to the fully conditional distribution of rij by Cho and Lee (2004). When both biological and experimental errors are correlated, (8) becomes

Formula 14(14)
To show how our expanded model works, we simulated the expression data of 1000 genes, which were measured using 5 biological samples with 3 experimental replicates of each and 10 biological samples each with 6 experimental replicates of each. Briefly, the grand mean was given as a constant. Gene effects were generated from Formula 13, condition effects from Formula 13, and interaction effects from Formula 13 , where Formula 13, Formula 13 and Formula 13. Biological and experimental errors were randomly generated either from Formula 13 and Formula 13, or from Formula 13 and Formula 13, where n* = GxCxmxn, Formula 13 . We ran 10 independent Markov chains, each consisting of 100 000 updates for each dataset and 1000 burn-in updates. However, to reduce correlation between successive updates, we used only 1 out of every 10 posterior samples generated for a total of 10 000 saved updates per dataset. Model performance was evaluated based on decomposition of mean squared errors (MSE) of each parameter (say gi) such that

Formula 15(15)
where

Formula
represented sampling variation and

Formula 15
represented squared bias, and Formula 15 was an estimate of gi at saved iteration t=1, 2, ... , T.

Our extended model facilitates estimating patterns of biological or/and experimental (co)variance or correlations (Fig. 2) whereas such information could not be inferred using the previous model (Cho and Lee, 2004). As mentioned before, biological or experimental correlations could be of considerable interest and necessary to help us better understand the nature of correlated gene expression. For example, suppose the pattern of experimental (co)variance in Figure 2b represents the environmental correlations in a biofilm. We might conclude that nutrient differences result in higher correlations between the inner and bottom layers than between the top and the other two layers.


Figure 2
View larger version (45K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 3D patterns of estimated (co)variance components among (a) five biological sample and (b) three experimental replicates.

 
Further, we investigated how different model specifications about biological/experimental covariance (i.e. zero versus non-zero) would affect the posterior estimates of parameters of interest. Ignoring experimental covariance in the model did not lead to significant differences in parameter estimation (data not shown). However biological covariance components are important for estimating gene and interaction effects. If the biological covariance components are ignored then gene effects and interaction effects will be estimated with large biases. However, grand mean, condition effects and the common residual variance appear to be more robust, and gave similar results under either model specification about biological (co)variance. As shown in Table 1, considering non-zero biological (co)variance in the model considerably decreased estimation biases of gene and interaction effects. Squared biases, and consequently MSE, of these parameters were smaller in the analysis considering biological covariance components than by ignoring them. Given the variation that occurs during the collection of gene expression data, posterior estimation of gene and interaction effects was better, in terms of MSE, sampling variation and squared biases, when more samples were used (Table 1). This result would challenge the present practice where a very limited number (say three) of samples are used in most gene expression DNA array experiments.


View this table:
[in this window]
[in a new window]

 
Table 1 Mean squared error (MSE), sampling variance (SV) and squared bias (SB) in posterior estimation of model effects under the assumption of zero [COV(b) = 0] versus non-zero [COV(b) != 0] biological covariancea,b

 
Finally, we show how model specifications about biological covariance would affect the estimation of gene expression difference, (µ + gi + cj + rij) – (µ + gi + cj' + rij') = (cjcj') + (rij–rij') between various conditions (Fig. 3), which is of importance in identifying differential gene expression. Clearly, the estimated gene expression differences were in better agreement with true differences when considering non-zero biological covariance as compared to ignoring them in the model. In the former case, the R-square value for the linear regression of the estimated and true expression differences was higher and the regression slope is closer to 1.0. The low slope in Figure 3b suggests an obvious systematic bias, which resulted from ignoring non-zero biological covariance. On average, the higher the biological correlations were, the greater the difference. In 10 independent simulations starting with different datasets, we found that 6.5–38.7% of the top 100 differentially expressed genes could vary when one assumes zero or non-biological covariance, respectively.


Figure 3
View larger version (44K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Regression of estimated to true gene expression difference obtained by assuming (a) zero versus (b) non-zero biological covariance components.

 
In conclusion, we extended the model of Cho and Lee (2004) to allow the analysis of gene expression data with biological/environmental correlations, which not only facilitate estimating biological/expression correlation patterns, but it also improved estimation of model parameters and consequently identifying differentially expressed genes in DNA array experiments. A FORTRAN 90 program was developed to implement our extended method and it is freely downloadable at http://www.ibest.uidaho.edu/tools/banal_array/


    FOOTNOTES
 
{dagger}Current address: Department of Animal Sciences, University of Wisconsin, Madison, WI 53706, USA Back

Associate Editor: John Quackenbush

Received on January 30, 2006; revised on April 3, 2006; accepted on April 29, 2006

    REFERENCES
 TOP
 REFERENCES
 

    Broet, P., et al. (2002) Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. J. Comput. Biol, . 9, 671–683[CrossRef][Web of Science][Medline].

    Cho, H. and Lee, J.K. (2004) Bayesian hierarchical error model for analysis of gene expression data. Bioinformatics, 20, 2016–2025[Abstract/Free Full Text].

    Gianola, D. and Sorensen, D. (2004) Quantitative genetic models for describing simultaneous and recursive relationships between phenotypes. Genetics, 167, 1407–1424[Abstract/Free Full Text].

    Johnston, J.M. Econometric Methods, (1972) , New York McGraw-Hill.

    Oleksiak, M.F., et al. (2002) Variation in gene expression within and among natural populations. Nat. Genet, . 32, 261–266[CrossRef][Web of Science][Medline].

    Qiu, X., et al. (2005) The effects of normalization on the correlation structure of microarray data. BMC Bioinformatics, 6, 120[CrossRef][Medline].

    Tadesse, M.G. and Ibrahim, J.G. (2004) A Bayesian hierarchical model for the analysis of Affymetrix arrays. Ann. N. Y. Acad. Sci, . 1020, 41–48[CrossRef][Web of Science][Medline].

    Tobler, W. (1970) Review of J. Forrester, ‘Urban Dynamics’. Geogr. Anal, . 2, 198–199.

    Schembri, M.A., et al. (2003) Global gene expression in Escherichia coli biofilms. Mol. Microbiol, . 48, 253–267[CrossRef][Web of Science][Medline].

    Watnick, P. and Kolter, R. (2000) Biofilm: city of microbes. J. Bacteriol, . 182, 2675–2579[Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
H. Cho and J. K. Lee
Response to comments on 'Bayesian Hierarchical Error Model for Analysis of Gene Expression Data'
Bioinformatics, October 1, 2006; 22(19): 2452 - 2452.
[Full Text] [PDF]


This Article
Right arrow Extract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/19/2446    most recent
btl173v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wu, X.-L.
Right arrow Articles by Joyce, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wu, X.-L.
Right arrow Articles by Joyce, P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?