Skip Navigation


Bioinformatics Advance Access originally published online on November 30, 2006
Bioinformatics 2007 23(3):328-335; doi:10.1093/bioinformatics/btl612
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/3/328    most recent
btl612v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (8)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Lo, K.
Right arrow Articles by Gottardo, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lo, K.
Right arrow Articles by Gottardo, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Flexible empirical Bayes models for differential gene expression

Kenneth Lo * and Raphael Gottardo

Department of Statistics, University of British Columbia 333-6356 Agricultural Road, Vancouver, BC, Canada V6T 1Z2

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 A BAYESIAN FRAMEWORK...
 3 APPLICATION TO EXPERIMENTAL...
 4 SIMULATION STUDIES
 5 DISCUSSION
 APPENDIX
 REFERENCES
 

Motivation: Inference about differential expression is a typical objective when analyzing gene expression data. Recently, Bayesian hierarchical models have become increasingly popular for this type of problem. The two most common hierarchical models are the hierarchical Gamma–Gamma (GG) and Lognormal–Normal (LNN) models. However, to facilitate inference, some unrealistic assumptions have been made. One such assumption is that of a common coefficient of variation across genes, which can adversely affect the resulting inference.

Results: In this paper, we extend both the GG and LNN modeling frameworks to allow for gene-specific variances and propose EM based algorithms for parameter estimation. The proposed methodology is evaluated on three experimental datasets: one cDNA microarray experiment and two Affymetrix spike-in experiments. The two extended models significantly reduce the false positive rate while keeping a high sensitivity when compared to the originals. Finally, using a simulation study we show that the new frameworks are also more robust to model misspecification.

Availability: The R code for implementing the proposed methodology can be downloaded at http://www.stat.ubc.ca/~c.lo/FEBarrays

Contact: c.lo{at}stat.ubc.ca

Supplementary information: The supplementary material is available at http://www.stat.ubc.ca/~c.lo/FEBarrays/supp.pdf


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 A BAYESIAN FRAMEWORK...
 3 APPLICATION TO EXPERIMENTAL...
 4 SIMULATION STUDIES
 5 DISCUSSION
 APPENDIX
 REFERENCES
 
As a natural development following the success of genome sequencing, DNA microarray technology has emerged for the sake of exploring the functioning of genomes (Schena et al., 1995). By exploiting the ability of a single-strand nucleic acid molecule to hybridize to a complementary sequence, researchers can simultaneously measure the expression levels of thousands of genes within a cell. A common task with microarray is to determine which genes are differentially expressed under two different conditions.

In recent years, there has been a considerable amount of work on the detection of differentially expressed genes. An early statistical treatment can be found in Chen et al. (1997) A common approach is to test a hypothesis for each gene using variants of t or F-statistics and then try to correct for multiple testing (Tusher et al., 2001; Efron et al., 2001; Dudoit et al., 2002). Due to the small number of replicates, variation in gene expression can be poorly estimated. Tusher et al. (2001) and Baldi and Long (2001) suggested using a modified t statistic where the denominator has been regularized by adding a small constant to the gene-specific variance estimate. Similar to an empirical Bayes approach this results in shrinkage of the empirical variance estimates towards a common estimate. Lönnstedt and Speed (2002) proposed an empirical Bayes normal mixture model for gene expression data, which was later extended to the two condition case by Gottardo et al. (2003) and to more general linear models by Smyth (2004) and Cui et al. (2005), though Smyth (2004) and Cui et al. (2005) did not use mixture models but simply empirical Bayes normal models for variance regularization. In each case, the authors derived explicit gene-specific statistics and did not consider the problem of estimating p the proportion of differentially expressed genes. Newton et al. (2001) developed a method for detecting changes in gene expression in a single two-channel cDNA slide using a hierarchical gamma–gamma (GG) model. Kendziorski et al. (2003) extended this to replicate chips with multiple conditions, and provided the option of using a hierarchical lognormal–normal (LNN) model. Both models are implemented in an R package called EBarrays (Empirical Bayes microarrays) and from now on we use the name EBarrays to refer to the methodology. Both EBarrays model specifications rely on the assumption of a constant coefficient of variation across genes. In this paper, we extend both models by releasing this assumption and introduce EM type algorithms for parameter estimation, thus extending the work of Lönnstedt and Speed (2002) and Gottardo et al. (2003) as well.

The structure of the paper is as follows. The extended forms of the two EBarrays hierarchical models and the estimation procedures are presented in Section 2. In Section 3, the performance of the extended models is examined on three experimental datasets and compared to five other baseline and commonly used methods. Section 4 presents a simulation study to further compare our empirical Bayes approach to the other methods. Finally, in Section 5 we discuss our results and possible extensions.


    2 A BAYESIAN FRAMEWORK FOR IDENTIFYING DIFFERENTIAL EXPRESSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 A BAYESIAN FRAMEWORK...
 3 APPLICATION TO EXPERIMENTAL...
 4 SIMULATION STUDIES
 5 DISCUSSION
 APPENDIX
 REFERENCES
 
2.1 A hierarchical model for measured intensities
In a typical microarray experiment, two conditions are compared for gene expression. Let us denote by Xgr and Ygr the intensities of gene g from the rth replicate in the two conditions, respectively. Measurements between the two conditions are assumed to be independent. The proposed model is an extension of the EBarrays framework (Newton et al., 2001; Kendziorski et al., 2003). Extensions to the original two types of model formulation are considered in turn below.

GG. Here, a Gamma distribution is used to model the measured intensities of a given gene. Explicitly, the probability density of Xgr (resp. Ygr) with shape and rate parameters ag and {theta}gx (resp. {theta}gy) is given by


Formula 1

(1)
To borrow strength across genes we assume an exchangeable Gamma(a0, {nu}) prior for the rate parameters, and a Lognormal({eta}, {xi})prior for the shape parameters. The Gamma prior is used for simplicity as it is conjugate to the sampling distribution (Newton et al., 2001) while the Lognormal prior is suggested by a histogram plot of the empirical shape parameters estimated by the method of moments (see Supplementary material). The hyperparameters a0, {nu}, {eta} and {xi} are assumed unknown and will be estimated as part of our approach.

The proposed model extends the EBarrays GG model by placing a prior on the shape parameter. In the original GG model, the shape parameter a was assumed to be constant and common to all genes whereas now it is gene specific. However, strength is borrowed across genes through the prior distribution. By ‘borrowing strength’, we mean that information from all genes is used when estimating ag, which comes from the hyperparmeters through the prior.

LNN. The second formulation is an extension of the EBarrays LNN framework. The intensities are assumed to be lognormally distributed, i.e. the log-transformed intensities are from a normal distribution, and we write log Xgr ~ N(µgx,Formula ) and log Ygr ~ N(µgy, Formula ), respectively. A conjugate prior is imposed on the mean µgx (resp. µgy) and precision {tau}gx (resp. {tau}gy). Explicitly, we set µgx|{tau}gx ~ N(m, Formula and {tau}gx ~ Gamma({alpha}, ß) respectively. In the original LNN model, the precision {tau} was assumed to be constant and common to all genes. Our proposed formulation extends the EBarrays model by releasing the assumption of a constant coefficient of variation Formula , which is equivalent to the assumption of a constant variance {tau}–1 on the log scale. Note that our proposed formulation is also the framework of Gottardo et al. (2003). However, in this paper we use an EM based algorithm to estimate the unknown parameters, including the proportion of differentially expressed genes.

On assuming a prior on both µgx (resp. µgy) and {tau}gx (resp. {tau}gy) common to all genes, strength is borrowed across genes through both means and variances of the distributions when making inferences. Again, we mean that information from all genes is used when estimating both µgx (resp. µgy) and {tau}gx (resp. {tau}gy). In particular, this is essential for variances—due to the small number of replicates variance estimates can be very noisy. Similar ideas have been used in Smyth (2004) and Cui et al. (2005), where the authors concentrated on variance regularization.

2.2 A mixture model for differential expression
We use a mixture model to identify differentially expressed genes. We assume that a priori {theta}gx = {theta}gy (resp. µgx = µgy) with probability 1 – p and {theta}gx != {theta}gy (resp. µgx != µgy) with probability p. For the former case, the model specification is just as stated in Section 2.1, while the latter case is modeled through setting the gene-specific parameters common to both conditions.

Let us denote by zg the indicator variable equal to one if there is real change in expression for gene g and zero otherwise. Then one can define the posterior probability of change, Pr(zg = 1|xg,yg,p,{psi}), where xg = (xg1, xg2, ... , xgRx)' and yg = (yg1, yg2, ... , ygRy)' and {psi} denotes the vector of unknown hyperparameters. Applying the Bayes rule, we obtain


Formula 2

(2)
where pA(xg, yg|{psi}) and p0(xg,yg|{psi}) denote the joint marginal density of the measured intensities of gene g under both the alternative (differential expression) and null (no differential expression) models respectively given {psi}. The marginal density for the extended LNN model can be computed explicitly and is given in Appendix. For the extended GG model only {theta}g can be integrated out, and the corresponding ‘conditional’ marginal density is given in Appendix. In the next section we describe an approximate estimation procedure to deal with this difficulty.

2.3 Parameter estimation using the EM-algorithm
Here we start with the extended LNN model as the estimation procedure is straightforward. The vector of unknown parameters {Phi} = ({psi}',p)', where {psi} = (m, k, {alpha}, ß)', can be estimated by maximizing the integrated likelihood using the EM-algorithm (Dempster et al., 1977). The estimation of p is important since it calibrates the posterior probability of change for multiple testing, as seen in (2). Such estimation is also part of some multiple testing procedure such as Storey's q-value (Storey, 2003). Estimation of the parameter p can be difficult (Smyth, 2004; Bhowmich et al., 2006), and as suggested by Newton et al. (2001) we place a Beta(2,2) prior over p, which avoids numerical issues when p gets close to 0 or 1. Given the large number of genes, the prior on p has essentially no effect on the final estimation, and thus on the number of genes called differentially expressed.

Treating the zg's as missing data, the complete data log-likelihood is given by


Formula 3

(3)
During the E-step, the expectation is obtained by replacing zg by Formula as given by (2) while the M-step consists of maximizing the conditional expectation with respect to the parameter vector {Phi} = ({psi}',p)'. At convergence, the estimated parameters can be substituted into (2) to compute the posterior probability of change for each gene.

Because the prior of the extended GG model is not conjugate to the sampling distribution, only the marginal density conditional on ag is analytically available for each gene. We refer to it as the conditional marginal density. To incorporate information about the prior for the ag's, we propose to estimate the hyperparameters {eta} and {xi} beforehand through an empirical Bayes approach using the method of moments (see Appendix for details), and add log[{pi}(ag|{eta}, {xi})] to the log conditional density as a penalty term. Again, treating the zg's as missing data, the corresponding modified complete data log-likelihood can be written as


Formula 4

(4)
where {psi} = (a0, {nu})'. The vector of parameters to be estimated becomes {Phi} = (a1, a2, ... , aG, {psi}', p)'.

Similar to the extended LNN model, we can use the EM algorithm to maximize the modified marginal likelihood. During the E-step to obtain the conditional expectation of the modified complete data log-likelihood zg in (4) is replaced by Formula as in (2). The M-step consists of maximizing Formula given the current zg's. Such maximization can be difficult given the high-dimensionality of {Phi} and here we suggest to exploit the conditional structure of the model during the maximization step, namely that given {psi} and p, the genes are conditionally independent and each ag can be maximized over separately. Let us split the unknown parameters into two groups, namely, {Phi}1 = (a1, a2, ... , aG)' (gene-specific shape parameters) and {Phi}2 = ({psi}',p)' (global parameters). Then the M-step would consist of iteratively maximizing over {Phi}1 given {Phi}2 and {Phi}2 given {Phi}1. Here, we decided to maximize over {Phi}1 only once during the first iteration to reduce the computational burden, and then take EM-iterations with respect to {Phi}2 only until convergence. It turns out that the estimates obtained were very similar to the ones obtained when maximizing over both {Phi}1 and {Phi}2, while significantly reducing the computing time.

Details about the estimation of ({eta}, {xi}) and initialization of the EM algorithm can be found in Appendix.


    3 APPLICATION TO EXPERIMENTAL DATA
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 A BAYESIAN FRAMEWORK...
 3 APPLICATION TO EXPERIMENTAL...
 4 SIMULATION STUDIES
 5 DISCUSSION
 APPENDIX
 REFERENCES
 
3.1 Data description
To illustrate our methodology we use three publicly available microarray datasets: one cDNA experiment and two Affymetrix spike-in experiments. All three have the advantage that in each case the true state (differentially expressed or not) of all or some of the genes is known.

The HIV-1 data. The expression levels of 4608 cellular RNA transcripts were measured 1 h after infection with human immunodeficiency virus type 1 (HIV-1) using four replicates on four different slides. 13 HIV-1 genes have been included in the set of RNA transcripts to serve as positive controls, i.e. genes known in advance to be differentially expressed. Meanwhile, 29 non-human genes have also been included and act as negative controls, i.e. genes known to be not differentially expressed. Another dataset was obtained by repeating the four aforementioned experiments but with an RNA preparation different from that for the first dataset. For easy reference, in this paper we label the two datasets as HIV-1A and HIV-1B, respectively. See van't Wout et al. (2003) for more details of the HIV-1 data. The data were lowess normalized using a global lowess normalization step (Yang et al., 2002).

The HGU95A Spike-in data. This dataset was obtained from a spike-in study by Affymetrix used to develop and validate the MAS 5.0 (Affymetrix Manual, 2001) platform. The concentrations of 14 spiked-in human gene groups in 14 groups of HGU95A GeneChip© arrays were arranged in a Latin square design. The concentrations of the 14 groups in the first array group are 0, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 and 1024 pM, respectively. Each subsequent array group rotates the spike-in concentrations by one group such that each human gene was spiked-in at a particular concentration level on exactly one array group, and each concentration level came with exactly one spiked-in gene group in each array group. There are three technical replicates in each array group. The third array group has been removed from the analysis as one of its replicates was missing. We use a set of 16 spiked-in genes in our list in recognition of the extras reported by Hsieh et al. (2003) and Cope et al. (2004). Analysis is performed on each set of probe summary indices computed using gcRMA (Wu et al., 2004), RMA (Irizarry et al., 2003b), MAS 5 and dChip (Li and Wong, 2001), respectively.

The HGU133A Spike-in data. This dataset was obtained from another spike-in study done with HGU133A arrays. A total of 42 spiked-in genes were organized in 14 groups, and the concentration used were 0, 0.125, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256 and 512 pM. The arrangement of the spike-in concentrations was similar to the Latin square design stated above. Again, there are three technical replicates in each array group. For more information see Irizarry et al. (2003a). In addition to the original 42, we claim that another 20 genes should also be included in the spiked-in gene list as they consistently show significant differential expression across the array groups in the exploratory data analysis. Similar observations have been made by Sheffler et al. (2005). Moreover, the probe sets of three genes contain probe sequences exactly matching those for the spiked-ins. These probes should be hybridized by the spike-ins as well. As a result, our expanded spiked-in gene list contains 65 entries in total.

3.2 Results
We compare our proposed methods—extended GG (eGG) and extended LNN (eLNN) models—to five other methods, namely, EBarrays GG and LNN models, the popular Significance Analysis of Microarrays (SAM) (Tusher et al., 2001), Linear Models for Microarray data (LIMMA) (Smyth, 2004), and a fully Bayesian approach named BRIDGE (Gottardo et al., 2006a). The results have been organized in Tables 13.


View this table:
[in this window]
[in a new window]

 
Table 1 Analysis of differential expression with the HIV-1 data

 


View this table:
[in this window]
[in a new window]

 
Table 2 Analysis of differential expression with the HGU95A spike-in data

 


View this table:
[in this window]
[in a new window]

 
Table 3 Analysis of differential expression with the HGU133A spike-in data

 
In the analysis of the HIV-1 data, we obtain the number of genes called differentially expressed (DE) for each method. Among those genes called DE, we look at the number of true positives (TP), i.e. genes known to be DE in advance, and the number of false positives (FPs), i.e. genes known to be not DE. Gottardo et al. (2006b) showed that one of the HIV genes, which was expected to be highly differentially expressed had a very small estimated log ratio and did not properly hybridize in the second experiment (HIV-1B). We removed the corresponding gene from the list of known differentially expressed genes. Thus there are 13 genes known to be DE in the first experiment and 12 in the second. To compare the performance between the seven methods, we intend to control the false discovery rate (FDR) at a fixed level of 0.1. The FDR cutoffs can be selected using a direct posterior probability calculation as described in Newton et al. (2004). For the HIV-1A dataset, when the FDR is controlled at 0.1, all methods can identify the 13 positive controls. Meanwhile, EBarrays LNN has made one FP. Similar result is observed when the HIV-1B dataset is considered. All methods detect 11 out of the 12 positive controls but both versions of EBarrays (GG and LNN) have made one FP. Concluded from the HIV-1 datasets, along with LIMMA, SAM and BRIDGE our proposed eGG and eLNN methods appear to perform the best as they recognize the most positive controls and do not get any FP.

For the HGU95A spike-in data, after removing the array group with one missing replicate, we have a set of 13 array groups. To evaluate the different methods we compare the first array group to the other array groups, leading to 12 comparisons. Since dChip may return negative probe summary indices, which cannot be processed by the aforementioned methods, those genes with negative summary indices were filtered out. This excluded 5.5 spike-ins on average. This time, since we know the actual status of each gene, we can check the true FDR of each method against the desired FDR. In addition, we look at the number of false negative (FNs) as a power assessment.

Unlike the results on the HIV-1 data, SAM does not show a competitive performance. A large number of FN (>11) have been observed with SAM for both gcRMA and RMA summary indices, considering that there are only 16 entries in our spiked-in gene list. eLNN and LIMMA have the actual FDR closest to the desired FDR in general, though they have a relatively large number of FN cases regarding MAS 5 and dChip summary indices. The actual FDRs for EBarrays GG and LNN methods are too high compared to the other methods, and our proposed extended versions have lowered the rates by a wide margin while keeping relatively small FN rates.

The HGU133A spike-in data have a set of 14 array groups, and therefore 13 comparisons have been made. A total of 14 out of 65 spiked-in genes on average have been filtered from the analysis with dChip due to negative summary indices. The relative performance of the six methods is similar to that for the HGU95A data. It is worth mentioning that eGG is the only method that can sustain the FN cases to a low number for all four types of probe summary indices, though its FDR is higher than the desired one. SAM has considerably more FN cases than the other methods for gcRMA and RMA, while its FDR is close to the desired one. Similarly, eLNN and LIMMA exhibit good FDR performance but with better FN rates. Again, the FDRs for EBarrays GG and LNN methods are at quite a high-level, while their extended versions (eGG and eLNN) have significantly reduced the rates while keeping relatively small FN rates.


    4 SIMULATION STUDIES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 A BAYESIAN FRAMEWORK...
 3 APPLICATION TO EXPERIMENTAL...
 4 SIMULATION STUDIES
 5 DISCUSSION
 APPENDIX
 REFERENCES
 
4.1 Data generation
We now use a series of simulation to study the performance of our empirical Bayes framework under different model specifications compared to the original EBarrays framework and the methods presented in Section 3.2. In order to do so, we generated data from the following models: EBarrays GG (a = 5, a0 = 0.8, {nu} = 15), EBarrays LNN (m = 5, {sigma}2 = 2, {tau}–1 = 0.25, {sigma}2 being the variance parameter of the prior of µgx or µgy), extended GG ({eta} = 2, {xi} = 1, a0 = 1, {nu} = 20) and extended LNN (m = 5, k = 12, {alpha} = 2, ß = 0.5). The values of the parameters are set in the proximity of the estimates from the HIV-1 data. We fixed the number of genes to 500, the number of replicates to three in each group and generated 100 datasets under each of the above models for two different values of p = {0.1,0.2}.

4.2 Results
The seven methods mentioned in Section 3.2 are applied to each simulated dataset to make inference about differential expression. Results are summarized graphically in two ways: a plot of the actual FDR against the desired FDR, and a plot of the number of FP against the number of FN. The curves show the average results across the 100 simulated datasets. For each dataset, results are collected by setting the cutoffs for the posterior probabilities or p-values at different points in turn in detecting differential expression.

As expected, the EBarrays GG and LNN models perform quite poorly compared to the eGG and eLNN models when the variance is not constant and clearly under estimate the FDR (Figs 1 and 2). On the other hand, the eGG and eLNN models are comparable to EBarrays when the variance is constant, showing that strength borrowing across genes is working well (Figs 3 and 4). Finally, both GG and eGG (resp. LNN and eLNN) appear to perform relatively well under LNN and eLNN (resp. GG and eGG) model specifications respectively. This confirms previous simulation studies (Kendziorski et al., 2003).


Figure 1
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Simulation results generated from the extended GG model.

 


Figure 2
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Simulation results generated from the extended LNN model.

 


Figure 3
View larger version (25K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Simulation results generated from the EBarrays GG model.

 


Figure 4
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 Simulation results generated from the EBarrays LNN model.

 
Overall, SAM is not performing very well and tend to under estimate the FDR by a large amount. Meanwhile, LIMMA and BRIDGE consistently show good performance for data generated from the four models, suggesting that they are good candidates for identifying differential expression under a wide variety of settings.


    5 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 A BAYESIAN FRAMEWORK...
 3 APPLICATION TO EXPERIMENTAL...
 4 SIMULATION STUDIES
 5 DISCUSSION
 APPENDIX
 REFERENCES
 
We have extended the EBarrays empirical Bayes framework for differential gene expression by releasing the constant coefficient of variation assumption, and introducing two algorithms that can be used for parameter estimation. Using both experimental and simulated data we have shown that the extended framework clearly improves the original framework. In addition, it appears that the eLNN model performs better than the eGG one as shown with the spike-in data, and that it is comparable to BRIDGE, a more computational fully Bayesian approach. This is not the case for the original EBarrays framework, where the GG model generally performs better. This confirms previous findings of Gottardo et al. (2006a) and suggests that EBarrays GG is more robust to the model misspecification of a constant coefficient of variation compared to the LNN formulation. However, when the EBarrays model formulations are extended and the constant coefficient of variation assumption is released, the LNN model seems more appropriate.

In spite of the complications accompanying the model enhancements relative to the original EBarrays framework, the proposed methodology remains to be highly competitive in terms of processing time. In the analysis with the HGU133A data of >20 000 genes, it takes about 5 min to complete the eGG or eLNN analysis of one comparison between the two array groups each with three replicates on the R platform.

In this paper, we have compared our approach with five alternatives, but there are many other methods for detecting differentially expressed genes with gene expression data. We chose these five because they are either obvious baseline methods or widely used; they are also representative of other methods. More comparisons between statistical tests can be found in Cui and Churchill (2003). Among explicit adjustments for multiple testing, we considered only the FDR control method as it is interpretable under each method.

For simplicity and ease of comparison, we assumed that we were in a situation with only two conditions of interest. However, the methodology could easily be extended to the multiple condition case (Kendziorski et al., 2003) or more complex ANOVA type designs (Cui and Churchill, 2003; Smyth, 2004).


    APPENDIX
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 A BAYESIAN FRAMEWORK...
 3 APPLICATION TO EXPERIMENTAL...
 4 SIMULATION STUDIES
 5 DISCUSSION
 APPENDIX
 REFERENCES
 
Marginal densities of measured intensities
Under the extended GG model, the joint marginal densities of measured intensities of a given gene g are developed without integrating ag away, i.e. they are conditional on ag. Denote by G(x; a,b) the Gamma density function with shape a and rate b. The explicit forms of the conditional marginal densities are given by


Formula 5

(5)
and


Formula 6

(6)
where {psi} = (a0, {nu})'.

The joint marginal densities of measured intensities under the extended LNN model are developed in a similar fashion, this time by integrating µg and {tau}g away. Denote by LN(x; a,b) the Lognormal density function with mean and variance parameters a and b, respectively, and by N(x; a,b) the normal density function. The marginal densities are developed as follows:


Formula 7

(7)
and


Formula 8

(8)
where {psi} = (m, k, {alpha}, ß)'.

Estimation of {eta} and {xi} for the prior of ag
As mentioned in Section 2.3, to make use of the modified complete data log-likelihood (4) in the extended GG model we need to provide estimates of the hyperparameters for the Lognormal({eta}, {xi}) prior of ag beforehand. Here we propose to use the method of moments (MMs) to estimate {eta} and {xi}. First we would like to come up with simple estimates of the ag's. On noting that the coefficient of variation is given by Formula for each gene, a robust empirical estimate of ag may be provided by


Formula

where med and mad stand for median and median absolute deviation, respectively. Note that a robust counterpart to mean and SD is adopted since there are usually relatively few replicates. With these crude estimates of ag's, we can then obtain the estimates of {eta} and {xi}:


Formula 9

(9)
Again, a robust version of MM is proposed here.

Initialization of the EM algorithm
We need to initialize the parameters to be estimated before the EM type algorithm described in Section 2.3 can be applied. Similar to the estimation for {eta} and {xi} above, robust MM estimates of (a, a0, {nu}) are obtained for the extended GG model. Similar measure is taken for (m, {alpha}, ß) if the data are modeled under the extended LNN framework, while k is empirically chosen to be 30. After the crude estimation step, updated estimates of the aforementioned parameters are obtained on maximizing the corresponding marginal null log-likelihood under either model formulation. This step is taken in order to bring the initial estimates closer to the estimates returned by the EM algorithm. Using these initial estimates together with p set as 0.5, the most likely value under the Beta (2,2) prior, initial estimates of zg's are obtained, which are then used to update the parameter estimates in the EM algorithm.


    Acknowledgments
 
The authors thank Adrian Raftery, Ka Yee Yeung and Roger Bumgarner for helpful discussion, and the two referees and the associate editor for suggestions that clearly improved an earlier draft of the article. This work was supported by a grant from the National Sciences and Engineering Research Council of Canada.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

Received on October 1, 2006; revised on November 21, 2006; accepted on November 26, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 A BAYESIAN FRAMEWORK...
 3 APPLICATION TO EXPERIMENTAL...
 4 SIMULATION STUDIES
 5 DISCUSSION
 APPENDIX
 REFERENCES
 

    Affymetrix Manual. Affymetrix Microarray Suite User Guide version 5.0, (2001) , Santa Clara, CA.

    Baldi, P. and Long, A. (2001) A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics, 17, 509–519[Abstract/Free Full Text].

    Bhowmick, D., et al. (2006) A Laplace mixture model for identification of differential expression in microarray experiments. Biostatistics, 7, 630–641[Abstract/Free Full Text].

    Chen, Y., Dougherty, E.R., Bittner, M.L. (1997) Ratio-based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Opt, . 2, 364–374[CrossRef].

    Cope, L.M., et al. (2004) A benchmark for Affymetrix GeneChip expression measures. Bioinformatics, 20, 323–331[Abstract/Free Full Text].

    Cui, X. and Churchill, G. (2003) Statistical tests for differential expression in cDNA microarray experiments. Genome Biol, . 4, 210[CrossRef][Medline].

    Cui, X., et al. (2005) Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics, 6, 59–75[Abstract].

    Dempster, A., Laird, N., Rubin, D. (1977) Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Series B, 39, 1–38.

    Dudoit, S., et al. (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statist. Sinica, 12, 111–139.

    Durbin, B.P., et al. (2002) A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics, 18, S105–S110[Abstract].

    Efron, B., et al. (2001) Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc, . 96, 1151–1160[CrossRef][Web of Science].

    Gottardo, R., et al. (2003) Statistical analysis of microarray data: a Bayesian approach. Biostatistics, 4, 597–620[Abstract].

    Gottardo, R., et al. (2006a) Bayesian robust inference for differential gene expression in microarrays with multiple samples. Biometrics, 62, 10–18[Web of Science][Medline].

    Gottardo, R., et al. (2006b) Quality control and robust estimation of cDNA microarray with replicates. J. Am. Stat. Assoc, . 11, 30–40.

    Hsieh, W.P., Chu, T.M., Wolfinger, R., et al. (2003) Who are those strangers in the latin square? In Johnson, K.F. and Lin, S.M. (Eds.). Methods of Microarray Data Analysis III: Papers from CAMDA'02, , Boston Kluwer, pp. 199–208.

    Huber, W., et al. (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics, 18, S96–S104[Abstract].

    Irizarry, R.A., et al. (2003a) Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res, . 31, e15[Abstract/Free Full Text].

    Irizarry, R.A., et al. (2003b) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 2, 249–264.

    Kendziorski, C.M., et al. (2003) On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Statist. Med, . 22, 3899–3914.

    Li, C. and Wong, W.H. (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc. Natl Acad. Sci. USA, 98, 31–36[Abstract/Free Full Text].

    Lönnstedt, I. and Speed, T. (2002) Replicated microarray data. Statist. Sinica, 12, 31–46.

    Newton, M.A., et al. (2001) On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J. Comput. Biol, . 8, 37–52[CrossRef][Web of Science][Medline].

    Newton, M., et al. (2004) Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics, 5, 155–176[Abstract].

    Rocke, D.M. and Durbin, B. (2001) A model for measurement error for gene expression arrays. J. Comput. Biol, . 8, 557–569[CrossRef][Web of Science][Medline].

    Schena, M., et al. (1995) Quantitative monitoring of gene-expression patterns with a complementary-DNA microarray. Science, 270, 467–470[Abstract/Free Full Text].

    Sheffler, W., et al. (2005) A learned comparative expression measure for Affymetrix GeneChip DNA microarrays. Proc. Comput. Syst. Bioinform. Conf, . 144–154.

    Smyth, G.K. (2004) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol, . 3, Article 3.

    Storey, J.D. (2003) The positive false discovery rate: a Bayesian interpretation and the q-value. Ann. Statist, . 31, 2013–2035[CrossRef].

    Tusher, V.G., Tibshirani, R., Chu, G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 5116–5121[Abstract/Free Full Text].

    van't Wout, A.B., et al. (2003) Cellular gene expression upon human immunodeficiency virus type 1 infection of CD4+ T-cell lines. J. Virol, . 77, 1392–1402.

    Wu, Z., et al. (2004) A model based background adjustment for oligonucleotide expression data. J. Am. Stat. Assoc, . 99, 909–917[CrossRef][Web of Science].

    Yang, Y.H., et al. (2002) Normalization for cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res, . 30, e15[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J ANIM SCIHome page
C. Diaz, N. Moreno-Sanchez, J. Rueda, A. Reverter, Y. H. Wang, and M. J. Carabano
Model selection in a global analysis of a microarray experiment
J Anim Sci, January 1, 2009; 87(1): 88 - 98.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Reverter and E. K. F. Chan
Combining partial correlation and an information theory approach to the reversed engineering of gene co-expression networks
Bioinformatics, November 1, 2008; 24(21): 2491 - 2497.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/3/328    most recent
btl612v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (8)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Lo, K.
Right arrow Articles by Gottardo, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lo, K.
Right arrow Articles by Gottardo, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?