Bioinformatics Advance Access originally published online on September 28, 2004
Bioinformatics 2005 21(6):723-729; doi:10.1093/bioinformatics/bti051
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Weighted analysis of microarray gene expression using maximum-likelihood
1Cancer Research UK Beatson Laboratories Garscube Estate, Bearsden, Glasgow G61 1BD, UK
2Department of Statistics, University of Glasgow Glasgow G12 8QW, UK
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: The numerical values of gene expression measured using microarrays are usually presented to the biological end-user as summary statistics of spot pixel data, such as the spot mean, median and mode. Much of the subsequent data analysis reported in the literature, however, uses only one of these spot statistics. This results in sub-optimal estimates of gene expression levels and a need for improvement in quantitative spot variation surveillance.
Results: This paper develops a maximum-likelihood method for estimating gene expression using spot mean, variance and pixel number values available from typical microarray scanners. It employs a hierarchical model of variation between and within microarray spots. The hierarchical maximum-likelihood estimate (MLE) is shown to be a more efficient estimator of the mean than the conventional estimate using solely the spot mean values (i.e. without spot variance data). Furthermore, under the assumptions of our model, the spot mean and spot variance are shown to be sufficient statistics that do not require the use of all pixel data.
The hierarchical MLE method is applied to data from both Monte Carlo (MC) simulations and a two-channel dye-swapped spotted microarray experiment. The MC simulations show that the hierarchical MLE method leads to improved detection of differential gene expression particularly when outlier spots are present on the arrays. Compared with the conventional method, the MLE method applied to data from the microarray experiment leads to an increase in the number of differentially expressed genes detected for low cut-off P-values of interest.
Availability: The Matlab code is available at http://www.stats.gla.ac.uk/~microarray/software/
Contact: d.bakewell{at}beatson.gla.ac.uk
| 1 INTRODUCTION |
|---|
|
|
|---|
Microarray spot summary statistics, such as mean, median and mode are commonly used as gene expression indicators from image processed microarray experiments. Ideally, all pixels of each spot located on each optically scanned microarray should be recorded to yield maximum available information but this would further exasperate existing data storage requirements and/or analysis capabilities. As such, the use of the abovementioned summary statistics is a trade-off between data storage capability and requirements of the end-user, and often one statistic is chosen by the analyst as the basis for quantifying gene expression. For example, the pixel median statistic for each spot is often chosen since it is resilient to a strongly skewed distribution of pixel intensities compared with, e.g. the mean. Recent work (Nuñez-Garcia et al., 2003) has shown that the choice of pixel mode statistic is a more accurate statistic for estimating gene expression for spots exhibiting a donut spot shape spatial pixel intensity distribution. Intuitively, the use of more than one summary statistic may lead to improvements in the estimation of gene expression and elucidation of the sources of data variability. For example, recent work has shown that using the mean, median and variance, combined together, can correct for signal saturation effects (Wit and McClure, 2003).
In this paper, we consider the use of the spot mean and standard deviation (SD) for estimating the mean expression across all spots of each gene under the same condition. These improved estimates will lead, for example, to a more accurate detection of differential expression. The take-away message is to evaluate weighted spot means where the weights are roughly inversely proportional to the spot variance.
It is important to note that the technique developed is applicable to all types of microarrays and experimental designs for which a measure of variation is available. Thus, for example, although the microarray study in Section 5.4 uses a dye-swap, by no means is the technique confined to this particular type of experimental design using two channels. In fact, the technique can be applied to any type of experimental measurement (genomic, proteomic, etc.) for which the error variance is known.
| 2 MODEL AND METHODS |
|---|
|
|
|---|
Quantitative evaluation of the expression of a single gene is achieved by optical measurement of spot fluorescence. Typically, for each of the spots the fluorescence of a large number of pixels is recorded. There are many sources of gene expression variation, both biological and technical (as discussed in Section 6), but a key characteristic of spotted microarrays is the physical, spatial separation of spots defined by printing. This motivates representation of gene expression variation on a microarray as a directed acyclic graph with two layers, as shown in Figure 1.
|
The hierarchical model pertains to a single gene with a mean expression value µ measured by n spots, which contain mi independent pixels (i = 1,...,n). The between-spot and within-spot deviations are denoted by quantities
b and
, respectively. The top layer represents the variation between spot values. Spots can vary from each other, for example, because they represent cDNA from different biological samples. In this case, the top layer principally describes the biological variation of the expression of a single gene within the population of interest. The second layer represents the variation between pixels within each spot and tends to arise from technical variation, for example, printing and hybridization effects (Wit and McClure, 2003, Table 3.1). An example of within-spot pixel variation for two spots with the same mean is shown in Figure 2. The left spot Figure 2a has a considerably higher pixel variation than the right spot, Figure 2b. As such, the right spot is more uniform and can be regarded as a more reliable indicator of gene expression.
|
2.1 Determining effective sample size of correlated spatial data
The i-th spot consists of a number of
pixels and for cDNA microarray data this number is typically in the order of 200300. These pixels do not constitute independent observations as the pixels are heavily spatially correlated. Therefore, we propose a correction term and to replace the sample size
for the i-th spot by the effective sample size of mi pixels that are independent.
There is an extensive literature on effective sample sizes in spatial autoregressive (SAR) processes (Clifford et al., 1989; Dutilleul, 1993). In matrix formulation, a SAR model is given as
![]() |
is the average spot intensity, 1 a
vector of ones, Y is the
vector of spot-pixel observations, W the
-contiguity matrix,
a vector of i.i.d. errors and
a scalar that indicates the spatial correlation between two neighbouring pixels. A SAR model is flexible enough to capture most simple spatially correlated observations. It was shown (Griffith and Zhang, 1999) that in this model the effective sample size can be approximated as,
![]() | (1) |
For one cDNA printed microarray, we estimated a value of
on average across all the spots. This means that for ordinary spots sizes (number of pixels > 50), the effective sample size is only 67% of the original sample size, i.e. mi
0.06m*i. We use this relation in Section 5.4.
2.2 Hierarchical variation between- and within-spots
The log-expression in the absence of measurement error can be modelled as
where
and
is the biological, or more generally, between-spot, variance. This constitutes the top level of the hierarchy in Figure 1. These between-spot log-expression parameters,
i, are estimated intermediately through the log-pixel values, xij, where the indices denote the j-th pixel for the i-th spot. The log-pixel values can be modelled as additional deviations from the spot log-expression,
![]() | (2) |
represents the within-spot variation and
is the variance within the i-th spot consisting of mi independent pixels.
The pixel values are in principle available, but in practice one tends to prefer to work with summary parameters, such as, the log-pixel sample average and sample variance,
![]() | (3) |
![]() | (4) |
values. Note that since we have assumed the pixel intensities are log-normally distributed, xij is the logarithm of each pixel value and not the raw intensity. Therefore, currently, these xij values are not provided by most scanner imaging software. An estimation method is presented in Section 3.2 and the assumption of the log-normal pixel distribution is addressed in Section 6.
The log-mean of the i-th spot for the hierarchical model has the following distribution if the pixels are independent,
![]() | (5) |
is used as a plug-in estimate of
for pragmatic reasons (otherwise the problem becomes intractable). Thus,
is given by Equations (3) and (4).
2.3 Maximum-likelihood estimate (MLE) of µ and
for the hierarchical model
Assuming
i and mi are known, the log-likelihood l of µ and
b given xi for n spots, i=1,2,...,n, is expressed as follows:
![]() | (6) |
![]() | (7) |
i = xi µ and
and the last equality results from (5).
Expressions for the maximum log-likelihood are found by setting the partial derivatives of (7)) to zero:
and
This results in
![]() | (8) |
, in the weighted sum is
![]() |
![]() | (9a) |
, leading to
![]() | (9b) |
![]() |
The subscript w on the estimates of
and
signifies these parameters are weighted sums originating from Equation (7). The solutions depend implicitly on the unknown µ and
b. Either iteration or an appropriate computational algorithm will yield estimates for µ and
b. The coefficients weight the spots inversely to their variance, i.e. giving high weight to high-quality spots with low
, and vice versa.
2.4 Naive gene expression estimates
The estimates for µw and
can be compared with estimates from a maximum-likelihood analysis that uses solely the spot mean xi data for the microarray scanner and does not include the within-spot,
data. The approach is called naive. Standard expressions for MLE unbiased estimates of the mean and variance of xi are as follows (Dougherty, 1990, p. 342):
![]() | (10) |
![]() | (11) |
are naive.
2.5 Comparison of hierarchical and naive MLE expressions for mean and variance
Rearranging Equation (9b)
![]() | (12) |
A useful comparison of µ and
MLEs between the hierarchical and naive models is achieved by considering the special case for the hierarchical model where the spot variances are all equal in Equations (8) and (12),
for all
and
where cv is a constant. Consequently, the spots have equal weighting and Equations (8) and (12) simplify to
![]() |
![]() |
assume equal weighting for each spot.
Comparing the hierarchical model where the within-spot variances are all equal to cv with the naive model [given by Equations (10) and (11)] shows that the estimates are the same for the mean,
and differ by cv for the variance,
. This illustrates that the naive estimate of the spot mean variance
contains the within-spot and between-spot components. Furthermore, without including the use of the within-spot information
i and mi (contained in this case within cv), it would not be possible to estimate the between-spot variance,
.
| ALGORITHM |
|---|
|
|
|---|
This section describes the algorithms used to maximize the log-likelihood function, given by Equation (7), followed by a description of algorithms required for log-transforming microarray scanner data. We implement the t-test to determine the genes that are differentially, or non-differentially, expressed under a particular treatment (Chen et al., 1997; Yang et al., 2000; Kerr et al., 2000).
3.1 Numerical considerations on finding a global maximum
The bounds of the maximum log-likelihood
are found by applying Cauchy's inequality to Equations (8) and (9b),
![]() | (13) |
![]() | (14) |
is found by numerically maximizing the function
![]() |
over the interval defined in Equation (13)),
![]() |
The maximum-likelihood
is rapidly evaluated by taking advantage of the properties of the double partial derivative with respect to µ that is negative for finite
and
,
![]() |
µw and
. This property ensures l(µ,
b) is a maximum with respect to µ wherever
µl(µ,
b) = 0. Substituting (8) into (7) simplifies the maximization of l(µ,
b) to
, i.e. maximization is only with respect to the explicit variable
,
![]() | (15) |
The procedure for finding a global maximum of the log-likelihood,
, can be described as a two-step process. The first step is to find the approximate location of
denoted as
. This is determined by maximizing Equation (15) with respect to
over the interval defined in (14)
![]() |
is to progressively refine (or polish) the starting point,
using a suitable numerical algorithm.
A satisfactory refinement of the starting point was achieved by joint iteration of Equations Equations (8) and (9b). Another choice of numerical optimization was to apply the partial derivatives to a quasi-Newton (or variable metric) method with globally convergent properties as described in the next section. The NelderMead simplex method (Press et al., 1992) that does not require the explicit use of partial derivatives was also used as a optimization method. A comparison of these methods for finding the global maximum
showed that they all agreed for starting values
. However, for starting points arbitrary located [i.e. not
] within the bounds given by Equations (13) and (14), the NelderMead was found to be the most robust numerical optimization method.
3.2 Transformation of data to log-scale
Current scanners provide spot mean, yi, and variance,
, raw summary statistical data that can be written as follows:
![]() |
i and
for each i-th spot.
These estimates can found using the Method of Moments (MoMs) (Dougherty, 1990, pp. 345348). The raw data
are assumed to have a true mean
i and true variance
,
![]() |
and
are defined as solutions of
and
.
Hence, it can be easily deduced that,
![]() | (16) |
![]() | (17) |
i. In Section 5.4, we set
and
.
3.3 t-test
Typically the mean expression, µt, for a particular treated gene is compared with the same parameter corresponding to a control (or wild-type) gene, µc. The t-test is often used to test the null hypothesis that the means are the same, Ho: µt = µc or different, H1: µt
µc.
After log-transforming the data and applying the MLE algorithms from Sections 2 and 3.1, the t-test constitutes the final step for identifying differentially expressed genes. The test subsequently also offers a way of evaluating the performance of the naive and weighted µ estimation methods as discussed later in Section 5.3.
The t-statistic for the difference between the treatment and control means for gene g estimated using the weighted method can be written as follows:
![]() | (18) |
![]() |
,
and the last equality results from applying (5).
Similarly, the t-statistic for the difference between the treatment and control means estimated using the naive method is,
![]() | (19) |
![]() |
Each gene can then be classified as non-differentially, or differentially, expressed (NDE or DE) according to acceptance or rejection of the null hypothesis. For example, the two-sided t-test for the weighted method comprises
![]() |
is the level of confidence or cut-off P-value (typically 5%) and
= 2(n 1) specifies the degrees-of-freedom. | 4 IMPLEMENTATION |
|---|
|
|
|---|
A Monte Carlo (MC) simulation of MLE was implemented using MATLAB 6© (Math Works, Inc., MA). The parameters specified were n,
b,
i, mi as defined previously, and the number of trials. Spot log-mean values xi were computed using a pair of normally distributed random number generators serially connected to achieve the hierarchical variation illustrated in Figure 1. The quasi-Newton method routine in MATLAB 6© used a DavidsonFletcherPowell (DFP) algorithm (see documentation) but another option was the BroydenFletcherGoldfarbShanno (BFGS) method.
| 5 RESULTS |
|---|
|
|
|---|
There are four results that stem from the MLE model presented in the preceding Sections 24. The first result is that the weighted estimate given by (8) is a more efficient estimate of µ than (10). An example illustration is given. The second result shows the sufficiency of spot mean and variance statistics for this model. The third and fourth results show application of the weighted and naive methods to data from both MC simulations and a two-channel dye-swapped spotted microarray experiment. Comparisons of these simulations and microarray studies show the advantages of the weighted approach.
5.1 Relative efficiency
The relative efficiency for unbiased estimators can be expressed as the ratio of the variance of the weighted and naive estimates of µ (Dougherty, 1990, p. 334; Casella and Berger, 2001, p. 476). Applying additivity properties for the variance of a sum of a Gaussian R.V. given in Equation (5) to
and
it can be shown using Cauchy's inequality (Abramowitz and Stegun, 1965) and
given in Equation (8) that the relative efficiency is,
![]() | (20) |
given in Equation (8). Excluding the trivial case where
, Equation (20) shows
is a more efficient estimator than
.
Figure 3 illustrates example distributions for estimates of the mean and SD using the weighted hierarchical MLE method and naive method, Equations (8) and (10). The data were generated using a MC simulation with 1000 trials and n = 10 spots using parameter values µ = 10,
b = 1, mi = 40
i = 1,...,10 and
i = 2i = 2, 4, 6,...,20. The histograms indicate that the means of the distributions are close to the correct value µ = 10 and they are the same,
, whereas the spread of
is less than
. The ratio of the SD was
(
denotes sample variance).
|
5.2 Sufficiency
The spot means and variances can be shown to be sufficient, that is, if the spot means and variances are known, then the pixels do not give any additional information for this model. This can be easily shown as follows. The likelihood using pixel values is given by
![]() | (21) |
and
, and
.
After some algebra, the second integrand in Equation (21),
can be evaluated and re-expressed
![]() |
b and
i and can be considered to be a constant. Hence,
![]() |
is the likelihood using the spot means and variance summary statistics, and
simply acts as a constant of proportionality. Hence, it follows
+ constant and the maximum log-likelihood using pixels leads to the same
and
as the maximum log-likelihood using summary statistics.
5.3 Simulation study
The performance of the MLE weighted and naive methods for estimating gene expression were compared using Monte Carlo simulated gene expression data. A set of 400 genes was split into one-half NDE: with gene identities g = 1,...,200 with µ simulated parameter values (µt = 1, µc = 1), and second-half DE: g = 201,...,400 (µt = 3, µc = 1). The between-spot SD was held constant
b = 1 for all genes for both control and treatment categories and the number of microarray spots (for each gene) likewise remained constant at n = 5.
Two microarray experiments were simulated with the number of independent pixels per spot, mi = 4, and within-spot SD
i applied to both control and treatment categories as follows:
- Constant variance (CV) with
i = 4 for all spots i = 1,...,5 and gene identities g = 1,...,400, that is 
- Constant variance with the exception of occasional high variance spots, or outliers (CV-O),

t-values evaluated using (18) and (19) for each gene were re-ordered to generate lists of genes identified as either true positive, or false positive, for a particular cut-off P-value. Combining the counts of true positive, or false positive, genes for a range of P-value cut-offs forms a Receiver Operating Characteristic (ROC)well-known as a performance indicator in digital communications (Van Trees, 1968, pp. 3646).
Figure 4 shows two sets of ROCs that enable a direct comparison of the weighted and naive estimation methods for each of the two simulated microarray experiments. ROCs that follow the left-hand and top borders show a more accurate test than those that follow a 45° diagonal. Figure 4 shows that in both microarray simulations the weighted method for estimating gene expression performs better than the naive method, in particular, for the case of constant variance with outliers.
|
5.4 A microarray study
The MLE weighted and naive methods for estimating gene expression were applied to a two-channel dye-swapped spotted microarray dataset (four separate technical replicates) used for studying skin cancer (Wit and McClure, 2003, 2004). To enable a fair comparison of the estimation method, the dataset excluded genes exhibiting extreme expression levels (very low or very high) indicative of unwanted technical effects, such as, fluorescence saturation.
The spot mean,
(spot i, gene g), and variance,
, data for both treatment genes (extracted from cancerous fibroblast cells) and control (normal) genes, were initially log-transformed using MoM Equations (16) and (17) in Section 3.2. The number of independent pixels per spot was estimated from Equation (1), Section 2.1,
, where
is the number of pixels in the i-th spot listed by the scanner software. Estimates of
and
for treatment and normal genes, and corresponding t-statistics were evaluated as described in Sections 2, 3 and 4. Figure 5 shows the proportion of genes (out of 928) where the null hypothesis is rejected (i.e. accept differential expression hypothesis, H1) versus the cut-off level over the range of typical interest, 0
0.2. The plots for this dataset show that the weighted method detects a higher proportion of differentially expressed genes compared with the naive method. Both the weighted and naive procedures can be compared with the plot of proportion of genes expected when differential expression does not occur.
|
| 6 DISCUSSION AND CONCLUDING REMARKS |
|---|
|
|
|---|
The preceding sections have shown that the MLE hierarchical model leads to a more efficient estimate of gene expression and subsequent improvement in differential detection. There are, however, areas where improvements can be made. One example concerns MoM estimation (prior to use in the MLE model) of the spot mean and variance of the log-pixel values from the corresponding raw quantities provided by the scanner software (i.e. xi and
from yi and
. MC simulations of differential expression using the MoM estimates indicate that there would be a significant benefit in differential detection if the scanner software provided the spot mean and variance of the log-pixel values thereby avoiding the need for MoM estimates. We also remark that the pixel distributions are not well represented in the literature compared with spot distributions, and that our use of a log-normal pixel distribution approximates the normal distribution using a square-root transformation (Glasbey and Ghazal, 2003). Implementing a square-root normal distribution would only require modifications to the MoM relations in Section 3.2.
In general, we note that the hierarchical model (Fig. 1) representing variations between-spots and within-spots may, possibly be generalized to investigate biological and technical variation. Biological variation arises from a number of factors in the process of gene expression, including, for example, spatial inhomogeneity of cells in the growth medium, poor cell-cycle synchronization, variations in DNA transcription, etc. Technical variation, in contrast, is attributed to uncertainties in the microarray, sample preparation and measurement process. Spotted microarrays, for example, exhibit technical variation arising from the DNA hybridization and labelling of the fluorescent dyes, spot printing effects, spatial variations in coatings of the slides, etc.
Considering all the available information, we can improve the quality of microarray gene expression estimation. Algorithms based on MLE are used to weight the measured mean of each spot inversely to the pixel variance measured on a gene-by-gene basis and we have shown the resulting improvements on the estimation of microarray expression values.
| Acknowledgments |
|---|
The authors thank Dr N. Barr for kindly making available the skin cancer data. D.J.B. would like to thank staff at the Bacterial Microarray Group, St Georges Hospital Medical School, London, and Dr K. Vass at the Beatson Laboratories (CR UK), for their guidance and encouragement, and providing office and computing facilities. E.W. would like to thank the Dipartimento di Scienze Statistiche Paolo Fortunati of the University of Bologna for its hospitality from January until July 2004. D.J.B. would also like to thank The Wellcome Trust for financial support (Project 062511).
Received on April 23, 2004; revised on September 17, 2004; accepted on September 20, 2004
| REFERENCES |
|---|
|
|
|---|
Abramowitz, M. and Stegun, I.A. Handbook of Mathematical Functions, (1965) , NY Dover Publications.
Casella, G. and Berger, R.L. Statistical Inference, (2001) 2nd edn. , Pacific Grove, CA Duxbury.
Chen, Y., Dougherty, E.R., Bittner, M.L. (1997) Ratio-based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Opt., 2, , pp. 364374[CrossRef].
Clifford, P., Richardson, S., Hemon, D. (1989) Assessing the significance of the correlation between two spatial processes. Biometrics, 45, 123134[CrossRef][Web of Science][Medline].
Dougherty, E.R. Probability and Statistics for the Engineering, Computing, and Physical Sciences, (1990) , Upper Saddle River, NJ Prentice-Hall, Inc.
Dutilleul, P. (1993) Modifying the t-test for assessing the correlation between two spatial processes. Biometrics, 49, , pp. 305314.
Glasbey, C.A. and Ghazal, P. (2003) Combinatorial image analysis of DNA microaray features. Bioinformatics, 19, 194203
Griffith, D. and Zhang, M. (1999) Computational simplifications needed for efficient implementation of spatial statistical techniques in a GIS. Int. J. Geograph. Inform. Sci., 5, 97105.
Kerr, M.K., Martin, M., Churchill, G.A. (2000) Analysis of variance for gene expression microarray data. J. Comput. Biol., 7, 819837[CrossRef][Web of Science][Medline].
Nuñez-Garcia, J., Mersinias, V., Cho, K.H., Smith, C.P., Wolkenhauer, O. (2003) The statistical distribution of the intensity of pixels within spots of DNA microarrays: what is the appropriate single-valued representative?. Appl. Bioinformatics, 2, 229239[Medline].
Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P. Numerical Recipes in FORTRAN, (1992) 2nd edn. , Cambridge, UK Cambridge University Press.
Van Trees, H.L. Detection, Estimation, and Modulation Theory, (1968) , NY John Wiley and Sons.
Wit, E. and McClure, J. (2003) Statistical adjustment of signal censoring in gene expression experiments. Bioinformatics, 19, , pp. 10551060
Wit, E. and McClure, J. Statistics for Microarrays: Design, Analysis and Inference, (2004) , NY John Wiley and Sons.
Technical Report 584. Yang, Y.H., Buckley, M.J., Dudoit, S., Speed, T.P. (2000) Comparison of methods for image analysis on cDNA microarrays. , Berkeley, CA Department of Statistics, University of California at Berkeley.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

, and within-spot, 
, and (b) higher quality spot with low pixel variance,
.







































