Skip Navigation


Bioinformatics Advance Access originally published online on September 28, 2004
Bioinformatics 2005 21(6):723-729; doi:10.1093/bioinformatics/bti051
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/6/723    most recent
bti051v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bakewell, D. J.
Right arrow Articles by Wit, E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bakewell, D. J.
Right arrow Articles by Wit, E.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Weighted analysis of microarray gene expression using maximum-likelihood

David J. Bakewell 1,* and Ernst Wit 2

1Cancer Research UK Beatson Laboratories Garscube Estate, Bearsden, Glasgow G61 1BD, UK
2Department of Statistics, University of Glasgow Glasgow G12 8QW, UK

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 MODEL AND METHODS
 ALGORITHM
 4 IMPLEMENTATION
 5 RESULTS
 6 DISCUSSION AND CONCLUDING...
 REFERENCES
 

Motivation: The numerical values of gene expression measured using microarrays are usually presented to the biological end-user as summary statistics of spot pixel data, such as the spot mean, median and mode. Much of the subsequent data analysis reported in the literature, however, uses only one of these spot statistics. This results in sub-optimal estimates of gene expression levels and a need for improvement in quantitative spot variation surveillance.

Results: This paper develops a maximum-likelihood method for estimating gene expression using spot mean, variance and pixel number values available from typical microarray scanners. It employs a hierarchical model of variation between and within microarray spots. The hierarchical maximum-likelihood estimate (MLE) is shown to be a more efficient estimator of the mean than the ‘conventional’ estimate using solely the spot mean values (i.e. without spot variance data). Furthermore, under the assumptions of our model, the spot mean and spot variance are shown to be sufficient statistics that do not require the use of all pixel data.

The hierarchical MLE method is applied to data from both Monte Carlo (MC) simulations and a two-channel dye-swapped spotted microarray experiment. The MC simulations show that the hierarchical MLE method leads to improved detection of differential gene expression particularly when ‘outlier’ spots are present on the arrays. Compared with the conventional method, the MLE method applied to data from the microarray experiment leads to an increase in the number of differentially expressed genes detected for low cut-off P-values of interest.

Availability: The Matlab code is available at http://www.stats.gla.ac.uk/~microarray/software/

Contact: d.bakewell{at}beatson.gla.ac.uk


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 MODEL AND METHODS
 ALGORITHM
 4 IMPLEMENTATION
 5 RESULTS
 6 DISCUSSION AND CONCLUDING...
 REFERENCES
 
Microarray spot summary statistics, such as mean, median and mode are commonly used as gene expression indicators from image processed microarray experiments. Ideally, all pixels of each spot located on each optically scanned microarray should be recorded to yield maximum available information but this would further exasperate existing data storage requirements and/or analysis capabilities. As such, the use of the abovementioned summary statistics is a trade-off between data storage capability and requirements of the end-user, and often one statistic is chosen by the analyst as the basis for quantifying gene expression. For example, the pixel median statistic for each spot is often chosen since it is resilient to a strongly skewed distribution of pixel intensities compared with, e.g. the mean. Recent work (Nuñez-Garcia et al., 2003) has shown that the choice of pixel mode statistic is a more accurate statistic for estimating gene expression for spots exhibiting a donut spot shape spatial pixel intensity distribution. Intuitively, the use of more than one summary statistic may lead to improvements in the estimation of gene expression and elucidation of the sources of data variability. For example, recent work has shown that using the mean, median and variance, combined together, can correct for signal saturation effects (Wit and McClure, 2003).

In this paper, we consider the use of the spot mean and standard deviation (SD) for estimating the mean expression across all spots of each gene under the same condition. These improved estimates will lead, for example, to a more accurate detection of differential expression. The ‘take-away’ message is to evaluate weighted spot means where the weights are roughly inversely proportional to the spot variance.

It is important to note that the technique developed is applicable to all types of microarrays and experimental designs for which a measure of variation is available. Thus, for example, although the microarray study in Section 5.4 uses a dye-swap, by no means is the technique confined to this particular type of experimental design using two channels. In fact, the technique can be applied to any type of experimental measurement (genomic, proteomic, etc.) for which the error variance is known.


    2 MODEL AND METHODS
 TOP
 Abstract
 1 INTRODUCTION
 2 MODEL AND METHODS
 ALGORITHM
 4 IMPLEMENTATION
 5 RESULTS
 6 DISCUSSION AND CONCLUDING...
 REFERENCES
 
Quantitative evaluation of the expression of a single gene is achieved by optical measurement of spot fluorescence. Typically, for each of the spots the fluorescence of a large number of pixels is recorded. There are many sources of gene expression variation, both biological and technical (as discussed in Section 6), but a key characteristic of spotted microarrays is the physical, spatial separation of spots defined by printing. This motivates representation of gene expression variation on a microarray as a directed acyclic graph with two layers, as shown in Figure 1.



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 1 Directed acyclic graph of the hierarchical model of data variation and parameter observation. The hierarchical variation comprises the between-spot, , and within-spot, {varepsilon}ij, components pertaining to the i-th spot and j-th pixel. The observed quantities are the spot mean, xi, and spot variance, .

 
The hierarchical model pertains to a single gene with a mean expression value µ measured by n spots, which contain mi independent pixels (i = 1,...,n). The between-spot and within-spot deviations are denoted by quantities {varepsilon}b and {varepsilon}, respectively. The top layer represents the variation between spot values. Spots can vary from each other, for example, because they represent cDNA from different biological samples. In this case, the top layer principally describes the biological variation of the expression of a single gene within the population of interest. The second layer represents the variation between pixels within each spot and tends to arise from technical variation, for example, printing and hybridization effects (Wit and McClure, 2003, Table 3.1).

An example of within-spot pixel variation for two spots with the same mean is shown in Figure 2. The left spot Figure 2a has a considerably higher pixel variation than the right spot, Figure 2b. As such, the right spot is more uniform and can be regarded as a more reliable indicator of gene expression.



View larger version (32K):
[in this window]
[in a new window]
 
Fig. 2 Example of two microarray spots with the same observed spot mean xi = 5 but different within-spot pixel variation: (a) poorer quality spot with relatively high pixel variance, , and (b) higher quality spot with low pixel variance, .

 
2.1 Determining effective sample size of correlated spatial data
The i-th spot consists of a number of pixels and for cDNA microarray data this number is typically in the order of 200–300. These pixels do not constitute independent observations as the pixels are heavily spatially correlated. Therefore, we propose a correction term and to replace the sample size for the i-th spot by the effective sample size of mi pixels that are independent.

There is an extensive literature on effective sample sizes in spatial autoregressive (SAR) processes (Clifford et al., 1989; Dutilleul, 1993). In matrix formulation, a SAR model is given as

where is the average spot intensity, 1 a vector of ones, Y is the vector of spot-pixel observations, W the -contiguity matrix, {eta} a vector of i.i.d. errors and {rho} a scalar that indicates the spatial correlation between two neighbouring pixels. A SAR model is flexible enough to capture most simple spatially correlated observations. It was shown (Griffith and Zhang, 1999) that in this model the effective sample size can be approximated as,

(1)

For one cDNA printed microarray, we estimated a value of on average across all the spots. This means that for ordinary spots sizes (number of pixels > 50), the effective sample size is only 6–7% of the original sample size, i.e. mi {cong} 0.06m*i. We use this relation in Section 5.4.

2.2 Hierarchical variation between- and within-spots
The log-expression in the absence of measurement error can be modelled as where and is the biological, or more generally, between-spot, variance. This constitutes the top level of the hierarchy in Figure 1. These between-spot log-expression parameters, {theta}i, are ‘estimated’ intermediately through the log-pixel values, xij, where the indices denote the j-th pixel for the i-th spot. The log-pixel values can be modelled as additional deviations from the spot log-expression,

(2)
and are shown as the second level of the the hierarchy. In Equation (2)), represents the within-spot variation and is the variance within the i-th spot consisting of mi independent pixels.

The pixel values are in principle available, but in practice one tends to prefer to work with summary parameters, such as, the log-pixel sample average and sample variance,

(3)

(4)
and the values. Note that since we have assumed the pixel intensities are log-normally distributed, xij is the logarithm of each pixel value and not the raw intensity. Therefore, currently, these xij values are not provided by most scanner imaging software. An estimation method is presented in Section 3.2 and the assumption of the log-normal pixel distribution is addressed in Section 6.

The log-mean of the i-th spot for the hierarchical model has the following distribution if the pixels are independent,

(5)
In Section 2.3 and subsequent calculations, the sample variance is used as a plug-in estimate of for pragmatic reasons (otherwise the problem becomes intractable). Thus, is given by Equations (3) and (4).

2.3 Maximum-likelihood estimate (MLE) of µ and for the hierarchical model
Assuming {sigma}i and mi are known, the log-likelihood l of µ and {sigma}b given xi for n spots, i=1,2,...,n, is expressed as follows:

(6)

(7)
where {Psi}i = xi µ and and the last equality results from (5).

Expressions for the maximum log-likelihood are found by setting the partial derivatives of (7)) to zero: and This results in

(8)
where the coefficient, , in the weighted sum is

and

(9a)
or the interesting case , leading to

(9b)
where the weight coefficient is

The subscript ‘w’ on the estimates of and signifies these parameters are weighted sums originating from Equation (7). The solutions depend implicitly on the unknown µ and {sigma}b. Either iteration or an appropriate computational algorithm will yield estimates for µ and {sigma}b. The coefficients weight the spots inversely to their variance, i.e. giving high weight to high-quality spots with low , and vice versa.

2.4 Naive gene expression estimates
The estimates for µw and can be compared with estimates from a maximum-likelihood analysis that uses solely the spot mean xi data for the microarray scanner and does not include the within-spot, data. The approach is called ‘naive’. Standard expressions for MLE unbiased estimates of the mean and variance of xi are as follows (Dougherty, 1990, p. 342):

(10)
and

(11)
where the subscript ‘n’ denotes that the estimates of µ and are naive.

2.5 Comparison of hierarchical and naive MLE expressions for mean and variance
Rearranging Equation (9b)

(12)
shows that the weighted sample variance consists of the between- and within-spot components.

A useful comparison of µ and MLEs between the hierarchical and naive models is achieved by considering the special case for the hierarchical model where the spot variances are all equal in Equations (8) and (12), for all and where cv is a constant. Consequently, the spots have equal weighting and Equations (8) and (12) simplify to

and

where the subscript ‘e’ denotes that the estimates of µ and assume equal weighting for each spot.

Comparing the hierarchical model where the within-spot variances are all equal to cv with the naive model [given by Equations (10) and (11)] shows that the estimates are the same for the mean, and differ by cv for the variance, . This illustrates that the naive estimate of the spot mean variance contains the within-spot and between-spot components. Furthermore, without including the use of the within-spot information {sigma}i and mi (contained in this case within cv), it would not be possible to estimate the between-spot variance, .


    ALGORITHM
 TOP
 Abstract
 1 INTRODUCTION
 2 MODEL AND METHODS
 ALGORITHM
 4 IMPLEMENTATION
 5 RESULTS
 6 DISCUSSION AND CONCLUDING...
 REFERENCES
 
This section describes the algorithms used to maximize the log-likelihood function, given by Equation (7), followed by a description of algorithms required for log-transforming microarray scanner data. We implement the t-test to determine the genes that are differentially, or non-differentially, expressed under a particular treatment (Chen et al., 1997; Yang et al., 2000; Kerr et al., 2000).

3.1 Numerical considerations on finding a global maximum
The bounds of the maximum log-likelihood are found by applying Cauchy's inequality to Equations (8) and (9b),

(13)
and

(14)
>In Equation (14) the lower and upper bounds arise from (9a) and (9b). The upper bound is found by numerically maximizing the function

with respect to over the interval defined in Equation (13)),

The maximum-likelihood is rapidly evaluated by taking advantage of the properties of the double partial derivative with respect to µ that is negative for finite and ,

where it is understood µ{equiv}µw and . This property ensures l(µ,{sigma}b) is a maximum with respect to µ wherever {partial}µl(µ,{sigma}b) = 0. Substituting (8) into (7) simplifies the maximization of l(µ,{sigma}b) to , i.e. maximization is only with respect to the explicit variable ,

(15)

The procedure for finding a global maximum of the log-likelihood, , can be described as a two-step process. The first step is to find the approximate location of denoted as . This is determined by maximizing Equation (15) with respect to over the interval defined in (14)

The second step for finding the location of the global maximum is to progressively refine (or ‘polish’) the starting point, using a suitable numerical algorithm.

A satisfactory refinement of the starting point was achieved by joint iteration of Equations Equations (8) and (9b). Another choice of numerical optimization was to apply the partial derivatives to a quasi-Newton (or variable metric) method with globally convergent properties as described in the next section. The Nelder–Mead simplex method (Press et al., 1992) that does not require the explicit use of partial derivatives was also used as a optimization method. A comparison of these methods for finding the global maximum showed that they all agreed for starting values . However, for starting points arbitrary located [i.e. not ] within the bounds given by Equations (13) and (14), the Nelder–Mead was found to be the most robust numerical optimization method.

3.2 Transformation of data to log-scale
Current scanners provide spot mean, yi, and variance, , ‘raw’ summary statistical data that can be written as follows:

The aim is to find estimates for the true spot mean and variance of the log pixel values {theta}i and for each i-th spot.

These estimates can found using the Method of Moments (MoMs) (Dougherty, 1990, pp. 345–348). The ‘raw’ data are assumed to have a true mean {gamma}i and true variance ,

where the second equality in both equations follows from standard formulae for log-normal mean and variance (Dougherty, 1990, p. 163). The MoM estimators and are defined as solutions of and .

Hence, it can be easily deduced that,

(16)
and

(17)
Note that in Section 2.2 the sample spot mean of the log-pixel values xi is an estimate of {theta}i. In Section 5.4, we set and .

3.3 t-test
Typically the mean expression, µt, for a particular ‘treated’ gene is compared with the same parameter corresponding to a ‘control’ (or wild-type) gene, µc. The t-test is often used to test the null hypothesis that the means are the same, Ho: µt = µc or different, H1: µt != µc.

After log-transforming the data and applying the MLE algorithms from Sections 2 and 3.1, the t-test constitutes the final step for identifying differentially expressed genes. The test subsequently also offers a way of evaluating the performance of the naive and weighted µ estimation methods as discussed later in Section 5.3.

The t-statistic for the difference between the treatment and control means for gene g estimated using the weighted method can be written as follows:

(18)
where the subscripts ‘g’, ‘c’ and ‘t’ signify ‘gene identity’, ‘control’ and ‘treatment’, respectively. In Equation (18) the spots means are assumed to be independent so the variances can be evaluated,

where , and the last equality results from applying (5).

Similarly, the t-statistic for the difference between the treatment and control means estimated using the naive method is,

(19)
where

Each gene can then be classified as non-differentially, or differentially, expressed (NDE or DE) according to acceptance or rejection of the null hypothesis. For example, the two-sided t-test for the weighted method comprises

where t–1 denotes the inverse student-t function, {alpha} is the level of confidence or cut-off P-value (typically 5%) and {nu} = 2(n – 1) specifies the degrees-of-freedom.


    4 IMPLEMENTATION
 TOP
 Abstract
 1 INTRODUCTION
 2 MODEL AND METHODS
 ALGORITHM
 4 IMPLEMENTATION
 5 RESULTS
 6 DISCUSSION AND CONCLUDING...
 REFERENCES
 
A Monte Carlo (MC) simulation of MLE was implemented using MATLAB 6© (Math Works, Inc., MA). The parameters specified were n, {sigma}b, {sigma}i, mi as defined previously, and the number of trials. Spot log-mean values xi were computed using a pair of normally distributed random number generators serially connected to achieve the hierarchical variation illustrated in Figure 1.

The quasi-Newton method routine in MATLAB 6© used a Davidson–Fletcher–Powell (DFP) algorithm (see documentation) but another option was the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method.


    5 RESULTS
 TOP
 Abstract
 1 INTRODUCTION
 2 MODEL AND METHODS
 ALGORITHM
 4 IMPLEMENTATION
 5 RESULTS
 6 DISCUSSION AND CONCLUDING...
 REFERENCES
 
There are four results that stem from the MLE model presented in the preceding Sections 2–4. The first result is that the weighted estimate given by (8) is a more efficient estimate of µ than (10). An example illustration is given. The second result shows the sufficiency of spot mean and variance statistics for this model. The third and fourth results show application of the weighted and naive methods to data from both MC simulations and a two-channel dye-swapped spotted microarray experiment. Comparisons of these simulations and microarray studies show the advantages of the weighted approach.

5.1 Relative efficiency
The relative efficiency for unbiased estimators can be expressed as the ratio of the variance of the weighted and naive estimates of µ (Dougherty, 1990, p. 334; Casella and Berger, 2001, p. 476). Applying additivity properties for the variance of a sum of a Gaussian R.V. given in Equation (5) to and it can be shown using Cauchy's inequality (Abramowitz and Stegun, 1965) and given in Equation (8) that the relative efficiency is,

(20)
where the third equality results from given in Equation (8). Excluding the trivial case where , Equation (20) shows is a more efficient estimator than .

Figure 3 illustrates example distributions for estimates of the mean and SD using the weighted hierarchical MLE method and naive method, Equations (8) and (10). The data were generated using a MC simulation with 1000 trials and n = 10 spots using parameter values µ = 10, {sigma}b = 1, mi = 40 {forall}i = 1,...,10 and {sigma}i = 2i = 2, 4, 6,...,20. The histograms indicate that the means of the distributions are close to the correct value µ = 10 and they are the same, , whereas the spread of is less than . The ratio of the SD was (‘’ denotes sample variance).



View larger version (31K):
[in this window]
[in a new window]
 
Fig. 3 Distributions for estimates of the mean using the weighted and naive methods, and . The histograms were generated using a MC simulation with 1000 trials and parameter values n = 10 spots, gene expression log-mean µ = 10 and variance {sigma}b = 1, mi = 40 independent pixels per spot and spot variance {sigma}i = 2i. See text for details.

 
5.2 Sufficiency
The spot means and variances can be shown to be sufficient, that is, if the spot means and variances are known, then the pixels do not give any additional information for this model. This can be easily shown as follows. The likelihood using pixel values is given by

(21)
where the vectors and , and .

After some algebra, the second integrand in Equation (21), can be evaluated and re-expressed

where Ci is independent of µ, {sigma}b and {theta}i and can be considered to be a constant. Hence,

where is the likelihood using the spot means and variance summary statistics, and simply acts as a constant of proportionality. Hence, it follows + constant and the maximum log-likelihood using pixels leads to the same and as the maximum log-likelihood using summary statistics.

5.3 Simulation study
The performance of the MLE weighted and naive methods for estimating gene expression were compared using Monte Carlo simulated gene expression data. A set of 400 genes was split into one-half NDE: with gene identities g = 1,...,200 with µ simulated parameter values (µt = 1, µc = 1), and second-half DE: g = 201,...,400 (µt = 3, µc = 1). The between-spot SD was held constant {sigma}b = 1 for all genes for both control and treatment categories and the number of microarray spots (for each gene) likewise remained constant at n = 5.

Two microarray experiments were simulated with the number of independent pixels per spot, mi = 4, and within-spot SD {sigma}i applied to both control and treatment categories as follows:

  1. Constant variance (CV) with {sigma}i = 4 for all spots i = 1,...,5 and gene identities g = 1,...,400, that is


  2. Constant variance with the exception of occasional high variance spots, or outliers (CV-O),


t-values evaluated using (18) and (19) for each gene were re-ordered to generate lists of genes identified as either true positive, or false positive, for a particular cut-off P-value. Combining the counts of true positive, or false positive, genes for a range of P-value cut-offs forms a Receiver Operating Characteristic (ROC)—well-known as a performance indicator in digital communications (Van Trees, 1968, pp. 36–46).

Figure 4 shows two sets of ROCs that enable a direct comparison of the weighted and naive estimation methods for each of the two simulated microarray experiments. ROCs that follow the left-hand and top borders show a more accurate test than those that follow a 45° diagonal. Figure 4 shows that in both microarray simulations the weighted method for estimating gene expression performs better than the naive method, in particular, for the case of constant variance with outliers.



View larger version (19K):
[in this window]
[in a new window]
 
Fig. 4 Receiver operating characteristics for the MLE hierarchical weighted and naive methods of estimating gene expression under two conditions: (1) constant within-spot variance (CV) and (2) constant variance with the exception of outliers with high variance (CV-O). See text for details.

 
5.4 A microarray study
The MLE weighted and naive methods for estimating gene expression were applied to a two-channel dye-swapped spotted microarray dataset (four separate technical replicates) used for studying skin cancer (Wit and McClure, 2003, 2004). To enable a fair comparison of the estimation method, the dataset excluded genes exhibiting extreme expression levels (very low or very high) indicative of unwanted technical effects, such as, fluorescence saturation.

The spot mean, (spot i, gene g), and variance, , data for both treatment genes (extracted from cancerous fibroblast cells) and control (normal) genes, were initially log-transformed using MoM Equations (16) and (17) in Section 3.2. The number of independent pixels per spot was estimated from Equation (1), Section 2.1, , where is the number of pixels in the i-th spot listed by the scanner software. Estimates of and for treatment and normal genes, and corresponding t-statistics were evaluated as described in Sections 2, 3 and 4. Figure 5 shows the proportion of genes (out of 928) where the null hypothesis is rejected (i.e. accept differential expression hypothesis, H1) versus the cut-off level over the range of typical interest, 0 ≤ {alpha} ≤ 0.2. The plots for this dataset show that the weighted method detects a higher proportion of differentially expressed genes compared with the naive method. Both the weighted and naive procedures can be compared with the plot of proportion of genes expected when differential expression does not occur.



View larger version (18K):
[in this window]
[in a new window]
 
Fig. 5 Proportion of genes detected as being differentially expressed (reject H0) versus cut-off value, {alpha}, for the skin cancer data (kindly provided by Dr N. Barr). The plots show the weighted method detects more genes than the naive method.

 

    6 DISCUSSION AND CONCLUDING REMARKS
 TOP
 Abstract
 1 INTRODUCTION
 2 MODEL AND METHODS
 ALGORITHM
 4 IMPLEMENTATION
 5 RESULTS
 6 DISCUSSION AND CONCLUDING...
 REFERENCES
 
The preceding sections have shown that the MLE hierarchical model leads to a more efficient estimate of gene expression and subsequent improvement in differential detection. There are, however, areas where improvements can be made. One example concerns MoM estimation (prior to use in the MLE model) of the spot mean and variance of the log-pixel values from the corresponding ‘raw’ quantities provided by the scanner software (i.e. xi and from yi and . MC simulations of differential expression using the MoM estimates indicate that there would be a significant benefit in differential detection if the scanner software provided the spot mean and variance of the log-pixel values thereby avoiding the need for MoM estimates.

We also remark that the pixel distributions are not well represented in the literature compared with spot distributions, and that our use of a log-normal pixel distribution approximates the normal distribution using a square-root transformation (Glasbey and Ghazal, 2003). Implementing a square-root normal distribution would only require modifications to the MoM relations in Section 3.2.

In general, we note that the hierarchical model (Fig. 1) representing variations between-spots and within-spots may, possibly be generalized to investigate ‘biological’ and ‘technical’ variation. Biological variation arises from a number of factors in the process of gene expression, including, for example, spatial inhomogeneity of cells in the growth medium, poor cell-cycle synchronization, variations in DNA transcription, etc. Technical variation, in contrast, is attributed to uncertainties in the microarray, sample preparation and measurement process. Spotted microarrays, for example, exhibit technical variation arising from the DNA hybridization and labelling of the fluorescent dyes, spot printing effects, spatial variations in coatings of the slides, etc.

Considering all the available information, we can improve the quality of microarray gene expression estimation. Algorithms based on MLE are used to weight the measured mean of each spot inversely to the pixel variance measured on a gene-by-gene basis and we have shown the resulting improvements on the estimation of microarray expression values.


    Acknowledgments
 
The authors thank Dr N. Barr for kindly making available the skin cancer data. D.J.B. would like to thank staff at the Bacterial Microarray Group, St George’s Hospital Medical School, London, and Dr K. Vass at the Beatson Laboratories (CR UK), for their guidance and encouragement, and providing office and computing facilities. E.W. would like to thank the Dipartimento di Scienze Statistiche ‘Paolo Fortunati’ of the University of Bologna for its hospitality from January until July 2004. D.J.B. would also like to thank The Wellcome Trust for financial support (Project 062511).

Received on April 23, 2004; revised on September 17, 2004; accepted on September 20, 2004

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 MODEL AND METHODS
 ALGORITHM
 4 IMPLEMENTATION
 5 RESULTS
 6 DISCUSSION AND CONCLUDING...
 REFERENCES
 

    Abramowitz, M. and Stegun, I.A. Handbook of Mathematical Functions, (1965) , NY Dover Publications.

    Casella, G. and Berger, R.L. Statistical Inference, (2001) 2nd edn. , Pacific Grove, CA Duxbury.

    Chen, Y., Dougherty, E.R., Bittner, M.L. (1997) Ratio-based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Opt., 2, , pp. 364–374[CrossRef].

    Clifford, P., Richardson, S., Hemon, D. (1989) Assessing the significance of the correlation between two spatial processes. Biometrics, 45, 123–134[CrossRef][Web of Science][Medline].

    Dougherty, E.R. Probability and Statistics for the Engineering, Computing, and Physical Sciences, (1990) , Upper Saddle River, NJ Prentice-Hall, Inc.

    Dutilleul, P. (1993) Modifying the t-test for assessing the correlation between two spatial processes. Biometrics, 49, , pp. 305–314.

    Glasbey, C.A. and Ghazal, P. (2003) Combinatorial image analysis of DNA microaray features. Bioinformatics, 19, 194–203[Abstract/Free Full Text].

    Griffith, D. and Zhang, M. (1999) Computational simplifications needed for efficient implementation of spatial statistical techniques in a GIS. Int. J. Geograph. Inform. Sci., 5, 97–105.

    Kerr, M.K., Martin, M., Churchill, G.A. (2000) Analysis of variance for gene expression microarray data. J. Comput. Biol., 7, 819–837[CrossRef][Web of Science][Medline].

    Nuñez-Garcia, J., Mersinias, V., Cho, K.H., Smith, C.P., Wolkenhauer, O. (2003) The statistical distribution of the intensity of pixels within spots of DNA microarrays: what is the appropriate single-valued representative?. Appl. Bioinformatics, 2, 229–239[Medline].

    Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P. Numerical Recipes in FORTRAN, (1992) 2nd edn. , Cambridge, UK Cambridge University Press.

    Van Trees, H.L. Detection, Estimation, and Modulation Theory, (1968) , NY John Wiley and Sons.

    Wit, E. and McClure, J. (2003) Statistical adjustment of signal censoring in gene expression experiments. Bioinformatics, 19, , pp. 1055–1060[Abstract/Free Full Text].

    Wit, E. and McClure, J. Statistics for Microarrays: Design, Analysis and Inference, (2004) , NY John Wiley and Sons.

    Technical Report 584. Yang, Y.H., Buckley, M.J., Dudoit, S., Speed, T.P. (2000) Comparison of methods for image analysis on cDNA microarrays. , Berkeley, CA Department of Statistics, University of California at Berkeley.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/6/723    most recent
bti051v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bakewell, D. J.
Right arrow Articles by Wit, E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bakewell, D. J.
Right arrow Articles by Wit, E.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?