Skip Navigation


Bioinformatics Advance Access originally published online on July 28, 2006
Bioinformatics 2006 22(20):2547-2553; doi:10.1093/bioinformatics/btl412
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/20/2547    most recent
btl412v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Motakis, E. S.
Right arrow Articles by Rutter, G. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Motakis, E. S.
Right arrow Articles by Rutter, G. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Variance stabilization and normalization for one-color microarray data using a data-driven multiscale approach

E. S. Motakis 1, G. P. Nason 1,*, P. Fryzlewicz 1 and G. A. Rutter 2

1 Department of Mathematics, University of Bristol Bristol, UK
2 Department of Biochemistry, University of Bristol Bristol, UK

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ESTABLISHED VARIANCE...
 3 DATA-DRIVEN HAAR-FISZ...
 4 RESULTS
 5 CONCLUSIONS AND FURTHER...
 APPENDIX
 REFERENCES
 

Motivation: Many standard statistical techniques are effective on data that are normally distributed with constant variance. Microarray data typically violate these assumptions since they come from non-Gaussian distributions with a non-trivial mean–variance relationship. Several methods have been proposed that transform microarray data to stabilize variance and draw its distribution towards the Gaussian. Some methods, such as log or generalized log, rely on an underlying model for the data. Others, such as the spread-versus-level plot, do not. We propose an alternative data-driven multiscale approach, called the Data-Driven Haar–Fisz for microarrays (DDHFm) with replicates. DDHFm has the advantage of being ‘distribution-free’ in the sense that no parametric model for the underlying microarray data is required to be specified or estimated; hence, DDHFm can be applied very generally, not just to microarray data.

Results: DDHFm achieves very good variance stabilization of microarray data with replicates and produces transformed intensities that are approximately normally distributed. Simulation studies show that it performs better than other existing methods. Application of DDHFm to real one-color cDNA data validates these results.

Availability: The R package of the Data-Driven Haar–Fisz transform (DDHFm) for microarrays is available in Bioconductor and CRAN.

Contact: g.p.nason{at}bristol.ac.uk


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ESTABLISHED VARIANCE...
 3 DATA-DRIVEN HAAR-FISZ...
 4 RESULTS
 5 CONCLUSIONS AND FURTHER...
 APPENDIX
 REFERENCES
 
Microarrays, in principle and in practice, are extensions of hybridization-based methods (Southern blots, northern blots, SAGE, etc.), which have been used for decades to identify and locate mRNA and DNA sequences that are complementary to a segment of DNA (Alwin et al., 1977; Velculescu et al., 1995). Microarray technology, in the form of either cDNA or high-density oligonucleotide arrays enables molecular biologists to measure simultaneously the expression level of thousands of genes. In a typical microarray experiment the aim is to compare different cell types, e.g. normal versus diseased cells, in order to identify genes that are differentially expressed in the two cell types.

Typically, microarray data analyses consist of several steps ranging from experimental design to the identification of important genes (for a review on the whole process see Sebastiani and Ramoni, 2003). Gene replication is a crucial design feature as it increases the precision of estimation and permits estimation of measurement variance which enables the significance of the final results to be judged.

Rocke and Durbin (2001) identified that the variance of the raw spot intensities increased with their mean and they modeled those intensities in terms of the two-component model:

Formula 1(1)
Here, Formula 1 are the raw single-color intensities for the n genes, each assumed to be replicated p times. Sometimes we will write Yr,i when we are referring to the r-th replicate on the i-th gene (Formula 1). The {alpha} term represents the (common) mean background noise of the n genes on the array, µi is the true expression level for gene i, and {eta}i and {varepsilon}i are the normally distributed error terms with zero mean and variances Formula 1 and Formula 1, respectively. In this way, Formula 1 can be considered as coming from an inhomogeneous process that produces the n gene intensities with finite but different µis and finite but different variances.

At low expression levels (i.e. µi close to 0) the measured expression Yi in (1) can be written as Formula 1 so that Yi is approximately distributed as N({alpha}, Formula 1). On the other hand, for large µis, the middle term in (1) dominates and Yi can be modeled as follows:

Formula 2(2)
with approximate variance

Formula 3(3)
where Formula 3. For moderate values of µi, Yi is modeled as in (1) with variance:

Formula 4(4)

From (3) and (4), we observe that the SD of the Yi increases linearly with their mean. Such mean–variance dependence, implying the presence of heteroscedastic intensities, is a major problem in the statistical analysis of microarrays.

Two methodological approaches have been followed to account for the heteroscedasticity. The first approach involves the estimation of differentially expressed genes directly from the heteroscedastic data by means of penalized t-statistics (e.g. SAM method of (Tusher et al., 2001), mixed or hierarchical Bayesian modeling (e.g. Baird et al., 2004; Hsiao et al., 2004), appropriate maximum-likelihood tests (e.g. Wang and Ethier, 2004) and, recently, gene grouping schemes (e.g. Comander et al., 2004; Delmar et al., 2005a,b). The second approach, which we follow in this article, involves finding appropriate transformations that stabilize the variance of the data. After variance stabilization the data can be analyzed by standard, simple and universally accepted tools, such as ANOVA models.

Section 2 outlines some existing variance stabilizing transforms that have been applied to microarray data. Section 3 proposes a new method called the Data-Driven Haar–Fisz transform for microarrays (DDHFm) and compares its performance with existing methods by means of simulated and real cDNA data in Section 4. We show that DDHFm is superior to existing methods in terms of variance stabilization and Gaussianization of the transformed intensities.


    2 ESTABLISHED VARIANCE STABILIZATION METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ESTABLISHED VARIANCE...
 3 DATA-DRIVEN HAAR-FISZ...
 4 RESULTS
 5 CONCLUSIONS AND FURTHER...
 APPENDIX
 REFERENCES
 
For brevity we discuss and compare the performance of different variance stabilization techniques without, at this stage, worrying about differential expression. For this reason we consider data obtained from one-color microarrays. Generalization to two-color experiments will be considered in future work.

2.1 Log-based transformations
Smyth et al. (2003) suggest using the log transform for microarray intensities. By assuming that the ‘lognormal distribution is an extremely good approximation to the bulk of the data’ (Hoyle et al., 2002) as in model (2), the log transform Formula 4 should stabilize the variance of the gene intensities and bring their distribution closer to the Gaussian. An extension of this approach then considers background corrected intensities, Formula 4, which may be negative and cannot be handled by the simple log function. Based on this notion, several authors have studied alternative logarithmic-based transformations for microarray data.

Tukey (1977) defines the ‘Started Log’ transformation as Formula 4, where k is a positive constant estimated via Formula 4, so that it minimizes the deviation from variance constancy. Alternatively, Holder et al. (2001) developed the log-linear hybrid transformation as Formula 4, for Formula 4 and Formula 4, for Formula 4. This transformation has also been called Linlog by Cui et al. (2003). As with sLog, the optimal k is estimated by Formula 4.

2.2 The generalized logarithm transformation (glog)
Munson (2001), Durbin et al. (2002) and Huber et al. (2002) independently developed the generalized logarithm transformation (referred to as glog in Rocke and Durbin, 2003). For data that come from model (1) with the mean–variance dependence (4), glog is assumed to produce symmetric transformed gene intensities with stabilized variance. The glog formula is:

Formula 5(5)
where c is estimated by Formula 5. Rocke and Durbin (2001) described algorithms to estimate {alpha} and c from one-color cDNA data. Although estimation of {alpha} can be conducted without replicated genes, estimation of c involves estimation of Formula 5, which requires replication. Maximum-likelihood methods for c estimation only, based on Box and Cox (1964), were also developed by Durbin and Rocke (2003) for the case of two-colors microarrays and thus they are not relevant to the present work.

2.3 Spread-versus-level plot transformation (SVL)
Archer et al. (2004) describes a different variance stabilization approach based on plotting the log-median of the replicated intensities on the x-axis (level) against the log of their fourth-spread (a variant of the interquantile range) on the y-axis (spread). Then the estimated slope of the subsequent linear regression model fit indicates the appropriate Box–Cox power transformation.


    3 DATA-DRIVEN HAAR–FISZ TRANSFORMATION FOR MICROARRAYS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ESTABLISHED VARIANCE...
 3 DATA-DRIVEN HAAR-FISZ...
 4 RESULTS
 5 CONCLUSIONS AND FURTHER...
 APPENDIX
 REFERENCES
 
This section describes how the recent DDHF transform can be adapted for use with microarray data. Our adaption requires a subtle organization of microarray intensities into a form acceptable for the DDHF transform. We call our adaption the DDHF transform for microarray data, or DDHFm.

Recently, a new class of variance stabilization transforms, generically known as Haar–Fisz (HF) transforms, were introduced by Fryzlewicz and Nason (2004). In that work the HF transform used a multiscale technique to take sequences of Poisson random variables with unknown intensities into a sequence of random variables with near constant variance and a distribution closer to normality. Later Fryzlewicz et al. (2006) introduced the DDHF transform which used a similar multiscale transform but additionally estimated the mean–variance relation as part of the process of stabilization and bringing the distribution closer to normality. See also Fryzlewicz and Delouille (2005). Hence, the DDHF transform can be used where there is a monotone mean–variance relationship but the precise form of the relationship is not known. In other words, DDHFm is ‘distribution-free’ in that the precise data distribution, such as model (1), need not be known or specified. See the Appendix section for further details on the HF and DDHF transforms.

Both the HF and DDHF transforms rely on an input sequence of positive random variables Xi with mean µi and variance Formula 5 with some monotone (non-decreasing) relation between the mean and variance Formula 5. Both HF and DDHF transforms work best when the underlying µi form a piecewise constant sequence. In other words, when consecutive µs are often very close or actually identical in value but large jumps in value are also permitted. However, microarray data are usually not organized in this sequential form. Microarray intensities Yi usually come in replicated blocks: i.e. Formula 5 is the r-th replicate for the i-th gene.

For the i-th gene what we do know is that the underlying intensity Formula 5 for Formula 5 is identical for each replicate r (this is the reason for replication). So, if the intensities for all replicates for a given gene i were laid out into a consecutive sequence we would know that their underlying Formula 5 sequence was constant.

To be able to make efficient use of the DDHF transform we would need to sort our intensities in order of increasing Formula 5 so that the sequence would be as near piecewise constant as possible. In actuality as we do not know the Formula 5 (since that is what we are trying to estimate) we cannot sort the sequence into increasing µ order. So, we do the next best thing in that we order the replicate sets according to their increasing mean observed value where the mean is taken across replicates. The idea is that the observed mean estimates Formula 5 and the observed mean ordering estimates the ‘correct’ true mean ordering. For example, suppose there were four replicates and four genes with observed (raw) intensities.


Rep 1 Rep 2 Rep 3 Rep 4 Means

Gene 1 13 12 13 14 13
Gene 2 10 11 12 11 11
Gene 3 100 102 99 103 101
Gene 4 73 74 74 75 74

Then ordering these replicates according to the means of replicates for each gene (indicated in the last column), and concatenating gives a sequence of

10 11 12 11 13 12 13 14 73 74 74 75 100 102 99 103.

This ordered sequence of intensities within replicate blocks forms the input, denoted Formula 5 in the Appendix section, to the DDHF transform. After transformation any further technique that has been applied previously to variance stabilized and normalized data may be applied here.


    4 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ESTABLISHED VARIANCE...
 3 DATA-DRIVEN HAAR-FISZ...
 4 RESULTS
 5 CONCLUSIONS AND FURTHER...
 APPENDIX
 REFERENCES
 
Durbin et al. (2002) and Rocke and Durbin (2003) compared the performance of glog with the background-uncorrected log (Log) and the background-corrected log (bcLog) transforms. By considering 18 deterministic µ values, each corresponding to a gene, they simulated Formula 5 with Formula 5 and Formula 5 intensities from the two-component model (1) with parameters Formula 5 and assessed the performance of the methods in terms of the resulting transformed gene intensity variances and skewness coefficients. The two major results of Durbin et al. (2002) state that glog ‘stabilizes the asymptotic variance of microarray data across the full range of the data, as well as making the data more symmetric’ than the other methods under comparison.

In Durbin et al. (2002) though, after simulating the intensities with the parameters mentioned above, the data were subsequently transformed using (5), with the known model parameters ({alpha}, {sigma}{eta}, {sigma}{varepsilon}). This procedure is biased. In practice, the true parameters are not known and have to be estimated, which results in inferior overall variance stabilization performance. Below, we demonstrate this by simulating data from the two-component model and estimating the parameters.

Additionally, in our simulations described next, we also transform our data with the background uncorrected log (Log) method, the log-linear hybrid transform, the spread-versus-level transform and our new DDHFm method. We do not use background corrected log and the started log, because both of them produce negative background corrected intensities, especially for small µs, and we have observed that they result in highly asymmetric data.

4.1 One-color cDNA data acquisition
We simulate from the two-component model (1) with parameters estimated from real cDNA data, obtained from the Stanford Microarray Database (http://smd.stanford.edu/). Two sets of data are considered. The first one comes from McCaffrey et al. (2004) study on mouse cDNA microarrays to investigate gene expression triggered by infection of bone marrow-derived macrophages with cytosol- and vacuole-localized Listeria monocytogenes (Lm). Each gene was replicated four times. The dataset numbers were 40 430, 40 571, 34 905 and 34 912.

The second set comes from Pauli et al. (2006) work to identify genes expressed in the intestine of Caenorhabditis elegans using cDNA microarrays. Student’s t-tests for differential expression were conducted with eight replicates for each gene. The dataset numbers were 36 590, 38 262, 38 265, 39 215, 40 157, 41 833, 41 834 and 41 886.

4.2 Simulations based on McCaffrey et al. (2004) data
We wish to simulate a likely µi signal using our real cDNA data. As in the example of Section 3, we estimate the mean of replicates for each gene from our two datasets. These means are ordered and concatenated in a single vector from which we sample 1024 equispaced values. This sequence of sample means, shown in Figure 1, forms our simulated Formula 5 signal (‘the truth’). This procedure is repeated for both real datasets.


Figure 1
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Simulated µ signal of 1024 genes.

 
From each of the 1024 µi levels we simulate p = 4 replicated raw intensities Yr,i, where r = 1, ..., 4 and i = 1, 2, ..., 1024, using the simdurbin2() function from the DDHFm package which simulates from model (1). To obtain Formula 5, model (1) was considered with parameters Formula 5, Formula 5 and Formula 5 as estimated (and rounded) from the McCaffrey et al. (2004) dataset. These parameters are re-estimated as in Rocke and Durbin (2001), then applied to the transformation methods that require their estimation (glog and Hyb) and the data are subsequently transformed. We iterate the above procedure Formula 5 times, and produce Formula 5 raw intensities, where Formula 5 denotes the r-th replicate of the k-th iterated sequence. Finally, we concatenate the transformed Formula 5 into a single ‘output’ vector for each i, from which we will derive our results. In other words, our output consists of 1024 output vectors Formula of length Formula 5 transformed observations.

The effectiveness of the methods is assessed in terms of adjusted SDs (Formula 5) of the replicated transformed intensities of each Formula 5. Each Formula 5 is computed as follows. The SD, Formula 5, of the stabilized sample of 4000 values is computed for each Formula 5. We noted that each method stabilizes the variance to a different value. So, for each method we compute the mean of Formula 5s over the whole Formula 5 set, denoted as Formula 5, and adjust each Formula 5 by computing Formula 5. In this way the different stabilization methods can be compared directly.

Additionally, we evaluate the Gaussianization properties of each transform by means of D’Agostino–Pearson Formula 5 test for normality (D'Agostino, 1974): the test is appropriate for detecting deviations from normality due to either ‘abnormal’ skewness or kurtosis. Hence, when we subsequently write (not) normal we mean relative to this test. In contrast to the analysis of Durbin et al. (2002) on the means of skewness coefficients over 1000 samples for each µ, we choose this more comprehensive, distribution-based approach.

Figures 2–4 show the variance stabilization results of the transformation methods. Note that ‘glogi’ stands for the generalized logarithm transform with the known (optimal) parameters {alpha}, Formula 5 and Formula 5, while ‘gloge’ is the glog transform with all parameters being estimated. Additionally, ‘Hyb’ = the log-linear hybrid method, ‘Log’ = the background uncorrected log transform, ‘SVL’ = the spread-versus-level transform and, finally, ‘DDHFm’.


Figure 2
View larger version (24K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Variance stabilization of glogi (upper panel) and gloge (lower panel) transforms. Dots, Formula 5; crosses, Formula 5. Horizontal line at 1. Each gene is replicated four times.

 
We plot the Formula 5s against the 1024 ‘mean-sorted’ genes of data simulated first from Formula 5 [estimated from McCaffrey et al. (2004) data] and then from Formula 5 in order to show the performance of the methods with different choices of the model parameters. Varying {alpha} and Formula 5 individually in the simulations did not yield different variance stabilization results from the ones reported here.

The more concentrated the Formula 5s are ~1 (the straight line in the figures), the better the stabilization has been performed. Figure 2 evidently shows the superiority of glogi over gloge for both Formula 5 values, indicating the direct effect on variance stabilization when the glog parameters are being estimated. The means of the estimated parameters over the Formula 5 sequences were estimated as Formula 5, Formula 5 and Formula 5. Further analysis has showed that the large difference of the estimate Formula 5 from {alpha}, frequently observed over the k iterations, is the main cause of the degradation in gloge performance.

Figure 3 shows Hyb and log variance stabilization results. Notice that both methods fail to stabilize the adjusted SDs of the transformed intensities and, similarly to gloge, their performance depends on the Formula 5 value: the smaller the Formula 5 gets, the better variance stabilization is achieved. For small Formula 5 though, Log seems to work better than the other two methods.


Figure 3
View larger version (24K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Variance stabilization of Hyb (upper panel) and Log (lower panel) transforms. Dots, Formula 5; crosses, Formula 5. Horizontal line at 1. Each gene is replicated four times.

 
In Figure 4 we note that SVL seems to perform well, especially for small Formula 5, but its performance is still inferior to DDHFm. DDHFm clearly outperforms every other method and its variance stabilization results are very similar with those of glogi (but, of course, glogi uses known parameters and cannot be used in practice).


Figure 4
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 Variance stabilization of SVL (upper panel) and DDHFm (lower panel) transforms. Dots, Formula 5; crosses, Formula 5. Horizontal line at 1. Each gene is replicated four times.

 
Figures 5 and 6 show the Gaussianization results of SVL and DDHFm, which had the best variance stabilization performances. To produce the respective dotplots, we have estimated the D'Agostino–Pearson Formula 5 p-value for each set of transformed intensities. In the figures we present these 1024 p-values (dots) over the 1024 ‘mean-sorted’ genes. We interpret p-values >0.05 to indicate good Gaussianization and have plotted a horizontal line in the plots to aid interpretation.


Figure 5
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5 Gaussianization of SVL transform. Upper panel: Formula 5; lower panel: Formula 5. Horizontal line at 5%. Each gene is replicated four times.

 


Figure 6
View larger version (27K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6 Gaussianization of DDHFm transform. Upper panel: Formula 5; lower panel: Formula 5. Horizontal line at 5%. Each gene is replicated four times.

 
We notice that SVL fails to normalize most of the transformed intensities for any Formula 5. At Formula 5, DDHFm normalizes 55% of the transformed intensities but a slight downward trend is apparent, indicating that DDHFm normalization performance degrades as µ gets larger. For Formula 5, though, DDHFm normalizes the 91% of the transformed data with inexistence of a particular trend. DDHF normalizes better than SVL and outperforms the other transforms, owing to its superior variance stabilization properties.

4.3 Simulations based on Pauli et al. (2006) data
We simulate, as before, Formula 5 sequences from Formula 5 genes. Here we replicate each gene Formula 5 times in order to show the performance of selected methods when more replicates are available. We generate the µ signal and then simulate raw intensities from the two component model with parameters Formula 5, Formula 5 and Formula 5 derived from Pauli et al. (2006) cDNA data analysis. We compare gloge, Log, SVL and DDHFm transforms, which for small Formula 5 produced the best results in the previous section.

The top section of Table 1 shows the summary statistics of the adjusted SDs Formula 5 of the transformed data for each method. Better concentration of the Formula 5 around 1 suggests better variance stabilization. We observe that the best performance is achieved by DDHFm with ~3.5 times lower range and four times lower SD from the best competitor (Log transform).


View this table:
[in this window]
[in a new window]

 
Table 1 Summary statistics of the adjusted SDs Formula 5 and K2 p-value (K2) for the various transforms

 
The bottom section of Table 1 shows the Formula 5 p-value summary statistics. Again, DDHFm performs better than any other method. DDHFm also has the first Quantile (Q1) of its p-values distribution >0.05.

4.4 Application to real cDNA data
In this section, we transform the McCaffrey et al. (2004) real cDNA data. The need for data transformation is suggested by a preliminary analysis which indicates that the replicate SD increases with the replicate mean.

We apply DDHFm, Log, SVL and glog transforms to the dataset and compute the adjusted replicate SDs. Ideally, the five sequences of Formula 5 should be as closely concentrated around 1 as possible.

Figure 7 shows the variance stabilization results of the methods. Notice that DDHFm Formula 5s range approximately from 0 to 3.5 (the dotted lines in the lower panel) with estimated SD of Formula 5, Formula 5, while the best competitor glog produces Formula 5s that range from 0 to 3.95 with Formula 5s range from 0 to 5.8 with Formula 5, Formula 5. Log and SVL perform worse than glog (their Formula 5, Formula 5). Since DDHFm produces Formula 5s that are more closely concentrated around one than of any of the competitors, we conclude that this is the best transformation for our dataset.


Figure 7
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7 Variance stabilization of glog (Upper panel, black), Log (upper panel, gray), SVL (lower panel, gray) and DDHFm (lower panel, black) transforms. Dashed lines, range of glog (upper panel) and SVL (lower panel) adjusted SDs; dotted lines, range of Log (upper panel) and DDHFm (lower panel) adjusted SDs.

 

    5 CONCLUSIONS AND FURTHER RESEARCH
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ESTABLISHED VARIANCE...
 3 DATA-DRIVEN HAAR-FISZ...
 4 RESULTS
 5 CONCLUSIONS AND FURTHER...
 APPENDIX
 REFERENCES
 
This article has introduced DDHFm, a new method for variance stabilization for replicated intensities that follow a non-decreasing mean–variance relationship. The DDHFm is self-contained and does not require any separate parameter estimation. The DDHFm is also ‘distribution-free’ in the sense that a parametric model for intensities does not need to be pre-specified. Hence, it can be used in situations where there is uncertainty about the precise underlying intensity distribution.

Simulations have shown that DDHFm not only performs very good variance stabilization but also it produces intensities that have distribution much closer to the Gaussian when compared to other established methods.

The superior performance of DDHFm combined with its ability to adapt to a wide range of distributions with non-decreasing mean–variance relationship make it an ideal tool for variance stabilization for microarray data.

This paper has not addressed the separate, but related, issue of calibration (i.e. adapting to the over location and scale of separate slides). This is an issue for DDHFm but to judge from the results on stabilization is not a significant issue. However, it would be possible to use DDHFm in conjunction with a calibration technique in a similar way to the combination of calibration and stabilization available in the vsn package described in Huber et al. (2003). We conjecture that stabilization would be again superior for DDHFm since the use of DDHFm requires somewhat more computational effort than glog type methods. Our future aim is to investigate this more challenging problem as well as develop direct Haar–Fisz methods for calibration.


    APPENDIX
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ESTABLISHED VARIANCE...
 3 DATA-DRIVEN HAAR-FISZ...
 4 RESULTS
 5 CONCLUSIONS AND FURTHER...
 APPENDIX
 REFERENCES
 
THE DATA-DRIVEN HAAR–FISZ TRANSFORM
Let Formula 5 denote an input vector to the DDHFT. The following list specifies the generic distributional properties of X.

  1. The length n of X must be a power of two. We denote J = log2(n). In practice, if our data is not of length 2J, then we reflect the end of our dataset in a mirror-like fashion so that the ‘padded’ sequence has a length which is a power of two.
  2. Formula 5 must be a sequence of independent, non-negative random variables with finite positive means Formula 5 and finite positive variances Formula 5.
  3. The variance Formula 5 must be a non-decreasing function of the mean Formula 5: we must have Formula 5, where the function h is independent of i.

For example, let Formula 5. In this case, Formula 5 and Formula 5, which yields Formula 5. Naturally, in many practical situations the exact form of h is unknown and needs to be estimated. Below, we describe the Haar–Fisz Transform (HFT) in the cases where h is known and unknown, respectively. (For microarrays the DDHF transform is modified and the Formula 5 are sorted to minimize variation of the function Formula 5, see Section 3.)

We first recall the formula for the Haar transform (HT). The HT is a linear orthogonal transform Formula 5 where Formula 5. Given an input vector Formula 5, the HT is performed as follows:

  1. Let Formula 5.
  2. For each Formula 5, recursively form vectors Formula 5 and Formula 5:

    Formula 5

The operator H, where Formula 5, defines the HT. The inverse HT is performed as follows:

  1. For each Formula 5, recursively form Formula 5:

    Formula 5

  2. Set Formula 5.

The elements of Formula 5 and Formula 5 have a simple interpretation: they can be thought of as ‘smooth’ and ‘detail’ (respectively) of the original vector X at scale Formula 5.

We now introduce the HFT: a multiscale algorithm for (approximately) stabilizing the variance of X and bringing its distribution closer to normality.

The main idea of the HFT is to decompose X using the HT, then ‘Gaussianize’ the coefficients Formula 5 and stabilize their variance, and then apply the inverse HT to obtain a vector which is closer to Gaussianity and has its variance approximately stabilized. We now describe the middle step: the variance stabilization and ‘Gaussianization’ of Formula 5.

Consider first Formula 5. Suppose for now that Formula 5 are identically distributed (i.d.): indeed, this is likely if the underlying mean Formula 5 is, for example, piecewise constant. This implies that Formula 5 is symmetric around zero. We want to stabilize the variance of Formula 5 around Formula 5. To do so, we divide Formula 5 by Formula 5 times its own SD. Using the assumption of independence (item 2, first list of this section above) we have

Formula 5
which gives Formula 5. In practice Formula 5 is unknown and we estimate it locally by Formula 5. The (approximately) variance-stabilized coefficient Formula 5 is given by Formula 5 (where the convention Formula 5 is used).

Turning now to Formula 5, we also first assume that the Formula 5 are i.d. In order to stabilize the variance of Formula 5 around Formula 5, we divide Formula 5 by two times its SD. We have Formula 5 as before, and we estimate Formula 5 locally by Formula 5, which yields an approximately variance-stabilized coefficient Formula 5. Asymptotic Gaussianity and variance stabilization of random variables of a form similar to Formula 5 were studied by Fisz (1955); hence, we label Formula 5 the Fisz coefficients of X, and the whole procedure—the Haar–Fisz transform of X.

We now give the general algorithm for the HFT when the function h is known.

  1. Let Formula 5.
  2. For each Formula 5, recursively form vectors Formula 5 and Formula 5:

    Formula 5

  3. For each Formula 5, recursively modify Formula 5:

    Formula 5

  4. Set Formula 5.

The relation Formula 5 defines a nonlinear, invertible operator Formula 5 which we call the HFT (of X) with link function h.

In practice h is often unknown and needs to be estimated. Since Formula 5, ideally we would wish to estimate h by computing the empirical variances of Formula 5 at points Formula 5, respectively, and then smoothing the observations to obtain an estimate of h. Suppose for the time being that the Formula 5s are known and, as an illustrative example, consider Formula 5. The empirical variance of Formula 5 can be pre-estimated, for example, as Formula 5. Note that on any piecewise constant stretch, our pre-estimate is exactly unbiased. The above discussion motivates the following regression setup:

Formula 5
where Formula 5 and ‘in most cases’ Formula 5. Of course, in practice, the Formula 5s are not known and, since we pre-estimate the variance of Formula 5 using Formula 5 and Formula 5, it also makes sense to pre-estimate Formula 5 by Formula 5. Note that for each Formula 5, we have Formula 5 and Formula 5, which leads to our final regression setup:

Formula 6(6)
In other words, we estimate h from the finest-scale Haar smooth and detail coefficients of Formula 6, where the smooth coefficients serve as pre-estimates of Formula 6 and the squared detail coefficients serve as pre-estimates of Formula 6.

As we restrict h to be a non-decreasing function of {rho}, we choose to estimate it from the regression problem Equation (6) via least-squares isotone regression, using the ‘pool-adjacent-violators’ algorithm described in detail in Johnstone and Silverman (2005, Section 6.3). The resulting estimate, denoted here by Formula 6, is a non-decreasing, piecewise constant function of {rho}.

The DDHFT is performed as above except that Formula 6 replaces h.


    Acknowledgments
 
E.S.M. is the grateful recipient of a Wellcome Prize Studentship awarded to G.A.R. and G.P.N. G.P.N. was partially supported by an EPSRC Advanced Research Fellowship.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Thomas Lengauer

Received on November 4, 2005; revised on July 4, 2006; accepted on July 21, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ESTABLISHED VARIANCE...
 3 DATA-DRIVEN HAAR-FISZ...
 4 RESULTS
 5 CONCLUSIONS AND FURTHER...
 APPENDIX
 REFERENCES
 

    Alwin, J.C., et al. (1977) Methods for detection of specific RNAs in agarose gels by transfer to diazobenzyloxymethyl-paper and hybridization with DNA probes. Proc. Natl Acad. Sci. USA, 74, 5350–5354[Abstract/Free Full Text].

    Archer, K.J., et al. (2004) Graphical technique for identifying a monotonic variance stabilizing transformation for absolute gene intensity signals. BMC Bioinformatics, 5, 60[CrossRef][Medline].

    Baird, D., et al. (2004) Normalization of microarray data using a spatial mixed model analysis which includes splines. Bioinformatics, 20, 3196–3205[Abstract/Free Full Text].

    Box, G.E.P. and Cox, D.R. (1964) An analysis of transformations. J. R. Stat. Soc. B, 26, 211–252.

    Comander, J., et al. (2004) Improving the statistical detection of regulated genes from microarray data using intensity-based variance estimation. BMC Genomics, 5, 17[CrossRef][Medline].

    Cui, X., et al. (2003) Transformations for cDNA microarray data. Stat. App. Gen. Mol. Biol, . 2, 4.

    D'Agostino, R.B. (1971) An omnibus test of normality for moderate and large size samples. Biometrika, 58, 341–348[Abstract/Free Full Text].

    Delmar, P., et al. (2005a) Mixture model on the variance for the differential analysis of gene expression data. J. R. Stat. Soc. C, 54, 31–50[CrossRef].

    Delmar, P., et al. (2005b) VarMixt: efficient variance modelling for the differential analysis of replicated gene expression data. Bioinformatics, 21, 502–508[Abstract/Free Full Text].

    Durbin, B.P. and Rocke, D.M. (2003) Estimation of transformation parameters for microarray data. Bioinformatics, 19, 1360–1367[Abstract/Free Full Text].

    Durbin, B.P., et al. (2002) A variance-stabilizing transformation for gene expression microarray data. Bioinformatics, 18, S105–S110[Abstract].

    Fisz, M. (1955) The limiting distribution of a function of two independent random variables and its statistical application. Colloquium Mathematicum, 3, 138–146[Medline].

    Fryzlewicz, P. and Delouille, V. (2006) A data-driven Haar–Fisz transform for multiscale variance stabilization. Proceedings of the 13th IEEE Workshop on Statistical Signal Processing (to appear).

    Fryzlewicz, P., Delouille, V., Nason, G.P. (2005) GOES-8 X-ray sensor variance stabilization using the multiscale data-driven Haar–Fisz transform. Technical Report 05:06 Statistics Group, Department of Mathematics, University of Bristol, UK.,.

    Fryzlewicz, P. and Nason, G.P. (2004) A Haar–Fisz algorithm for Poisson intensity estimation. J. Comput. Graph. Stat, . 13, 621–638[CrossRef].

    Holder, D., Raubertas, R.F., Pikounis, V.B., Svetnik, V., Soper, K. (2001) Statistical analysis of high density oligonucleotide arrays: a SAFER approach. Proceedings of the GeneLogic Workshop on Low Level Analysis of Affymetrix GeneChip Data19 NovemberBethesda, MD.

    Hoyle, D.C., et al. (2002) Making sense of microarray data distributions. Bioinformatics, 18, 576–584[Abstract/Free Full Text].

    Hsiao, A., et al. (2004) Variance-modelled posterior inference of microarray data: detecting gene-expression changes in 3T3-L1 adipocytes. Bioinformatics, 20, 3108–3127[Abstract/Free Full Text].

    Huber, W., et al. (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics, 18, S96–S104[Abstract].

    Huber, W., et al. (2003) Parameter estimation for the calibration and variance stabilization of microarray data. Stat. Appl. Gen. Mol. Biol, . 2, Article 3.

    Johnstone, I.M. and Silverman, B.W. (2005) EbayesThresh: R programs for empirical Bayes thresholding. J. Stat. Softw, . 12, 1–38.

    McCaffrey, R.L., et al. (2004) A specific gene expression program triggered by Gram-positive bacteria in the cytocol. Proc. Natl Acad. Sci. USA, 101, 11386–11391[Abstract/Free Full Text].

    Munson, P. (2001) A ‘consistency’ test for determining the significance of gene expression changes on replicate samples and two-convenient variance-stabilizing transformations. Proceedings of the GeneLogic Workshop on Low Level Analysis of Affymetrix GeneChip Data19 NovemberBethesda, MD.

    Pauli, F., et al. (2006) Chromosomal clustering and GATA transcriptional regulation of intestine-expressed genes in C.elegans. Development, 133, 287–295[Abstract/Free Full Text].

    Rocke, D.M. and Durbin, B.P. (2001) A model for measurement error for gene expression arrays. J. Comput. Biol, . 8, 557–569[CrossRef][ISI][Medline].

    Rocke, D.M. and Durbin, B.P. (2003) Approximate variance-stabilizing transformations for gene expression microarray data. Bioinformatics, 19, 966–972[Abstract/Free Full Text].

    Sebastiani, P. and Ramoni, M. (2003) Statistical Challenges in Functional Genomics. Statist. Sci, . 18, 33–70[CrossRef].

    Smyth, G.K., Yang, Y.H., Speed, T. (2003) Statistical issues in cDNA Microarray data analysis. In Brownstein, M.J. and Khodursky, A. (Eds.). Functional Genomics: Methods and Protocols, Methods of Molecular Biology, , Totowa, NJ Humana Press Vol. 224, , pp. 111–136.

    Tukey, J.W. Exploratory Data Analysis, (1977) , MA Addison-Wesley, Reading.

    Tusher, V., et al. (2001) Significance analysis of microarrays applied to ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 5116–5121[Abstract/Free Full Text].

    Velculescu, V.E., et al. (1995) Serial analysis of gene expression. Science, 270, 484–487[Abstract/Free Full Text].

    Wang, S. and Ethier, S. (2004) A generalized likelihood ratio test to identify differentially expressed genes from microarray data. Bioinformatics, 20, 100–104[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/20/2547    most recent
btl412v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Motakis, E. S.
Right arrow Articles by Rutter, G. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Motakis, E. S.
Right arrow Articles by Rutter, G. A.
Social Bookmarking
 Add to CiteULike