Bioinformatics Advance Access originally published online on July 28, 2006
Bioinformatics 2006 22(20):2547-2553; doi:10.1093/bioinformatics/btl412
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Variance stabilization and normalization for one-color microarray data using a data-driven multiscale approach
1 Department of Mathematics, University of Bristol Bristol, UK
2 Department of Biochemistry, University of Bristol Bristol, UK
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Many standard statistical techniques are effective on data that are normally distributed with constant variance. Microarray data typically violate these assumptions since they come from non-Gaussian distributions with a non-trivial meanvariance relationship. Several methods have been proposed that transform microarray data to stabilize variance and draw its distribution towards the Gaussian. Some methods, such as log or generalized log, rely on an underlying model for the data. Others, such as the spread-versus-level plot, do not. We propose an alternative data-driven multiscale approach, called the Data-Driven HaarFisz for microarrays (DDHFm) with replicates. DDHFm has the advantage of being distribution-free in the sense that no parametric model for the underlying microarray data is required to be specified or estimated; hence, DDHFm can be applied very generally, not just to microarray data.
Results: DDHFm achieves very good variance stabilization of microarray data with replicates and produces transformed intensities that are approximately normally distributed. Simulation studies show that it performs better than other existing methods. Application of DDHFm to real one-color cDNA data validates these results.
Availability: The R package of the Data-Driven HaarFisz transform (DDHFm) for microarrays is available in Bioconductor and CRAN.
Contact: g.p.nason{at}bristol.ac.uk
| 1 INTRODUCTION |
|---|
|
|
|---|
Microarrays, in principle and in practice, are extensions of hybridization-based methods (Southern blots, northern blots, SAGE, etc.), which have been used for decades to identify and locate mRNA and DNA sequences that are complementary to a segment of DNA (Alwin et al., 1977; Velculescu et al., 1995). Microarray technology, in the form of either cDNA or high-density oligonucleotide arrays enables molecular biologists to measure simultaneously the expression level of thousands of genes. In a typical microarray experiment the aim is to compare different cell types, e.g. normal versus diseased cells, in order to identify genes that are differentially expressed in the two cell types.
Typically, microarray data analyses consist of several steps ranging from experimental design to the identification of important genes (for a review on the whole process see Sebastiani and Ramoni, 2003). Gene replication is a crucial design feature as it increases the precision of estimation and permits estimation of measurement variance which enables the significance of the final results to be judged.
Rocke and Durbin (2001) identified that the variance of the raw spot intensities increased with their mean and they modeled those intensities in terms of the two-component model:
![]() | (1) |
are the raw single-color intensities for the n genes, each assumed to be replicated p times. Sometimes we will write Yr,i when we are referring to the r-th replicate on the i-th gene (
). The
term represents the (common) mean background noise of the n genes on the array, µi is the true expression level for gene i, and
i and
i are the normally distributed error terms with zero mean and variances
and
, respectively. In this way,
can be considered as coming from an inhomogeneous process that produces the n gene intensities with finite but different µis and finite but different variances.
At low expression levels (i.e. µi close to 0) the measured expression Yi in (1) can be written as
so that Yi is approximately distributed as N(
,
). On the other hand, for large µis, the middle term in (1) dominates and Yi can be modeled as follows:
![]() | (2) |
![]() | (3) |
. For moderate values of µi, Yi is modeled as in (1) with variance:
![]() | (4) |
From (3) and (4), we observe that the SD of the Yi increases linearly with their mean. Such meanvariance dependence, implying the presence of heteroscedastic intensities, is a major problem in the statistical analysis of microarrays.
Two methodological approaches have been followed to account for the heteroscedasticity. The first approach involves the estimation of differentially expressed genes directly from the heteroscedastic data by means of penalized t-statistics (e.g. SAM method of (Tusher et al., 2001), mixed or hierarchical Bayesian modeling (e.g. Baird et al., 2004; Hsiao et al., 2004), appropriate maximum-likelihood tests (e.g. Wang and Ethier, 2004) and, recently, gene grouping schemes (e.g. Comander et al., 2004; Delmar et al., 2005a,b). The second approach, which we follow in this article, involves finding appropriate transformations that stabilize the variance of the data. After variance stabilization the data can be analyzed by standard, simple and universally accepted tools, such as ANOVA models.
Section 2 outlines some existing variance stabilizing transforms that have been applied to microarray data. Section 3 proposes a new method called the Data-Driven HaarFisz transform for microarrays (DDHFm) and compares its performance with existing methods by means of simulated and real cDNA data in Section 4. We show that DDHFm is superior to existing methods in terms of variance stabilization and Gaussianization of the transformed intensities.
| 2 ESTABLISHED VARIANCE STABILIZATION METHODS |
|---|
|
|
|---|
For brevity we discuss and compare the performance of different variance stabilization techniques without, at this stage, worrying about differential expression. For this reason we consider data obtained from one-color microarrays. Generalization to two-color experiments will be considered in future work.
2.1 Log-based transformations
Smyth et al. (2003) suggest using the log transform for microarray intensities. By assuming that the lognormal distribution is an extremely good approximation to the bulk of the data (Hoyle et al., 2002) as in model (2), the log transform
should stabilize the variance of the gene intensities and bring their distribution closer to the Gaussian. An extension of this approach then considers background corrected intensities,
, which may be negative and cannot be handled by the simple log function. Based on this notion, several authors have studied alternative logarithmic-based transformations for microarray data.
Tukey (1977) defines the Started Log transformation as
, where k is a positive constant estimated via
, so that it minimizes the deviation from variance constancy. Alternatively, Holder et al. (2001) developed the log-linear hybrid transformation as
, for
and
, for
. This transformation has also been called Linlog by Cui et al. (2003). As with sLog, the optimal k is estimated by
.
2.2 The generalized logarithm transformation (glog)
Munson (2001), Durbin et al. (2002) and Huber et al. (2002) independently developed the generalized logarithm transformation (referred to as glog in Rocke and Durbin, 2003). For data that come from model (1) with the meanvariance dependence (4), glog is assumed to produce symmetric transformed gene intensities with stabilized variance. The glog formula is:
![]() | (5) |
. Rocke and Durbin (2001) described algorithms to estimate
and c from one-color cDNA data. Although estimation of
can be conducted without replicated genes, estimation of c involves estimation of
, which requires replication. Maximum-likelihood methods for c estimation only, based on Box and Cox (1964), were also developed by Durbin and Rocke (2003) for the case of two-colors microarrays and thus they are not relevant to the present work.
2.3 Spread-versus-level plot transformation (SVL)
Archer et al. (2004) describes a different variance stabilization approach based on plotting the log-median of the replicated intensities on the x-axis (level) against the log of their fourth-spread (a variant of the interquantile range) on the y-axis (spread). Then the estimated slope of the subsequent linear regression model fit indicates the appropriate BoxCox power transformation.
| 3 DATA-DRIVEN HAARFISZ TRANSFORMATION FOR MICROARRAYS |
|---|
|
|
|---|
This section describes how the recent DDHF transform can be adapted for use with microarray data. Our adaption requires a subtle organization of microarray intensities into a form acceptable for the DDHF transform. We call our adaption the DDHF transform for microarray data, or DDHFm.
Recently, a new class of variance stabilization transforms, generically known as HaarFisz (HF) transforms, were introduced by Fryzlewicz and Nason (2004). In that work the HF transform used a multiscale technique to take sequences of Poisson random variables with unknown intensities into a sequence of random variables with near constant variance and a distribution closer to normality. Later Fryzlewicz et al. (2006) introduced the DDHF transform which used a similar multiscale transform but additionally estimated the meanvariance relation as part of the process of stabilization and bringing the distribution closer to normality. See also Fryzlewicz and Delouille (2005). Hence, the DDHF transform can be used where there is a monotone meanvariance relationship but the precise form of the relationship is not known. In other words, DDHFm is distribution-free in that the precise data distribution, such as model (1), need not be known or specified. See the Appendix section for further details on the HF and DDHF transforms.
Both the HF and DDHF transforms rely on an input sequence of positive random variables Xi with mean µi and variance
with some monotone (non-decreasing) relation between the mean and variance
. Both HF and DDHF transforms work best when the underlying µi form a piecewise constant sequence. In other words, when consecutive µs are often very close or actually identical in value but large jumps in value are also permitted. However, microarray data are usually not organized in this sequential form. Microarray intensities Yi usually come in replicated blocks: i.e.
is the r-th replicate for the i-th gene.
For the i-th gene what we do know is that the underlying intensity
for
is identical for each replicate r (this is the reason for replication). So, if the intensities for all replicates for a given gene i were laid out into a consecutive sequence we would know that their underlying
sequence was constant.
To be able to make efficient use of the DDHF transform we would need to sort our intensities in order of increasing
so that the sequence would be as near piecewise constant as possible. In actuality as we do not know the
(since that is what we are trying to estimate) we cannot sort the sequence into increasing µ order. So, we do the next best thing in that we order the replicate sets according to their increasing mean observed value where the mean is taken across replicates. The idea is that the observed mean estimates
and the observed mean ordering estimates the correct true mean ordering. For example, suppose there were four replicates and four genes with observed (raw) intensities.
| ||||||||||||||||||||||||||||||||||||
Then ordering these replicates according to the means of replicates for each gene (indicated in the last column), and concatenating gives a sequence of
10 11 12 11 13 12 13 14 73 74 74 75 100 102 99 103.
This ordered sequence of intensities within replicate blocks forms the input, denoted
in the Appendix section, to the DDHF transform. After transformation any further technique that has been applied previously to variance stabilized and normalized data may be applied here.
| 4 RESULTS |
|---|
|
|
|---|
Durbin et al. (2002) and Rocke and Durbin (2003) compared the performance of glog with the background-uncorrected log (Log) and the background-corrected log (bcLog) transforms. By considering 18 deterministic µ values, each corresponding to a gene, they simulated
with
and
intensities from the two-component model (1) with parameters
and assessed the performance of the methods in terms of the resulting transformed gene intensity variances and skewness coefficients. The two major results of Durbin et al. (2002) state that glog stabilizes the asymptotic variance of microarray data across the full range of the data, as well as making the data more symmetric than the other methods under comparison.
In Durbin et al. (2002) though, after simulating the intensities with the parameters mentioned above, the data were subsequently transformed using (5), with the known model parameters (
, 
, 
). This procedure is biased. In practice, the true parameters are not known and have to be estimated, which results in inferior overall variance stabilization performance. Below, we demonstrate this by simulating data from the two-component model and estimating the parameters.
Additionally, in our simulations described next, we also transform our data with the background uncorrected log (Log) method, the log-linear hybrid transform, the spread-versus-level transform and our new DDHFm method. We do not use background corrected log and the started log, because both of them produce negative background corrected intensities, especially for small µs, and we have observed that they result in highly asymmetric data.
4.1 One-color cDNA data acquisition
We simulate from the two-component model (1) with parameters estimated from real cDNA data, obtained from the Stanford Microarray Database (http://smd.stanford.edu/). Two sets of data are considered. The first one comes from McCaffrey et al. (2004) study on mouse cDNA microarrays to investigate gene expression triggered by infection of bone marrow-derived macrophages with cytosol- and vacuole-localized Listeria monocytogenes (Lm). Each gene was replicated four times. The dataset numbers were 40 430, 40 571, 34 905 and 34 912.
The second set comes from Pauli et al. (2006) work to identify genes expressed in the intestine of Caenorhabditis elegans using cDNA microarrays. Students t-tests for differential expression were conducted with eight replicates for each gene. The dataset numbers were 36 590, 38 262, 38 265, 39 215, 40 157, 41 833, 41 834 and 41 886.
4.2 Simulations based on McCaffrey et al. (2004) data
We wish to simulate a likely µi signal using our real cDNA data. As in the example of Section 3, we estimate the mean of replicates for each gene from our two datasets. These means are ordered and concatenated in a single vector from which we sample 1024 equispaced values. This sequence of sample means, shown in Figure 1, forms our simulated
signal (the truth). This procedure is repeated for both real datasets.
|
From each of the 1024 µi levels we simulate p = 4 replicated raw intensities Yr,i, where r = 1, ..., 4 and i = 1, 2, ..., 1024, using the simdurbin2() function from the DDHFm package which simulates from model (1). To obtain
, model (1) was considered with parameters
,
and
as estimated (and rounded) from the McCaffrey et al. (2004) dataset. These parameters are re-estimated as in Rocke and Durbin (2001), then applied to the transformation methods that require their estimation (glog and Hyb) and the data are subsequently transformed. We iterate the above procedure
times, and produce
raw intensities, where
denotes the r-th replicate of the k-th iterated sequence. Finally, we concatenate the transformed
into a single output vector for each i, from which we will derive our results. In other words, our output consists of 1024 output vectors
transformed observations.
The effectiveness of the methods is assessed in terms of adjusted SDs (
) of the replicated transformed intensities of each
. Each
is computed as follows. The SD,
, of the stabilized sample of 4000 values is computed for each
. We noted that each method stabilizes the variance to a different value. So, for each method we compute the mean of
s over the whole
set, denoted as
, and adjust each
by computing
. In this way the different stabilization methods can be compared directly.
Additionally, we evaluate the Gaussianization properties of each transform by means of DAgostinoPearson
test for normality (D'Agostino, 1974): the test is appropriate for detecting deviations from normality due to either abnormal skewness or kurtosis. Hence, when we subsequently write (not) normal we mean relative to this test. In contrast to the analysis of Durbin et al. (2002) on the means of skewness coefficients over 1000 samples for each µ, we choose this more comprehensive, distribution-based approach.
Figures 24 show the variance stabilization results of the transformation methods. Note that glogi stands for the generalized logarithm transform with the known (optimal) parameters
,
and
, while gloge is the glog transform with all parameters being estimated. Additionally, Hyb = the log-linear hybrid method, Log = the background uncorrected log transform, SVL = the spread-versus-level transform and, finally, DDHFm.
|
We plot the
s against the 1024 mean-sorted genes of data simulated first from
[estimated from McCaffrey et al. (2004) data] and then from
in order to show the performance of the methods with different choices of the model parameters. Varying
and
individually in the simulations did not yield different variance stabilization results from the ones reported here.
The more concentrated the
s are
1 (the straight line in the figures), the better the stabilization has been performed. Figure 2 evidently shows the superiority of glogi over gloge for both
values, indicating the direct effect on variance stabilization when the glog parameters are being estimated. The means of the estimated parameters over the
sequences were estimated as
,
and
. Further analysis has showed that the large difference of the estimate
from
, frequently observed over the k iterations, is the main cause of the degradation in gloge performance.
Figure 3 shows Hyb and log variance stabilization results. Notice that both methods fail to stabilize the adjusted SDs of the transformed intensities and, similarly to gloge, their performance depends on the
value: the smaller the
gets, the better variance stabilization is achieved. For small
though, Log seems to work better than the other two methods.
|
In Figure 4 we note that SVL seems to perform well, especially for small
, but its performance is still inferior to DDHFm. DDHFm clearly outperforms every other method and its variance stabilization results are very similar with those of glogi (but, of course, glogi uses known parameters and cannot be used in practice).
|
Figures 5 and 6 show the Gaussianization results of SVL and DDHFm, which had the best variance stabilization performances. To produce the respective dotplots, we have estimated the D'AgostinoPearson
p-value for each set of transformed intensities. In the figures we present these 1024 p-values (dots) over the 1024 mean-sorted genes. We interpret p-values >0.05 to indicate good Gaussianization and have plotted a horizontal line in the plots to aid interpretation.
|
|
We notice that SVL fails to normalize most of the transformed intensities for any
. At
, DDHFm normalizes 55% of the transformed intensities but a slight downward trend is apparent, indicating that DDHFm normalization performance degrades as µ gets larger. For
, though, DDHFm normalizes the 91% of the transformed data with inexistence of a particular trend. DDHF normalizes better than SVL and outperforms the other transforms, owing to its superior variance stabilization properties.
4.3 Simulations based on Pauli et al. (2006) data
We simulate, as before,
sequences from
genes. Here we replicate each gene
times in order to show the performance of selected methods when more replicates are available. We generate the µ signal and then simulate raw intensities from the two component model with parameters
,
and
derived from Pauli et al. (2006) cDNA data analysis. We compare gloge, Log, SVL and DDHFm transforms, which for small
produced the best results in the previous section.
The top section of Table 1 shows the summary statistics of the adjusted SDs
of the transformed data for each method. Better concentration of the
around 1 suggests better variance stabilization. We observe that the best performance is achieved by DDHFm with
3.5 times lower range and four times lower SD from the best competitor (Log transform).
|
The bottom section of Table 1 shows the
p-value summary statistics. Again, DDHFm performs better than any other method. DDHFm also has the first Quantile (Q1) of its p-values distribution >0.05.
4.4 Application to real cDNA data
In this section, we transform the McCaffrey et al. (2004) real cDNA data. The need for data transformation is suggested by a preliminary analysis which indicates that the replicate SD increases with the replicate mean.
We apply DDHFm, Log, SVL and glog transforms to the dataset and compute the adjusted replicate SDs. Ideally, the five sequences of
should be as closely concentrated around 1 as possible.
Figure 7 shows the variance stabilization results of the methods. Notice that DDHFm
s range approximately from 0 to 3.5 (the dotted lines in the lower panel) with estimated SD of
,
, while the best competitor glog produces
s that range from 0 to 3.95 with
s range from 0 to 5.8 with
,
. Log and SVL perform worse than glog (their
,
). Since DDHFm produces
s that are more closely concentrated around one than of any of the competitors, we conclude that this is the best transformation for our dataset.
|
| 5 CONCLUSIONS AND FURTHER RESEARCH |
|---|
|
|
|---|
This article has introduced DDHFm, a new method for variance stabilization for replicated intensities that follow a non-decreasing meanvariance relationship. The DDHFm is self-contained and does not require any separate parameter estimation. The DDHFm is also distribution-free in the sense that a parametric model for intensities does not need to be pre-specified. Hence, it can be used in situations where there is uncertainty about the precise underlying intensity distribution.
Simulations have shown that DDHFm not only performs very good variance stabilization but also it produces intensities that have distribution much closer to the Gaussian when compared to other established methods.
The superior performance of DDHFm combined with its ability to adapt to a wide range of distributions with non-decreasing meanvariance relationship make it an ideal tool for variance stabilization for microarray data.
This paper has not addressed the separate, but related, issue of calibration (i.e. adapting to the over location and scale of separate slides). This is an issue for DDHFm but to judge from the results on stabilization is not a significant issue. However, it would be possible to use DDHFm in conjunction with a calibration technique in a similar way to the combination of calibration and stabilization available in the vsn package described in Huber et al. (2003). We conjecture that stabilization would be again superior for DDHFm since the use of DDHFm requires somewhat more computational effort than glog type methods. Our future aim is to investigate this more challenging problem as well as develop direct HaarFisz methods for calibration.
| APPENDIX |
|---|
|
|
|---|
THE DATA-DRIVEN HAARFISZ TRANSFORM
Let
denote an input vector to the DDHFT. The following list specifies the generic distributional properties of X.- The length n of X must be a power of two. We denote J = log2(n). In practice, if our data is not of length 2J, then we reflect the end of our dataset in a mirror-like fashion so that the padded sequence has a length which is a power of two.
must be a sequence of independent, non-negative random variables with finite positive means
and finite positive variances
.
- The variance
must be a non-decreasing function of the mean
: we must have
, where the function h is independent of i.
For example, let
. In this case,
and
, which yields
. Naturally, in many practical situations the exact form of h is unknown and needs to be estimated. Below, we describe the HaarFisz Transform (HFT) in the cases where h is known and unknown, respectively. (For microarrays the DDHF transform is modified and the
are sorted to minimize variation of the function
, see Section 3.)
We first recall the formula for the Haar transform (HT). The HT is a linear orthogonal transform
where
. Given an input vector
, the HT is performed as follows:
- Let
.
- For each
, recursively form vectors
and
: 
The operator H, where
, defines the HT. The inverse HT is performed as follows:
- For each
, recursively form
: 
- Set
.
The elements of
and
have a simple interpretation: they can be thought of as smooth and detail (respectively) of the original vector X at scale
.
We now introduce the HFT: a multiscale algorithm for (approximately) stabilizing the variance of X and bringing its distribution closer to normality.
The main idea of the HFT is to decompose X using the HT, then Gaussianize the coefficients
and stabilize their variance, and then apply the inverse HT to obtain a vector which is closer to Gaussianity and has its variance approximately stabilized. We now describe the middle step: the variance stabilization and Gaussianization of
.
Consider first
. Suppose for now that
are identically distributed (i.d.): indeed, this is likely if the underlying mean
is, for example, piecewise constant. This implies that
is symmetric around zero. We want to stabilize the variance of
around
. To do so, we divide
by
times its own SD. Using the assumption of independence (item 2, first list of this section above) we have
![]() |
. In practice
is unknown and we estimate it locally by
. The (approximately) variance-stabilized coefficient
is given by
(where the convention
is used).
Turning now to
, we also first assume that the
are i.d. In order to stabilize the variance of
around
, we divide
by two times its SD. We have
as before, and we estimate
locally by
, which yields an approximately variance-stabilized coefficient
. Asymptotic Gaussianity and variance stabilization of random variables of a form similar to
were studied by Fisz (1955); hence, we label
the Fisz coefficients of X, and the whole procedurethe HaarFisz transform of X.
We now give the general algorithm for the HFT when the function h is known.
- Let
.
- For each
, recursively form vectors
and
: 
- For each
, recursively modify
: 
- Set
.
The relation
defines a nonlinear, invertible operator
which we call the HFT (of X) with link function h.
In practice h is often unknown and needs to be estimated. Since
, ideally we would wish to estimate h by computing the empirical variances of
at points
, respectively, and then smoothing the observations to obtain an estimate of h. Suppose for the time being that the
s are known and, as an illustrative example, consider
. The empirical variance of
can be pre-estimated, for example, as
. Note that on any piecewise constant stretch, our pre-estimate is exactly unbiased. The above discussion motivates the following regression setup:
![]() |
and in most cases
. Of course, in practice, the
s are not known and, since we pre-estimate the variance of
using
and
, it also makes sense to pre-estimate
by
. Note that for each
, we have
and
, which leads to our final regression setup:
![]() | (6) |
, where the smooth coefficients serve as pre-estimates of
and the squared detail coefficients serve as pre-estimates of
.
As we restrict h to be a non-decreasing function of
, we choose to estimate it from the regression problem Equation (6) via least-squares isotone regression, using the pool-adjacent-violators algorithm described in detail in Johnstone and Silverman (2005, Section 6.3). The resulting estimate, denoted here by
, is a non-decreasing, piecewise constant function of
.
The DDHFT is performed as above except that
replaces h.
| Acknowledgments |
|---|
E.S.M. is the grateful recipient of a Wellcome Prize Studentship awarded to G.A.R. and G.P.N. G.P.N. was partially supported by an EPSRC Advanced Research Fellowship.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Thomas Lengauer
Received on November 4, 2005; revised on July 4, 2006; accepted on July 21, 2006
| REFERENCES |
|---|
|
|
|---|
Alwin, J.C., et al. (1977) Methods for detection of specific RNAs in agarose gels by transfer to diazobenzyloxymethyl-paper and hybridization with DNA probes. Proc. Natl Acad. Sci. USA, 74, 53505354
Archer, K.J., et al. (2004) Graphical technique for identifying a monotonic variance stabilizing transformation for absolute gene intensity signals. BMC Bioinformatics, 5, 60[CrossRef][Medline].
Baird, D., et al. (2004) Normalization of microarray data using a spatial mixed model analysis which includes splines. Bioinformatics, 20, 31963205
Box, G.E.P. and Cox, D.R. (1964) An analysis of transformations. J. R. Stat. Soc. B, 26, 211252.
Comander, J., et al. (2004) Improving the statistical detection of regulated genes from microarray data using intensity-based variance estimation. BMC Genomics, 5, 17[CrossRef][Medline].
Cui, X., et al. (2003) Transformations for cDNA microarray data. Stat. App. Gen. Mol. Biol, . 2, 4.
D'Agostino, R.B. (1971) An omnibus test of normality for moderate and large size samples. Biometrika, 58, 341348
Delmar, P., et al. (2005a) Mixture model on the variance for the differential analysis of gene expression data. J. R. Stat. Soc. C, 54, 3150[CrossRef].
Delmar, P., et al. (2005b) VarMixt: efficient variance modelling for the differential analysis of replicated gene expression data. Bioinformatics, 21, 502508
Durbin, B.P. and Rocke, D.M. (2003) Estimation of transformation parameters for microarray data. Bioinformatics, 19, 13601367
Durbin, B.P., et al. (2002) A variance-stabilizing transformation for gene expression microarray data. Bioinformatics, 18, S105S110[Abstract].
Fisz, M. (1955) The limiting distribution of a function of two independent random variables and its statistical application. Colloquium Mathematicum, 3, 138146[Medline].
Fryzlewicz, P. and Delouille, V. (2006) A data-driven HaarFisz transform for multiscale variance stabilization. Proceedings of the 13th IEEE Workshop on Statistical Signal Processing (to appear).
Fryzlewicz, P., Delouille, V., Nason, G.P. (2005) GOES-8 X-ray sensor variance stabilization using the multiscale data-driven HaarFisz transform. Technical Report 05:06 Statistics Group, Department of Mathematics, University of Bristol, UK.,.
Fryzlewicz, P. and Nason, G.P. (2004) A HaarFisz algorithm for Poisson intensity estimation. J. Comput. Graph. Stat, . 13, 621638[CrossRef].
Holder, D., Raubertas, R.F., Pikounis, V.B., Svetnik, V., Soper, K. (2001) Statistical analysis of high density oligonucleotide arrays: a SAFER approach. Proceedings of the GeneLogic Workshop on Low Level Analysis of Affymetrix GeneChip Data19 NovemberBethesda, MD.
Hoyle, D.C., et al. (2002) Making sense of microarray data distributions. Bioinformatics, 18, 576584
Hsiao, A., et al. (2004) Variance-modelled posterior inference of microarray data: detecting gene-expression changes in 3T3-L1 adipocytes. Bioinformatics, 20, 31083127
Huber, W., et al. (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics, 18, S96S104[Abstract].
Huber, W., et al. (2003) Parameter estimation for the calibration and variance stabilization of microarray data. Stat. Appl. Gen. Mol. Biol, . 2, Article 3.
Johnstone, I.M. and Silverman, B.W. (2005) EbayesThresh: R programs for empirical Bayes thresholding. J. Stat. Softw, . 12, 138.
McCaffrey, R.L., et al. (2004) A specific gene expression program triggered by Gram-positive bacteria in the cytocol. Proc. Natl Acad. Sci. USA, 101, 1138611391
Munson, P. (2001) A consistency test for determining the significance of gene expression changes on replicate samples and two-convenient variance-stabilizing transformations. Proceedings of the GeneLogic Workshop on Low Level Analysis of Affymetrix GeneChip Data19 NovemberBethesda, MD.
Pauli, F., et al. (2006) Chromosomal clustering and GATA transcriptional regulation of intestine-expressed genes in C.elegans. Development, 133, 287295
Rocke, D.M. and Durbin, B.P. (2001) A model for measurement error for gene expression arrays. J. Comput. Biol, . 8, 557569[CrossRef][ISI][Medline].
Rocke, D.M. and Durbin, B.P. (2003) Approximate variance-stabilizing transformations for gene expression microarray data. Bioinformatics, 19, 966972
Sebastiani, P. and Ramoni, M. (2003) Statistical Challenges in Functional Genomics. Statist. Sci, . 18, 3370[CrossRef].
Smyth, G.K., Yang, Y.H., Speed, T. (2003) Statistical issues in cDNA Microarray data analysis. In Brownstein, M.J. and Khodursky, A. (Eds.). Functional Genomics: Methods and Protocols, Methods of Molecular Biology, , Totowa, NJ Humana Press Vol. 224, , pp. 111136.
Tukey, J.W. Exploratory Data Analysis, (1977) , MA Addison-Wesley, Reading.
Tusher, V., et al. (2001) Significance analysis of microarrays applied to ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 51165121
Velculescu, V.E., et al. (1995) Serial analysis of gene expression. Science, 270, 484487
Wang, S. and Ethier, S. (2004) A generalized likelihood ratio test to identify differentially expressed genes from microarray data. Bioinformatics, 20, 100104
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











and K2 p-value (K2) for the various transforms


