Skip Navigation


Bioinformatics Advance Access originally published online on November 22, 2005
Bioinformatics 2006 22(3):326-331; doi:10.1093/bioinformatics/bti788
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/3/326    most recent
bti788v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by de Ridder, D.
Right arrow Articles by Reinders, M. J. T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by de Ridder, D.
Right arrow Articles by Reinders, M. J. T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Maximum significance clustering of oligonucleotide microarrays

Dick de Ridder 1,2,*, Frank J. T. Staal 2, Jacques J. M. van Dongen 2 and Marcel J. T. Reinders 1

1Information and Communication Theory Group, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology PO Box 5031, 2600 GA Delft, The Netherlands
2Department of Immunology, Erasmus MC, University Medical Center PO Box 1738, 3000 DR Rotterdam, The Netherlands

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RELATED WORK
 4 EXPERIMENTS
 5 CONCLUSIONS
 REFERENCES
 

Motivation: Affymetrix high-density oligonucleotide microarrays measure the expression of DNA transcripts using probesets, i.e. multiple probes per transcript. Usually, these multiple measurements are transformed into a single probeset expression level before data analysis proceeds; any information on variability is lost. In this paper we demonstrate how individual probe measurements can be used in a statistic for differential expression. Furthermore, we show how this statistic can serve as a criterion for clustering microarrays.

Results: A novel clustering algorithm using this maximum significance criterion is demonstrated to be more efficient with the measured data than competing techniques for dealing with repeated measurements, especially when the sample size is small.

Availability: MATLAB source code can be found at http://ict.ewi.tudelft.nl/~dick

Contact: D.deRidder{at}ewi.tudelft.nl


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RELATED WORK
 4 EXPERIMENTS
 5 CONCLUSIONS
 REFERENCES
 
The high-density oligonucleotide GeneChip microarrays, produced by Affymetrix, Inc. (Lipschutz et al., 1999), contain a large number of probesets (10 000–55 000), each of which measures the presence of transcripts of a certain gene or expressed sequence tag (EST). The individual probes in a probeset (typically 11–20) contain 25mers which match non-overlapping regions in the gene's or EST's nucleotide sequence.

Affymetrix GeneChip microarrays are increasingly used in large-scale studies (Pomeroy et al., 2002; Yeoh et al., 2002). The standard approach followed is to first summarize the probe measurements in a probeset as an expression level, using one of a few standard methods [e.g. Affymetrix MicroArray Suite 5.0 (Affymetrix, Inc., 2002, http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf), dChip (Li and Wong, 2001), VSN (Huber et al., 2002) or robust multichip analysis, RMA (Irizarry et al., 2003)]. These expression levels are subsequently used as measurements on that probeset and used for further data analysis, such as finding differentially expressed probesets, clustering microarrays or probesets, and the construction of predictors.

In this paper, we propose to use the information (on variability) in the individual probe measurements in the data analysis, rather than summarizing it from the start. This should make more efficient use of the measured data and allow us to obtain reliable results using fewer microarrays. The latter is especially useful for molecular biologists interested in performing small-scale, specific microarray experiments, e.g. comparing cell types, treatments, etc. using just 2–3 microarrays per condition.

In Section 2.1 the linear models used as the basis of this approach, as well as their use in testing for differential expression will be discussed. Section 2.2 introduces a criterion for clustering microarrays based on this test and an accompanying clustering algorithm. Some related work is discussed in Section 3. This algorithm is tested on synthetic and real data in Section 4. Section 5 gives some concluding remarks and an outlook.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RELATED WORK
 4 EXPERIMENTS
 5 CONCLUSIONS
 REFERENCES
 
2.1 Linear models
2.1.1 Probeset summarization
In many studies, (e.g. Pomeroy et al., 2002; Yeoh et al., 2002), microarray measurement data are normalized and summarized using software supplied with the Affymetrix system (MicroArray Suite or GeneChip Operating Software). These packages have often been criticized: the normalization included is simply linear and per-array; and the method used to summarize probeset expression is rather heuristic. Furthermore, analyses change between different versions of the software and depend on a relatively large number of parameters to be set by the user.

Recently, a number of more principled methods, based on error models, have been proposed (Huber et al., 2002; Irizarry et al., 2003; Li and Wong, 2001). RMA (Irizarry et al., 2003) is widely used, implemented in the freely available Bioconductor R package (Gautier et al., 2004) and heavily benchmarked (Cope et al., 2004). The RMA approach consists of three steps as follows:

  1. per-array background signal removal over all probes on an array;
  2. quantile normalization over all probes on all arrays;
  3. robustly fitting a linear additive model for each probeset:

Formula 1(1)
In this model, the log2-transformed intensity Formula 1 of each probe p in probeset g on array a, Formula 1, is assumed to be the sum of a probe-effect {alpha}g,p, an array-effect ßg,a and an i.i.d. noise {varepsilon}g,p,a. RMA finds these parameters using a robust method, median polish (Holder et al., 2001), and ßg,a is used as the expression level summary of probeset g on array a.

Given a number of groups of microarrays, e.g. corresponding to conditions, cell types, diagnosis, etc. differential expression between groups is then assessed by performing a statistical test, such as a t-test (Dudoit et al., 2000) or SAM (Tusher et al., 2001), on these summaries. To reliably estimate the statistics used in these tests, at least 5–10 measurements (microarrays) should be available per group. This leads to high expense, both financially and in terms of the required amount of biological material.

In step 3 of the RMA approach, it is also possible to apply a two-way analysis of variance (ANOVA), using a model similar to (1), to calculate both probeset expression levels and an F-statistic for significance of differential expression between the conditions (Dik et al., 2005). A disadvantage in comparison to RMA is a possible lack of robustness1; however, a major advantage is that information on individual probes is used in testing for differential expression. The ANOVA approach models all probes as measuring the same effect, variation in probe measurements therefore corresponds to measurement noise. ANOVA's assumption that this noise is normally distributed does not always hold, e.g. when probesets contain outliers probes, which more easily cross-hybridize than others or do not hybridize well at all. Nevertheless, this approach will be applicable in experiments with smaller sample sizes than required by traditional tests applied to probeset summaries.

A full discussion of the merits of the ANOVA model in testing for differential expression falls outside the scope of this paper; however, as we use it to formulate a clustering criterion, it will be discussed in more detail below.

2.1.2 Two-way ANOVA
Suppose there are K conditions in a microarray study. Let the number of arrays measured under each condition be Ak, k = 1, ..., K and Formula 1. Each probeset g (g = 1, ..., G) is represented on each array by Pg probes. The log2 of the measured intensities for probeset g can then be stored in a matrix X as in Figure 1. This matrix can be divided along several dimensions: rows, columns, or both. For clarity, the shorthand X will be used to denote the entire matrix; Xp.., the submatrix corresponding to row group (probe) p; X:,k the submatrix corresponding to column group (condition) k; and Xp,k. the submatrix corresponding to both at once, containing the data of probe p for all arrays in condition k.


Figure 1
View larger version (4K):
[in this window]
[in a new window]
 
Fig. 1 Probeset data matrix definition.

 
The following linear model can be applied to this data as follows:

Formula 2(2)
where ßg,k is now a condition effect rather than the array-effect it was in (1). The assumptions in this model are that there is no probe–array interaction effect (i.e. that Affymetrix microarrays are identical) and that there is no effect related to arrays (normalization should have taken care of removing any array-specific effect within conditions); arrays are simply repeated measurements on probes. If Ak = 1 {forall} k (k coincides with a) and µg is incorporated in ßg,a, (2) is identical to (1).

The expression level of probeset g under condition k is then eg,k = µg + ßg,k. An F-ratio can be used to assess the significance level at which the null hypothesis of no condition effect can be rejected:

Formula 3(3)
The F-ratio for a gene g is given (Sheskin, 2004) as follows:

Formula 4(4)
in which MSK is the mean-of-squares within conditions and MSWG is the within-group mean-of-squares. Similarly, the SS are sums-of-squares and the df are degrees of freedom, with dfWG = Pg(A – K) and dfK = Pg. For writing (4) in terms of X, assume without loss of generality that the overall average Formula 4 (summations over all elements in a matrix, indicated by a ·, are not indexed); then

Formula 5(5)

Formula 6(6)

Formula 7(7)
where 1 is a vector of 1s. The F-ratio (4) is distributed according to a Fisher distribution with K – 1 and Pg(A – K) df for the numerator and denominator, respectively, as follows:

Formula 8(8)
In summary, then, for each probeset g, we can use a linear model to calculate an average expression value per condition k, eg,k, as well as a p-value for differential expression between conditions [which, when applied to entire microarrays, of course should be properly adjusted for multiple testing (Ge et al., 2003)]. This method is particularly useful to assess differential expression for small sample sizes (Ak = 2 or 3), for which methods operating on summarized expression levels will fail.

2.2 Maximum significance clustering
2.2.1 Criterion
In Equations (5)(7), it is assumed it is known which samples belong to a condition k. However, in clustering, the goal is exactly to find this membership. To this end, we can introduce an AxK membership matrix Q, in which Qak = 1 if array a belongs to cluster k, and 0 otherwise. This allows us to write (X:,k)1 as XQ. For notational purposes, it is more convenient to use a slightly different membership matrix M, with Mak = 1/{surd} Ak if array a belongs to cluster k, and 0 otherwise:

Formula 9(9)
Equations (5) and (7) then become

Formula 10(10)

Formula 11(11)
For unknown membership, the F-ratio (4) can therefore be written as

Formula 12(12)
(again assuming the overall average is zero) with the constraints that M > 0 and M1 = 1.

Maximizing (12) w.r.t. M, subject to the constraints on M, will find a soft assignment of the arrays to clusters, such that the test statistic for probeset g is maximized; (i.e.) the difference in expression between conditions is maximally significant. However, as the number of probes (and hence the df of the test statistic's distribution under H0) may differ between probesets, this cannot easily be extended to assigning clusters based on the data of multiple genes. Therefore, directly minimizing the p-value is more useful as follows:

Formula 13(13)
For entire microarrays, assuming independence between the expression of different probesets, these p-values can be combined as

Formula 14(14)
or, equivalently,

Formula 15(15)
where r is an arbitrarily small non-zero regularization factor; as for large A some p-values may reach machine precision and be rounded to 0 (in all experiments here, r = 10–300).

Note that in assessing differential expression, statistics serve to assess effects between experimentally controlled conditions, not as parameters to be optimized. Moreover, it is unclear to what extent the assumptions of the ANOVA model are still satisfied after optimization. Hence, although (15) is useful for clustering, the resulting p-values should not be interpreted as having any meaning.

2.2.2 Search algorithm
If we would be interested in a soft clustering, we could efficiently minimize (15) w.r.t. M (under the constraints on M) using any non-linear optimization package. For clustering genes, soft clustering is often useful, as individual genes may be involved in several clusters (e.g. pathways). However, interpretation of a soft clustering of microarrays and the associated p-values is problematic— e.g. what does it mean when an array belongs to cluster 1 for 30% and to cluster 2 for 70%? Researchers are often interested in ‘crisp’ assignments, i.e. ones for which Mak = 1/{surd}Ak and Maj = 0, {forall}j != k. Therefore, in the remainder of this paper we focus on crisp clustering. Unfortunately, this makes the optimization problem much harder.

We used a hill-climbing approach, which is guaranteed to converge to at least a local optimum of (15). The algorithm starts with a random assignment M of arrays to clusters and iteratively moves arrays between clusters to minimize (15). In each iteration, each of the A arrays is assigned to each of the K–1 clusters it is not currently assigned to, keeping all other assignments fixed. The criterion Formula 15 (H0|X, M') is then calculated for each of these A(K – 1) new assignments M', and M is replaced by the M' giving the lowest criterion value. The iterations stop when the criterion increases, i.e. when a local minimum has been found. The maximum significance clustering algorithm (MSC) is given in pseudo-code in Algorithm 1.


View this table:
[in this window]
[in a new window]
 
Algorithm 1 MSC: the maximum significance clustering algorithm

 
This approach works well, but is rather inefficient for large K, as in each iteration A(K – 1) assignments have to be tried. The complexity can be brought down by divide-and-conquer, i.e. iteratively clustering into 2, ..., K clusters, as described in Algorithm 2, MSCK. This was found experimentally to lead to quick convergence. The process is similar to hierarchical clustering; however, the algorithm ends by still performing an overall clustering into K clusters starting from the suboptimal solution found thus far [unlike a similar procedure (Ward, 1963), which works purely hierarchically]. Like any hill-climbing algorithm, such as k-means or expectation-maximization (EM), the MSC(K) algorithm finds a local rather than a global optimum. It will therefore have to be restarted a number of times (see also Section 4.2).


View this table:
[in this window]
[in a new window]
 
Algorithm 2 MSCK: a faster maximum significance clustering algorithm for K > 2

 
Note that in principle, a similar algorithm could be developed to cluster probesets rather than microarrays. However, the computational burden will be much larger, since M becomes a G x K matrix instead of an A x K matrix, where in general G>>A.


    3 RELATED WORK
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RELATED WORK
 4 EXPERIMENTS
 5 CONCLUSIONS
 REFERENCES
 
Although we are not aware of any previous model-based approaches to clustering oligonucleotide microarray data based on probe data, there is literature on the use of repeated measurements in clustering in general. In Hughes et al. (2000), an error-weighted clustering method is proposed, which in Yeung et al. (2003) is extended to use weights based on different measures of variability. Let the expression level of probeset g on array a be calculated as Formula 15, with an accompanying weight wg,a. The weighted Euclidean distance2 between two arrays a and b is then given as

Formula 16(16)
The weight wg,a is either

  • constant, i.e. not used (denoted D0);
  • based on the standard deviation (denoted Ds),

    Formula 17(17)

  • or based on the coefficient-of-variation (denoted Dc),

    Formula 18(18)

Alternatively, two possible distance measures based on the Kullback–Leibler divergence between two distributions are given in Wit and McClure (2004) as follows:

Formula 19(19)
and

Formula 20(20)
where the distributions are assumed to be normal so that Formula 20 can be written as

Formula 21(21)
Besides these distance-based methods, a clustering algorithm generally applicable to repeated measurements has been presented (Medvedovic et al., 2004; Yeung et al., 2003). However, the software cannot handle different numbers of probes per probesets, which is required for application to real-world datasets.

In the following section, the MSCK algorithm will be compared with a number of standard clustering algorithms: mixtures-of-Gaussians (MoG) with diagonal covariance matrices, fitted using the EM algorithm; the k-means algorithm; and hierarchical clustering with complete, average and single linkage, all applied to the average expression level for each probeset (corresponding to the use of D0) and to probeset values obtained by applying RMA (Irizarry et al., 2003) and MAS (Affymetrix, Inc., 2002). k-means and hierarchical clustering were also used with the Ds, Dc, DJeffreys and DResistor distance measures outlined earlier.


    4 EXPERIMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RELATED WORK
 4 EXPERIMENTS
 5 CONCLUSIONS
 REFERENCES
 
4.1 Simulated data
To learn if and under which circumstances MSCK performs well, we simulated a large number of datasets, as in de Ridder et al., (2005), with different characteristics. In each dataset, the number of probesets was G = 1000 and the number of probes per probeset was set to Pg = 11, g = 1, ..., G. The number of probesets per condition differentially expressed w.r.t. all other conditions, G{Delta}, the difference of expression between conditions for these probesets, {Delta}, the number of arrays A and the number of conditions K were varied. On each dataset, the Gs probesets with most variation in probeset expression level over the A arrays were selected for use in clustering (‘variation filtering’); in the reported results, Gs was always 50.

For each parameter set, 10 datasets were simulated and MSCK was applied. Next, five distance matrices were calculated using the measures described in Section 3 and clustered using k-means, MoG and hierarchical clustering using complete, average and single linkage. Performance was measured by the Jaccard index J = nss/(nsd + nds + nss), which measures agreement between conditions and cluster assignments (with nss the number of pairs of arrays in the same condition with the same assignment, nsd the number of pairs of arrays in the same condition with different assignments and nds the number of pairs of arrays in different conditions with the same assignment). The Jaccard index lies between 0 and 1, 1 indicating complete agreement and 0 complete disagreement.

Figure 2 shows performance as a function of each of the parameters that were varied, keeping the other parameters fixed (note that in each subgraph, MSCK performance is always the same):

  • For small numbers of differentially expressed probesets G{Delta} (1–2), the MSCK algorithm performs well, as do the MoG. Only the hierarchical methods using Dc perform comparably well. For larger G{Delta}, all methods except k-means perform well.
  • For small {Delta} (0.5), MSCK performs worse than other methods, because it starts fitting the noise in the data rather than the signal. For large {Delta} (1.5), MSCK performs well; only hierarchical clustering using D0 and Dc performs as well. In between these, MSCK performs well, again together with hierarchical methods using Dc.
  • With increasing K, performance decreases for all methods. Relative performance stays largely the same, except for k-means which seems to suffer more from large K. Using DJeffreys and DResistor clearly leads to poor results for large K.
  • MSCK performs best of all methods for small A, as intended, but performance does not decrease for increasing A.
  • The better the performance, the higher the standard deviation, for all methods. For the hierarchical clustering methods, this variation is caused only by the 10 different simulated datasets. For MSCK, MoG and k-means, this is also influenced by inherent variation in results due to random initialization (see also Section 4.2).
Of the five distance measures, DJeffreys and DResistor do not show advantages over Ds and Dc. The clustering methods using D0 perform remarkably well, given that they do not use any information on variability.


Figure 2
View larger version (38K):
[in this window]
[in a new window]
 
Fig. 2 Experiments on simulated data, illustrating performance as a function of a number of parameters, with all other parameters fixed: (a) the number of differentially expressed probesets G{Delta} out of 1000; (b) the level of differential expression {Delta}; (c) the number of conditions and clusters K and (d) the number of microarrays A. Each set of graphs compares performance of MSCK with that of five standard clustering methods: MoG with diagonal covariance matrices fitted to probeset averages (corresponding to D0) and k-means and hierarchical clustering with complete, average and single linkage on five distance measures for repeated measurements: D0, Ds, Dc, DJeffreys and DResistor. Performances are shown as mean ± standard deviation over 10 realizations of the simulated datasets. Note that per row, MSCK and MoG performances are the same in each subgraph.

 
In conclusion, MSCK performs well for small sample sizes A, seemingly combining the advantage of k-means (working well for small A) with that of the hierarchical methods (working well for larger K). It is most useful when the number of differentially expressed probesets G{Delta} is small and their expression difference {Delta} is intermediate. For smaller {Delta}, the method starts fitting noise; for larger {Delta}, using variability information is not necessary.

4.2 Real data
The methods above were next applied to a number of simple real-world problems:

  • a small subset of the Yeoh precursor B-ALL dataset (Yeoh et al., 2002), containing 16 BCR-ABL and 21 MLL samples, measured by HG-U95Av2 microarrays (A = 37, K = 2);
  • Pomeroy dataset A (Pomeroy et al., 2002), containing 42 cases of five types of central nervous system embryonal tumor (8 primitive neuroectodermal tumors, 4 normal, 10 medulloblastomas, 10 malignant gliomas and 10 AT/RTs), measured on Hu6800 microarrays (A = 42, K = 5);
  • a dataset of seven development stages of T-cells (Dik et al., 2005) (CD34+38+1a–, CD34+38+1a+, immature single positive CD4+, double positive CD3–, double positive CD3+, single positive CD8+ and single positive CD4+), measured on two HG-U133A microarrays each (A = 14, K = 7).
For the HG-U133A microarrays in the T-cell dataset, which carry less probes per probeset, we found that a few outlier probes could have a very large influence. We therefore preprocessed this data for all methods by removing three outlier probesets, i.e. those with the highest average absolute deviation from average probe rank over all arrays. For all datasets, array background was first removed and all arrays were quantile normalized as in Irizarry et al., (2003).

Figure 3 shows the results of using MSCK and the methods using distance measures for repeated measurements, as a function of the number of probesets Gs selected for clustering. MSCK and k-means were started 10 times from random initializations. Obviously, results can vary quite significantly, since these algorithms find local rather than global optima. As discussed in Section 2.2, in practice both methods should therefore preferably be restarted a number of times (say, 100) and the clustering resulting in the lowest value of the criterion function should be chosen.


Figure 3
View larger version (30K):
[in this window]
[in a new window]
 
Fig. 3 Experiments on three real-world datasets: (a) a subset of the Yeoh precursor-B ALL data; (b) Pomeroy dataset A and (c) a dataset of T-cell development stages. Each set of graphs compares the performance of MSCK with that of four standard clustering methods (k-means and hierarchical clustering with complete, average and single linkage) on four distance measures for repeated measurements. MSCK and k-means performances are shown as mean ± standard deviation over 10 initializations. Note that per row, MSCK performance is the same in each subgraph.

 
On all datasets, MSCK performs reasonably well, especially for small Gs, showing it to be more efficient with the data than traditional methods. For larger Gs performance decreases, as the method starts to fit noise. Although for each dataset a clustering algorithm/distance measure combination can be found for which performance is comparable with that of MSCK, no single combination consistently outperforms it. On the Pomeroy and T-cell data, MSCK gives the best results; in fact, only MSCK is able to perfectly cluster the T-cell data.

Figure 4 illustrates the effect of using repeated measurements, by comparing MSCK results with those obtained by clustering probeset values summarized by simple averaging, RMA and MAS. Globally, it shows that using MSCK gives an advantage over the more traditional approaches, such as using k-means or hierarchical clustering on probeset summaries. Only for the Yeoh data, hierarchical clustering gives good results for small gene subsets on MAS-derived data, but it starts behaving erratically for larger gene subset sizes.


Figure 4
View larger version (33K):
[in this window]
[in a new window]
 
Fig. 4 Experiments on real-world datasets. Each set of graphs compares performance of MSCK with that of five standard clustering methods (MoG, k-means and hierarchical clustering with complete, average and single linkage) on three different probeset summarization methods: average probe value (equivalent to the use of D0), RMA and MAS. MSCK, MoG and k-means performances are shown as mean ± standard deviation over 10 initializations. Note that per row, MSCK performance is the same in each subgraph.

 

    5 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RELATED WORK
 4 EXPERIMENTS
 5 CONCLUSIONS
 REFERENCES
 
In this paper, we proposed the use of an ANOVA model on probe data obtained with Affymetrix GeneChip microarrays. This allows the use of information on probe variability in assessing differential expression of probesets between conditions, even for extremely small numbers of microarrays. Here, we developed a clustering criterion based on the ANOVA F-statistic, which is optimized by the MSCK algorithm using a hill-climbing approach. Experiments on simulated data showed that this algorithm outperforms a number of competing methods, most clearly when both the number of microarrays and the differences between conditions are small. This makes it a useful tool for the molecular biologist seeking to answer questions by performing limited microarray experiments, comparing a small number of conditions using duplicate or triplicate microarray measurements. The simulation results were confirmed by some experiments on real-world data, where the MSCK algorithm showed the most advantage for the T-cell development dataset with only 2 microarrays per condition.

We intend to further explore the possibilities of reducing sample size requirements using probe variability information, e.g. in developing predictors. This should facilitate the use of microarrays in more types of small-scale experiments in molecular biology. It is also interesting to investigate to what extent even lowerlevel measurement noise (e.g. the standard devation of pixel intensity in the microarray scan) can be used for high-level analyses such as clustering.


    Acknowledgments
 
We would like to thank Karin Pike-Overzet and Floor Weerkamp for the T-cell development data, and David Tax and Lodewyk Wessels for stimulating discussions.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

1A robust alternative is Friedman's test (Sheskin, 2004). On the Affymetrix HG-U133A spike-in dataset (Cope et al., 2004) we found this test to give similar performance in detecting differential expression, but the median expression level to be less well correlated to the actual spike-in concentrations (data not shown). Back

2In Yeung et al. (2003), a similarly weighted correlation measure is given. In our experiments, performance using this measure was never better than using Euclidean distance, so we do not report any of those results here. Back

Received on September 23, 2005; revised on November 3, 2005; accepted on November 16, 2005

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RELATED WORK
 4 EXPERIMENTS
 5 CONCLUSIONS
 REFERENCES
 

    Affymetrix, Inc. (2002) Affymetrix Statistical Algorithms Description Document.

    Cope, L.M., et al. (2004) A benchmark for Affymetrix GeneChip expression measures. Bioinformatics, 20, 323–331[Abstract/Free Full Text].

    de Ridder, D., et al. (2005) Purity for clarity: the need for purification of tumor cells in DNA microarray studies. Leukemia, 19, 618–627[Medline].

    Dik, W.A., et al. (2005) New insights on human T cell development by quantitative T cell receptor gene rearrangement studies and gene expression profiling. J. Exp. Med, . 201, 1715–1723[Abstract/Free Full Text].

    Dudoit, S., Yang, Y.H., Callow, M.J., Speed, T.P. (2000) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. , Berkeley, CA Techinical Report 578 UC Berkeley.

    Gautier, L., et al. (2004) Affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics, 20, 307–315[Abstract/Free Full Text].

    Ge, Y., et al. (2003) Resampling-based Multiple Testing for Microarray Data Analysis. TEST, 12, 1–77.

    Holder, D., Raubertas, R.F., Pikounis, V.B., Svetnik, V., Soper, K. (2001) Statistical analysis of high density oligonucleotide arrays: a SAFER approach. Proceedings of the ASA Annual MeetingAtlanta, GA.

    Huber, W., et al. (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics, 18, Suppl 1, , pp. S96–S104[Abstract].

    Hughes, T.R., et al. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109–126[CrossRef][Web of Science][Medline].

    Irizarry, R.A., et al. (2003) Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res, . 31, e15[Abstract/Free Full Text].

    Li, C. and Wong, W.H. (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc. Natl Acad. Sci. USA, 98, 31–36[Abstract/Free Full Text].

    Lipschutz, R.J., et al. (1999) High density synthetic oligonucleotide arrays. Nat. Genet, . 21, 20–24[CrossRef][Web of Science][Medline].

    Medvedovic, M., et al. (2004) Bayesian mixtures for clustering replicated microarray data. Bioinformatics, 20, 1222–1232[Abstract/Free Full Text].

    Pomeroy, S.L., et al. (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415, 436–442[CrossRef][Medline].

    Sheskin, D.J. Handbook of Parametric and Nonparametric Statistical Procedures, (2004) 3rd edn , Boca Raton, FL Chapman & Hall/CRC.

    Tusher, V.G., et al. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 5116–5121[Abstract/Free Full Text].

    Ward, J.H. (1963) Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc, . 58, 236–244[CrossRef][Web of Science].

    Wit, E. and McClure, J. (2004) Statistics for Microarrays—Design, Analysis and Inference. , Chichester, UK John Wiley & Sons.

    Yeoh, E.J., et al. (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1, 133–143[CrossRef][Web of Science][Medline].

    Yeung, K.Y., et al. (2003) Clustering gene-expression data with repeated measurements. Genome Biol, . 4, R34[CrossRef][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BloodHome page
O. S. Kustikova, H. Geiger, Z. Li, M. H. Brugman, S. M. Chambers, C. A. Shaw, K. Pike-Overzet, D. d. Ridder, F. J. T. Staal, G. v. Keudell, et al.
Retroviral vector insertion sites associated with dominant hematopoietic clones mark "stemness" pathways
Blood, March 1, 2007; 109(5): 1897 - 1907.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/3/326    most recent
bti788v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by de Ridder, D.
Right arrow Articles by Reinders, M. J. T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by de Ridder, D.
Right arrow Articles by Reinders, M. J. T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?