Bioinformatics Advance Access originally published online on November 22, 2005
Bioinformatics 2006 22(3):326-331; doi:10.1093/bioinformatics/bti788
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Maximum significance clustering of oligonucleotide microarrays
1Information and Communication Theory Group, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology PO Box 5031, 2600 GA Delft, The Netherlands
2Department of Immunology, Erasmus MC, University Medical Center PO Box 1738, 3000 DR Rotterdam, The Netherlands
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Affymetrix high-density oligonucleotide microarrays measure the expression of DNA transcripts using probesets, i.e. multiple probes per transcript. Usually, these multiple measurements are transformed into a single probeset expression level before data analysis proceeds; any information on variability is lost. In this paper we demonstrate how individual probe measurements can be used in a statistic for differential expression. Furthermore, we show how this statistic can serve as a criterion for clustering microarrays.
Results: A novel clustering algorithm using this maximum significance criterion is demonstrated to be more efficient with the measured data than competing techniques for dealing with repeated measurements, especially when the sample size is small.
Availability: MATLAB source code can be found at http://ict.ewi.tudelft.nl/~dick
Contact: D.deRidder{at}ewi.tudelft.nl
| 1 INTRODUCTION |
|---|
|
|
|---|
The high-density oligonucleotide GeneChip microarrays, produced by Affymetrix, Inc. (Lipschutz et al., 1999), contain a large number of probesets (10 00055 000), each of which measures the presence of transcripts of a certain gene or expressed sequence tag (EST). The individual probes in a probeset (typically 1120) contain 25mers which match non-overlapping regions in the gene's or EST's nucleotide sequence.
Affymetrix GeneChip microarrays are increasingly used in large-scale studies (Pomeroy et al., 2002; Yeoh et al., 2002). The standard approach followed is to first summarize the probe measurements in a probeset as an expression level, using one of a few standard methods [e.g. Affymetrix MicroArray Suite 5.0 (Affymetrix, Inc., 2002, http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf), dChip (Li and Wong, 2001), VSN (Huber et al., 2002) or robust multichip analysis, RMA (Irizarry et al., 2003)]. These expression levels are subsequently used as measurements on that probeset and used for further data analysis, such as finding differentially expressed probesets, clustering microarrays or probesets, and the construction of predictors.
In this paper, we propose to use the information (on variability) in the individual probe measurements in the data analysis, rather than summarizing it from the start. This should make more efficient use of the measured data and allow us to obtain reliable results using fewer microarrays. The latter is especially useful for molecular biologists interested in performing small-scale, specific microarray experiments, e.g. comparing cell types, treatments, etc. using just 23 microarrays per condition.
In Section 2.1 the linear models used as the basis of this approach, as well as their use in testing for differential expression will be discussed. Section 2.2 introduces a criterion for clustering microarrays based on this test and an accompanying clustering algorithm. Some related work is discussed in Section 3. This algorithm is tested on synthetic and real data in Section 4. Section 5 gives some concluding remarks and an outlook.
| 2 METHODS |
|---|
|
|
|---|
2.1 Linear models
2.1.1 Probeset summarization
In many studies, (e.g. Pomeroy et al., 2002; Yeoh et al., 2002), microarray measurement data are normalized and summarized using software supplied with the Affymetrix system (MicroArray Suite or GeneChip Operating Software). These packages have often been criticized: the normalization included is simply linear and per-array; and the method used to summarize probeset expression is rather heuristic. Furthermore, analyses change between different versions of the software and depend on a relatively large number of parameters to be set by the user.
Recently, a number of more principled methods, based on error models, have been proposed (Huber et al., 2002; Irizarry et al., 2003; Li and Wong, 2001). RMA (Irizarry et al., 2003) is widely used, implemented in the freely available Bioconductor R package (Gautier et al., 2004) and heavily benchmarked (Cope et al., 2004). The RMA approach consists of three steps as follows:
- per-array background signal removal over all probes on an array;
- quantile normalization over all probes on all arrays;
- robustly fitting a linear additive model for each probeset:
![]() | (1) |
of each probe p in probeset g on array a,
, is assumed to be the sum of a probe-effect
g,p, an array-effect ßg,a and an i.i.d. noise
g,p,a. RMA finds these parameters using a robust method, median polish (Holder et al., 2001), and ßg,a is used as the expression level summary of probeset g on array a. Given a number of groups of microarrays, e.g. corresponding to conditions, cell types, diagnosis, etc. differential expression between groups is then assessed by performing a statistical test, such as a t-test (Dudoit et al., 2000) or SAM (Tusher et al., 2001), on these summaries. To reliably estimate the statistics used in these tests, at least 510 measurements (microarrays) should be available per group. This leads to high expense, both financially and in terms of the required amount of biological material.
In step 3 of the RMA approach, it is also possible to apply a two-way analysis of variance (ANOVA), using a model similar to (1), to calculate both probeset expression levels and an F-statistic for significance of differential expression between the conditions (Dik et al., 2005). A disadvantage in comparison to RMA is a possible lack of robustness1; however, a major advantage is that information on individual probes is used in testing for differential expression. The ANOVA approach models all probes as measuring the same effect, variation in probe measurements therefore corresponds to measurement noise. ANOVA's assumption that this noise is normally distributed does not always hold, e.g. when probesets contain outliers probes, which more easily cross-hybridize than others or do not hybridize well at all. Nevertheless, this approach will be applicable in experiments with smaller sample sizes than required by traditional tests applied to probeset summaries.
A full discussion of the merits of the ANOVA model in testing for differential expression falls outside the scope of this paper; however, as we use it to formulate a clustering criterion, it will be discussed in more detail below.
2.1.2 Two-way ANOVA
Suppose there are K conditions in a microarray study. Let the number of arrays measured under each condition be Ak, k = 1, ..., K and
. Each probeset g (g = 1, ..., G) is represented on each array by Pg probes. The log2 of the measured intensities for probeset g can then be stored in a matrix X as in Figure 1. This matrix can be divided along several dimensions: rows, columns, or both. For clarity, the shorthand X will be used to denote the entire matrix; Xp.., the submatrix corresponding to row group (probe) p; X:,k the submatrix corresponding to column group (condition) k; and Xp,k. the submatrix corresponding to both at once, containing the data of probe p for all arrays in condition k.
|
The following linear model can be applied to this data as follows:
![]() | (2) |
k (k coincides with a) and µg is incorporated in ßg,a, (2) is identical to (1).
The expression level of probeset g under condition k is then eg,k = µg + ßg,k. An F-ratio can be used to assess the significance level at which the null hypothesis of no condition effect can be rejected:
![]() | (3) |
![]() | (4) |
(summations over all elements in a matrix, indicated by a ·, are not indexed); then
![]() | (5) |
![]() | (6) |
![]() | (7) |
![]() | (8) |
2.2 Maximum significance clustering
2.2.1 Criterion
In Equations (5)(7), it is assumed it is known which samples belong to a condition k. However, in clustering, the goal is exactly to find this membership. To this end, we can introduce an AxK membership matrix Q, in which Qak = 1 if array a belongs to cluster k, and 0 otherwise. This allows us to write (X:,k)1 as XQ. For notational purposes, it is more convenient to use a slightly different membership matrix M, with Mak = 1/
Ak if array a belongs to cluster k, and 0 otherwise:
![]() | (9) |
![]() | (10) |
![]() | (11) |
![]() | (12) |
Maximizing (12) w.r.t. M, subject to the constraints on M, will find a soft assignment of the arrays to clusters, such that the test statistic for probeset g is maximized; (i.e.) the difference in expression between conditions is maximally significant. However, as the number of probes (and hence the df of the test statistic's distribution under H0) may differ between probesets, this cannot easily be extended to assigning clusters based on the data of multiple genes. Therefore, directly minimizing the p-value is more useful as follows:
![]() | (13) |
![]() | (14) |
![]() | (15) |
Note that in assessing differential expression, statistics serve to assess effects between experimentally controlled conditions, not as parameters to be optimized. Moreover, it is unclear to what extent the assumptions of the ANOVA model are still satisfied after optimization. Hence, although (15) is useful for clustering, the resulting p-values should not be interpreted as having any meaning.
2.2.2 Search algorithm
If we would be interested in a soft clustering, we could efficiently minimize (15) w.r.t. M (under the constraints on M) using any non-linear optimization package. For clustering genes, soft clustering is often useful, as individual genes may be involved in several clusters (e.g. pathways). However, interpretation of a soft clustering of microarrays and the associated p-values is problematic e.g. what does it mean when an array belongs to cluster 1 for 30% and to cluster 2 for 70%? Researchers are often interested in crisp assignments, i.e. ones for which Mak = 1/
Ak and Maj = 0,
j
k. Therefore, in the remainder of this paper we focus on crisp clustering. Unfortunately, this makes the optimization problem much harder.
We used a hill-climbing approach, which is guaranteed to converge to at least a local optimum of (15). The algorithm starts with a random assignment M of arrays to clusters and iteratively moves arrays between clusters to minimize (15). In each iteration, each of the A arrays is assigned to each of the K1 clusters it is not currently assigned to, keeping all other assignments fixed. The criterion
(H0|X, M') is then calculated for each of these A(K 1) new assignments M', and M is replaced by the M' giving the lowest criterion value. The iterations stop when the criterion increases, i.e. when a local minimum has been found. The maximum significance clustering algorithm (MSC) is given in pseudo-code in Algorithm 1.
|
This approach works well, but is rather inefficient for large K, as in each iteration A(K 1) assignments have to be tried. The complexity can be brought down by divide-and-conquer, i.e. iteratively clustering into 2, ..., K clusters, as described in Algorithm 2, MSCK. This was found experimentally to lead to quick convergence. The process is similar to hierarchical clustering; however, the algorithm ends by still performing an overall clustering into K clusters starting from the suboptimal solution found thus far [unlike a similar procedure (Ward, 1963), which works purely hierarchically]. Like any hill-climbing algorithm, such as k-means or expectation-maximization (EM), the MSC(K) algorithm finds a local rather than a global optimum. It will therefore have to be restarted a number of times (see also Section 4.2).
|
Note that in principle, a similar algorithm could be developed to cluster probesets rather than microarrays. However, the computational burden will be much larger, since M becomes a G x K matrix instead of an A x K matrix, where in general G>>A.
| 3 RELATED WORK |
|---|
|
|
|---|
Although we are not aware of any previous model-based approaches to clustering oligonucleotide microarray data based on probe data, there is literature on the use of repeated measurements in clustering in general. In Hughes et al. (2000), an error-weighted clustering method is proposed, which in Yeung et al. (2003) is extended to use weights based on different measures of variability. Let the expression level of probeset g on array a be calculated as
, with an accompanying weight wg,a. The weighted Euclidean distance2 between two arrays a and b is then given as
![]() | (16) |
- constant, i.e. not used (denoted D0);
- based on the standard deviation (denoted Ds),

(17)
- or based on the coefficient-of-variation (denoted Dc),

(18)
![]() | (19) |
![]() | (20) |
can be written as
![]() | (21) |
In the following section, the MSCK algorithm will be compared with a number of standard clustering algorithms: mixtures-of-Gaussians (MoG) with diagonal covariance matrices, fitted using the EM algorithm; the k-means algorithm; and hierarchical clustering with complete, average and single linkage, all applied to the average expression level for each probeset (corresponding to the use of D0) and to probeset values obtained by applying RMA (Irizarry et al., 2003) and MAS (Affymetrix, Inc., 2002). k-means and hierarchical clustering were also used with the Ds, Dc, DJeffreys and DResistor distance measures outlined earlier.
| 4 EXPERIMENTS |
|---|
|
|
|---|
4.1 Simulated data
To learn if and under which circumstances MSCK performs well, we simulated a large number of datasets, as in de Ridder et al., (2005), with different characteristics. In each dataset, the number of probesets was G = 1000 and the number of probes per probeset was set to Pg = 11, g = 1, ..., G. The number of probesets per condition differentially expressed w.r.t. all other conditions, G
, the difference of expression between conditions for these probesets,
, the number of arrays A and the number of conditions K were varied. On each dataset, the Gs probesets with most variation in probeset expression level over the A arrays were selected for use in clustering (variation filtering); in the reported results, Gs was always 50. For each parameter set, 10 datasets were simulated and MSCK was applied. Next, five distance matrices were calculated using the measures described in Section 3 and clustered using k-means, MoG and hierarchical clustering using complete, average and single linkage. Performance was measured by the Jaccard index J = nss/(nsd + nds + nss), which measures agreement between conditions and cluster assignments (with nss the number of pairs of arrays in the same condition with the same assignment, nsd the number of pairs of arrays in the same condition with different assignments and nds the number of pairs of arrays in different conditions with the same assignment). The Jaccard index lies between 0 and 1, 1 indicating complete agreement and 0 complete disagreement.
Figure 2 shows performance as a function of each of the parameters that were varied, keeping the other parameters fixed (note that in each subgraph, MSCK performance is always the same):
- For small numbers of differentially expressed probesets G
(12), the MSCK algorithm performs well, as do the MoG. Only the hierarchical methods using Dc perform comparably well. For larger G
, all methods except k-means perform well.
- For small
(0.5), MSCK performs worse than other methods, because it starts fitting the noise in the data rather than the signal. For large
(1.5), MSCK performs well; only hierarchical clustering using D0 and Dc performs as well. In between these, MSCK performs well, again together with hierarchical methods using Dc.
- With increasing K, performance decreases for all methods. Relative performance stays largely the same, except for k-means which seems to suffer more from large K. Using DJeffreys and DResistor clearly leads to poor results for large K.
- MSCK performs best of all methods for small A, as intended, but performance does not decrease for increasing A.
- The better the performance, the higher the standard deviation, for all methods. For the hierarchical clustering methods, this variation is caused only by the 10 different simulated datasets. For MSCK, MoG and k-means, this is also influenced by inherent variation in results due to random initialization (see also Section 4.2).
|
In conclusion, MSCK performs well for small sample sizes A, seemingly combining the advantage of k-means (working well for small A) with that of the hierarchical methods (working well for larger K). It is most useful when the number of differentially expressed probesets G
is small and their expression difference
is intermediate. For smaller
, the method starts fitting noise; for larger
, using variability information is not necessary.
4.2 Real data
The methods above were next applied to a number of simple real-world problems:
- a small subset of the Yeoh precursor B-ALL dataset (Yeoh et al., 2002), containing 16 BCR-ABL and 21 MLL samples, measured by HG-U95Av2 microarrays (A = 37, K = 2);
- Pomeroy dataset A (Pomeroy et al., 2002), containing 42 cases of five types of central nervous system embryonal tumor (8 primitive neuroectodermal tumors, 4 normal, 10 medulloblastomas, 10 malignant gliomas and 10 AT/RTs), measured on Hu6800 microarrays (A = 42, K = 5);
- a dataset of seven development stages of T-cells (Dik et al., 2005) (CD34+38+1a, CD34+38+1a+, immature single positive CD4+, double positive CD3, double positive CD3+, single positive CD8+ and single positive CD4+), measured on two HG-U133A microarrays each (A = 14, K = 7).
Figure 3 shows the results of using MSCK and the methods using distance measures for repeated measurements, as a function of the number of probesets Gs selected for clustering. MSCK and k-means were started 10 times from random initializations. Obviously, results can vary quite significantly, since these algorithms find local rather than global optima. As discussed in Section 2.2, in practice both methods should therefore preferably be restarted a number of times (say, 100) and the clustering resulting in the lowest value of the criterion function should be chosen.
|
On all datasets, MSCK performs reasonably well, especially for small Gs, showing it to be more efficient with the data than traditional methods. For larger Gs performance decreases, as the method starts to fit noise. Although for each dataset a clustering algorithm/distance measure combination can be found for which performance is comparable with that of MSCK, no single combination consistently outperforms it. On the Pomeroy and T-cell data, MSCK gives the best results; in fact, only MSCK is able to perfectly cluster the T-cell data.
Figure 4 illustrates the effect of using repeated measurements, by comparing MSCK results with those obtained by clustering probeset values summarized by simple averaging, RMA and MAS. Globally, it shows that using MSCK gives an advantage over the more traditional approaches, such as using k-means or hierarchical clustering on probeset summaries. Only for the Yeoh data, hierarchical clustering gives good results for small gene subsets on MAS-derived data, but it starts behaving erratically for larger gene subset sizes.
|
| 5 CONCLUSIONS |
|---|
|
|
|---|
In this paper, we proposed the use of an ANOVA model on probe data obtained with Affymetrix GeneChip microarrays. This allows the use of information on probe variability in assessing differential expression of probesets between conditions, even for extremely small numbers of microarrays. Here, we developed a clustering criterion based on the ANOVA F-statistic, which is optimized by the MSCK algorithm using a hill-climbing approach. Experiments on simulated data showed that this algorithm outperforms a number of competing methods, most clearly when both the number of microarrays and the differences between conditions are small. This makes it a useful tool for the molecular biologist seeking to answer questions by performing limited microarray experiments, comparing a small number of conditions using duplicate or triplicate microarray measurements. The simulation results were confirmed by some experiments on real-world data, where the MSCK algorithm showed the most advantage for the T-cell development dataset with only 2 microarrays per condition.
We intend to further explore the possibilities of reducing sample size requirements using probe variability information, e.g. in developing predictors. This should facilitate the use of microarrays in more types of small-scale experiments in molecular biology. It is also interesting to investigate to what extent even lowerlevel measurement noise (e.g. the standard devation of pixel intensity in the microarray scan) can be used for high-level analyses such as clustering.
| Acknowledgments |
|---|
We would like to thank Karin Pike-Overzet and Floor Weerkamp for the T-cell development data, and David Tax and Lodewyk Wessels for stimulating discussions.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
1A robust alternative is Friedman's test (Sheskin, 2004). On the Affymetrix HG-U133A spike-in dataset (Cope et al., 2004) we found this test to give similar performance in detecting differential expression, but the median expression level to be less well correlated to the actual spike-in concentrations (data not shown). ![]()
2In Yeung et al. (2003), a similarly weighted correlation measure is given. In our experiments, performance using this measure was never better than using Euclidean distance, so we do not report any of those results here. ![]()
Received on September 23, 2005; revised on November 3, 2005; accepted on November 16, 2005
| REFERENCES |
|---|
|
|
|---|
Affymetrix, Inc. (2002) Affymetrix Statistical Algorithms Description Document.
Cope, L.M., et al. (2004) A benchmark for Affymetrix GeneChip expression measures. Bioinformatics, 20, 323331
de Ridder, D., et al. (2005) Purity for clarity: the need for purification of tumor cells in DNA microarray studies. Leukemia, 19, 618627[Medline].
Dik, W.A., et al. (2005) New insights on human T cell development by quantitative T cell receptor gene rearrangement studies and gene expression profiling. J. Exp. Med, . 201, 17151723
Dudoit, S., Yang, Y.H., Callow, M.J., Speed, T.P. (2000) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. , Berkeley, CA Techinical Report 578 UC Berkeley.
Gautier, L., et al. (2004) Affyanalysis of Affymetrix GeneChip data at the probe level. Bioinformatics, 20, 307315
Ge, Y., et al. (2003) Resampling-based Multiple Testing for Microarray Data Analysis. TEST, 12, 177.
Holder, D., Raubertas, R.F., Pikounis, V.B., Svetnik, V., Soper, K. (2001) Statistical analysis of high density oligonucleotide arrays: a SAFER approach. Proceedings of the ASA Annual MeetingAtlanta, GA.
Huber, W., et al. (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics, 18, Suppl 1, , pp. S96S104[Abstract].
Hughes, T.R., et al. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109126[CrossRef][ISI][Medline].
Irizarry, R.A., et al. (2003) Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res, . 31, e15
Li, C. and Wong, W.H. (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc. Natl Acad. Sci. USA, 98, 3136
Lipschutz, R.J., et al. (1999) High density synthetic oligonucleotide arrays. Nat. Genet, . 21, 2024[CrossRef][ISI][Medline].
Medvedovic, M., et al. (2004) Bayesian mixtures for clustering replicated microarray data. Bioinformatics, 20, 12221232
Pomeroy, S.L., et al. (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415, 436442[CrossRef][Medline].
Sheskin, D.J. Handbook of Parametric and Nonparametric Statistical Procedures, (2004) 3rd edn , Boca Raton, FL Chapman & Hall/CRC.
Tusher, V.G., et al. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 51165121
Ward, J.H. (1963) Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc, . 58, 236244[CrossRef][ISI].
Wit, E. and McClure, J. (2004) Statistics for MicroarraysDesign, Analysis and Inference. , Chichester, UK John Wiley & Sons.
Yeoh, E.J., et al. (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1, 133143[CrossRef][ISI][Medline].
Yeung, K.Y., et al. (2003) Clustering gene-expression data with repeated measurements. Genome Biol, . 4, R34[CrossRef][Medline].
This article has been cited by other articles:
![]() |
O. S. Kustikova, H. Geiger, Z. Li, M. H. Brugman, S. M. Chambers, C. A. Shaw, K. Pike-Overzet, D. d. Ridder, F. J. T. Staal, G. v. Keudell, et al. Retroviral vector insertion sites associated with dominant hematopoietic clones mark "stemness" pathways Blood, March 1, 2007; 109(5): 1897 - 1907. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






















