Skip Navigation


Bioinformatics Advance Access originally published online on March 5, 2008
Bioinformatics 2008 24(9):1154-1160; doi:10.1093/bioinformatics/btn083
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/9/1154    most recent
btn083v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Shabalin, A. A.
Right arrow Articles by Nobel, A. B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shabalin, A. A.
Right arrow Articles by Nobel, A. B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Merging two gene-expression studies via cross-platform normalization

Andrey A. Shabalin 1,*, Håkon Tjelmeland 2, Cheng Fan 3, Charles M. Perou 3,4,5 and Andrew B. Nobel 1

1Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, USA, 2Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim, Norway, 3Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, 4Department of Pathology and Laboratory Medicine, University of North Carolina at Chapel Hill and 5Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CROSS-PLATFORM NORMALIZATION...
 3 OTHER METHODS
 4 DATASETS AND PREPROCESSING
 5 VALIDATION
 6 FURTHER DISCUSSION OF...
 7 CONCLUSION
 APPENDIX: MAXIMUM LIKELIHOOD...
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Gene-expression microarrays are currently being applied in a variety of biomedical applications. This article considers the problem of how to merge datasets arising from different gene-expression studies of a common organism and phenotype. Of particular interest is how to merge data from different technological platforms.

Results: The article makes two contributions to the problem. The first is a simple cross-study normalization method, which is based on linked gene/sample clustering of the given datasets. The second is the introduction and description of several general validation measures that can be used to assess and compare cross-study normalization methods. The proposed normalization method is applied to three existing breast cancer datasets, and is compared to several competing normalization methods using the proposed validation measures.

Availability: The supplementary materials and XPN Matlab code are publicly available at website: https://genome.unc.edu/xpn

Contact: shabalin{at}email.unc.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CROSS-PLATFORM NORMALIZATION...
 3 OTHER METHODS
 4 DATASETS AND PREPROCESSING
 5 VALIDATION
 6 FURTHER DISCUSSION OF...
 7 CONCLUSION
 APPENDIX: MAXIMUM LIKELIHOOD...
 ACKNOWLEDGEMENTS
 REFERENCES
 
High-throughput gene-expression microarrays are currently being applied in a wide variety of biomedical problems. There are now several widely used, commercially available, microarray platforms that measure gene expression in related, but different, ways. No matter which technology is used, the evaluation of gene-expression experiments usually begins with statistical analyses that take a variety of forms, including exploratory analysis (such as clustering), classification and assessments of differential expression.

The increasing number and availability of large-scale gene-expression studies of human and other organisms provides strong motivation for cross-study analyses that combine existing and/or new datasets. In a cross-study analysis, the data, relevant test statistics or conclusions of several studies are combined. The simultaneous analysis of different studies of a common organism and phenotype has the potential to strengthen and extend the results obtained from the individual studies. Cross-study analyses can be carried out using existing datasets, so their results hold out the promise of comparatively inexpensive, scientific ‘value-added’.

On the other hand, combining data from different expression studies poses a number of statistical difficulties. These difficulties arise from the fact that the constituent datasets have often been produced using different gene-expression platforms and different processing facilities. As a consequence, measurements from different platforms cannot be directly combined. Identifying and removing such systematic effects is the primary statistical challenge in cross-study analysis. We note that technological differences between studies may be confounded with biological differences arising from the choice of patient cohorts (e.g. age, gender or ethnicity). In many cases, technological artifacts are dominant, though care should be taken to verify this, and one can hope to remove them while leaving biological information intact.

There are several potential approaches to cross-study analysis, depending on what information is being synthesized. At the highest level, one may wish to combine, through meta-analysis or other techniques, the broad conclusions of different studies. Most existing work on multi-study gene-expression analysis is focused on an intermediate level, where the goal is to combine information from primary statistics (such as t-statistics or P-values) or secondary statistics (such as gene lists) that are derived from the individual studies (Choi et al., 2003; Garrett-Mayer et al., 2004; Ghosh et al., 2003). Other approaches to meta-analysis of gene-expression data are considered by (Garrett-Mayer et al., 2007; Parmigiani et al., 2004; Rhodes et al., 2002, 2004; Shen et al., 2004). This article deals with the problem of cross-study normalization: how to combine two available datasets in order to produce a single, unified dataset to which standard statistical procedures (such as clustering, classification and measures of differential expression) can be applied.

There has been a great deal of work on the normalization of gene-expression data within a single study (Bolstad et al., 2003; Irizarry et al., 2003a, b; Yang et al., 2002). Much of that work can be applied, with little modification, to normalizing data from multiple studies that are based on the same technological platform. The emphasis here is on the problem of combining data from different array platforms. We will use the term cross-platform normalization when this distinction is important.


    2 CROSS-PLATFORM NORMALIZATION (XPN) METHOD
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CROSS-PLATFORM NORMALIZATION...
 3 OTHER METHODS
 4 DATASETS AND PREPROCESSING
 5 VALIDATION
 6 FURTHER DISCUSSION OF...
 7 CONCLUSION
 APPENDIX: MAXIMUM LIKELIHOOD...
 ACKNOWLEDGEMENTS
 REFERENCES
 
Here we describe the basic idea behind the XPN (cross-platform normalization) method. We restrict our attention to merging two studies; the model and fitting procedure can be extended in a natural way to handle three or more studies.

XPN takes as input the gene-expression measurements from two studies, after appropriate preprocessing and imputation. One may work with the set of common genes in the studies, or on a selected subset of these genes. Once an appropriate set G of genes has been identified, the available data can be represented as two matrices


Formula 1

(1)
Here Xp denotes the available data from study p, and xgsp is the expression of gene g in sample s of study p. Let n1 and n2 denote the number of samples in studies 1 and 2, respectively, m denote the number of genes in G. The normalized data can be represented similarly, as two matrices Formula with the same dimensions as X1 and X2.

2.1 Block linear model
The XPN procedure is based on a simple block-linear model. In this model, the observed value xgsp is a scaled and shifted block mean plus noise. The block mean is constant over a range of gene and sample values, and is the same in each platform. The slope and offset of the linear transformation, as well as the variance of the noise, depend on the gene g and the platform p. More precisely, we assume that


Formula 2

(2)
The functions {alpha}* : {1, ... , m} ↦ {1, ... , K} and Formula , p = 1, 2, define linked groups of genes and samples, respectively. The numbers Aijp are block means, while bgp and cgp represent sensitivity and offset parameters, respectively, that are specific to each gene and platform. The noise variables {varepsilon}gsp are independent standard normals, so the final term in (2) has variance Formula . The model reflects the assumption that the samples of each available study fall roughly into one of L statistically homogenous groups, and that each group is defined by an associated gene profile that is constant within each of K groups of similar genes. The block means {Ai, j : i = 1, ... , K} represent the profile of the jth group. Figure 1 illustrates the underlying block structure. Note that the basic studies may be of different sizes. A heatmap illustrating the same idea on real data is provided in the Supplementary Materials.


Figure 1
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Studies 1 and 2 after row and column clustering of their combined data, with K = 5 gene groups and L = 3 sample groups. Shading indicates linked gene-sample blocks.

 
2.2 Description of XPN
Initially, the data from the available studies are sample standardized and gene median centered, in order to remove gross systematic differences, and then combined. Following the model (2), clustering is then used to identify homogenous groups of genes and samples in the combined data matrix. Specifically, k-means clustering is applied independently to the rows and columns of the combined data matrix, using k = K gene clusters and k = L sample clusters, respectively. Application of k-means begins with a random choice of centroids for the clusters. In clustering rows, we select K rows of the data matrix at random, and use these as the initial centroids. Cluster assignments and centroids are then updated iteratively until convergence to a local minimum of the sum of squared Euclidean distances. A similar procedure is used for clustering of the columns.

The gene clusters in the combined data matrix are summarized by the assignment function {alpha} : G -> {1, ... , K}. Gene clusters are naturally linked across studies, as we work with the same genes in each study. The column clusters in the combined data matrix are summarized by assignment functions βp : {1, ... , np} ↦ {1, ... , L} for p = 1, 2. Specifically, βp(s) is the index of the combined sample cluster containing sample s from Study p. The {ell}th combined cluster splits into linked clusters {s : β1(s) = {ell}} in Study 1 and {s : β2(s) = {ell}} in Study 2.

From the mappings {alpha}(g) and βp(s), estimates of the model parameters Âijp, Formula , Formula gp and Formula are obtained using standard maximum likelihood methods. Details are given in the Appendix. Common model parameters Formula and Âij are then calculated as weighted averages of the parameters in Study 1 and Study 2:


Formula

where nj,p is the number of samples in the jth sample group of platform p. The expression values of each platform are then modified in accordance with the estimated model parameters to produce normalized values


Formula

The output of the XPN algorithm is based on multiple clusterings of the data. The procedure described above is applied 30 times, with different randomly chosen initial centroids for the row and column clusters. The output of the algorithm is the average of the normalized values obtained over the repeated runs.

There are several reasons for averaging the results of multiple clusterings of the combined data matrix. To start, there is unlikely to be a single, ‘biologically correct’ clustering of the available genes and samples: disease subtypes and gene pathways are not always uniquely defined, and they may exhibit moderate overlap. Multiple clusterings better capture the structure present in this situation. By combining normalization results from multiple clusterings (each of which yields a local minimum of the sum of squares cost function) the XPN algorithm performs a simple form of model averaging. Averaging also controls (minor) instability that may arise from use of the k-means clustering procedure, whose output is dependent on the initial choice of cluster centroids. In this latter respect, XPN is similar in spirit to resampling-based approaches to cluster stability such as those in (Dudoit and Fridlyand, 2002; Tibshirani et al., 2001; Tseng, 2007; Tseng and Wong, 2005).

In principle, the XPN method procedure can be used with any clustering method that produces a pre-specified number of clusters from a given set of vectors, or with resampling, based improvements of such methods. We chose to use k-means clustering because of its simplicity and computational efficiency. The validation study below indicates that the XPN method performs well, and generally outperforms competing normalization methods, when it is used with basic k-means clustering. The validation results leave open the possibility of further improvements with alternative clustering methods, but a number of experiments with other clustering methods have not produced better results.

In the current implementation of XPN, the number of row and column clusters, K ≥ 1 and L ≥ 1, respectively, are fixed in advance, and will depend on the type and dimension of the data under study. In general, L should be large enough to capture principal sample groups or subtypes, and L should be large enough to capture large, homogenous groups of genes. In the numerical experiments below we chose K = 5 and L = 25. (In practice, XPN is not sensitive to the choice of K and L, see Section 6.1 below). As a general rule we suggest letting the number L of sample clusters be in range of 5–8, and the number K of row clusters to be on the order of 10–30, depending on the number of genes. As an alternative, one may employ a method such as the GAP statistic (Tibshirani et al., 2001), implemented as an R function kmeansGap in library ‘SLmisc’, to assess the number of row and column clusters in the data. Applied to the dataset used in this article, the GAP statistic suggested 4–8 sample clusters and 8–9 gene clusters.


    3 OTHER METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CROSS-PLATFORM NORMALIZATION...
 3 OTHER METHODS
 4 DATASETS AND PREPROCESSING
 5 VALIDATION
 6 FURTHER DISCUSSION OF...
 7 CONCLUSION
 APPENDIX: MAXIMUM LIKELIHOOD...
 ACKNOWLEDGEMENTS
 REFERENCES
 
We compare XPN with several other normalization methods in the literature. The other methods have previously been applied to batch correction on single platforms, but are well adapted to more general cross-study situations. As a baseline, we standardized each available column (sample) (CS). Beginning with CS data, we median centered each gene in each study and then combined studies. The resulting procedure is denoted by (MC). The MC method is currently used in practice, and in spite of its simplicity, performs relatively well in our validation experiments. We also consider the Empirical Bayes (EB) method (Johnson et al., 2007). EB is based on the model


Formula

The platform specific parameters {gamma}gp and {delta}gp are estimated using an EB approach, and are essentially equal to least squares estimates shrunken towards their respective cross-platform means. Other parameters are estimated by gene-wise OLS. The data is then transformed to remove the effects of different {gamma}gp and {delta}gp across platforms. Finally, we considered the Distance Weighted Discrimination (DWD) method for batch correction (Benito et al., 2004), which is based on the DWD method (Marron and Todd, 2004). DWD normalization finds a direction in which the sample-vectors from the two studies are well-separated, and then translates the samples from each study along that direction until their respective families of vectors have significant overlap.

The Probability of Expression (POE) method (Parmigiani et al., 2002; Shen et al., 2004), transforms each data value into a signed probability in the range [– 1, 1]. While this transformation is useful for identifying meta-signatures, the resulting data is difficult to compare with normalized values produced by other methods, and we do not include its analysis here.

We note that each of the alternative normalization methods described above is gene-wise affine, that is, for each gene g there exist constants ag and bg, with ag > 0, such that Formula . As a result, the correlation between xs,g and Formula across samples s is 1 for every g. In contrast, XPN seeks to simultaneously borrow strength across genes and samples via linked row and column clusters, and as a result, XPN is not gene-wise affine.


    4 DATASETS AND PREPROCESSING
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CROSS-PLATFORM NORMALIZATION...
 3 OTHER METHODS
 4 DATASETS AND PREPROCESSING
 5 VALIDATION
 6 FURTHER DISCUSSION OF...
 7 CONCLUSION
 APPENDIX: MAXIMUM LIKELIHOOD...
 ACKNOWLEDGEMENTS
 REFERENCES
 
We applied XPN and the methods described above to three existing breast cancer datasets. The first dataset, from (Huang et al., 2003), has 89 samples and 8948 genes. Their experiments were performed with Affymetrix GeneChip U95Av2 arrays. The 89 samples were obtained at the Koo Foundation Sun Yat-Sen Cancer Centre (KF-SYSCC), Taipei. The second dataset, which will be referred to as Nederlands Kanker Instituut [Netherlands Cancer Institute (NKI)], comes from (van't Veer et al., 2002). It contains 97 samples and 16 360 genes, and was obtained from Netherlands Cancer Institute and Rosetta Inpharmatics-Merck custom designed 25K Agilent oligonucleotide arrays. Most of the NKI patients had stage I or II breast cancer. The third dataset, referred to as University of North Carolina (UNC), is from (Hu et al., 2006). It contains 114 samples representing 104 patients and 12 065 genes, and was obtained using 22K Agilent oligonucleotide arrays. The UNC sample set represents an ethnically and geographically diverse cohort.

Initially, locally weighted regression (LOWESS) normalization was applied to the NKI and UNC datasets; robust multi-array analysis (RMA) was used to obtain expression values for the Huang dataset. The raw expression values in each study were then log-2 transformed, and missing values were imputed with 1-nearest neighbor imputation (Troyanskaya et al., 2001). Duplicated genes in each datasets were collapsed by median using Entrez Gene ID. There were 6092 common genes among the three platforms. Cross-study normalization methods were applied to this set of common genes, and subsequently to a smaller set of ‘intrinsic genes’ (Perou et al., 2000) identified as playing an active role in the biology of breast cancer.

The next section presents validation results for the set of common genes. The same analysis for the set of intrinsic genes is presented in the supplementary materials. In our experiments, all cross-platform normalization methods worked better on the set of intrinsic genes, and more generally, on smaller gene sets selected using integrative correlation filtering. Prior to cross-study normalization, the log-2 transformed expression values in each platform were column standardized.


    5 VALIDATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CROSS-PLATFORM NORMALIZATION...
 3 OTHER METHODS
 4 DATASETS AND PREPROCESSING
 5 VALIDATION
 6 FURTHER DISCUSSION OF...
 7 CONCLUSION
 APPENDIX: MAXIMUM LIKELIHOOD...
 ACKNOWLEDGEMENTS
 REFERENCES
 
Broadly speaking, cross-study normalization methods can be assessed in terms of two competing criteria. Ideally, a normalization method should produce a single unified dataset, in which samples originating in Study 1 are not distinguishable from those originating in Study 2 on the basis of non-biological features. A method that fails to remove systematic differences between studies under-corrects the data. On the other hand, excessive homogenization of the studies (over-correction) can result in a loss of biological information, and the combined dataset may be less useful than its constituents.

The validation results presented below are intended to assess the performance of the methods under study, and their tendency towards over- and under-correction. We begin with the column-standardized datasets X1, X2 and X3. Every method is applied to each pair Xi, Xj with 1 ≤ i < j ≤ 3 to produce normalized data Formula . Validation measures are applied to each pair, and the average value of the measure over the three pairs is reported. For before and after comparisons, we take as a reference the initial data [Xi, Xj] produced by column-standardization (denoted CS in what follows).

In order to better understand the baseline behavior and biases of the normalization methods under consideration, we also apply them to artificial studies obtained by randomly dividing the arrays in a given platform into two pseudo-studies, similar to the procedure in (Gentleman et al., 2006). To be more precise, from a single column-standardized dataset Xi, we produce a pair Formula of pseudo studies by randomly assigning each sample to one of two groups. Different normalization methods are then applied to Formula , yielding a normalized datasets Formula . Validation measures are applied to compare the pseudo-study and its normalized version. Each of the three available datasets is randomly split 10 times, and the average measure (over splits and studies) is reported.

By design, the data in each pair of pseudo studies come from a common platform and study. Thus we anticipate that a cross-study normalization method should have relatively little effect, beyond its attempt to correct the unavoidable differences that result from splitting the studies in half. While these differences are not negligible, they are typically smaller than the differences between platforms.

5.1 Measures of center and spread
For a given array, the difference between the mean and the median of its values provides a rough measure of its asymmetry in regards to location. After normalization, it is desirable to see a similar distribution of asymmetry across both studies. Figure 2 shows the area between the cumulative distribution function (CDFs) of mean minus median in the two available studies. Graphs for both standard and split-study validation are shown.


Figure 2
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Area between the CDFs of array mean minus array median across platforms. Lower values indicate greater similarity of datasets after normalization.

 
A similar comparison for scale can be carried out by considering the SD ({sigma}) and median absolute deviation from median (MAD). For the standard normal distribution with CDF {Phi}, we have {sigma} = MAD/{Phi}(0.75). Figure 3 shows the area between CDFs of {sigma} – MAD/{Phi}(0.75) in each of the two available studies. XPN reduces both measures more than the other methods; the split study results show little bias for all methods.


Figure 3
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Area between the CDFs of {sigma} – MAD/{Phi}(0.75) for arrays of different platforms. Lower values indicate greater similarity of datasets after normalization.

 
5.2 Average distance to nearest array in another platform
The set of arrays in given platform can be viewed as a set of points in m-dimensional Euclidean space. After normalization it is reasonable to expect that the point ‘clouds’ associated with distinct platforms will have substantial overlap. (This is one of the motivations behind the DWD normalization method.) To measure overlap in a pair of normalized studies, we measure the Euclidean distance from each array in the first study to the nearest array in the second study, then repeat, swapping the roles of the studies, and finally average the results. The results are presented in Figure 4, with smaller values indicating greater overlap. XPN and EB reduce the average distance more than other methods. The split study results show little bias for all the methods.


Figure 4
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Average L2 distance from the samples of one study to the nearest sample from the other study. Lower values indicate greater similarity of the study point ‘clouds’ after normalization.

 
5.3 Correlation with column standardized data
The previous validation measures assess the similarity of two datasets after normalization. A natural way to see how much the normalization methods affect the data is to calculate correlation between the data matrices before and after normalization, where ‘before’ is represented by CS. This measure does not by itself support a given normalization method, but in choosing between methods that perform similarly across other validation measures, the method that has less effect on the data should clearly be preferred. The average correlation of arrays before and after normalization for the different methods under study is shown in Figure 5. Median centering has the least effect on the data; the other three methods yield average correlations close to 0.8, with XPN lying between DWD and EB. Table 1 shows the average correlation of genes before and after normalization, averaged over both studies. As discussed above, all methods but XPN perform normalization by transforming each gene in an affine fashion; thus the gene correlation for these methods is equal to 1. Similar remarks apply to the integrative correlation and t-statistic measures described below. The gene correlation for XPN is 0.99, with a split-study value of 0.996.


Figure 5
View larger version (30K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Average correlation of arrays with their values before normalization (CS). Larger values indicate less modification of the data by the normalization procedure.

 

View this table:
[in this window]
[in a new window]

 
Table 1. Gene-based correlation measures

 
5.4 Global integrative correlation
Integrative correlation (Cope et al., 2007) is a means of identifying genes with concordant expression in different studies. Let r1(g), r2(g) be the gth row of X1 and X2, respectively. The global integrative correlation (GIC) between X1 and X2 is the correlation between


Formula

here regarded as vectors with |G|2 components. High values of IC(g) indicate good concordance between the values in Studies 1 and 2. GICs for different normalization methods are shown in Table 1. The results for CS shows that the average GIC between halves of the same platform (0.556) is much higher than average GIC between different studies (0.255). XPN is the only method among those considered that affects GIC. It increases GIC by 33% to 0.338 in cross-study validation, well below the split-study level (0.556). XPN increases GIC between pseudo studies by a relatively small 7%.

Each tumor sample in the datasets under consideration has an associated, clinically based ER status (ER+ or ER–). We next consider several validation measures based on this biological information. The Huang dataset has only 15 ER negative samples out of 89, making its split-study results unstable, and is therefore excluded from the split study analysis of the ER-based validation measures.

5.5 Correlation of t-statistics
For each platform, t-statistics measuring the association of gene-expression values with the ER status are calculated. Ideally, the vectors of t-statistics for different platforms should become more concordant after platform normalization. Table 1 shows the Pearson correlation between the t-statistics for ER status for different normalization methods. (Results for rank correlation are similar.) As expected, the average correlation of t-statistics is higher in split study (0.446) than between platforms (0.312). XPN increases the correlation of t-statistics between platforms by 45% to 0.451. In split-study validation it increased correlation by roughly 22%. Overall, XPN has greater effect than the other methods considered. The correlation measurements above show that, on average, XPN does not make dramatic changes in the rows of the data matrices, and we believe that much of the split study increase in t-statistic correlation is due to inherent differences between the randomly selected pseudo studies.

5.6 Cross platform prediction of ER status
If we regard ER status as a binary phenotype, we may explore misclassification rates associated with its prediction. Ideally, combining labeled studies via cross-platform normalization should lead to lower misclassification rates on test datasets. To test the compatibility of different studies after normalization in regards to classification, we treated the data from one study as a training set, and the data from the other study as a test set, and vice versa. Lower error rates indicate better concordance. Classification was performed using two methods: nearest shrunken centroids prediction analysis for microarrays (PAM) (Tibshirani et al., 2002) and support vector machines (SVM) (Boser et al., 1992; Cortes and Vapnik, 1995). The results are presented in Figures 6 and 7. As can be seen, all of the normalization methods greatly reduce cross-platform prediction error, with the minimum error achieved by XPN. In the split-study test, none of the methods produces significant reductions in classification error, as expected.


Figure 6
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Cross-platform prediction error of the PAM (nearest shrunken centroids) classifier. Smaller values indicate better concordance between platforms.

 

Figure 7
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7. Cross-platform prediction error of the SVM (Support Vector Machine) classifier. Smaller values indicate better concordance between platforms.

 
One might also be interested in the 5- or 10-fold cross-validation prediction error rate on the combined studies. However, none of the normalization methods has a significant effect on the cross-validated classification error. This appears to arise from the fact that, in cross validation, the classification methods are trained on elements of both platforms, and the distinguishing features of ER status are strong enough to enable the methods to perform well without prior normalization.

5.7 Preservation of significant genes
Lastly, we consider gene lists produced using ER-based t-statistics at a nominal 0.1% significance threshold. Let Li be the list of genes in Study i = 1, 2, and let L1, 2 be the list produced at the same nominal 0.1% level from the combined data Formula . Ideally, genes that are in both L1 and L2 should appear in L1,2, and most genes that appear in at least one of the single study lists will be in the joint list. We assess these two types of overlap by measures V1 = |(L1 {cap} L2) {cap} L1,2|/|L1 {cap} L2| and V2 = |(L1 {cup} L2) {cap} L1,2|/|L1 {cup} L2|, respectively. The results are presented in Table 2. The value of V1 is 1 for all normalization methods except CS, showing the importance of platform normalization. The V2 measure is increased by all methods, with the greatest increase achieved by MC and DWD.


View this table:
[in this window]
[in a new window]

 
Table 2. Measures of gene-list preservation

 

    6 FURTHER DISCUSSION OF XPN
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CROSS-PLATFORM NORMALIZATION...
 3 OTHER METHODS
 4 DATASETS AND PREPROCESSING
 5 VALIDATION
 6 FURTHER DISCUSSION OF...
 7 CONCLUSION
 APPENDIX: MAXIMUM LIKELIHOOD...
 ACKNOWLEDGEMENTS
 REFERENCES
 
6.1 Stability with respect to K and L parameters
To test stability of XPN with respect to the numbers K and L of row and column clusters, we applied XPN with a range of parameters. For L = 5 we tried K = 2, 10, 20, 25, 30, 50, 100, 500, and for K = 25 we tried L = 2, 4, 5, 6, 7, 8, 10. The results (presented in the Supplementary Materials) indicate that XPN is generally insensitive to the choice of the K and L. However, we do see (expected) degradation of performance in situations where K or L is below four, in which case the clustering is too coarse to adequately capture homogenous blocks of samples or genes. At the other extreme, when L is large, one finds column clusters containing samples from a single platform. For such clusters the algorithm cannot combine information across platforms, and its results will be degraded accordingly. (In its current implementation, XPN excludes such clusterings from the average that forms its output.) Values of K larger than 25 make the algorithm slower and do not substantially improve its performance.

6.2 Stability of XPN output
The XPN algorithm averages the normalization results from B row/column clusterings. To assess the stability of XPN, we calculated the SD of each element in the normalized matrix over the B = 100 runs of the basic procedure. The average SD (over all elements and platform pairs) was 0.004. In contrast, the average SD of the entries of the normalized matrices was 0.79. Thus, the variability of the normalized entries due to random clusterings was, on average, two orders of magnitude less than the variability between the final normalized entries.


    7 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CROSS-PLATFORM NORMALIZATION...
 3 OTHER METHODS
 4 DATASETS AND PREPROCESSING
 5 VALIDATION
 6 FURTHER DISCUSSION OF...
 7 CONCLUSION
 APPENDIX: MAXIMUM LIKELIHOOD...
 ACKNOWLEDGEMENTS
 REFERENCES
 
The increasing number and public availability of large-scale gene-expression studies provides impetus for cross-study analyses that combine existing, and potentially new, datasets. Properly combined datasets give researchers more power for biological and statistical analysis. In this article we propose a new, block model-based method, called XPN, for cross-platform normalization. The block model distinguishes XPN from other platform normalization methods such as DWD and EB, which are gene-wise linear.

We propose a set of validation measures for comparison of different normalization methods. The validation measures can be roughly split in two groups. One group assesses the ability of normalization methods to remove systematic differences across platforms, while the other measures how much the data is transformed by normalization procedures. Based on the proposed validation measures, XPN successfully combined three existing breast cancer datasets without incurring substantial overfitting. In particular, cross-platform ER prediction error rates indicate that XPN successfully preserved biological information while removing systematic differences between platforms.

The XPN method has three parameters: the number of row and column clusters (K and L) and the number of basic iterations B. Our experiments indicate that the results of XPN are robust to the choice of K and L (see Section 6.1). The analysis in Section 6.2 suggests setting B = 30 is sufficient for stable output.


    APPENDIX: MAXIMUM LIKELIHOOD ESTIMATION OF THE MODEL
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CROSS-PLATFORM NORMALIZATION...
 3 OTHER METHODS
 4 DATASETS AND PREPROCESSING
 5 VALIDATION
 6 FURTHER DISCUSSION OF...
 7 CONCLUSION
 APPENDIX: MAXIMUM LIKELIHOOD...
 ACKNOWLEDGEMENTS
 REFERENCES
 
The XPN algorithm estimates the parameters of the Model (2) using maximum likelihood approach. The model has distinct sets of parameter for different gene clusters and different platforms. Thus the problem of parameter estimation can be split into 2K smaller tasks. Fix i isin {1, ... , K} and p isin {1, 2}. The log-likelihood function associated with gene group i and platform p can be expressed as


Formula

To ensure identifiability of the coefficients {Aijp} and {bgp}, we set


Formula

The parameters Aijp, bgp, cgp and Formula are chosen to maximize the log-likelihood. To find them we take first derivative of the log-likelihood with respect to these parameters and set the result equal to zero:


Formula

Here and in what follows, each sum is taken over all the genes in the ith cluster. The above equations simplify to


Formula

Define the sample mean and variance of the expression values of a gene in sample block j:


Formula

This allows further simplification of the equations


Formula

There is no closed form solution for this system of equations. To obtain the estimates, the formulas are applied iteratively until convergence of the parameters. Each iteration increases the log-likelihood and the limit values satisfy all first order conditions.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CROSS-PLATFORM NORMALIZATION...
 3 OTHER METHODS
 4 DATASETS AND PREPROCESSING
 5 VALIDATION
 6 FURTHER DISCUSSION OF...
 7 CONCLUSION
 APPENDIX: MAXIMUM LIKELIHOOD...
 ACKNOWLEDGEMENTS
 REFERENCES
 
Funding for this work was provided by National Science Foundation Grant (DMS 0406361) to A.B.N. and A.A.S.; National Cancer Institute Breast SPORE program to University of North Carolina at Chapel Hill (P50-CA58223-09A1) to C.M.P. and C.F.; National Cancer Institute (RO1-CA-101227-01) to C.M.P. and C.F.; by the Breast Cancer Research Foundation. The authors would like to thank J.S. Marron for helpful conversations and suggestions regarding the validation procedures discussed in the article.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: David Rocke

Received on November 28, 2007; revised on February 7, 2008; accepted on March 1, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 CROSS-PLATFORM NORMALIZATION...
 3 OTHER METHODS
 4 DATASETS AND PREPROCESSING
 5 VALIDATION
 6 FURTHER DISCUSSION OF...
 7 CONCLUSION
 APPENDIX: MAXIMUM LIKELIHOOD...
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Benito M, et al. Adjustment of systematic microarray data biases. Bioinformatics (2004) 20:105–114.[Abstract/Free Full Text]

    Bolstad B, et al. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics (2003) 19:185–193.[Abstract/Free Full Text]

    Boser B, et al. A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning theory. (1992) Berkeley, USA: Berkeley Electronic Press. 144–152.

    Choi J, et al. Combining multiple microarray studies and modeling interstudy variation. Bioinformatics (2003) 19:84–90.[CrossRef]

    Cope L, et al. The Integrative Correlation Coefficient: A Measure of Cross-study Reproducibility for Gene Expressionea Array Data. Working Papers, Deptartment of Biostatistics, Johns Hopkins University (2007) Berkeley, USA: Berkeley Electronic Press. 152.

    Cortes C, Vapnik V. Support-vector networks. Mach. Learn (1995) 20:273–297.

    Dudoit S, Fridlyand J. A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol (2002) 3:1–21.[Medline]

    Garrett-Mayer E, et al. Cross-study Validation and Combined Analysis of Gene Expression Microarray Data. (2004) Working Papers, Deptartment of Biostatistics, Johns Hopkins University. Berkeley Electronic Press, Berkeley, US, p. 65.

    Garrett-Mayer E, et al. Cross-study validation and combined analysis of gene expression microarray data. Biostatistics (2007) kxm033.

    Gentleman R, et al. Meta-analysis for microarray experiments. Bioconductor (2006) http://www.bioconductor.org/packages/bioc/vignettes/GeneMeta/inst/doc/GeneMeta.pdf.

    Ghosh D, et al. Statistical issues and methods for meta-analysis of microarray data: a case study in prostate cancer. Funct. Integr. Genomics (2003) 3:180–188.[CrossRef][Medline]

    Huang E, et al. Gene expression predictors of breast cancer outcomes. The Lancet (2003) 361:1590–1596.

    Hu Z, et al. The molecular portraits of breast tumors are conserved across microarray platforms. BMC Genomics (2006) 7:96.[CrossRef][Medline]

    Irizarry R, et al. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res (2003a) 31:e15.[Abstract/Free Full Text]

    Irizarry R, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics (2003b) 4:249–264.[Abstract]

    Johnson WE, et al. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics (2007) 8:118–127.[Abstract/Free Full Text]

    Marron JS, et al. Distance weighted discrimination. Journal of the American Statistical Association (2004).

    Parmigiani G, et al. A statistical framework for expression-based molecular classification in cancer. J. R. Stat. Soc. Series B (Stat. Method.) (2002) 64:717–736.[CrossRef]

    Parmigiani G, et al. A cross-study comparison of gene expression studies for the molecular classification of lung cancer. Clin. Cancer. Res (2004) 10:2922–2927.[Abstract/Free Full Text]

    Perou C, et al. Molecular portraits of human breast tumours. Nature (2000) 406:747–752.[CrossRef][Medline]

    Rhodes D, et al. Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Res (2002) 62:4427–4433.[Abstract/Free Full Text]

    Rhodes D, et al. Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc. Nat. Acad. Sci (2004) 101:9309–9314.[Abstract/Free Full Text]

    Shen R, et al. Prognostic meta-signature of breast cancer developed by two-stage mixture modeling of microarray data. BMC Genomics (2004) 5:94.[CrossRef][Medline]

    Tibshirani R, et al. Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B (Stat. Method.) (2001) 63:411–423.[CrossRef]

    Tibshirani R, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Nat. Acad. Sci (2002) 99:6567.[Abstract/Free Full Text]

    Troyanskaya O, et al. Missing value estimation methods for DNA microarrays. Bioinformatics (2001) 17:520–525.[Abstract/Free Full Text]

    Tseng G, Wong W. Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics (2005) 61:10–16.[CrossRef][Web of Science][Medline]

    Tseng G. Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data. Bioinformatics (2007) 23:2247.[Abstract/Free Full Text]

    van't Veer L, et al. Gene-expression profiling predicts clinical outcome of breast cancer. Nature (2002) 415:530–536.[CrossRef][Medline]

    Yang Y, et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res (2002) 30:e15.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J. Clin. Pathol.Home page
A H Sims
Bioinformatics and breast cancer: what can high-throughput genomic approaches actually tell us?
J. Clin. Pathol., October 1, 2009; 62(10): 879 - 885.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
C. Cheng, K. Shen, C. Song, J. Luo, and G. C. Tseng
Ratio adjustment and calibration scheme for gene-wise normalization to enhance microarray inter-study prediction
Bioinformatics, July 1, 2009; 25(13): 1655 - 1661.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/9/1154    most recent
btn083v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Shabalin, A. A.
Right arrow Articles by Nobel, A. B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shabalin, A. A.
Right arrow Articles by Nobel, A. B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?