Skip Navigation


Bioinformatics Advance Access originally published online on September 16, 2004
Bioinformatics 2005 21(4):517-528; doi:10.1093/bioinformatics/bti029
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/4/517    most recent
bti029v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kim, B. S.
Right arrow Articles by Chung, H. C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kim, B. S.
Right arrow Articles by Chung, H. C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics vol. 21 issue 4 © Oxford University Press 2005; all rights reserved.

Statistical methods of translating microarray data into clinically relevant diagnostic information in colorectal cancer

Byung Soo Kim 1,*, Inyoung Kim 2, Sunho Lee 4, Sangcheol Kim 2,3, Sun Young Rha 2,3 and Hyun Cheol Chung 2,3

1 Department of Applied Statistics, College of Medicine, Yonsei University Seoul, South Korea
2 Cancer Metastasis Research Center, College of Medicine, Yonsei University Seoul, South Korea
3 Brain Korea 21 Project for Medical Science, College of Medicine, Yonsei University Seoul, South Korea
4 Department of Applied Mathematics, Sejong University Seoul, South Korea

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 EXPERIMENT AND DATA...
 3 STATISTICAL METHODS
 4 RESULTS
 5 DISCUSSION

 REFERENCES
 

Motivation: It is a common practice in cancer microarray experiments that a normal tissue is collected from the same individual from whom the tumor tissue was taken. The indirect design is usually adopted for the experiment that uses a common reference RNA hybridized both to normal and tumor tissues. However, it is often the case that the test material is not large enough for the experimenter to extract enough RNA to conduct the microarray experiment. Hence, collecting ncases does not necessarily end up with a matched pair sample of size n. Instead we usually have a matched pair sample of size n 1, and two independent samples of sizes n 2 and n 3, respectively, for ‘reference versus normal tissue only’ and ‘reference versus tumor tissue only’ hybridizations (n = n 1 + n 2 + n 3). Standard statistical methods need to be modified and new statistical procedures are developed for analyzing this mixed dataset.

Results: We propose a new test statistic, t 3, as a means of combining all the information in the mixed dataset for detecting differentially expressed (DE) genes between normal and tumor tissues. We employed the extended receiver operating characteristic approach to the mixed dataset. We devised a measure of disagreement between a RT–PCR experiment and a microarray experiment. Hotelling's T 2 statistic is employed to detect a set of DE genes and its prediction rate is compared with the prediction rate of a univariate procedure. We observe that Hotelling's T 2 statistic detects DE genes more efficiently than a univariate procedure and that further research is warranted on the formal test procedure using Hotelling's T 2 statistic.

Contact: bskim{at}yonsei.ac.kr


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 EXPERIMENT AND DATA...
 3 STATISTICAL METHODS
 4 RESULTS
 5 DISCUSSION

 REFERENCES
 
Microarray technology, with its potential to quantitatively measure the expression levels of thousands of genes simultaneously, holds the promise of becoming a valuable tool in cancer research and clinical diagnostics. Two technologies are widely used, the cDNA microarray and the oligonucleotide array, which differ with respect to the types of nucleic acid probes arrayed for the interrogation of labeled RNA specimens. In the present study, we focus on the cDNA microarray, which uses a dual-label system in which two RNA specimens are separately reverse transcribed, labeled, mixed and hybridized together to each array.

In the cancer microarray experiment based on n patients, the experimenter often wants to compare the tumor tissue with the normal tissue within the same individual using a common reference RNA. This design is often referred to as a reference design or an indirect design. Ideally, this experiment produces n pairs of microarray data, where each pair consists of two sets of microarray data resulting from reference versus normal tissue and reference versus tumor tissue hybridizations. However, for certain individuals either normal tissue or tumor tissue is not large enough for the experimenter to extract enough RNA for conducting the microarray experiment, hence there are missing values in the normal and tumor tissue data. Practically, we have n 1 pairs of complete observations, n 2 ‘normal only’ and n 3 ‘tumor only’ data for the microarray experiment with n patients, where n = n 1 + n 2 + n 3. We refer to this dataset as a mixed dataset, as it contains a mixture of fully observed and partially observed pair data. This mixed dataset was actually observed in the microarray experiment based on human tissues, where human tissues were obtained during the surgical operations of cancer patients. There are two major points that the reference design gains in this type of microarray experiment. First, it permits the parallel comparison of differentially expressed (DE) genes along different cancers. Second, it allows us to utilize the partially observed pair data obtained from n 2 and n 3 patients. If we had employed the direct design where both tumor and normal tissues were put on a single array, no information would have been obtained for the partially observed pair data.

The aim of this study is to develop a statistical method for the mixed dataset. We conducted cDNA microarray experiment based on 87 colorectal cancer patients using a reference design with cDNA microarrays containing ~17 000 human genes. The primary objective of the microarray experiment based on colorectal cancers was to ranking genes for the development of a biomarker that could be used for the population screening and detecting DE genes for various subtypes of colorectal cancer as well, e.g. colon cancer versus rectum cancer, stages (B, C and D) and carcinoembryonic antigen (CEA) ≥5 versus CEA <5. The clinicians use CEA for monitoring the progress of the colorectal cancer after the initial treatment, and the threshold value is usually set to be 5.

The success of microarray technology in cancer research depends on the development of statistical methods for detecting DE genes among different tissue types. The most widely used statistical methods for detecting DE genes are basically univariate procedures often coupled with adjustment of P-values or a similar concept due to multiple tests (Tusher et al., 2001; Dudoit et al., 2002b; Lönnstedt and Speed, 2002). These univariate methods disregard the multidimensional structure of microarray data. Multivariate approaches are needed that can utilize the correlated structure of the microarray data and therefore capture the hidden information in gene interactions. Li et al. (2001) proposed a genetic algorithm combined with the k-nearest neighbor method to detect DE genes that might have a potential to jointly discriminate different tissues, e.g. normal versus turmor. However, it still remains to be seen whether the genetic algorithm can detect DE genes in multivariate dimension. Szabo et al. (2003) suggested a random search algorithm after defining a computationally tractable distance in the p-dimensional gene space. It is interesting to note that the set of DE genes detected by a univariate procedure, for example, the t-test and the set of DE genes selected by these two multivariate procedures have ≤50% in common for the biological replicates1. We propose in this study using Hotelling's T 2 statistic for detecting DE genes. As an initial attempt to employ Hotelling's T 2 statistic as a means of multivariate approach in the microarray data, we restrict ourselves to a random vector of length two primarily due to the computational limit.

Simon and Dobbin (2003), characterized the objectives of many DNA microarray experiments as class comparison, class prediction or class discovery. The detection of DE genes between two groups, e.g. between tumors that respond to chemotherapy and those that do not (Rosenwald et al., 2002), between normal and tumor tissues (Alon et al., 1999) and among various subtypes of a tumor (Hedenfalk et al., 2001), comprises a class comparison problem. Classifying a new specimen to the known subtype of a cancer (van de Vijver et al., 2002) and developing a molecular prognostic predictor model (Rosenwald et al., 2002) would result in a class prediction problem. Discovering a new subtype of a cancer based on the molecular profile (Alizadeh et al., 2000) would belong to the class discovery problem. In addition to these objectives, ranking candidate genes for the development of a biomarker that can be used for the population screening of a cancer should serve the purpose of the microarray experiment. Pepe et al. (2003) elegantly employed the receiver operating characteristic (ROC) approach for this purpose based on two independent samples that consist of normal and tumor tissues. We employ the extension of Pepe et al.'s approach to the mixed dataset and rank candidate genes involved in the development of a biomarker for the population screening of the colorectal cancer. The extension procedure of Pepe et al.'s approach will be reported in a separate communication.

We first show that three commonly used univariate procedures produce more or less the same set of DE genes between normal and tumor tissues. For the validation of statistical procedures, either univariate or multivariate, used in this study for the detection of DE genes between normal and tumor tissues, the mixed dataset provides a natural way of splitting sample. We use the n 1 paired data for the training set, and the pool of n 2 ‘normal only’ and n 3 ‘tumor only’ data for the test set. We employed several classifiers including linear and quadratic discriminant analyses, and support vector machine (SVM) for classifying the test set based on the DE genes detected from the training set. Using this split sample approach, we show that Hotelling's T 2 statistic detects DE genes more efficiently than a commonly used univariate procedure. Now, as a means of combining all the information in the mixed dataset we propose a t-based statistic, t 3, for the detection of DE genes between normal and tumor tissues. We develop a measure of disagreement between a RT–PCR experiment and a microarray experiment. For the subtype analyses, we employ the split sample approach 10 times repeatedly for the validation of DE genes in each class of subtypes. We use the two-sample t-statistic coupled with a P-value adjustment, or the similar concept due to multiple tests and SVM to calculate the prediction rate. We could improve the test error of ‘CEA ≥5’ with the help of the regression-based gene selection by taking advantage of the continuous scale of CEA.

In Section 2, we describe patient and experiment data, and microarray data pre-processing including normalization and missing value imputations. Section 3 presents three univariate procedures, Hotelling's T 2 statistic, t 3 statistic for the detection of DE genes, several classifiers, the ROC approach for ranking genes for the biomarker development, regression-based gene selection for the CEA variable and a measure of disagreement between a RT–PCR assay and a microarray assay. Technical details are discussed in the Appendix section. Section 4 shows the results and Section 5 concludes with a discussion.


    2 EXPERIMENT AND DATA PRE-PROCESSING
 TOP
 Abstract
 1 INTRODUCTION
 2 EXPERIMENT AND DATA...
 3 STATISTICAL METHODS
 4 RESULTS
 5 DISCUSSION

 REFERENCES
 
Cancer and normal tissues were obtained during the surgical operations from 87 colorectal cancer patients at Severance Hospital (Yonsei Cancer Center, Yonsei University College of Medicine, Seoul, Korea) from May to December 2002. We conducted a cDNA microarray experiment using a common reference design with cDNA microarrays containing ~17 000 human genes. We pooled 11 cancer cell lines and used it for the common reference. These 11 cancer cell lines are as follows: AGS, MDA-MB-231, HCT-116, SK-Hep-1, A549, HL-60, MOLT-4, HeLa, HT-1080, Caki-2 and U87MG (American Type Culture Collection). The fresh specimens of cancer and normal tissues obtained from the colorectal cancer patients during surgery were snap-frozen in liquid nitrogen immediately after the resection and stored at –70°C until further use. Clinical characteristics of the patients are provided in Table 1.


View this table:
[in this window]
[in a new window]
 
Table 1 Clinical characteristics of 87 colorectal cancer patients

 
We originally attempted to extract total RNAs from the tumor and normal tissues of 87 patients. We obtained RNA specimens both for tumor and normal tissues from 36 patients. However, RNA specimens for normal tissues alone were available from 19 patients. Also, RNA specimens for tumor tissues alone were obtained from the other 32 patients. Thus, we have a matched pair sample2 of size 36 and two independent samples of sizes 19 and 32. In terms of notations provided in Section 1: n 1 = 36, n 2 = 19 and n 3 = 32. These tissues were taken by a single surgeon from 87 patients, and there was no specific clinical or biological meaning on the possibly different characteristic among these three subgroups. Therefore, we assume that these three subgroups are independent samples from a single population. After total RNAs were extracted from fresh frozen tissues, 50 µg of purified RNAs were labeled and hybridized to cDNA microarrays based on the protocol established in Cancer Metastasis Research Center (Yonsei University, Korea) (Park et al., 2004).

We used M = log2(R/G) for the evaluation of relative intensity, where R and G represent the cy5 and cy3 fluorescent intensities, respectively.

Quantitative real-time RT–PCR was performed with six selected genes using a Rotor Gene 2072D real-time PCR machine (Corbett Research, Australia) in accordance with the manufacturer's instructions. We used SYBR Green (Quiagen, CA) for the labeling. The amplified fluorescence signal was measured and the level of transcript for each specimen was calculated based on the standard curve. The standard curve was drawn by plotting the measured threshold cycle versus the arbitrary unit of copies/reaction according to the amount of serially diluted standard RNA. The threshold cycle (Ct) value was determined as the cycle number at which the fluorescence exceeded the threshold value.

Let X and Y denote the log-fluorescent intensity ratios of reference versus normal and reference versus turmor hybridizations, respectively. Let U and V be independent copies of X and Y, respectively. Then, we may observe three data types represented as follows:


We first define no missing proportion (NMP) of a gene as the proportion of valid observations out of the total number of arrays. For example, if a gene has valid observations for 32 out of 40 arrays, its NMP is 0.8. We pre-processed the data as follows:

  1. We normalized the log-intensity ratio, log2(R/G), using within-print tip group, intensity-dependent normalization following Yang et al. (2002).
  2. We used 0.8 for the cut point of NMP to delete genes containing missing values for >20% of the total number of observations. This filtering procedure yielded 13 859 genes.
  3. We employed k-nearest neighbor (k =10) method for the imputation of missing values.
  4. We averaged values for the multiple spots. The numbers of duplicated, triplicated and quadruplicated spots were 982, 6 and 5, respectively.
  5. Finally, we have a dataset represented using a 12 850 x 123 matrix, where 12 850 represents the number of genes and 123 stands for the number of microarrays.

We investigated various box plots (data not shown) after the (location parameter) normalization, and concluded that it was not necessary to have the scale normalization either between blocks within an array or between arrays.

One way of utilizing all the information in the mixed dataset would be using the matched pair sample of a training set for the detection of DE genes and two independent samples of a test set for the validation of the chosen DE genes. By detecting a set of DE genes in the training set of matched pairs we could control between individual variation. The independence within and between two samples of sizes 19 and 32 would make them the ideal test set.


    3 STATISTICAL METHODS
 TOP
 Abstract
 1 INTRODUCTION
 2 EXPERIMENT AND DATA...
 3 STATISTICAL METHODS
 4 RESULTS
 5 DISCUSSION

 REFERENCES
 
3.1 The Detection of DE genes between normal and tumor tissues
We employ the following three univariate procedures for the detection of a set of DE genes from the matched pair sample of size 36.

  1. Paired t-test and Dudoit et al.'s maxT procedure for controlling the family-wise error rate (FWER) (Dudoit et al., 2002b; Ge et al., 2003).
  2. Tusher et al.'s (2001) SAM procedure.
  3. Lönnstedt and Speed's (2002) empirical Bayes procedure using B-statistic.

Dudoit et al. (2002b) employed the FWER for controlling the Type I error and used Westfall and Young's step down procedure for calculating the adjusted P-value. We have 236 permutations of changing the signs of the paired t-statistic for 36 patients from which we can derive the null distribution of the paired t-statistic. We used 100 000 bootstrap samples of size 36, due to computation limit, to derive the null distribution of the paired t-statistic. Tusher et al.'s (2001) SAM procedure is a permutation test with a modified t-statistic. They adopted the false discovery rate (FDR) (Benjamini and Hochberg, 1995) for controlling the Type I error, where FDR is defined as the expectation of the number of false positive genes divided by the number of declared significant genes. FDR is more sensitive in the detection of significant genes (Ge et al., 2003). Lönnstedt and Speed (2002) used empirical Bayes method to derive a Bayes log posterior odds, e.g. B-statistic. The experimenter may consider top 100 genes in terms of B-values in combination with experimental preference.

We computed Hotelling's T 2 statistic by pairing two genes in all possible ways from the training set and considered the top 25 pairs in the order of magnitude. We restricted ourselves to the top 50 genes, since 50 is the maximum number of genes for the experimenter to conduct a confirmatory experiment. Hotelling's T 2 statistic can be expressed as follows:


where t 1, t 2 and {rho} denote t-statistic for the two component genes, and their correlation coefficient, respectively (Mood et al., 1974). Hotelling's T 2 statistic can pick up some of the genes that are not detected by the univariate t-statistic mostly when either t 1 or t 2 is large, and {rho}t 1 t 2 has a negative sign and |{rho}t 1 t 2| is large. To validate sets of DE genes detected by various statistical methods from the training set, we applied several classification methods to the test set including the diagonal linear discriminant analysis (DLDA), the diagonal quadratic discriminant analysis (DQDA) following Dudoit et al. (2002a) and SVM with several kernels, such as linear, quadratic and Gaussian kernels (Christiani and Shawe-Taylor, 2000).

As a means of utilizing all the information of the mixed dataset for the detection of DE genes, we propose using a t-base statistic, t 3, as follows:


where , , , , and are sample variances of D, U and V, respectively, and n H is the harmonic mean of n 2 and n 3. The distributions of D, U and V are almost symmetric around their means under the null hypothesis. Thus, small sample sizes such as 20 for n 1, n 2 and n 3 would invoke the central limit theorem to approximate the null distribution of t 3 from Equation (2) by N(0,1).

3.2 Ranking candidate genes for the biomarker development
Recently, Pepe et al. (2003) employed the ROC approach to rank candidate genes that can be used for the biomarker development with the ultimate purpose of population screening of a cancer. Identifying DE genes in a cancer can be accomplished using various statistical methods, for example, those described in Section 3.1. If a gene is found to be differentially expressed in the cancer, then by developing a suitable biomarker, the corresponding protein product or an antibody to it can be detected in the blood or urine, which forms the basis for the population screening (Pepe et al., 2001).

When we develop a biomarker that can be used for the population screening of a specific cancer from a microarray experiment, we note the following two points. First, clinical bioassays for some gene products may be too difficult to develop for technical reasons. Thus, we need to have a sizable number of candidate genes with development priorities so that if one gene proves to be useless for biomarker development, we may still explore the next gene for the development. The second point is that the bioassay, once it is developed, is applied to the whole population, and hence the false positive rate should be extremely low. Even a small false positive rate yields a large number of healthy people being subjected to diagnostic procedures that are unnecessary, costly and sometimes invasive. For ranking candidate genes under these two perspectives, we need statistical measures that discriminate between normal and tumor tissues. The measure of choice should focus on the minimization of the false positive probability as well as on the separation of these two distributions. Pepe et al. (2003) provide the rationale of using ROC approach based on two independent samples for this purpose of ranking candidate gene instead of using t- or Mann–Whitney statistics.

We first modified Pepe et al.'s ROC approach of ranking genes for a paired dataset of size n 1 and referred to the corresponding ROC value as ROCpair. Pepe et al.'s approach was directly applicable to calculate the ROC value, denoted by ROCind for two independent samples of sizes n 2 and n 3 in the mixed dataset. Then, we averaged these two ROC values to derive an overall ROC value, denoted by ROCmix, as follows:


where t 0 was the false positive probability which, in turn, was determined by a threshold value and n H was the harmonic mean of n 2 and n 3. Details of the modification and the extension of the ROC approach will be reported in a separate communication.

3.3 Prediction of subtypes
The colorectal cancer dataset contains three clinical variables of interest, namely location, CEA value and stage. Each of these three clinical variables defines a class of subtypes. The primary interest in the subtype analysis is to detect a set of DE genes between subtypes of a given clinical variable and to validate the chosen set of DE genes for the classification. For example, we are interested in identifying a set of genes that discriminate the colon cancer from the rectum cancer. For the stage variable, we are interested in the pairwise comparison, e.g. B versus C and so on. We use the split sample approach for the validation by randomly dividing the 68 tumors into a training set of 45 tumors and a test set of 23 tumors. For the subtype analyses, we deleted 19 ‘normal only’ cases of which the microarray experiment did not provide any gene information on tumor tissues.

We employed the two-sample t-statistic (with unequal variances) coupled with FWER, FDR and pFDR (Storey, 2002) to detect a set of DE genes for each class of subtypes. We used SVM to calculate the prediction rate of each subtype based on the top 10 genes in terms of t-statistic. This process of detecting DE genes and calculating the prediction rate was repeated 10 times each for a pair of training and test sets randomly generated from the data, and these 10 prediction rates were averaged.

So far as CEA is concerned we employed a regression-based gene selection to take advantage of its continuous scale and improve the prediction of CEA ≥5. Details of this approach are described in the Appendix section.

3.4 Assessing accuracy of the microarray gene expression against RT–PCR
We first calculated Pearson's correlation between log-transformed RT–PCR, denoted by log(RT–PCR), and the microarray gene expression measurement based on 17 mRNA specimens (9 tumor and 8 normal tissues). The scatter diagram of log(RT–PCR) and the microarray gene expression measurement is typically shown in Figure 1 that normal tissue group was separated from the tumor group and each group has a rather large within-group variation. Obviously, Pearson's correlation is not meaningful specifically in this situation of ‘having two clouds located far away’.



View larger version (7K):
[in this window]
[in a new window]
 
Fig. 1 A typical scatter diagram of the microarray gene expression measurement versus log(RT–PCR) for a gene [Gene id: AA634308, Gene name: ATP-binding cassette, sub-family A (ABC1), member 8]. ‘N’ and ‘T’ denote normal and tumor tissues, respectively. Even though the Pearson correlation is high, it may not be the right measure of association between two assays, since it shows ‘two clouds located far away’ situation. Horizontal and vertical dotted lines correspond to threshold values for determining normal and tumor tissues, respectively, for RT–PCR and microarray based decisions. For RT–PCR-based decision false negative and false positive rates equal to 1/8 and 0, respectively. These two error rates are considered as lower bounds for the corresponding error rates of the microarray-based decision. These two error rates are observed to be 0 for the microarray-based decision in this example. The false negative rate of 0 for the microarray-based decision is regarded as a random variation. Hence, differr of Equation (4) equals to 0.

 
The aim of the study is to predict whether a new tissue is normal or tumor based on the microarray gene expression data. We now have single gene expression data measured by using both RT–PCR and the microarray based on 17 mRNA specimens whose tumor statuses are known. As for the measure of disagreement between these two assays in which RT–PCR is considered a standard, we may propose how much error rate of the microarray-based decision do we have in addition to the error rate produced by the decision based on the RT–PCR assay.

For measuring error rate for each assay with respect to each gene, we can calculate the false positive and false negative rates by determining a cut-off value that minimizes the sum of false positive and false negative rates. Let FPR, FNR, FPM and FNM denote the false positive and false negative rates, respectively, for RT–PCR-based and microarray-based decisions. FPR and FNR are considered to be lower bounds of FPM and FNM, respectively. Then we may propose the following measure, referred to as differr, for the measure of disagreement between these two assays.


For the construction of differr, we consider RT–PCR as the standard and assume that the error rate of microarray-based decision is not less than the error rate of the RT–PCR-based decision. We illustrate the calculation of differr in Figure 1 based on an example dataset.


    4 RESULTS
 TOP
 Abstract
 1 INTRODUCTION
 2 EXPERIMENT AND DATA...
 3 STATISTICAL METHODS
 4 RESULTS
 5 DISCUSSION

 REFERENCES
 
4.1 Three univariate procedures
Three univariate procedures discussed in Section 3.1 reasonably coincide with each other. There were more than 2000 DE genes screened in the t + FWER procedure for which the adjusted P-values were <0.01. We compared the overlap pattern of the top 100 genes selected from each of these three procedures. For the B-statistic the top 100th gene had 22.95 for the log-posterior odds ratio, which might have had value 0 under the null hypothesis. For SAM procedure, we decreased the delta (of SAM software) until we obtained 100 DE genes for which the FDR was observed to be 0.0031 (0.31%). Three sets of top 100 genes detected by three procedures have 72 genes in common as shown in Figure 2. Thus, for comparing the performance of univariate procedures with Hotelling's T 2 statistic we use just one procedure in this study, the t-statistic.



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 2 The overlap pattern of DE genes detected by three procedures: Dudoit et al.'s t-test and maxT procedure; Tusher et al.'s SAM; and Lönnstedt and Speed's B-statistic. The number in the intersection indicates the number of genes jointly detected by two or three procedures.

 
4.2 Hotelling's T 2 statistic
Hotelling's T 2 statistic in Equation (1) indicates that it can pick up some of the genes that are not detected by the univariate t-statistic, such as G 1' and G 2' listed in the bottom table of Figure 3, but have high correlations with genes of very large t-values. Figure 3 shows scatter plots of these two gene pairs in the test set, from which we note that the Hotelling's T 2 statistic clearly separates the normal tissue from the tumor tissue and its performance is better than a univariate t-statistic alone.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 3 Scatter plots of the top two gene pairs selected from the training set by Hotelling's T 2 statistic are drawn based on the test set. x- and y-axes represent the log-intensity ratios, and ‘N’ and ‘T’ denote normal and tumor tissues, respectively. Details of the top two gene pairs are provided in the bottom table.

 
4.3 Classifying the test set
We considered top 50 genes selected by the univariate t-statistic and top 25 pairs chosen by Hotelling's T 2 statistic for the classification of the test set that comprises 19 normal and 32 tumor specimens. We increased the number of genes one by one in the classifiers used beginning with the top gene. We observed that 0% test error was achieved with a few genes as shown in Table 2. We further noted that 0% test error was achieved using only one gene, which ranked fourth using the univariate t-statistic. However, this gene, denoted by G 1 in the bottom of the table of Figure 3, belongs to the top ranked gene pair of Hotelling's T 2 statistic. When we compare the gene expression distribution of the top ranked gene and the fourth ranked gene selected by the t-statistic, we note that the fourth gene has better separated gene expression distributions between normal versus tumor than the top ranked gene. This aspect is shown in Figure 4. Furthermore, when we look up the upper diagram in Figure 3, we note that G 1 clearly separates ‘N’ from ‘T’ in the test set without any error. Thus, it is obvious that the 0% test error is not dependent upon the particular subset of the current test set. All these results strongly indicate that the Hotelling's T 2 statistic, in particular, and the multivariate analysis, in general, is a valuable tool for the detection of DE genes in the microarray analysis.


View this table:
[in this window]
[in a new window]
 
Table 2 The number of genes (gene pairs) which yield 0% test error

 


View larger version (12K):
[in this window]
[in a new window]
 
Fig. 4 Gene expression distributions for normal versus tumor specimens for the top ranked gene and the fourth ranked gene in terms of univariate t-statistic. We note that the fourth ranked gene has better separated the normal specimens from the tumor specimens than the top ranked gene.

 
4.4 ROC approach to the mixed dataset
Let D denote the differential expression between the tumor and the normal tissues, which is defined as D = YX. Let D 0 denote the hypothetical version of D under the null hypothesis. Once distributions of D and D 0 are determined following the procedure described in the Appendix section, one can proceed calculating ROCpair(t 0), with a suitably chosen c 0 value which, in turn, determines the false positive probability t 0 in the baseline distribution. In general c 0 is chosen to make t 0 very small. However, as Pepe et al. (2003) indicate, with a small number of tissue specimens, that the estimation of ROC(t 0) at very small t 0 is not possible and hence in the real application one needs to compromise with the choice of t 0 such that it is small, but large enough to make ROC(t 0) reasonably precise. Our choice of t 0 is 1/36. The top 20 genes in terms of ROCpair(1/36) values have 50% overlap with top 20 genes in terms of t-statistic. We noted that the top gene in terms of the t-statistic was ranked ninth in terms of ROCpair(1/36).

Table 3 shows the list of top 20 genes in terms of ROCmix(1/36) of Equation (3) and their corresponding ranks in terms of t 3-statistic of Equation (2).


View this table:
[in this window]
[in a new window]
 
Table 3 Top 20 genes in terms of ROCmix(1/36) values and their corresponding ranks in terms of t 3-statistic

 
4.5 Validation of the microarray gene expression
We selected six genes based on their ranks in t, t 3 and ROCmix values, and performed the RT–PCR experiment. The list of genes, their ranks in terms of t, t 3 and ROCmix, Pearson correlation between log(RT–PCR) and microarray gene expression, and differr of Equation (4) are given in Table 4. From Table 4, we note the small value of differr suggests that these two assays be interchangeably used for the prediction of a new tissue to be normal or tumor.


View this table:
[in this window]
[in a new window]
 
Table 4 The list of six genes for which RT–PCR experiment was performed, their ranks in terms of t, t 3 and ROCmix, Pearson correlation between log(Rt–PCR) and microarray gene expression measurement, and differr of Equation (4) as a measure of disagreement between these two assays

 
The second gene in Table 4 with the gene id AA634308 [ATP-binding cassette, subfamily A (ABC1), member 8] is the one associated with the 0% test error, which is the same gene denoted by G 1 in Figure 3. We note from Table 4 and Figure 3 that it has consistently high ranks in terms of four statistics we considered in this study, namely, t, t 3, Hotelling's T 2 statistic and ROCmix. Also in the earlier analysis of a subset of 58 cases (20 paired, 16 normal only and 22 tumor only cases), this same gene ranked the second, the top, and again the top, respectively, in the t, t 3, and Hotelling's T 2 statistic (Kim et al., 2005). The RT–PCR result of this gene is shown in Figure 1.

4.6 Subtype analyses
Each class of subtypes was not well separated as with normal versus tumor tissues. We used two sample t-statistic (with unequal variances) and employed FWER, FDR and pFDR for controlling the Type I error due to multiple tests. We could detect nine DE genes between stages B and D with FDR = 0.33, and 36 DE genes for CEA ≥5 versus CEA <5 with FDR = 0.31. We failed to detect DE genes for colon cancer versus rectum cancer and for stages B versus C. These results are shown in Table 5.


View this table:
[in this window]
[in a new window]
 
Table 5 The number of DE genes detected in each class of subtypes by employing two-sample t-statistic with FWER, FDR and pFDR for the P-value adjustments

 
The prediction rate for each class of subtypes based on the top 10 genes in terms of the t-statistic using SVM based on 10 pairs of training and test sets does not exceed 0.72, which corresponds to the rate for stages B versus D. We observed 0.49, 0.51, 0.66 and 0.64, for the prediction rates of colon cancer versus rectum cancer, stages B versus C, stages C versus D and CEA ≥5 versus CEA <5, respectively. However, so far as CEA is concerned we could improve the prediction rate by 17% through the SVM/regression-based gene selection approach described in Section 3. The prediction rates of regression approach alone and SVM/regression based approach were 0.64 and 0.75, respectively.

In the regression-based gene selection for predicting CEA ≥5, we observed that the stepwise regression ended up either four gene or five gene model of Equation (6) in the Appendix section for each training set. Thus, we had 10 sets of finally selected genes, each set consisting of four or five genes. We counted the frequency of each gene belonging to 10 sets. Three genes have the frequency of 7 or more. These are AI088704|AI939310 (expressed sequence tags, ESTs), AA911045 (ESTs) and H68509 [alcohol dehydrogenase 6 (class V)].


    5 DISCUSSION
 TOP
 Abstract
 1 INTRODUCTION
 2 EXPERIMENT AND DATA...
 3 STATISTICAL METHODS
 4 RESULTS
 5 DISCUSSION

 REFERENCES
 
We have developed statistical methods that are applicable to the mixed dataset of microarray experiment performed on human cancers by extending standard statistical procedures, such as t-statistic and ROC approach. The mixed dataset occurs quite often in clinical practice when the tissue material is not large enough to yield the adequate amount of RNA for undergoing the DNA microarray experiment.

Diverse cancer-related genes were included in the selected genes such as commonly up-regulated nuclear factor erythroid 2-like 3 (NFE2L3), wingless-type MMTV integration site family (WNT5A), matrix metalloproteinase 7 (MMP 7), matrix metalloproteinase 11 (MMP 11), ets variant gene 4 (ETV4), and commonly down-regulated genes of chemokine (C-X-C motif) ligand 12 (CXCL 12), chromogranin A (CgA) and carbonic anhydrase II (CA II). High level of MMP 11 is associated with human cancer progression (Basset et al., 1997), and MMP 11 increased tumorigenesis through decreased cancer cell death with the help of apoptosis and necrosis (Boulay et al., 2001). In addition, MMP 7 is known to be overexpressed in human colorectal carcinomas (Ougolkov et al., 2002), and ETV4 can activate the promoters of various MMPs (Horiuchi et al., 2003). CA II and CA XII are the members of zinc metalloenzymes family and the loss of CA II expression is known to accompany the progression to malignant transformation (Kivela et al., 2001). CgA is a neuroendocrine secretary gene and is reduced in gastric and colorectal carcinomas (Indinnimeo et al., 2002). These information support the biological concept that the selected known genes and ESTs are related to colorectal cancer.

There has been a criticism on using univariate approaches for the detection of DE genes, since these approaches disregard the multidimensional structure of microarray data (Szabo et al., 2003). As an initial attempt of employing a multivariate analysis in the microarray data, we consider a random vector consisting of two genes and computed Hotelling's T 2 statistic for all possible combinations of two gene pairs for the detection of DE genes between normal and tumor tissues. This multivariate approach provides a prediction rate that is at least as good as univariate approaches including the t-test. It was more sensitive than the univariate t-statistic for the detection of the gene that alone discriminated between normal and tumor tissues with 0% test error.

However, there are several issues that need to be addressed before Hotelling's T 2 statistic is formally applied in the analysis of microarray data. The first issue would be calculating the P-value of the observed T 2 statistic. The second is to extend Hotelling's T 2 statistic to a vector of length k ≥ 3. These two issues might be handled with random search algorithm suggested by Szabo et al. (2003) with the reasonable equipment of hardwares.

We reported that among the top 50 genes selected by the t-statistic, 44% were detected by Hotelling's T 2 statistic. This 44% for the overlap proportion is in parallel with Szabo et al. (2003), who noted 41–~47% overlap between gene sets detected by a multivariate method and several univariate approaches in an inhomogeneous dataset. Different gene sets detected by several univariate procedures and Hotelling's T 2 statistic need to be further validated using Gene Ontology and molocular pathway findings.

In contrast to detecting an overwhelming number of DE genes between normal and tumor tissues, we failed to detect significant number of DE genes in subtype analyses. We detected nine DE genes between stages B and D and another 36 genes between CEA ≥5 and CEA <5 as provided in Table 5. However, the small sample size of stage D, high values of FDR and low-prediction rates all indicate that the current sample size, 68, is not large enough to detect the multiple mechanisms that underlie each class of subtypes.

The 87 colorectal cancer patients comprises three subgroups of sizes 36, 19 and 32. We did not find any statistical evidence against independence of these three subgroups with respect to age, gender, stage and CEA ≥5 versus CEA <5. Only for location (colon versus rectum) we obtained an unadjusted P-value of 0.017, which might yield Bonferroni adjusted P-value of 0.085. There is no clinical meaning at this point on this marginal significance of location versus three subgroups.

The presence of correlation between genes plays its role in Hotelling's T 2 statistic for the selection of DE genes. As we discussed earlier we observed large values of Hotelling's T 2 statistic mostly when either |t 1| or |t 2| was large, {rho}t 1 t 2 < 0, and |{rho}t 1 t 2| was large in Equation (1). Therefore, one might expect that about half of the genes in the top 25 pairs of Hotelling's T 2 statistic could be selected by the univariate t-statistic alone and the other half would not. This may raise the issue of inherent inconsistency between these two procedures. However, our data show that the top gene pair selected by Hotelling's T 2 statistic contains the gene with the smallest test error.

We noted in this study that different gene sets provided more or less the same prediction rate. This aspect may be viewed from the multiplicity of model and integrating these different models would serve in future research.


   
 TOP
 Abstract
 1 INTRODUCTION
 2 EXPERIMENT AND DATA...
 3 STATISTICAL METHODS
 4 RESULTS
 5 DISCUSSION

 REFERENCES
 
6 APPENDIX
6.1 Null distribution of D = YX
Let D = YX and D 0 denote the hypothetical version of D under the null hypothesis of no differential expression. Let denote the sample mean of D values based on n 1 observations. The distribution of D with a mean {delta} a and the variance denoted by . The distribution of D 0 is represented by . We augment the D notation by adding a superscript ‘(i)’ to represent the i-th gene. Hence D (i) denote D for the i-th gene and similarly for . We omit this superscript when the argument is based on each gene. Let denote the order statistics of , where p is the number of genes spotted in a cDNA microarray. We assume, for simplicity, that D 0 has the same distribution with D except for the mean and variance. We expect that {delta}0 ≤ {delta} a and we do not necessarily assume that {delta}0 = 0 to allow a small baseline value that may vary depending on the experimental condition. We further assume that . There are several ways of estimating the distribution of D using the matched pair sample data. We choose a set of genes, denoted by for a small {sigma} > 0. The suitable choice of {sigma} can be determined from the plot of . We concluded from the plot (data not shown) that the first 100 order statistics provide information on the non-DE genes. We observed that beyond the 100th smallest value, the variance tended to increase slowly. Based on these 100 genes, we could estimate the mean and the variance of the null distribution.

6.2 Regression-based gene selection procedure for predicting CEA ≥5
Define CEA j to be the CEA level for the j-th patient and let M ij denote the i-th gene expression level for the j-th patient for j = 1, ..., 68 and i = 1, ..., p (p = 12850). We describe the procedure below.

  1. For each i, we estimate the regression equation from the training set


    where Stage j represents the stage of the j-th patient, and I{A} is an indicator function that assumes 1 if A is true, and 0 otherwise. We found that the inverse of square root transformation of CEA j was most appropriate after applying Box–Cox transformations to a selected subset of genes.

  2. We found 27 significant genes with pFDR <0.31.
  3. Employing a stepwise regression using 27 genes in (2) resulted in a five gene model with R 2 = 0.71.


    The ranks of these five genes in terms of the t-statistic are 58, 63, 65, 1813 and 5029.

  4. Prediction rates of {CEA ≥5} were calculated using SVM and the regression Equation (6).
  5. We further randomly divided 68 tumors into a training set of 45 tumor and a test set of 23. For this new pair of training and test sets, we repeat the steps (1)–(4) with the same pFDR level, obtained the regression Equation (6) but with possibly different number of genes. Finally, we predicted the CEA levels for the test set using SVM and the estimated regression equation.

We iterated this procedure of Steps 1–5 for 10 times and averaged prediction rates.


    Acknowledgments
 
B.S.K. was supported by a grant from the Korea Health 21 R&D Project, Ministry of Health & Welfare, Republic of Korea (02-PJ1-PG3-10411-00-03). S.H.L. was supported by a grant R04-203-000-10145-0 from the Basic Research Program of the Korea Science and Engineering Foundation. H.C.C. and S.Y.R. were supported by the Korea Science and Engineering Foundation (KOSEF) through the Cancer Metastasis Research Center (CMRC) at Yonsei University College of Medicine.


    Footnotes
 
1 Szabo et al. (2003) reported that for the colon cancer cell line data, SAM procedure and the random search algorithm detected two sets of genes that overlapped more than 90%. Back

2 We use ‘sample’ to denote a random sample in statistics to distinguish it from a biological specimen. Back

Received on June 29, 2004; revised on August 28, 2004; accepted on September 12, 2004

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 EXPERIMENT AND DATA...
 3 STATISTICAL METHODS
 4 RESULTS
 5 DISCUSSION

 REFERENCES
 

    Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., et al. (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503–511[CrossRef][Medline].

    Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J. (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl Acad. Sci. USA, 96, 6745–6750[Abstract/Free Full Text].

    Basset, P., Bellocq, J.P., Lefebvre, O., Noel, A., Chenard, M.P., Wolf, C., Anglard, P., Rio, M.C. (1997) Stromelysin-3: a paradigm for stroma-derived factors implicated in carcinoma progression. Crit. Rev. Oncol. Hematol., 26, 43–53[Web of Science][Medline].

    Benjamini, V. and Hochberg, V. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. B., 57, 289–300.

    Boulay, A., Masson, R., Chenard, M.P., El Fahime, M., Cassard, L., Bellocq, J.P., Sautes-Fridman, C., Basset, P., Rio, M.C. (2001) High cancer cell death in syngeneic tumors developed in host mice deficient for the stromelysin-3 matrix metalloproteinase. Cancer Res., 61, 2189–2193[Abstract/Free Full Text].

    Christiani, N. and Shawe-Taylor, J. An Introduction to Support Vector Machines, (2000) , Cambridge Cambridge University Press.

    Dudoit, S., Fridlyand, J., Speed, T.P. (2002a) Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Statist. Assoc., 97, , pp. 77–87.

    Dudoit, S., Yang, Y.H., Callow, M.J., Speed, T.P. (2002b) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat. Sinica, 2, 111–139.

    Ge, Y., Dudoit, S., Speed, T.P. (2003) Resampling-based multiple testing for microarray data analysis. Test, 12, 1–44.

    Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon, R., Meltzer, P., Gusterson, B., Esteller, M., Kallioniemi, O-P., et al. (2001) Gene-expression profiles in hereditary breast cancer. N. Eng. J. Med., 344, 539–548[Abstract/Free Full Text].

    Horiuchi, S., Yamamoto, H., Min, Y., Adachi, Y., Itoh, F., Imai, K. (2003) Association of ets-related transcriptional factor E1AF expression with tumour progression and overexpression of MMP-1 and matrilysin in human colorectal cancer. J. Pathol., 200, 568–576[CrossRef][Web of Science][Medline].

    Indinnimeo, M., Cicchini, C., Memeo, L., Stazi, A., Provenza, C., Ricci, F., Mingazzini, P.L. (2002) Correlation between chromogranin-A expression and pathological variables in human colon carcinoma. Anticancer Res., 22, 395–398[Web of Science][Medline].

    Kim, B.S., Lee, S., Kim, I., Kim, S., Rha, S.Y., Chung, H.C. (2005) Statistical issues in search for bio markers of colorectal cancer using microarray experiments. In Edler, L. and Kitsos, C. (Eds.). Quantitative Methods for Cancer and Human Health Risk Assessment, , Chichester (in press) Wiley.

    Kivela, A.J., Saarnio, J., Karttunen, T.J., Kivela, J., Parkkila, A.K., Pastorekova, S., Pastorek, J., Waheed, A., Sly, W.S., Parkkila, T.S., Rajaniemi, H. (2001) Differential expression of cytoplasmic carbonic anhydrases, CA I and II, and membrane-associated isozymes, CA IX and XII, in normal mucosa of large intestine and in colorectal tumors. Dig. Dis. Sci., 46, , pp. 2179–2186[CrossRef][Web of Science][Medline].

    Li, L., Weinberg, C.R., Darden, T.A., Pedersen, L.G. (2001) Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics, 17, 1131–1142[Abstract/Free Full Text].

    Lönnstedt, I. and Speed, T.P. (2002) Replicated microarray data. Stat. Sinica, 12, 31–46.

    Mood, A.M., Graybill, F.A., Boes, D.C. Introduction to the Theory of Statistics, (1974) 3rd edn. , NY McGraw-Hill.

    Ougolkov, A.V., Yamashita, K., Mai, M., Minamoto, T. (2002) Oncogenic beta-catenin and MMP-7 (matrilysin) cosegregate in late-stage clinical colon cancer. Gastroenterology, 122, , pp. 60–71[CrossRef][Web of Science][Medline].

    Park, C.H., Jeong, H.J., Jung, J.J., Lee, G.Y., Kim, S.C., Kim, T.S., Yang, S.H., Chung, H.C., Rha, S.Y. (2004) Fabrication of high quality cDNA microarray using a small amount of cDNA. Int. J. Mol. Med., 13, 675–679[Web of Science][Medline].

    Pepe, M.S., Etzioni, R., Feng, Z., Potter, J.D., Thompson, M.L., Thornquist, M., Winget, M., Yasui, Y. (2001) Phases of biomarker development for early detection of cancer. J. Natl Cancer Inst., 93, 1054–1061[Free Full Text].

    Pepe, M.S., Longton, G., Anderson, G.L., Schummer, M. (2003) Selecting differentially expressed genes from microarray experiments. Biometrics, 59, 133–142[CrossRef][Web of Science][Medline].

    Rosenwald, A., Wright, G., Chan, W.C., Connors, J.M., Campo, E., Fisher, R.I., Gascoyne, R.D., Konrad Muller-Hermelink, H., Smeland, E.B., Staudt, L.M. (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell Lymphoma. N. Eng. J. Med., 346, 1937–1947[Abstract/Free Full Text].

    Simon, R.M. and Dobbin, K. (2003) Experimental design of DNA microarray experiments. Biotechniques, 34, 16–21.

    Storey, J.D. (2002) A direct approach to false discovery rate. J. R. Statist. Soc. B., 64, 479–498[CrossRef].

    Szabo, A., Boucher, K., Jones, D., Tsodikov, A.D., Klevanov, L.B., Yakovlev, A.Y. (2003) Multivariate exploratory tools for microarray data analysis. Biostatistics, 4, 555–567[Abstract].

    Tusher, V., Tibshirani, R., Chu, G. (2001) Significance analysis of microarrays applied to transcriptional responses to ionizing radiation. Proc. Natl Acad. Sci. USA, 98, 5116–5121[Abstract/Free Full Text].

    van de Vijver, M.J., He, Y.D., van't Veer, L.J., Dai, H., Hart, A.A.M., Voskuil, D.W., Schreiber, G.J., Peterne, J.L., Robert, C., Marton, M.J., et al. (2002) A gene-expression signature as a predictor of survival in breast cancer. N. Eng. J. Med., 347, 1999–2009[Abstract/Free Full Text].

    Yang, Y.H., Dudoit, S., Luu, P., Lin, D.M., Peng, V., Ngai, J., Speed, T.P. (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res., 30, e15[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
C.-A. Tsai and J. J. Chen
Multivariate analysis of variance test for gene set analysis
Bioinformatics, April 1, 2009; 25(7): 897 - 903.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. W. Kong, W. T. Pu, and P. J. Park
A multivariate approach for integrating genome-wide expression data and biological knowledge
Bioinformatics, October 1, 2006; 22(19): 2373 - 2380.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Tan, L. Shi, S. M. Hussain, J. Xu, W. Tong, J. M. Frazier, and C. Wang
Integrating time-course microarray gene expression profiles with cytotoxicity for identification of biomarkers in primary rat hepatocytes exposed to cadmium
Bioinformatics, January 1, 2006; 22(1): 77 - 87.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/4/517    most recent
bti029v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kim, B. S.
Right arrow Articles by Chung, H. C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kim, B. S.
Right arrow Articles by Chung, H. C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?