Skip Navigation


Bioinformatics Advance Access originally published online on February 1, 2006
Bioinformatics 2006 22(8):950-958; doi:10.1093/bioinformatics/btl029
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
22/8/950    most recent
btl029v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Koo, J.-Y.
Right arrow Articles by Lee, J. W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Koo, J.-Y.
Right arrow Articles by Lee, J. W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

Structured polychotomous machine diagnosis of multiple cancer types using gene expression

Ja-Yong Koo 1,*, Insuk Sohn 1, Sujong Kim 2,3 and Jae Won Lee 1

1 Department of Statistics, Korea University Seoul 136-701, Korea
2 Department of Biochemistry, College of Medicine, Hanyang University Seoul 133-791, Korea
3Current address: Skin Research Institute, AmorePacific R&D Center Yongin 449-729, Korea

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 Systems and methods
 3 RESULTS
 4. CONCLUSION AND DISCUSSION
 REFERENCES
 

Motivation: The problem of class prediction has received a tremendous amount of attention in the literature recently. In the context of DNA microarrays, where the task is to classify and predict the diagnostic category of a sample on the basis of its gene expression profile, a problem of particular importance is the diagnosis of cancer type based on microarray data. One method of classification which has been very successful in cancer diagnosis is the support vector machine (SVM). The latter has been shown (through simulations) to be superior in comparison with other methods, such as classical discriminant analysis, however, SVM suffers from the drawback that the solution is implicit and therefore is difficult to interpret. In order to remedy this difficulty, an analysis of variance decomposition using structured kernels is proposed and is referred to as the structured polychotomous machine. This technique utilizes Newton–Raphson to find estimates of coefficients followed by the Rao and Wald tests, respectively, for addition and deletion of import vectors.

Results: The proposed method is applied to microarray data and simulation data. The major breakthrough of our method is efficiency in that only a minimal number of genes that accurately predict the classes are selected. It has been verified that the selected genes serve as legitimate markers for cancer classification from a biological point of view.

Availability: All source codes used are available on request from the authors.

Contact: jykoo{at}korea.ac.kr


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 Systems and methods
 3 RESULTS
 4. CONCLUSION AND DISCUSSION
 REFERENCES
 
DNA microarray analysis is a new biotechnological breakthrough, which allows the simultaneous monitoring of thousands of gene expressions in cells (Brown and Botstein, 1999) and has far reaching applications in pharmaceutical and clinical research. By comparing gene expression in normal and tumor tissues, for example, microarrays can be used to identify tumor-related genes and targets for therapeutic drugs (Alizadeh et al., 2000).

Classification of different phenotypes, predominantly cancer types, using microarray gene expression data is abundant in the literature; see e.g. Golub et al. (1999), Alizadeh et al. (2000), Furey et al. (2000) and Dudoit et al. (2002). The methods used in these studies range from classical discriminant analysis to flexible tools from machine learning such as bagging, boosting and support vector machines (SVM). SVM has been sucessfully applied to classification of gene expression data; Furey et al. (2000), Ramaswamy et al. (2001) and Guyon et al. (2002) used SVM and Lee and Lee (2003) adopted multicategory SVM (Lee et al., 2004) for this purpose. Recently, Lee et al. (2005) compared the performance of various classification methods, and provided guidelines for the most appropriate classification tool in various contexts.

SVM is growing in popularity as a classification problem tool with abundantly many successful and diverse applications (Vapnik, 1998; Schölkopf et al., 2002). Recently, Zhu and Hastie (2001) proposed replacing the hinge loss by the multinomial likelihood coining this method import vector machine (IVM). It can be shown that IVM has advantages over SVM in the following aspects:

  1. IVM can naturally handle the polychotomous classification problem;
  2. IVM can provide estimates of the posterior probabilities and
  3. the computational cost of the IVM is typically cheaper than that of SVM (Zhu and Hastie, 2001).

Ordinary SVM suffers from the drawback that the solution is implicit and therefore is difficult to interpret. A method for improving the interpretability of SVM is the SVM-RFE method which has been introduced by Guyon et al. (2002). In this paper, we propose structured polychotomous machine (SPM) which extends IVM through a functional analysis of variance (ANOVA) decomposition using structured kernels. We use Newton–Raphson to find the coefficient estimates followed by the Rao and Wald tests, respectively, for addition and deletion of import vectors (Zhu and Hastie, 2001). A computational improvement over IVM is the usage of the Rao statistic which can be calculated faster than the one-step Newton–Raphson method of Zhu and Hastie (2001). The Wald statistic for the stepwise deletion step can be carried out with an insignificant amount of extra computation, because the deletion algorithm is much less computer intensive than the addition algorithm.

A summary of this paper is as follows. Section 2 describes the SPM and the stepwise algorithm for the selection of the import vectors. Section 3 illustrates the performance of the proposed algorithm using real microarray data and simulated data as examples. Concluding remarks and discussions are given in Section 4.


    2 Systems and methods
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 Systems and methods
 3 RESULTS
 4. CONCLUSION AND DISCUSSION
 REFERENCES
 
2.1 Structured polychotomous machines
In this section, we propose the SPM based on a functional ANOVA decomposition using structured kernels.

Consider the training data L = {(xn, yn) : n = 1, ... , N}, where the input xn belongs to some domain X sub Rd and the label yn isin M = {0, 1, ... , M}. The set X is the input space from which the inputs xns (cases, instances, patterns) are taken and the yns are called the labels, targets or outputs. Let x = (x1, ... , xd). A polychotomous learning algorithm uses L to construct a function C(x) such that we have to assign a new input x to C(x) isin M.

SVM is defined by a positive definite kernel K. Popular kernels are polynomial kernels of the form K (x, x') = (1 + <x, x'>)q and Gaussian radial basis functions of the form K (x, x') = exp(–||xx'||2/(2{sigma}2)). If a SVM is defined by a multivariate kernel such as the polynomial kernel or the Gaussian kernel, then it may be difficult to interpret the effect of each input variable.

The ANOVA decomposition of a function h has the form

Formula
where b is a constant, hjs are the main effects, hjk are the two-factor interactions, and so on. For example, additive models have the form of h(x) = b + h1(x1) + ··· + hd(xd).

Though we will consider the additive SPM alone because the main focus is on the feature (gene) selection, the SPM methodology can handle two-factor interactions. For 1 ≤ j ≤ d, let nj be a non-negative integer and

Formula
be the set of import vectors for the j-th coordinate. Let Formula and Formula. We refer to S as the set of import vectors following Zhu and Hastie (2001). Suppose the kernel function for each coordinate are equal, i.e. Kj = K for 1 ≤ j < d. Let J = 1 + n1 + ··· + nd and denote by B1, ... , BJ the basis functions consisting of the constant basis function 1 and the univariate functions Formula where Formula. Consider

Formula (1)
where ßm = (ßmp) with 1 ≤ p ≤ J and 1 ≤ m ≤ M and {eta}(0|x; ß) {equiv} 0. Note that the function {eta}(m|x; ßm) is an additive model. Let ß denote the MJ–dimensional vector whose entries are those of ß1, ... , ßM. The SPM model has the form

Formula (2)
Given {Bj}, the SPM model (2) is a polychotomous logisitic regression model (Hastie et al., 2002). Suppose Formula for some 1 ≤ u, v ≤ N. Then, only x1 and x2 are relevant inputs for classification.

Define the multinomial log-likelihood based on L and the additiveSPM by

Formula (3)
where the additive regularization matrix corresponding to Sj is given by

Formula
It can be noted from (3) that the constant basis function is not regularized. The (regularized) maximum likelihood estimator (MLE) Formula is defined as the maximizer of {ell}{lambda}(ß). The SPM classifies a new input x into the class Formula which is defined by

Formula

In order to find Formula, we use the the Newton–Rapshon method. Let S{lambda}(ß) = {partial}{ell}{lambda}(ß)/{partial}ß denote the score at ß, and let I{lambda}(ß) denote the information matrix with entries –{partial}2{ell}{lambda}(ß)/{partial}2ß. The maximum likelihood estimate Formula satisfies the likelihood equations S{lambda} Formula. The Newton–Raphson method for computing Formula is to iteratively determine ßm+1 from ßm according to the formula

Formula

2.2 The Rao statistic and the Wald statistic
In this section, we explain the Rao (score) statistic and the Wald statistic (Rao, 1973) which will be used in choosing import vectors. The dependency of Formula on {lambda} is suppressed for notational convenience.

2.2.1 Rao statistic
Given B1, ... , BJ, let Formula and Formula be the MLE and the information matrix, respectively. Here, Formula denotes the estimated coefficient corresponding to the j-th class and the k-th basis function Bk. Let BJ+1 be a candidate basis function for addition. Define Formula by

Formula
Then the Rao statistic is defined by

Formula
with S{lambda}(·) and I{lambda}(Formula ) corresponding to B1, ... , BJ, BJ+1. The Rao statistic can be computed much faster since one basis function is added.

2.2.2 Wald statistic
Given B1, ... , BJ, let Formula and Formula be the MLE and the information matrix, respectively. Let BJ be the candidate basis function for deletion and Formula denote M x l vector of elements Formula, m = 1, ... , M. The Wald statistic is defined by

Formula 5(5)
where Formula 5 is the M x M submatrix of Formula whose rows and columns correspond to these M coefficients. The Wald statistic can be used as a measure of the importance of covariates which are selected in the final SPM.

2.3 Choice of import vectors
Finding an optimal set Formula is not an easy problem since an iterative algorithm such as the Newton–Raphson algorithm is necessary to compute Formula 5 and the size of Formula 5 is not known beforehand. One may use a sparse greedy algorithm to find near optimal solutions among the subsets of P as in Zhu and Hastie (2001). Regarding the functions Formula 5 as a set of basis functions, one can extend the stepwise basis selection method of Kooperberg et al. (1997).

We propose the following ANOVA Sparse Stepwise Algorithm (ASSA) for the choice of the import vectors for SPM. In ASSA, Formula 5 and Formula 5 denote the Rao (4) and Wald (5) statistics corresponding to the basis functions Formula 5 and Formula 5, respectively.

In order to find the final model, one needs a model selection criterion. During the combination of stepwise addition and deletion, we get a sequence of models indexed by {nu} with the {nu}-th model having M J{nu} parameters and MLE Formula 5. We select the model that minimizes

Formula 5
which is similar to the BIC criterion of Schwarz (1978). One may also use the test error rate as the model selection criterion for the final SPM model. The results of this paper have been obtained using BIC.

Formula


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 Systems and methods
 3 RESULTS
 4. CONCLUSION AND DISCUSSION
 REFERENCES
 
This section presents the selection performance of SPM using real microarray data and simulated data. In order to show the gene selection performance, we adopt the additive SPM defined by (1) and (2) with a Gaussian kernel. Consider the problem of choosing {sigma} which is the tuning parameter of the gaussian kernel. It has been observed that the performance of SPM is not much sensitive to the choice of {sigma}. In order to make SPM data-dependent, we adopt cross-validation for the choice of {sigma}. Given {sigma}, let SPM{sigma} denote the SPM fitted to a training dataset according to the ASSA algorithm using BIC as the model selection criterion. The tuning parameter {sigma} were chosen by search over 20, ... , 25 using 5-fold cross-validation. Given a training dataset, the SPM fit is defined by Formula 5, where Formula 5 is the minimizer of the cross-validated prediction error.

3.1 Simulation
We carried out a simulation study to evaluate SPM. We simulated artificial dataset assuming normal distributions of log expression levels as in Broberg (2002). The means and standard deviations for the simulation data are shown in Table 1. In Table 1, the first three rows represent the insignificant genes (null case), whereas the last three rows represent the significantly differentially expressed genes (significant case). Means and standard deviations were chosen randomly among the three rows in both null and significant cases. The sample sizes were (47 and 25) arrays, which are the same as those for the real leukemia dataset. Simulated data contained 1% significantly differentially expressed genes out of 1000 genes.


View this table:
[in this window]
[in a new window]
 
Table 1 The means and standard deviations for the simulation data

 
A simulated dataset was randomly divided into a training sample (67%) and a test sample (33%), where the training sample was used to fit Formula 5 and the test error rates were computed by the test sample. This procedure was repeated 50 times. We compared the average number of genes selected, along with the average number of true genes selected from the simulated data and misclassification rate by our method and other two well-known methods such as PAM (Tibshirani et al., 2002) and SVM-RFE (Guyon et al., 2002). We used PAM method by choosing the optimal amount of shrinkage which minimizes test error rates under the restriction that the maximum number of selected genes <100 and we used SVM-RFE method, where the optimal gene set was chosen by minimizing the leave-one-out cross-validation error rate.

Tabel 2 shows the average number of genes selected, along with the average number of true genes selected from the simulated data. As shown in Table 2, SPM selected smaller average number of genes selected with higher recovery rate. We also display the boxplot of the misclassification rates on the simulated data (Fig. 1), and SPM gave lower misclassification rate.


View this table:
[in this window]
[in a new window]
 
Table 2 The average number of genes selected and the average number of true genes selected from the simulated data

 

Figure 1
View larger version (8K):
[in this window]
[in a new window]
 
Fig. 1 Boxplots of the misclassification rates on the simulated data.

 
3.2 Real data
3.2.1 Small round blue cell tumor
This dataset came from small round blue cell tumors (SRBCT) study (Khan et al., 2001). The data, consisting of expression values of 2308 genes, were obtained from cDNA microarrays, which were made according to the standard protocol of National Human Genome Research Institute. The SRBCTs were categorized to Burkitt lymphoma (BL): 8, Ewing sarcoma (EWS): 23, neuroblastoma (NB): 12 or rhabdomyosarcoma (RMS): 20. The dataset consisted of 63 training samples and 20 test samples. Logarithm base 10 of the expression levels was taken and the arrays were standardized. SPM selected 6 genes (Table 2) from the 63 training samples. Both the training and test error rates are 0%, which means the SPM method can predict the tumor class for both seen and unseen samples using the six genes with 100% accuracy. The Wald statistic present the relevance of variables.

We compare the number of genes in the optimal set selected by our method with those of PAM and SVM-RFE (Table 3). The result for artificial neural networks (ANNs) was taken from Khan et al. (2001). We used the multi-class SVM-RFE in the rfe package for the multiclass case. Multi-class SVM-RFE is achieved using the one-against-one approach and each two-class SVM classifier is described by a weight vector. The k-class classifier based on the one-against-one approach is characterized by k(k–1)/2 weight vectors, where k = M + 1. The sum of the absolute value of the weight vector coordinate is used to characterize the discriminant power of the associated feature. We used multi-class SVM-RFE method with 200 genes because of limitations in computer memory. Table 3 shows test error rates and number of genes selected for SRBCT data. SPM selected the smallest number of genes that can accurately classify samples.


View this table:
[in this window]
[in a new window]
 
Table 3 Test error rates and number of genes selected for SRBCT data

 
We also examined the correspondence of genes selected by SPM with ANNs, PAM and SVM-RFE methods. We found that SPM was high correspondence between all methods. All six genes selected by SPM were also selected by ANNs and four genes by PAM and SVM-RFE (Table 4).


View this table:
[in this window]
[in a new window]
 
Table 4 The gene list selected for the SRBCT data

 
Whether the selected genes serve as legitimate markers for cancer classification was further verified by cluster analysis and visualization. In this regard, we applied a hierarchical clustering program developed by Eisen (Eisen et al., 1998) to the expression data of the selected genes and then visualized the structure of the data (Fig. 2). Figure 3 illustrates six different patterns of the chosen six genes in the same fashion as Figure 2 in Lee and Lee (2003).


Figure 2
View larger version (29K):
[in this window]
[in a new window]
 
Fig. 2 The gene expression maps of the chosen six genes for the SRBCT data. Each row corresponds to a gene, with the columns corresponding to different samples. Expression levels greater than the mean are shaded in red(green), and those below the mean are shaded in green(red). The genes ordered by hierarchical clustering. We used CLUSTER and TREEVIEW software which are publicly available at http://rana.lbl.gov

 

Figure 3
View larger version (15K):
[in this window]
[in a new window]
 
Fig. 3 The Box plots show the expression patterns of the chosen 6 genes for the SRBCT data, each numbered as IMAGE ID number.

 
The SPM classifier was applied to the original data consisting of 2308 genes. It was the objective to detect important genes for classifying SRBCTs common in children, and at the same time to gauge the uncertainty that inherently lies in estimating the effects of the 2308 response covariates with only 63 observations. For the assessment of variability, 100 bootstrap samples are drawn from the training data with the same class proportions as in the original sample. Similar to Figure 7 in Lee et al. (2004), our Figure 4 shows the proportion of selecting each gene in 100 replicated SPM classifiers based on bootstrap samples in which four genes out of the selected six genes are consistently selected >45% of the time.


Figure 4
View larger version (11K):
[in this window]
[in a new window]
 
Fig. 4 The proportion of selecting each gene in SPM for 100 bootstrap samples. Red vertical lines denote the six genes selected for the SRBCT.

 
Previous studies already demonstrated that the six genes were predictive for subtype classification of SRBCT (Khan et al., 2001; Tibshirani et al., 2002). In addition, the relevance of some genes to specific types of tumor was reported in the biological literature. For example, over-expression of cyclin D1 in EWS and NB has been shown by biological methods (Zhang et al., 2004; Dauphinot et al., 2001; Molenaar et al., 2003; Elenitoba-Johnson et al., 2002). Insulin-like growth factor 2 (somatomedin A: IGF2), related to myogenesis, was also previously reported to be highly expressed in RMS (El-Badry et al., 1990; Khan et al., 1999). However, IGF2 is expressed in some other cancers and normal tissues, lacking specificity. Some genes that are under-represented in a particular type of tumor compared with other types can also be selected as a predictive genes. For instance, cyclin-dependent kinase 6 (CDK6) gene selected for EWS was under-expressed in this tumor. Previously, CDK6 was reported to restrain proliferation in a certain type of cell and its loss or down-regulation was implicated to play a role in development and progress of some tumor types (Lucas et al., 2004). However, CDK6 is ubiquitously expressed in a wide variety of cells and strong expression was described in specific tumors such as acute lymphoblastic leukemia (Fink and LeBein, 2001; Chilosi et al., 1998; Omura-Minamisawa et al., 2000). This gene was identified in Khan et al.'s work and Tibshirani et al.'s work for SRBCT. Meningioma (disrupted in balanced translocation) 1 (MN1) was also under-expressed in NB subtype. MN1 gene resides on chromosome 22 and was found to be disrupted by a balanced translocation (4;22) in meningioma, common benign brain tumor (Lekanne et al., 1995). Absence of functional MN1 protein was suggested to contribute to meningioma pathogenesis. In addition, MN1 was shown to be fused to TEL, a member of the family of ETS transcription factor on chromosome 12p13 (12;22) in acute myeloid leukemia. Although cellular function of MN1 oncoprotein has not been investigated in detail, it is suggested to be a transcription coactivator and to be involved in RAR/RXR-mediated transcription. This gene was also identified in Khan et al.'s work and Tibshirani et al.'s work for SRBCT.

3.2.2 Leukemia data
Leukemia dataset was composed of 7129 gene expression values in three classes of leukemias: B-cell and T-cell acute lymphoblastic leukemia (B-cell ALL-38 patients, T-cell ALL-9 patients) and acute myeloid leukemia (AML-25 patients) (Golub et al., 1999). As described in Dudoit et al. (2002), this dataset was preprocessed with thresholding, filtering and underwent a logarithmic transformation followed by standardization. The dataset consists of 38 training samples and 34 test samples.

SPM selected 4 genes (Table 5) from the 38 training samples. The training accuracy is 100% and test predictive accuracy is 91.17% (31/34), which means the SPM can predict the tumor classes for both seen and unseen samples using four genes with reasonably high accuracy. For comparison, we applied PAM and SVM-RFE methods to the leukemia data. PAM selected 10 genes from the 38 training samples and achieved an accuracy of 94.11% (32/34) and SVM-RFE selected 6 genes from the 38 training samples and achieved an accuracy of 91.17% (31/34). All six genes selected by SPM were also selected by PAM and one gene by SVM-RFE (Table 5).


View this table:
[in this window]
[in a new window]
 
Table 5 The gene list selected for the leukemia data

 
Whether the selected genes serve as legitimate markers for cancer classification was further verified by applying a hierarchical clustering to the expression data of the selected genes. By visual inspection of the gene expression of the chosen four genes, three clusters were clearly separated (Fig. 5). Figure 6 illustrates three different patterns of the chosen four genes in the same fashion as Figure 2 in Lee and Lee (2003).


Figure 5
View larger version (26K):
[in this window]
[in a new window]
 
Fig. 5 The gene expression maps of the chosen four genes for the leukemia data. Each row corresponds to a gene, with the columns corresponding to different samples. Expression levels greater than the mean are shaded in red(green), and those below the mean are shaded in green(red). The genes ordered by hierarchical clustering. We used CLUSTER and TREEVIEW software which are publicly available at http://rana.lbl.gov

 

Figure 6
View larger version (14K):
[in this window]
[in a new window]
 
Fig. 6 The Box plots show the expression patterns of the chosen four genes for the leukemia data, each numbered as Gene Accession number. (iii)X00437_s_at and (iv)X76223_s_at are identical expression patterns.

 
Previous studies already showed that cystatin C (CST3) gene was responsible for subtype classification of leukemia as a two-class (ALL/AML) problem (Golub et al., 1999). CST3 gene was also identified in Antonov et al.'s work for AML/ALL classification. But there was not any plausible conclusion about the biological function of this gene related to AML/ALL pathogenesis (Antonov et al., 2004). The relevance of MAL gene to T-cell ALLs was reported in the biological literature. The expression of MAL gene was shown to be significantly high in the PBMC of patients with T-cell ALL, as compared with that of chronic T-cell leukemia patients and normal subjects (Kohno et al., 2000). The MAL mRNA was expressed in T cells at intermediate and late stages of differentiation (Alonso et al., 1987), thyroid epithelial cells (Martin-Belmonte et al., 1998, Zacchetti et al., 1995), and myelin-forming cells (Kim et al., 1995). MAL is a proteolipid that has been identified as a component of glycolipid-enriched membrane (Martin-Belmonte et al., 1998; Zacchetti et al., 1995; Kim et al., 1995). MAL associates with glycosylphosphatidylinositol (GPI)-anchored proteins and the Src-like tyrosine kinase Lck in T lymphocytes. Cross-linking of GPIanchored proteins triggers signaling pathways leading to Lck activation and T-cell proliferation (Millan and Alonso, 1998; Shenoy-Scaria et al., 1992). Thus, the MAL molecule is closely associated with molecules for T-cell activation, and it probably has roles in activation and proliferation. The MAL gene was also identified in Antonov et al.'s work for AML/ALL classification (Antonov et al., 2004). We also identified some genes not identified in other works for AML/ALL classification. T-cell leukemia/lymphoma1 (TCL1) gene and T-cell receptor active beta chain (TCRB) genes were up-regulated in B-type ALL and T-type ALL, respectively. TCL1 originally was cloned in 1994 as a gene on chromosome 14 involved in recurring chromosomal abnormalities of adult T-cell leukemia (Virgilio et al., 1993,1994). In these abnormalities. TCL1 was juxtaposed to a TCR locus and inappropriately regulated by the TCR cis-regulatory elements. TCL1 subsequently has been shown to be expressed in a variety of B-cell and T-cell neoplasm (Takizawa et al., 1998; Teitell et al., 1999; Narducci et al., 2000; Nakayama et al., 2000), but in neither hematopoietic progenitor cells (CD34+) nor mature lymphocytes. The increased expression of TCL1 was shown to be associated with the progression of immature B cell ALL (Fears et al., 2002). The rearrangement of the gene coding for the beta-chain of the T-cell receptor was found in T-cell leukemias and in T-cell lymphomas in biological literatures (O'Connor et al., 1985; Aisenberg et al., 1985; Bertness et al., 1985).

3.2.3 Other real data
We report the classification results and the size of gene sets for the three additional publicly available real datasets (Table 6) which are known to have relatively higher misclassification rates than the above two real datasets.


View this table:
[in this window]
[in a new window]
 
Table 6 Description of real datasets

 
After preprocessing, all the data were base 10 log-transformed and the arrays were standardized. We performed 3-fold cross-validation and observed classification error rates. This procedure was repeated 50 times. We compared the test error rates and the average number of genes selected of our method with those of PAM and SVM-RFE.

Table 7 and Figure 7 display the average number of genes selected and boxplots of the misclassification rates of each classifier, respectivity. For the Colon and Lymphoma data, SPM gave smaller misclassification rates with less average number of genes selected and for the Brain data, the misclassification rate of SPM was quite comparable with those of the other methods with less average number of selected genes. This result appears to show that our proposed method outperformed the other methods.


View this table:
[in this window]
[in a new window]
 
Table 7 The average number of genes selected from the real datasets

 

Figure 7
View larger version (12K):
[in this window]
[in a new window]
 
Fig. 7 Boxplots of the misclassification rates on the real datasets. (a) Colon, (b) Lymphoma and (c) Brain.

 
In order to show that variable selection may be helpful in improving prediction accuracy, we included SVM without gene selection for comparison. We used the R implementaion svm() which is based on LIBSVM (Chang and Lin, 2001) and performed C-classification with Gaussian kernel. The parameter {sigma} and the cost C were tuned by a grid search on {2–12, ... , 212} x {2–5, ... , 210} by 10-fold cross-validation on each training dataset, similar to Myer et al. (2003) and Dettling (2004). As shown in Figure 7, SPM gave smaller misclassification rates than SVM without gene selection for these three datasets.


    4. CONCLUSION AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 Systems and methods
 3 RESULTS
 4. CONCLUSION AND DISCUSSION
 REFERENCES
 
The goal of this study is to provide an effective method for finding genes that accurately discriminate cancer subtypes. Our proposed methods using SPM gave a satisfactory classification performance with informative genes in the SRBCT and leukemia datasets. For the SRBCT data, SPM built an optimal class predictor consisted of six genes, that was able to assign each SRBCT sample to one of four subtypes, BL, EWS, NB and RMS, with 0% error rates. Our method required the smallest set of genes that can accurately classify samples. Although our method gives a little lower accuracy (31/34) than PAM (32/34) for leukemia classification, our method is able to select a smaller set of genes with ability to successfully classify among samples. To select a minimally required set of genes associated with optimum predictive performance is more cost-effective in cancer classification. We found four genes that are able to assign leukemia samples to one of three classes (AML, B-cell ALLs and T-cell ALLs), while PAM found 10 genes as an optimal set. All four genes selected by our method are also selected by PAM.

The efficiency of our method in finding a relatively small number of predictive genes will facilitate the search for new diagnostic markers. The method efficiently finds and ranks genes that can discriminate one subtype of tumor from another. For SBRCTs and Leukemias analyzed here, the predictive genes are attractive candidates for markers in RNA-based diagnostic test or immunohistochemical staining. In addition, our method may be used to select genes that are most significantly correlated with drug ensitivity or resistance, and to predict responses to chemotherapy according to gene expression profiles.


    Acknowledgments
 
The authors wish to thank Trevor Hastie and Ji Zhu for their helpful comments and referees who informed them of related works and provided constructive comments that greatly improved this article. The research of J.Y.K. was supported by Korea Research Foundation Grant funded by Korea Government (MOEHRD, Basic Research Promotion Fund) (KRF-2005-070-C00020). J.W.L. and I.S. were supported by Korea Science and Engineering Foundation Grant (R14-2003-002-01002-0). Funding to pay the Open Access publication charges for this article was provided by Korea Research Foundation Grant funded by Korea Government (MOEHRD, Basic Research Promotion Fund) (KRF-2005-70-C00020).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on December 21, 2005; revised on January 26, 2006; accepted on January 26, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 Systems and methods
 3 RESULTS
 4. CONCLUSION AND DISCUSSION
 REFERENCES
 

    Aisenberg, A.C., et al. (1985) Rearrangement of the gene for the beta chain of the T-cell receptor in T-cell chronic lymphocytic leukemia and related disorders. N. Eng. J. Med, . 313, 529–533[Abstract].

    Alizadeh, A., et al. (2000) Distinct types of diffuse large B-cell-lymphoma identified by gene expression profiling. Nature, 403, 503–511[CrossRef][Medline].

    Alonso, M.A. and Weissman, S.M. (1987) cDNA cloning and sequence of MAL, a hydrophobic protein associated with human T-cell differentiation. Proc. Natl Acad. Sci. USA, 84, 1997–2001[Abstract/Free Full Text].

    Antonov, A.V., et al. (2004) Optimization models for cancer classification: extracting gene interaction information from microarray expression data. Bioinformatics, 20, 644–652[Abstract/Free Full Text].

    Bertness, V., et al. (1985) T-cell receptor gene rearrangements as clinical markers of human T-cell lymphomas. N. Eng. J. Med, . 313, 534–538[Abstract].

    Broberg, P. (2002) Ranking genes with respect to differential expression. Genome Biol, . 3, preprint0007.

    Brown, P.O. and Botstein, D. (1999) Exploring the new world of the genome with DNA microarrays. Nat. Genet, . 21, Suppl. 1, 33–37[CrossRef][ISI][Medline].

    Chang, C. and Lin, C. (2001) LIBSVM: a library for support vector machines.

    Chilosi, M., et al. (1998) Differential expression of cyclin-dependent kinase 6 in cortical thymocytes and T-cell lymphoblastic lymphoma/leukemia. Am. J. Pathol, . 152, 209–217[Abstract].

    Dauphinot, L., et al. (2001) Analysis of the expression of cell cycle regulators in Ewing cell lines: EWS-FLI-1 modulates p57KIP2and c-Myc expression. Oncogene, 20, 3258–3265[CrossRef][ISI][Medline].

    Dettling, M. (2004) BagBoosting for tumor classification with gene expression data. Bioinformatics, 20, 3583–3593[Abstract/Free Full Text].

    Dudoit, S., et al. (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc, . 97, 77–87[CrossRef][ISI].

    Eisen, M.B., et al. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 14863–14868[Abstract/Free Full Text].

    El-Badry, O.M., et al. (1990) Insulin-like growth factor II acts as an autocrine growth and motility factor in human rhabdomyosarcoma tumors. Cell Growth Differ, . 1, 325–331[Abstract].

    Elenitoba-Johnson, K.S., Bohling, S.D., Jenson, S.D., Lin, Z., Monnin, K.A., Lim, M.S. (2002) Fluorescence PCR quantification of cyclin D1 expression. J. Mol. Diagn, . 4(2), 90–96.

    Fears, S., et al. (2002) Differential expression of TCL1 during pre-B-cell acute lymphoblastic leukemia progression. Cancer Genet. Cytogenet, . 135, 110–119[Medline].

    Fink, J.R. and LeBien, T.W. (2001) Novel expression of cyclin-dependent kinase inhibitors in human B-cell precursors. Exp. Hematol, . 29, 490–498[Medline].

    Furey, T.S., et al. (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16, 906–914[Abstract/Free Full Text].

    Golub, T.R., et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537[Abstract/Free Full Text].

    Guyon, I., et al. (2002) Gene selection for cancer classification using support vector machines. Mach. Learn, . 46, 389–422[CrossRef].

    Hastie, T., Tibshirani, R., Friedman, J. (2002) The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer Verlag.

    Khan, J., et al. (1999) cDNA microarrays detect activation of a myogenic transcription program by the PAX3-FKHR fusion oncogene. Proc. Natl Acad. Sci. USA, 96, 13264–13269[Abstract/Free Full Text].

    Khan, J., et al. (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med, . 7, 673–679[CrossRef][ISI][Medline].

    Kim, T., et al. (1995) Cloning and characterization of MVP17: a developmentally regulated myelin protein in oligodendrocytes. J. Neurosci. Res, . 42, 413–422[CrossRef][ISI][Medline].

    Kohno, T., et al. (2000) Identification of genes associated with the progression of adult T cell leukemia (ATL). Jpn. J. Cancer Res, . 91, 1103–1110[Medline].

    Kooperberg, C., et al. Polychotomous regression. J. Am. Stat. Assoc, . 92, 117–127.

    Lee, J.W., et al. (2005) An extensive comparison of recent classification tools applied to microarray data. Comput. Stat. Data Anal, . 48, 869–885[CrossRef].

    Lee, Y. and Lee, C.-K. (2003) Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics, 19, 1132–1139[Abstract/Free Full Text].

    Lee, Y., et al. (2004) Structured multicategory support vector machines with ANOVA decompositon. Technical Report 743, , University of Ohio State, OH Department of Statistics.

    Lekanne Deprez, R. H., et al. (1995) Cloning and characterization of MN1, a gene from chromosome 22q11, which is disrupted by a balanced translocation in a meningioma. Oncogene, 10, 1521–1528[ISI][Medline].

    Lucas, J.J., et al. (2004) Cyclin-dependent kinase 6 inhibits proliferation of human mammary epithelial cells. Mol. Cancer Res, . 2, 105–114[Abstract/Free Full Text].

    Martin-Belmonte, F., et al. (1998) Expression of the MAL gene in the thyroid: the MAL proteolipid, component of glycolipidenriched membranes, is apically distributed in thyroid follicles. Endocrinology, 139, 2077–2084[Abstract/Free Full Text].

    Millan, J. and Alonso, M.A. (1998) MAL, a novel integral membrane protein of human T lymphocytes, associates with glycosylphosphatidylinositol-anchored proteins and Src-like tyrosine kinases. Eur. J. Immunol, . 28, 3675–3684[CrossRef][ISI][Medline].

    Molenaar, J.J., et al. (2003) Rearrangements and increased expression of cyclin D1 (CCND1) in neuroblastoma. Genes Chromosomes Cancer, 36, 242–249[CrossRef][Medline].

    Myer, D., Leisch, F, Hornik, K. (2003) The support vector machines under test. Neurocomputing, 55, 169–442[CrossRef].

    Nakayama, I., et al. (2000) Activation of the TCL1 protein in B cell lymphomas. Pathol. Int, . 50, 191–199[CrossRef][Medline].

    Narducci, M.G., et al. (2000) Regulation of TCL1 expression in B- and T-cell lymphomas and reactive lymphoid tissues. Cancer Res, . 60, 2095–2100[Abstract/Free Full Text].

    O'Connor, N.T.J., et al. (1985) Rearrangement of the T-cell-receptor beta-chain gene in the diagnosis of lymphoproliferative disorders. Lancet, 8, 1295–1297.

    Omura-Minamisawa, M., et al. (2000) Universal inactivation of both p16 and p15 but not downstream components is an essential event in the pathogenesis of T-cell acute lymphoblastic leukemia. Clin. Cancer Res, . 6, 1219–1228[Abstract/Free Full Text].

    Ramaswamy, S., et al. (2001) Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl Acad. Sci. USA, 98, 15149–15154[Abstract/Free Full Text].

    Rao, C.R. Linear Statistical Inference and Its Applications, (1973) 2nd edn , New York Wiley.

    Schölkopf, B. and Smola, J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, (2002) , Massachusetts The MIT Press.

    Schwarz, G. (1978) Estimating the dimension of a model. Ann. Stat, . 6, 461–464.

    Shenoy-Scaria, A.M., et al. (1992) Signal transduction through decay-accelerating factor. Interaction of glycosyl-phosphatidylinositol anchor and protein tyrosine kinases p56lck and p59fyn 1. J. Immunol, . 149, 3535–3541[Abstract].

    Takizawa, J., et al. (1998) Expression of the TCL1 gene at 14q32 in B-cell malignancies but not in adult T-cell leukemia. Jpn. J. Cancer Res, . 89, 712–718[CrossRef][ISI][Medline].

    Teitell, M., et al. (1999) TCL1 oncogene expression in AIDS-related lymphomas and lymphoid tissues. Proc. Natl Acad. Sci. USA, 96, 9809–9814[Abstract/Free Full Text].

    Tibshirani, R., et al. (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl Acad. Sci. USA, 99, 6567–6572[Abstract/Free Full Text].

    Vapnik, V. Statistical Learning Theory, (1998) , New York Wiley.

    Virgilio, L., et al. (1993) Chromosome walking on the TCL1 locus involved in T-cell neoplasia. Proc. Natl Acad. Sci. USA, 90, 9275–9279[Abstract/Free Full Text].

    Virgilio, L., et al. (1994) Identification of the TCL1 gene involved in T-cell malignancies. Proc. Natl Acad. Sci. USA, 91, 12530–12534[Abstract/Free Full Text].

    Zacchetti, D., et al. (1995) VIP/MAL, a proteolipid in apical transport vesicles. FEBS Lett, . 377, 465–469[CrossRef][ISI][Medline].

    Zhang, J., et al. (2004) Selective usage of D-Type cyclins by Ewing's tumors and rhabdomyosarcomas. Cancer Res, . 64, 6026–6034[Abstract/Free Full Text].

    Zhu, J. and Hastie, T. (2001) Kernel logistic regression and the import vector machines. Adv. Neural Inf. Process. Syst, . 14, .


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
22/8/950    most recent
btl029v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Koo, J.-Y.
Right arrow Articles by Lee, J. W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Koo, J.-Y.
Right arrow Articles by Lee, J. W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?