Bioinformatics Vol. 18 no. 1 2002
Pages 39-50
© 2002 Oxford University Press
Tumor classification by partial least squares using microarray gene expression data
1 Center for Image Processing and Integrated
Computing
2 Department of Applied Science, University
of California, Davis, CA 95616, USA
Received on November 23, 2000
; revised on March 22, 2001
; accepted on June 6, 2001
Motivation: One important application of gene expression microarray data is classification of samples into categories, such as the type of tumor. The use of microarrays allows simultaneous monitoring of thousands of genes expressions per sample. This ability to measure gene expression en masse has resulted in data with the number of variables p(genes) far exceeding the number of samples N. Standard statistical methodologies in classification and prediction do not work well or even at all when N < p. Modification of existing statistical methodologies or development of new methodologies is needed for the analysis of microarray data.
Results: We propose a novel analysis procedure for classifying (predicting) human tumor samples based on microarray gene expressions. This procedure involves dimension reduction using Partial Least Squares (PLS) and classification using Logistic Discrimination (LD) and Quadratic Discriminant Analysis (QDA). We compare PLS to the well known dimension reduction method of Principal Components Analysis (PCA). Under many circumstances PLS proves superior; we illustrate a condition when PCA particularly fails to predict well relative to PLS. The proposed methods were applied to five different microarray data sets involving various human tumor samples: (1) normal versus ovarian tumor; (2) Acute Myeloid Leukemia (AML) versus Acute Lymphoblastic Leukemia (ALL); (3) Diffuse Large B-cell Lymphoma (DLBCLL) versus B-cell Chronic Lymphocytic Leukemia (BCLL); (4) normal versus colon tumor; and (5) Non-Small-Cell-Lung-Carcinoma (NSCLC) versus renal samples. Stability of classification results and methods were further assessed by re-randomization studies.
Availability: The methodology can be implemented using a combination of standard statistical methods, available, for example, in SAS. Illustrative SAS code is available from the first author.
Contact: nguyen{at}wald.ucdavis.edu; dmrocke{at}ucdavis.edu
* To whom correspondence should be addressed.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
S. Ma and J. Huang Penalized feature selection and classification in bioinformatics Brief Bioinform, September 1, 2008; 9(5): 392 - 403. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.-L. Boulesteix, C. Porzelius, and M. Daumer Microarray-based classification and clinical predictors: on combined classifiers and additional predictive value Bioinformatics, August 1, 2008; 24(15): 1698 - 1706. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Li, D. Pan, J. Liu, P. S. Kern, G. F. Gerberick, A. J. Hopfinger, and Y. J. Tseng Categorical QSAR Models for Skin Sensitization based upon Local Lymph Node Assay Classification Measures Part 2: 4D-Fingerprint Three-State and Two-2-State Logistic Regression Models Toxicol. Sci., October 1, 2007; 99(2): 532 - 544. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Zhu, Y. Li, and H. Li Multivariate correlation estimator for inferring functional relationships from replicated genome-wide data Bioinformatics, September 1, 2007; 23(17): 2298 - 2305. [Abstract] [Full Text] [PDF] |
||||
![]() |
J.G. Liao and K.-V. Chin Logistic regression for disease classification using microarray data: model selection in a large p and small n case Bioinformatics, August 1, 2007; 23(15): 1945 - 1951. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Tai and W. Pan Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms Bioinformatics, July 15, 2007; 23(14): 1775 - 1782. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Nueda, A. Conesa, J. A. Westerhuis, H. C. J. Hoefsloot, A. K. Smilde, M. Talon, and A. Ferrer Discovering gene expression patterns in time course microarray experiments by ANOVA SCA Bioinformatics, July 15, 2007; 23(14): 1792 - 1800. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Fishel, A. Kaufman, and E. Ruppin Meta-analysis of gene expression data: a predictor-based approach Bioinformatics, July 1, 2007; 23(13): 1599 - 1606. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Liu, J. M. Hughes-Oliver, and J. A. Menius Jr Domain-enhanced analysis of microarray data using GO annotations Bioinformatics, May 15, 2007; 23(10): 1225 - 1234. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Ma and J. Huang Clustering threshold gradient descent regularization: with applications to microarray studies Bioinformatics, February 15, 2007; 23(4): 466 - 472. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. M. Havaleshko, H. Cho, M. Conaway, C. R. Owens, G. Hampton, J. K. Lee, and D. Theodorescu Prediction of drug combination chemosensitivity in human bladder cancer Mol. Cancer Ther., February 1, 2007; 6(2): 578 - 586. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.-L. Boulesteix and K. Strimmer Partial least squares: a versatile tool for the analysis of high-dimensional genomic data Brief Bioinform, January 1, 2007; 8(1): 32 - 44. [Abstract] [Full Text] [PDF] |
||||
![]() |
H.-Q. Wang and K. Li A New Algorithm Based on Support Vectors and Penalty Strategy for Identifying Key Genes Related with Cancer Transactions of the Institute of Measurement and Control, August 1, 2006; 28(3): 263 - 273. [Abstract] [PDF] |
||||
![]() |
Y. Tan, L. Shi, S. M. Hussain, J. Xu, W. Tong, J. M. Frazier, and C. Wang Integrating time-course microarray gene expression profiles with cytotoxicity for identification of biomarkers in primary rat hepatocytes exposed to cadmium Bioinformatics, January 1, 2006; 22(1): 77 - 87. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Ma and J. Huang Regularized ROC method for disease classification and biomarker selection with microarray data Bioinformatics, December 15, 2005; 21(24): 4356 - 4362. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. F. A. Wessels, M. J. T. Reinders, A. A. M. Hart, C. J. Veenman, H. Dai, Y. D. He, and L. J. v. Veer A protocol for building and evaluating predictors of disease state based on microarray data Bioinformatics, October 1, 2005; 21(19): 3755 - 3762. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Y. Yeung, R. E. Bumgarner, and A. E. Raftery Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data Bioinformatics, May 15, 2005; 21(10): 2394 - 2402. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. J. Fu, R. J. Carroll, and S. Wang Estimating misclassification error with small samples via bootstrap cross-validation Bioinformatics, May 1, 2005; 21(9): 1979 - 1986. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Fort and S. Lambert-Lacroix Classification using partial least squares with penalized logistic regression Bioinformatics, April 1, 2005; 21(7): 1104 - 1111. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Sandberg and I. Ernberg Assessment of tumor characteristic gene expression in cell lines using a tissue similarity index (TSI) PNAS, February 8, 2005; 102(6): 2052 - 2057. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. L. Yap, M. P. Wong, X. W. Zhang, D. Hernandez, R. Gras, D. K. Smith, and A. Danchin Conserved transcription factor binding sites of cancer markers derived from primary lung adenocarcinoma microarrays Nucleic Acids Res., January 14, 2005; 33(1): 409 - 421. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Tan, L. Shi, W. Tong, and C. Wang Multi-class cancer classification by total principal component regression (TPCR) using microarray gene expression data Nucleic Acids Res., January 7, 2005; 33(1): 56 - 65. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Perez-Enciso, M. A. Toro, M. Tenenhaus, and D. Gianola Combining Gene Expression and Molecular Marker Information for Mapping Complex Trait Genes: A Simulation Study Genetics, August 1, 2003; 164(4): 1597 - 1606. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Romualdi, S. Campanaro, D. Campagna, B. Celegato, N. Cannata, S. Toppo, G. Valle, and G. Lanfranchi Pattern recognition in gene expression profiling using DNA array: a comparative study of different statistical methods applied to cancer classification Hum. Mol. Genet., April 15, 2003; 12(8): 823 - 836. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Simon, M. D. Radmacher, K. Dobbin, and L. M. McShane Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification J Natl Cancer Inst, January 1, 2003; 95(1): 14 - 18. [Full Text] [PDF] |
||||









