Bioinformatics Advance Access originally published online on December 13, 2005
Bioinformatics 2006 22(4):472-476; doi:10.1093/bioinformatics/bti827
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Differential gene expression detection and sample classification using penalized linear regression models
Division of Biostatistics, School of Public Health, University of Minnesota A460 Mayo Building, MMC 303, Minneapolis, MN 55455, USA
| ABSTRACT |
|---|
|
|
|---|
Differential gene expression detection and sample classification using microarray data have received much research interest recently. Owing to the large number of genes p and small number of samples n (p >> n), microarray data analysis poses big challenges for statistical analysis. An obvious problem owing to the large p small n is over-fitting. Just by chance, we are likely to find some non-differentially expressed genes that can classify the samples very well. The idea of shrinkage is to regularize the model parameters to reduce the effects of noise and produce reliable inferences. Shrinkage has been successfully applied in the microarray data analysis. The SAM statistics proposed by Tusher et al. and the nearest shrunken centroid proposed by Tibshirani et al. are ad hoc shrinkage methods. Both methods are simple, intuitive and prove to be useful in empirical studies.
Recently Wu proposed the penalized t/F-statistics with shrinkage by formally using the
1 penalized linear regression models for two-class microarray data, showing good performance. In this paper we systematically discussed the use of penalized regression models for analyzing microarray data. We generalize the two-class penalized t/F-statistics proposed by Wu to multi-class microarray data. We formally derive the ad hoc shrunken centroid used by Tibshirani et al. using the
1 penalized regression models. And we show that the penalized linear regression models provide a rigorous and unified statistical framework for sample classification and differential gene expression detection.
Availability: For the computer programs, detailed analysis results and R functions for the proposed methods, please see http://www.biostat.umn.edu/~baolin/research/L1C-mc.html
Contact: baolin{at}biostat.umn.edu
Supplementary information: http://www.biostat.umn.edu/~baolin/research/L1C-mc.html
| 1 INTRODUCTION |
|---|
|
|
|---|
Microarray monitors the expression levels of hundreds of thousands of genes simultaneously in a whole genome level. The spotted arrays (DeRisi et al., 1996) and the high-density oligonucleotide chips (Lockhart et al., 1996) are two commonly used microarray technologies. There are many statistical problems associated with microarray data. The readers are referred to the books by Parmigiani et al. (2003) and Speed (2003) for comprehensive reviews. In this paper we focus on differential gene expression detection and sample classification using microarray data obtained from different groups or conditions, e.g. stage I, II, III, IV breast cancer and normal tissues.
Consider a G-class microarray data, where we have measured the expression levels of p genes for ng samples from class g, g = 1,..., G. Denote the measured expression values as
![]() |
![]() |
![]() | (1) |
In differential gene expression detection, the basic idea is to compare the expression levels across the different classes. We can do the comparison for gene i using the following linear regression model
![]() | (2) |
across all classes.
We can formally test for no difference across all classes by testing ßi1 = · · · = ßiG, which can be done using the F-statistics obtained from the least squares fitting (Kutner et al., 2004)
![]() |
, and (SSB,SSW) are the between/within-class sum of squares, which are also equal to the regression/error sum of squares (SSR/SSE) of the linear regression model (2). Therefore the F-statistics is proportional to the commonly used ratio of between/within-class sum of squares (Golub et al., 1999). After we cast differential gene expression detection in the framework of linear regression models, it is natural to consider alternative estimation methods other than least squares. In the following we discuss the differential gene expression detection and sample classification with the penalized linear regression models.
| 2 PENALIZED LINEAR REGRESSION |
|---|
|
|
|---|
2.1
penalty: LASSO regressionLinear regression with the
penalty, known as the LASSO regression, has been shown to have shrinkage and variable selection properties (Tibshirani, 1996; Efron et al., 2004). The
penalized regression model corresponds to the following constrained optimization
![]() | (3) |
![]() |
With the estimated parameters the predicted mean value for class g is
![]() | (4) |
penalized linear regression (3) is shrinking the difference between the individual class means and the overall mean. As a result the expression value is shrunken towards the global mean. The intuition is that the overall mean is relatively more stable than the individual class means owing to the bigger sample size. So the shrinkage can help to produce more reliable estimates of the true means (Donoho and Johnstone, 1994). Notice that if we make
g the same across classes, then (4) implicitly shrinks more those class means with smaller sample size.
Similar to the least squares estimation we define the total/error sum of squares (SSTO/SSE) for the penalized regression model as
![]() |
penalized F-statistics for testing equal means across groups as
![]() |
![]() |
![]() |
In the nearest shrunken centroid method (also known as PAM; Tibshirani et al., 2002, 2003), the following ad hoc soft-shrunken t-statistics and centroid are used
![]() |
is the shrinkage parameter and
makes mgsi equal to the estimated standard error of the mean difference,
. Thus dig is a t-statistics for gene i comparing class g with the average class. We can equivalently write the previous equations as
![]() | (5) |
is equivalent to the predicted mean response (4) from the penalized linear model if we take
g = 2ngmg si
. Similar to (4), in (5) the class means with smaller sample size are shrunken more since
.
2.2 Differential gene expression detection and sample classification
Most differential gene expression detection methods, being parametric, non-parametric or Bayesian, are based on thresholding. Usually a score is assigned to each gene. Commonly used scores include the P-values and some variants of t/F-statistics. A cutoff value t0 is chosen and those genes with the score bigger than t0 are declared as differentially expressed. The false discovery rate (FDR; Benjamini and Hochberg, 1995) has become increasingly adopted in large-scale genomic studies to assess the significance of the identified set of genes. FDR is defined as the expected proportion of false positives and used as a guidance for choosing the cutoff value t0.
We can also select a set of interesting genes using sample classification errors, which can be formulated as the following optimization problem
![]() |
is the classification error using genes
with some classifiers, e.g. K-nearest-neighbor, support vector machine (SVM), random forest (RF), etc. (Dudoit et al., 2002; Wu et al., 2003). Several difficulties are associated with this approach. First, since it is a combinatorial optimization, it is very hard to get the global optimum. Second, the results will depend on the classifier used. For microarray data it is common to observe p >> n, so overfitting is very likely to occur. Just by chance we can find many combinations of genes to achieve small classification errors.
One way to overcome some of the difficulties is to use some heuristic optimization with regularization. In the PAM method, the shrinkage parameter
controls the set of genes selected and the classification errors, i.e.
![]() |
. Hence we transform the combinatorial optimization into an one-dimensional optimization problem indexed by parameter
. Microarray is often used as an exploratory tool. The goal of differential gene expression detection is usually to identify a set of interesting genes for follow-up study. How many genes we select as interesting is really a balance of many considerations. The mathematical formulations as minimizing classification error or FDR are really a simplistic approach. Owing to the small sample size, the estimated FDR or classification errors will have large variance. In practice they should be combined with other factors to select genes. For example, owing to cost considerations, only a limited number of genes can be further studied.
2.3 Penalty parameter
g selection
In the PAM method, the tuning parameter
is chosen to shrink the majority of the
to zero, and those genes with at least one non-zero
are selected to build the classifier. Notice that when
g = 2ngmgsi
,
![]() |
For two-class microarray data (G = 2), let
1 =
2 =
and notice that
![]() |
![]() |
![]() |
, which are based on minimizing the dependence between si and
. This can be applied to select the parameters
g for the penalized F-statistics. In addition, analogous to the sample classification approach, we propose to select both
g and differentially expressed genes simultaneously using the FDR: for specific values of
g, we will declare those genes with
as differentially expressed, and the associated FDR can be estimated using permutations.
In the following we first illustrate the selection of penalty parameter
using the small round blue cell tumors (SRBCT) microarray data reported at Khan et al. (2001), which has also been analyzed in Tibshirani et al. (2002). We then compare the proposed method with other alternative classification methods using some public microarray data.
| 3 APPLICATION TO MICROARRAY DATA |
|---|
|
|
|---|
The SRBCT microarray data measured the expression levels of 2308 genes for 63 training samples and 25 testing samples from four tumor types: Burkitt lymphoma (BL, 8 samples), Ewing sarcoma (EWS, 23 samples), neuroblastoma (NB, 12 samples) and rhabdomyosarcoma (RMS, 20 samples). The normalized expression values for these 2308 genes can be downloaded from http://research.nhgri.nih.gov/microarray/Supplement
3.1 Differential gene expression detection and FDR estimation/control
With some selected values of
g, we can shrink the majority of the
1 penalized F-statistics to zero, and those genes with non-zero
can be declared as significant. The number of false positives owing to random chance can be estimated by permutations. Specifically we can randomly permute the class indicator and recalculate the penalized F-statistics. The average number of non-zeros over permutations are used to estimate the number of false positives. Figure 1 shows the false positive estimations for these 2308 genes using
g =
. We can see that the number of false positives is extremely small. One possible explanation is that in Khan et al. (2001), these 2308 genes are the quality filtered subset of the original 6567 genes and possibly lots of non-significant genes have been removed. The extremely small proportion of false positives observed in Figure 1 also partially explains the good classification accuracy for the SRBCT microarray data (Khan et al., 2001; Tibshirani et al., 2002).
|
3.2 Sample classification
Similar to the PAM method, we can apply the nearest centroid classification using the shrunken centroid estimated from the
penalty models (4) to select genes and classify samples simultaneously. Considering the small sample sizes for different classes of tumors (8, 23, 12, 20), we choose 5-fold CV to estimate the classification errors.
There are two levels in terms of the penalty parameter selections: the relation of
g across different classes within the same gene and the relation of penalty parameters across different genes. Notice that in the
penalized linear regression model (3), the penalty parameter
g implicitly depends on the sample size n.
For within-gene penalty parameter selection, we can use common
g across classes:
=
g/n; or use standardized penalty regression: each predictor is standardized first and then the same penalties are applied to all regression coefficients, which corresponds to using
![]() |
is the standard deviation for ygj. Let ßig =
ig/
g and we have
![]() |
For across gene penalty parameter selection, we can use the same
for different genes. Or we can scale
by the standard error of individual gene expressions, i.e. using penalty parameter si
for gene i.
Table 1 summarizes the parameter estimations for the combination of standardization within-gene and scaling across-gene.
|
In 5-fold CV, we randomly partition the 63 samples into five similar groups. Each time one group is held out as a testing set. Notice that in the CV, n and ng will be different for different combination of training groups. Figure 2 shows the classification error estimations from the 5-fold CV. We can see that the scaling helps to decrease the classification error and reduce the number of genes used. Using both standardization and scaling can substantially improve the classification.
|
Next we compare the proposed method (using both standardization and scaling) with some popular classification methods by application to some public microarray data.
| 4 COMPARISON WITH OTHER CLASSIFICATION METHODS |
|---|
|
|
|---|
We evaluate classification performance using the following three public microarray data listed at Table 2.
|
We randomly choose one-third of the samples as testing set and the other samples as training set. For each training set, K-fold CV is used to train the classifier, which is then used to predict the testing set. We use K = 5 for the breast cancer data and K = 3 for the other two data due to their relatively small sample sizes in each group. We repeat the process 50 times to get the average misclassification errors. The same procedures are applied to SVM (Burges, 1998), RF (Breiman, 2001), PAM (Tibshirani et al., 2002) and the proposed method (called RegCL in the following discussion). For SVM multiclass-classification, we employ the one-against-one-approach, and the appropriate class is found by a voting scheme (Chang and Lin, 2001).
Table 3 summarizes the average classification errors over 50 random CVs. We can see that the proposed method has favorable performance compared with other methods (for detailed results please see the Supplementary website).
|
| 5 DISCUSSION |
|---|
|
|
|---|
Due to the large p small n commonly observed for microarray data, regularization can help to prevent over-fitting and produce more reliable estimations. Some ad-hoc shrinkage methods have been proposed to utilize the shrinkage ideas and prove to be useful in empirical studies. In this paper we show that the ad hoc shrinkage methods can be rigorously derived from the
penalized linear regression models, which provide a unified statistical framework for the penalized differential gene expression detection (Wu, 2005) and the nearest shrunken centroid classification methods (Tibshirani et al., 2002). Through applications to public microarray data, we illustrate the importance of the regularization parameter selection. By comparing with some other popular classification methods, we showed the favorable performance of the proposed methods. Recently Storey et al. (2005) proposed the optimal discovery procedure to fully utilize the information from all the genes to improve differentially gene expression detection. It would be interesting to study how to incorporate dependence information among genes into the proposed penalized F-statistics for differential gene expression detection.
| Acknowledgments |
|---|
This research was partially supported by a startup fund from the Division of Biostatistics and a research grant from the Graduate School of University of Minnesota. I would like to thank two referees for their constructive comments, which have greatly improved the presentation of the paper.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
Received on October 11, 2005; revised on December 6, 2005; accepted on December 7, 2005
| REFERENCES |
|---|
|
|
|---|
Alizadeh, A., et al. (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503511[CrossRef][Medline].
Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B, . 57, 289300.
Bolstad, B., et al. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19, 185193
Breiman, L. (2001) Random forests. Mach. Learning, 45, 532[CrossRef].
Burges, C.J.C. (1998) A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc, . 2, 121167[CrossRef].
Chang, C.C. and Lin, C.J. (2001) Libsvm : a library for support vector machines. Technical Report, available at http://www.csie.ntu.ed.tw/~cjlin/libsvm.
DeRisi, J., et al. (1996) Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nat. Genet, . 14, 457460[CrossRef][ISI][Medline].
Donoho, D.L. and Johnstone, I.M. (1994) Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81, 425455
Dudoit, S., et al. (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat. Sinica, 12, 111139.
Efron, B., et al. (2004) Least angle regression. Ann. Stat, . 32, 407499[CrossRef].
Golub, T.R., et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531537
Khan, J., et al. (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med, . 7, 673679[CrossRef][ISI][Medline].
Kutner, M., Nachtsheim, C., Wasserman, W. Applied Linear Regression Models, (2004) 4th edn. , Mcgraw-Hill/Irwin, NY.
Lockhart, D., et al. (1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol, . 14, 16751680[CrossRef][ISI][Medline].
In Parmigiani, G., Garrett, E.S., Irizarry, R.A., Zeger, S.L. (Eds.). The Analysis of Gene Expression Data: Methods and Software, . (2003) , Springer, NY.
Pomeroy, S.L., et al. (2002) Prediction of central nervous system embryonal tumor outcome based on gene expression. Nature, 415, 436442[CrossRef][Medline].
In Speed, T. (Ed.). Statistical Analysis of Gene Expression Microarray Data, (2003) Chapman & Hall/CRC, USA.
Storey, J.D., Dai, J.Y., Leek, J.T. (2005) The optimal discovery procedure II: applications to comparative microarray experiments. UW Biostatistics Working Paper Series, , pp. 260 available at http://www.bepress.com/uwbiostat/paper260.
Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. J. R. Statistical Soc. B, Methodological, 58, 267288.
Tibshirani, R., et al. (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. USA, 99, 65676572
Tibshirani, R., et al. (2003) Class prediction by nearest shrunken centroids, with application to DNA microarrays. Stat. Sci, . 18, 104117[CrossRef][ISI].
Tusher, V.G., et al. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA, 98, 51165121
West, M., et al. (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl. Acad. Sci. USA, 98, 1146211467
Wu, B. (2005) Differential gene expression detection using penalized linear regression models: the improved SAM statistics. Bioinformatics, 21, 15651571
Wu, B., et al. (2003) Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics, 19, 16361643
Yang, Y.H., et al. (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res, . 30, e15
This article has been cited by other articles:
![]() |
Y. Lai Genome-wide co-expression based prediction of differential expressions Bioinformatics, March 1, 2008; 24(5): 666 - 673. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Abeel, Y. Saeys, E. Bonnet, P. Rouze, and Y. Van de Peer Generic eukaryotic core promoter prediction using structural features of DNA Genome Res., February 1, 2008; 18(2): 310 - 323. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Tai and W. Pan Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data Bioinformatics, December 1, 2007; 23(23): 3170 - 3177. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Wang and J. Zhu Improved centroids estimation for the nearest shrunken centroid classifier Bioinformatics, April 15, 2007; 23(8): 972 - 979. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

























