Bioinformatics Advance Access originally published online on December 14, 2004
Bioinformatics 2005 21(8):1565-1571; doi:10.1093/bioinformatics/bti217
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Differential gene expression detection using penalized linear regression models: the improved SAM statistics
Division of Biostatistics, School of Public Health, University of Minnesota A460 Mayo Building, MMC 303, Minneapolis, MN, 55455, USA
| Abstract |
|---|
|
|
|---|
Summary: Differential gene expression detection using microarrays has received lots of research interests recently. Many methods have been proposed, including variants of F-statistics, non-parametric approaches and empirical Bayesian methods etc. The SAM statistics has been shown to have good performance in empirical studies. SAM is more like an ad hoc shrinkage method. The idea is that for small sample microarray data, it is often useful to pool information across genes to improve efficiency. Under Bayesian framework Smyth formally derived the test statistics with shrinkage using the hierarchical models. In this paper we cast differential gene expression detection in the familiar framework of linear regression model. Commonly used test statistics correspond to using least squares to estimate the regression parameters. Based on the vast literature of research on linear models, we can naturally consider other alternatives. Here we explore the penalized linear regression. We propose the penalized t-/F-statistics for two-class microarray data based on
penalty. We will show that the penalized test statistics intuitively makes sense and through applications we illustrate its good performance.
Availability: Supplementary information including program codes, more detailed analysis results and R functions for the proposed methods can be found at http://www.biostat.umn.edu/~baolin/research
Contact: baolin{at}biostat.umn.edu
Supplementary information: http://www.biostat.umn.edu/~baolin/research
| 1 INTRODUCTION |
|---|
|
|
|---|
Micoarray enables us to study the expression levels of hundreds of thousands of genes simultaneously in a whole genome level. The commonly used microarray technology includes the spotted arrays (DeRisi et al., 1996) and the high-density oligonucleotide chips (Lockhart et al., 1996).
There are lots of statistical problems associated with microarray data. The readers are referred to Smyth et al. (2003) for a comprehensive review. In this paper we focus on detecting differentially expressed genes using microarray data obtained from different groups or conditions, e.g. cancer and normal tissues.
Consider the two-class microarray data, where we have measured expression levels of m genes for n1 samples from one group and n2 samples from another group. Note the measured gene expression data as the following matrix
![]() |
We introduce the following group indicator
![]() | (1) |
![]() |
In differential gene expression detection, the basic idea is to compare the expression levels across two groups. We can do the comparison for gene i by using the following linear regression models
![]() | (2) |
![]() | (3) |
![]() | (4) |
![]() |
The statistic (3) is the two-sample t-statistics with the pooled variance estimation being
![]() | (5) |
After we cast the differential gene expression detection in the linear model framework, it is natural to consider other alternative estimation methods than the least squares. In the following section we will discuss the use of penalized regression model, which leads to intuitively appealing statistics.
| 2 PENALIZED LINEAR REGRESSION |
|---|
|
|
|---|
penalized linear regression has been shown to have shrinkage and variable selection properties (Tibshirani, 1996; Efron et al., 2004). It can be formally defined as the following optimization problem
![]() | (6) |
We can show that the optimal parameters for the penalized model (6) are (see Appendix for details)
![]() | (7) |
![]() |
![]() |
Similar to the ordinary linear regression model, we define the total/error sum of squares for the penalized regression model as
![]() | (8) |
![]() | (9) |
Analogous to the definition of ordinary t- (3) and F-statistics (4), we propose the penalized t-/F-statistics (t*,F*) defined as
![]() | (10) |
![]() | (11) |
. But for
penalized regression model, they do have some difference.
Notice that the shrinkage is applied to both the nominator and denominator for the penalized t-/F-statistics (11). In the nominator the absolute mean difference is shrunken by
. The addition of the constant
2/(n2) to the denominator has the effect of making individual variance more similar to each other and stabilizing the variance. Compared to the SAM statistics (Tusher et al., 2001)
![]() | (12) |
In the following section we apply the proposed methodology to the ionizing radiation (IR) response microarray data reported at Tusher et al. (2001). Through the application, we will illustrate the selection of penalty parameter
and the estimation/control of false positives.
| 3 APPLICATION TO MICROARRAY DATA |
|---|
|
|
|---|
Tusher et al. (2001) compared the expression levels of 6800 genes from wild-type human lymphoblastoid cell lines, designated as 1 and 2, growing in an unirradiated state (U) to an irradiated state (I) 4 h after exposure to a modest dose of 5 Gy of IR.
The raw and normalized data are downloaded from http://www-stat.stanford.edu/~tibs/tusher. The downloaded data contains 7129 gene-expression levels from eight hybridization conditions: U1A, U1B, U2A, U2B, I1A, I1B, I2A and I2B, where U and I correspond to the irradiation conditions, 1 and 2 are two cell lines and A and B are two replicates. The goal is to identify those differentially expressed genes between four U and I samples. The normalized gene expression data are used for analysis.
For the SAM statistics (12), the s0 was chosen to minimize the coefficient of variation of di computed as a function of si in moving windows across the data. The intuition is to make di independent of si.
3.1 False discovery rate estimation/control
We are testing m (=7129 for this dataset) genes simultaneously to identify differentially expressed genes. In the statistics of multiple testing (Hochberg and Tamhane, 1987; Westfall and Young, 1993) FWER (familywise error rate) refers to the probability of conducting at least one false positive, which is too conservative for the microarray dataset. False discovery rate (FDR), proposed by Benjamini and Hochberg (1995), is defined as the expected proportion of false positives. Controlling FDR proves more powerful than FWER and has become increasingly adopted for genomic studies.
The statistical significance of the penalized t-/F-statistics for each gene is evaluated by permutations. Similar to the analysis of Tusher et al. (2001) balanced permutations are used to minimize the potential confounding effects from differences between the two cell lines, where for each permutation we require that the first group contains two samples from the first cell line and two samples from the second cell line. In total we have 36 balanced permutations. Specifically we can use the following permutation algorithm to select significant genes and estimate FDR:
- For the original data calculate the SAM and penalized t-/F-statistics, denote their ordered values as

- For the k-th balanced permutation, calculate the SAM and penalized t-/F-statistics and denote their ordered values as
Denote their averages across all permutations as

- For a cutoff value
, identify the following genes as significant
Denote
and estimate the expected number of false positives by chance for the SAM statistics as following
where I{·} is the indicator function, and the estimated FDR is
where
is the total number of significant genes. We can similarly calculate the expected number of false positives and FDR for the penalized t-/F-statistics.
3.2 Matching penalty parameter
Tusher et al. (2001) reported the optimal s0 for the IR microarray data as 3.3. Comparing the SAM statistics (12) and the penalized t-statistics (11), we can choose the matching penalty parameter
as
![]() |
|
3.3 Optimal
selectionIn the previous section, the
was chosen to match the optimal s0 for the SAM statistics. We can also minimize the dependence of the penalized t-statistics and si to select an optimal
. We can design a score to measure their dependence by assuming the following linear model
![]() | (13) |
, and is used for testing ß1 = 0, which corresponds to independence of
and si. It is easy to show that R2 equals the square of the sample correlation between
and si. So the optimal
can be chosen by minimizing the correlations. Figure 2 displays the absolute value of R as a function of
. The optimal
is 27.14 for the data.
|
It may be argued that the linear model assumption between
and si (13) is too strong. We can avoid this by using local regression model. The basic idea is to use a moving window across the values of si, and within each local window we will fit a linear or polynomial model between
and si. And the residuals will reflect the dependence between the two variables. Formally, we can maximize the ratio of residual sum of squares (SSE) and total sum of squares (SSTO) to select optimal
, where the SSTO serves as a scaling factor for SSE to make the ratio comparable across different
. Figure 3 shows the square root of the ratios as a function of
calculated using the lowess fitting algorithm in the R program (Ihaka and Gentleman, 1996). The optimal
is 32.83. It is different from 27.14 but the two values produce very close results in terms of the number of false positives for a fixed number of significant genes (Fig. 4). Also the flat curve of the ratios over a big range, say 2837, is reflected in the fact that the performance of the penalized t-statistics is very stable for
within that range (see Supplementary information for details).
|
|
Using s0 = 3.3 for the SAM statistics and
= 27.14,32.83 for the penalized t-statistics, we can compute the number of significant genes and the expected number of false positives for a fixed cutoff. Figure 4 compares the two procedures with optimally selected parameter values. We can see that the penalized t-statistics does perform better than the SAM statistics in terms of minimizing the expected number of false positives for a fixed number of significant genes.
Figure 5 shows the difference in the identified sets of significant genes between the SAM statistics and the penalized t-statistics, where the x-axis shows the number of significant genes corresponding to different thresholding parameter
and the y-axis shows the corresponding number of non-overlapping genes between the two sets of significant genes. This difference reflects the combination of true and false positive differences and is partially responsible for their difference in the estimated number of false positives (and hence FDR) in Figure 4. It would be very interesting to investigate the differences with additional biological information or experiments (the annotations of the genes are not available in the downloaded dataset).
|
3.4 Penalized t-statistics versus the penalized F-statistics
For the proposed penalized t-/F-statistics (11), the penalized t-statistics is shrunken more than the penalized F-statistics since we have
![]() | (14) |
All our previous comparisons are based on the penalized t-statistics. For the penalized F-statistics we can similarly choose the optimal
by minimizing the dependence of
and si. For this dataset, the penalized t-statistics and the penalized F-statistics give similar results (see Supplementary information for details), though they may perform differently for other datasets.
| 4 SIMULATION STUDY |
|---|
|
|
|---|
In the previous section we illustrate our proposed methods by applying them to a public microarray data which has been studied by a well-established method and comparing their performance on the dataset. But, unfortunately, the annotations of the genes are not available in the downloaded dataset. So it is impossible to validate the difference by additional biological information.
Here we used simulations to further verify the validity of the proposed method. First we rank all the 7129 genes, the top 1000 genes are used to simulate differentially expressed genes and the remaining 6129 genes are used to simulate non-differentially expressed genes. The set of top 1000 genes ranked by the t-statistics, the SAM statistics and the penalized t-/F-statistics are very similar. Denote the mean difference and standard error estimation as
![]() |
i,si),i = 1,...,1000 are from the top 1000 genes.
m = 7129 genes with n = 8 replicates are simulated from normal distributions. Specifically m1 = 200 differentially expressed genes are simulated from distribution
![]() |
![]() |
The 8 replicates of the m0 = m m1 = 6929 non-differentially expressed genes are simulated from distribution
![]() |
![]() |
We compare the ordinary t-statistics, the SAM statistics and the penalized t-/F-statistics. The number of true positives and false positives are estimated based on B = 1000 simulations. Figure 6 shows the simulation results, where the x-axis is the average number of true positives divided by m1, and the y-axis is the average number of false positives divided by m0. We can see that the penalized t-statistics does have a small edge over the SAM statistics. Both are better than the penalized F-statistics, while the ordinary t-statistics are worse than the others.
|
Simulations for m1/m equal to other values, e.g. 0.1, 0.2, 0.9, have also been done and the patterns for the proportions of the true and false positives are very similar to Figure 6.
| 5 DISCUSSION |
|---|
|
|
|---|
In this paper we cast differential gene-expression detection into the linear regression framework. And commonly used statistics correspond to fitting the linear model using ordinary least squares. Due to the large number of genes and relatively small number of samples, regularization is useful to prevent overfitting. We discuss the use of
penalty model and propose the penalized t-/F-statistics. We illustrate the usefulness of the proposed models by an application to a public microarray data and show their comparable performance to the SAM statistics proposed by Tusher et al. (2001). Both the SAM statistics and the proposed penalized t-/F-statistics can be seen as a shrinkage of ordinary t-statistics. But the SAM is more like an ad hoc shrinkage method, while the proposed penalized t-/F-statistics are based on rigorous statistical models and prove to be intuitive and perform favorably in applications. In the application FDR is estimated using permutation and thresholding the test statistics. Alternative estimation methods using P-values can also be applied (Storey (2002)). Owing to the limited number of permutations (36), the assigned P-values will be discrete. Using the discrete P-values to estimate FDR will produce very crude results (Tusher et al., 2001). When we have enough samples we may get more accurate estimation of FDR using P-values.
Through the use and generalization of linear models, we propose two automatic methods to select optimal penalty parameter
for the penalized t-/F-statistics. The applications also show good performance.
The proposed penalized t-/F-statistics (11) are symmetric with respect to samples from two classes. It can also be shown that the results are the same if we choose a different coding of y in (1), e.g. using the first group instead of the second group as the reference group. For multiclass microarray dataset, the results will be different depending on our coding of the y variables. Penalized linear model approach to differential gene expression detection using multiclass dataset will be reported elsewhere.
| APPENDIX |
|---|
|
|
|---|
Parameter estimation for the regression model with
penalty [Equation (6)]Our goal is to minimize
![]() |
![]() |
![]() |
Derivation of the t-/F-statistics for the linear model with least squares estimation
For the ordinary regression model (2) we have
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Derivation of the penalized t-/F-statistics
For the penalized model (6) we have
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
fittings do not satisfy the ANOVA equation, SSTO = SSE+SSR.
| Acknowledgments |
|---|
This research was supported by a startup fund from the Division of Biostatistics, University of Minnesota. The author would like to thank Dr Wei Pan for helpful discussions, and the referees for helpful comments.
Received on November 9, 2004; revised on December 8, 2004; accepted on December 9, 2004
| REFERENCES |
|---|
|
|
|---|
Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (Methodological), 57, 289300.
Bolstad, B.M., Irizarry, R.A., Astrand, M., Speed, T.P. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19, 185193
DeRisi, J., Penland, L., Brown, P.O., Bittner, M.L., Meltzer, P.S., Ray, M., Chen, Y., Su, Y.A., Trent, J.M. (1996) Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nat. Genet., 14, 457460[CrossRef][Web of Science][Medline].
Dudoit, S., Yang, Y.H., Callow, M.J., Speed, T.P. (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat. Sin., 12, 111139.
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. (2004) Least angle regression. Ann. Stat., 32, 407499[CrossRef].
Efron, B., Tibshirani, R., Storey, J.D., Tusher, V. (2001) Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc., 96, 11511160[CrossRef][Web of Science].
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531537 (http://www.sciencemag.org/cgi/content/abstract/286/5439/531)
Hochberg, Y. and Tamhane, A.C. Multiple Comparison Procedures, (1987) John Wiley & Sons.
Ihaka, R. and Gentleman, R. (1996) R: a language for data analysis and graphics. J. Comput. Graph. Stat., 5, 299314[CrossRef].
Lockhart, D.J., Dong, H., Byrne, M.C., Follettie, M.T., Gallo, M.V., Chee, M.S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H., Brown, E.L. (1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol., 14, 16751680[CrossRef][Web of Science][Medline].
Pan, W. (2002) A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics, 18, 546554 (http://bioinformatics.oupjournals.org/cgi/content/abstract/18/4/546)
Pan, W. (2003) On the use of permutation in and the performance of a class of nonparametric methods to detect differential gene expression. Bioinformatics, 19, 13331340 (http://bioinformatics.oupjournals.org/cgi/content/abstract/19/11/1333)
Smyth, G.K. (2004) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol., 3, Article 3.
Smyth, G.K., Yang, Y.H., Speed, T.P. (2003) Statistical issues in microarray data analysis. In Brownstein, M.J. and Khodursky, A.B. (Eds.). Functional Genomics: Methods and Protocols, Human Press Vol. 224, , pp. 111136[CrossRef].
Storey, J.D. (2003) A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B (Methodological), 64, 479498.
Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. J. R. Stat. Soc., Ser. B (Methodological), 58, 267288.
Tusher, V.G., Tibshirani, R., Chu, G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA, 98, 51165121
Westfall, P.H. and Young, S.S. Resampling-Based Multiple Testing: Examples and Methods for P-value Adjustment, (1993) John Wiley & Sons.
Yang, Y.H., Dudoit, S., Luu, P., Lin, D.M., Peng, V., Ngai, J., Speed, T.P. (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res., 30, e15
This article has been cited by other articles:
![]() |
Y. Lai Genome-wide co-expression based prediction of differential expressions Bioinformatics, March 1, 2008; 24(5): 666 - 673. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Wu Differential gene expression detection and sample classification using penalized linear regression models Bioinformatics, February 15, 2006; 22(4): 472 - 476. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Xie, W. Pan, and A. B. Khodursky A note on using permutation-based false discovery rate estimates to compare different analysis methods for microarray data Bioinformatics, December 1, 2005; 21(23): 4280 - 4288. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||















































