Bioinformatics Advance Access originally published online on January 18, 2007
Bioinformatics 2007 23(6):732-738; doi:10.1093/bioinformatics/btl663
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Genomic characterization of multiple clinical phenotypes of cancer using multivariate linear regression models
1Department of Pharmacoepidemiology, Graduate School of Public Health, Kyoto University, Kyotom, 2Translational Research Informatics Center, Foundation for Biomedical Research and Innovation, Kobe, 3Department of Urology, Kyoto University Graduate School of Medicine, Kyoto, 4Division of Biostatistics, School of Pharmaceutical Sciences, Kitasato University, Tokyo, 5Department of Pathology, Kyoto University Graduate School of Medicine, Kyoto, Japan, 6Pacific Edge Biotechnology, University of Otago, Dunedin, 7Cancer Genetics Laboratory, Department of Biochemistry, University of Otago, Dunedin, New Zealand and 8Dvision of Clinical Trial Design and Management, Translational Research Center, Kyoto University Hospital, Kyoto, Japan
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: The development of gene expression microarray technology has allowed the identification of differentially expressed genes between different clinical phenotypic classes of cancer from a large pool of candidate genes. Although many class comparisons concerned only a single phenotype, simultaneous assessment of the relationship between gene expression and multiple phenotypes would be warranted to better understand the underlying biological structure.
Results: We develop a method to select genes related to multiple clinical phenotypes based on a set of multivariate linear regression models. For each gene, we perform model selection based on the doubly-adjusted R-square statistic and use the maximum of this statistic for gene selection. The method can substantially improve the power in gene selection, compared with a conventional method that uses a single model exclusively for gene selection. Application to a bladder cancer study to correlate pre-treatment gene expressions with pathological stage and grade is given. The methods would be useful for screening for genes related to multiple clinical phenotypes.
Availability: SAS and MATLAB codes are available from author upon request.
Contact: matsui{at}pbh.med.kyoto-u.ac.jp
| 1 INTRODUCTION |
|---|
|
|
|---|
The development of comprehensive, gene expression microarray technology has allowed the identification of differentially expressed genes between different clinical phenotypic classes of cancer from a large pool of candidate genes. Although many of such class comparison studies assessed relation of gene expression with a single clinical phenotype, such as stage of cancer, assessment of relationship with multiple clinical phenotypes, e.g. stage and grade of cancer, would be more relevant to better understand the underlying biological structure.
In incorporating multiple phenotypic characteristics of cancer simultaneously, multivariate linear regression models or ANOVA-type models for each gene are a useful analytical tool by introducing phenotypes as covariates for the response variable of gene expression level, although many applications concern modelling of channel-specific intensities to reflect the experimental design using multichannel arrays (e.g., Kerr et al., 2000, 2001; Wolfinger et al., 2001; Dobbin and Simon, 2002) or oligonucleotide arrays (Chu et al., 2002). In correlating gene expression data with multiple phenotypes, several groups of genes with distinct differential patterns may exist. A group of genes may relate to only a single phenotype, e.g. stage or grade, while another group of genes may relate to multiple phenotypes, e.g. both stage and grade, possibly with their interaction.
A commonly used approach for selecting genes related to multiple phenotypes is to assume a full model that incorporates all phenotypes and their possible interactions, e.g. a model with stage, grade and their interaction as covariates, to cover all possible differential patterns for multiple phenotypes. Then, an overall test for the null hypothesis of no effect of all the covariates on gene expression is performed with some control for false positives (Chen et al., 2005; Matsui, 2006). A drawback to this approach is that, although the overall test could detect genes with any differential patterns, the power of this test may be smaller for genes for which reduced models are correct. One may want to detect genes with various differential models equally. Only a single model should not be fitted for all genes exclusively, unless a particular differential pattern is of interest.
One idea to improve the power for various differential models would be to invoke some model selection. Model selection may also be warranted to gain additional insights on selected genes. For example, a group of genes for which the same model is selected have the potential to relate to the same aspect of disease biology. One approach is to perform a structural hypothesis testing. For example, we first assume the full model with stage, grade and their interaction and test if there is an interaction effect between stage and grade for all genes. For those genes where an interaction exists, the full model can be selected. For those where there is not, the reduced model with just stage and grade can be fit and the main effects of stage and grade are tested separately. A critical issue of this approach, however, is how to determine the significance levels for these tests. One possible approach is to determine the significance levels based on a criterion to control an error rate on false positives. However, there could be multiple combinations of significant levels for main and interaction effects for a given level of an overall error rate. Apart from technical issues, one may also argue in the first place that model selection is a different problem of gene selection and hence another criterion should be used for model selection.
In this article, we develop a method to select genes related to multiple clinical phenotypes via model selection based on a set of candidate, multivariate linear regression models. For each gene, all candidate models are fit and a model is selected based on a criterion of prediction error. Specifically we select the model for which the doubly-adjusted R-square proposed by Okuno et al. (1977) is maximized. This model selection procedure can be expressed as model selection based on hypothesis testing described earlier, but the significance levels are determined from the perspective of prediction error (see Section 2). Also, this model selection criterion is approximately equal to popular criteria based on Mallow's Cp (Mallows, 1973) and AIC (Akaike, 1973). Meanwhile, the doubly-adjusted R-square is a reasonable statistic for gene selection because of its link to an overall test for the null hypothesis of no effect of all phenotypic variables on gene expression. Procedures to control the multiplicity from examining the large number of genes can be developed. A simulation study is conducted to evaluate the performance of the proposed method. Lastly, we illustrate the methods using a bladder cancer data set to correlate pre-treatment gene expression levels with pathological stage and grade.
| 2 METHODS |
|---|
|
|
|---|
2.1 Multivariate linear regression models
Assume values Yj,i for j = 1, ... , J genes from sample i = 1, ... , n. For cDNA arrays these are typically normalized log-ratios and for oligonucleotide arrays they are normalized log signals. For each gene, the expression data is linked to multiple phenotypes via multivariate linear regression models. We suppose that there are, in total, p variables, x1, ... , xp, for which their association with gene expression is of interest. Note that the set of variables may include those representing interaction effects of different phenotypes, for example, an interaction of two different phenotypes, xk1xk2, where xk1 and xk2 represent the respective phenotypes (k1
k2). We assume various candidate models using the set of all the p variables and subsets of it as covariates. When k covariates (1
k
p) are linked to the expression level for gene j, we assume the (candidate) linear model,
|
| (1) |
j,i, is assumed independently and identically distributed the normal distribution N(0,
j2).
2.2 The Doubly-adjusted R-square
In this section, we restrict our attention to analysis of a particular gene and subscript j is dropped for brevity. The doubly-adjusted R-square is related to the prediction sum of squares (PRESS) (Hocking, 1976), defined by PRESS =
, where
i is the predicted value of yi obtained after omitting sample i from the fitting process (i = 1, ... , n). The PRESS is a statistic related to prediction error. If the model is correctly specified, the expected value of PRESS reduces to (n + k + 1)
2. Using the commonly used, unbiased estimator for the parameter
2, RSS/(n – k – 1), the expected PRESS can be estimated by (n + k + 1)RSS/(n – k – 1).
The doubly-adjusted R-square is defined by one minus the ratio of estimated expected prediction errors, the expected prediction error under the linear model with k covariates and that under the linear model with no covariate (k = 0) (Okuno et al., 1977),
|
|
|
| (2) |
|
| (3) |
i is the fitted value for yi and
The statistic
has a one-to-one correspondence with an F-statistic for the overall test of the null assumption that all of the k regression coefficients are zero. Note that this null assumption represents no effects of all the p variables because no effects of the other p – k variables are assumed in model (1). The F-statistic is given by
|
| (4) |
|
| (5) |
For each gene, we select the model for which
is maximized over candidate models. Consider comparison of the two models, one with k covariates and the other with (k + 1) covariates after the entry of one additional covariate into the model with k covariates. The model with (k + 1) covariates will be selected if
. Another popular criterion for model selection is based on the partial F-statistic, which can be expressed as
|
|
|
|
|
|
|
|
Cp(k) or AIC(k + 1)
AIC(k). When the unbiased estimate for
2 under the model with (k + 1) covariates, RSS(k+1)/(n – k – 2), is replaced with
2, these criteria may correspond to |
|
2.3 Selection of model and gene
We invoke model selection for each gene before gene selection. For each gene, we perform model selection based on
, i.e. selecting the model for which is maximized. Then, the maximum
is used as the statistic for gene selection, and genes with the greatest values of the maximum
are selected. A rationale for using
for gene selection is related to its correspondence with the overall F-test of no relations between all the p variables and gene expression as noted in the previous section. For genes for which this null assumption is incorrect,
for some candidate models or the maximum of
across candidate models may tend to have larger values, compared with genes for which the null assumption is correct. This tendency may be true for genes with any differential pattern for gene expression as long as this is captured by one of candidate models. Thus, our procedure is to select genes related to multiple phenotypes irrespective of their differential patterns via the optimization in gene selection. Note that other model selection criteria such as Cp and AIC may not have the correspondence with the overall test because of standardization across candidate models for each gene, which would be a drawback for using these statistics for gene selection. For example, if all p variables are used as covariates in estimating
2 for each gene, Cp for the model with k covariates may reduce to 2k – p + 1 for all genes.
The multiplicity issue by repeated examinations across genes can be adjusted by controlling the family-wise error rate (FWER) or the false discovery rate (FDR), where the FWER is the probability of making one or more false positives and the FDR is the expected proportion of false positives among the positives declared (Dudoit et al., 2003). Note that the control of these error rates does not relate to individual effects but to the overall test of the null assumption of no effects for all the p variables. Many authors now favour the FDR over the FWER as the appropriate error measure for exploratory microarray studies because the FWER approach can be very conservative. A practical and conceptually simple approach to providing procedures to control the FDR conservatively is to fix the rejection region beforehand and to estimate the FDR (Storey, 2002). A gene is declared to be significant if the maximum
for the gene is equal to or greater than a cut-off point on this statistic. For a given cut-off point, we can estimate the FDR conservatively by a multivariate permutation procedure with random permutations of the assignment of the set of all p variables to derive the null distribution of the maximum
, from which the average of (false) positives can be calculated as an estimate of the expected false positives. For controlling FWER, the null distribution of the maximum of the maximum
across genes is derived from the random permutations, which is referred to the observed value of the maximum
to obtain an adjusted P-value.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Simulation
We assessed adequacy of the proposed method through a small simulation study. Particularly we compared the power of our method with that of the conventional method that uses a single model with all p variables as covariates exclusively and select genes based on the overall F-test for this model. We considered three distinct phenotypes. The state of each phenotype was dichotomized, and let x1, x2 and x3 be binary variables representing respective phenotypes. Let xh = 1 for a particular state of phenotype and xh = 0 for the other states (h = 1, 2, 3). Thus, there are eight combinations in total for the level of the three phenotypes, (x1, x2, x3) = (0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 0, 0), (1, 0, 1), (1, 1, 0) and (1, 1, 1). We considered six samples for each combination (the total number of samples was 48).
We assessed the power for selecting a single informative gene which is related to phenotypes. We considered 1000 genes and, of the 1000 genes, 1 gene was informative and the other 999 were non-informative (0.1% of the whole gene is informative). For the informative gene, we assumed the model {1}, {1, 2}, {1, 2, 1x2}, {1, 2, 3} or {1, 2, 3, 1x2, 1x3, 2x3, 1x2x3}. For example, the model {1, 2, 1x2} is to a linear model with the three covariates, x1, x2 and x1x2. The error term (
j,i in(1)) is assumed independently and identically distributed the standard normal distribution for all genes. The size of regression coefficients ranged from 0.0 to 2.0. For the informative gene with the model, {1, 2, 1x2}, we assumed the differential patterns where the samples with one of the four combinations for the level of the two binary phenotypic variables, x1 and x2, e.g. samples with (x1, x2) = (0, 0), irrespective of the level of x3, had higher expressions compared with the other samples. Similarly, for the informative gene with the model, {1, 2, 3, 1x2, 1x3, 2x3, 1x2x3}, we assumed the differential patterns where the samples with one of the eight combinations for the level of the three binary phenotypic variables, x1, x2 and x3, e.g. samples with (x1, x2, x3) = (0, 0, 0), had higher expressions compared with the other samples.
For each simulation configuration, we performed the proposed method and the conventional method. For the proposed method, we introduced the four variables representing interaction effects, x1x2, x1x3, x2x3 and x1x2x3, in addition to the three variables, x1, x2 and x3, as covariates. For the seven covariates, we assumed the following 12 candidate models, {1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 1x2}, {1, 3, 1x3}, {2, 3, 2x3}, {1, 2, 3}, {1, 2, 3, 1x2, 1x3, 2x3} and {1, 2, 3, 1x2, 1x3, 2x3, 1x2x3}. For the conventional method, we employed only the overall F-test for the full model, {1, 2, 3, 1x2, 1x3, 2x3, 1x2x3}, for gene selection. For each procedure, we control the FWER. A null distribution of the maximum of statistic (the maximum
for the proposed method and the F-statistic for the conventional method) was obtained from 500 permutations, which was referred to the observed value of statistic to obtain an adjusted P-value. For each of a selected set of regression coefficients, Table 1 summarizes empirical power for the informative gene obtained from 500 simulations at the significance level of 10% for the adjusted P-value. We first confirmed that the empirical FWER was almost equal to the nominal level (data not shown). The empirical power of the proposed method can be substantially greater than that of the conventional method when models other than the full model are correct for the informative gene. The conventional method outperformed the proposed method when the full model is correct for the informative gene. Although this could be explained by error in model selection, the power loss was relatively small in our simulation setting.
|
3.2 The bladder cancer example
Tumour-biopsy tissues were collected at diagnosis from 48 patients with bladder transitional cell carcinomas at the Department of Urology, Kyoto University Graduate School of Medicine between 1990 and 2003. Patients were selected on the basis of availability of tumour-biopsy samples, without regard to clinical information. Microarray experiments were performed for frozen tissues using a cDNA array that contains printings of approximately 30 000 oligonucleotides fabricated by Pacific Edge Biotechnology Limited in New Zealand. In each hybridization, fluorescent cDNA targets were prepared from a tumour mRNA sample and a reference mRNA sample contracted from a pool of cell lines of different cancers. Two replicated arrays made from the same sample of RNA (technical replicates) without reverse labelling or dye swap were averaged to improve precision of the estimate of the expression profile for a given RNA sample. After image analysis and spot filtering, the data were normalized by a locally weighted linear regression method. We excluded genes that had low variance of log ratio expressions across samples, because the observed variability for genes with low variance is more likely to be due to measurement noise than actual biological variability (e.g. Simon et al., 2004). We selected six thousands genes with variance in the top 20th percentile.
As clinical phenotypes, cancer stage and grade from pathological reports were confirmed by a pathologist. We classified stage into two categories with distinct treatment modalities, superficial (
pT1) and invasive tumours (>pT1), to understand the underlying biological difference related to therapeutics. Generally, superficial tumours are treated with transurethral surgery or intravesical instillation therapy, while invasive tumours are treated with radical cystectomy or systemic chemotherapy. Of 48 patients, we had 31 superficial and 17 invasive patients. We also classified grade into two categories, low grade (grade 1 or 2) and more malignant, high grade (grade 3). Of the 31 superficial tumours, 25 and 6 were low and high grade, respectively. Of the 17 invasive tumours, 5 and 12 were low and high grade, respectively.
Although several authors have been interested in genes differentially expressed between stage classes (e.g. Sanchez-Carbayo et al., 2003; Modlich et al., 2004), it would seem more likely that multiple factors are related to gene expression simultaneously. We assumed a multivariate linear regression model for each gene. For a single gene, let Yi be the log-ratio of gene expression measurement for patient i (i = 1, ... , 48). Let stagei be a binary variable that takes 1 if the stage of patient i is invasive and zero otherwise. Let gradei be a binary variable that takes 1 if the grade of patient i is high and zero otherwise. We also consider an interaction term of stage and grade, stageixgradei. For log-transformed intensity ratios, we assume the following four candidate linear models for differential expression,
|
|
0) in M4, which may yield a linear model with a single interaction effect and no main effects. These differential patterns could be identified based on the estimates of parameter coefficients for genes for which M4 is selected as illustrated below.
Table 2 shows an estimate of FDR for each of several cut-off points for the maximum
obtained by the multivariate permutation procedure with 2000 permutations. We selected the top 100 genes, so that the FDR was around 15%. Table 3 shows classification of 100 selected genes based on selected models and the sign of estimated regression coefficients in selected model. Note that, out of the top 100 genes, the same model was selected for 98 genes based on AIC, as expected by the equality of the criterion on
with that on AIC as noted in Section 2. Seventeen genes were related to only stage via the model M1; 9 (8) had negative (positive) estimates of the regression coefficient β1 for stage in M1. In other words, the 9 (8) genes were under-(over-) expressed for invasive cancer. Thirteen genes were related to only grade via the model M2; 10 (3) were under-(over-) expressed for high grade cancer. Interestingly, the rest 70 genes were related to both stage and grade; 39 genes via the model M3 without the interaction term of stage and grade and 31 genes via the model M4 with the interaction term. Of the 39 genes for which the model M3 was selected, negative values for both β1 and β2 (under-expression for invasive or high grade cancer) were suggested for 37 genes, while positive values for both β1 and β2 (over-expression for invasive or high grade tumours) for two genes. Of the 31 genes for which the model M4 was selected, 15 (16) genes had negative (positive) estimates for β3, i.e. the interaction of stage and grade.
|
|
Figure 1 shows typical mean expression profiles for the selected genes with negative (Fig. 1A and 1C) or positive interaction (Fig. 1B and 1D). Figures 1A and 1B indicate differential expression between invasive cancer with high grade (the worst type) and the others, while Figures 1C and D indicate differential expression between superficial cancer with low grade (the best type) and the others. In general, selected genes for which under-expression was linked to more progressed cancer, i.e. invasive and/or high grade, involved putative tumour suppressor genes. A gene related to tumour protein p53, which is a well-known tumour suppressor, was related to grade via the model M2 and the estimate for β1 (the regression coefficient for grade) was negative. Glutathione peroxidase (GPX) 2, which is a member of oxidoreductase and the destruction of it increased the genesis of bacteria-induced intestinal cancer (Chu et al., 2004), was related to both stage and grade via the model M3 and the estimates for β1 and β2 (the regression coefficient for stage and grade, respectively) were negative. With respect to genes for which the model M4 was selected, putative tumour suppressor genes, RNA binding motif protein 3 (RBM3) (Sutherland et al., 2005) in Figure 1A and CD-81 (Koenig-Hoffmann et al., 2005) in Figure 1D were under-expressed for the worst type of cancer (invasive and high grade) and for cancers other than the best type (superficial and low grade), respectively.
|
On the other hand, selected genes for which over-expression was linked to more progressed cancer are suggested to act like as oncogenes. Amphiregulin in Figure 1B acts as a protease of the extracellular matrix (ECM) with induction of extracellular matrix metalloproteinase (MMP) –2 and –9, and the expression level was associated with survival of bladder cancer patients (Menashi et al., 2003; Thogersen et al., 2001). Highly aggressive invasive cancers are characterized by their activity of inducing break down of cell– cell or cell– matrix adhesion, modulation of ECM proteolysis and induction of angiogenesis. A family of human endogenous retroviruses (HERVs), HERV-H, in Figure 1C was highly expressed in bladder cancer (Stauffer et al., 2004). Amphiregulin in Figure 1B and HERV-H in Figure 1C were over-expressed for the worst type of cancer (invasive and high grade) and for cancers other than the best type (superficial and low grade), respectively.
| DISCUSSION |
|---|
|
|
|---|
We have developed a method to select genes related to multiple clinical phenotypes via model selection based on a set of multivariate linear regression models. As indicated by the simulation study, our methods with model selection based on the doubly-adjusted R-square,
Because of the large number of genes in microarray experiments, there will always be some genes with a very small sum of squares across samples, so that the value of an overall test statistic will be very large whether or not their averages are large. Tusher et al. (2001) have proposed a regularization that avoids this difficulty. For the overall F-test in (4), this regularization may be expressed as
|
|
is a fudge factor whose value is determined to reduce the dependence of F(
) versus
(Tusher et al., 2001). As a simple modification of our methods, we obtain a regularized version of the doubly-adjusted R-square by replacing the usual F-statistic with this regularized F-statistic in (2), (3) and (5). This regularized statistic would be reasonable for gene selection because of its one-to-one correspondence with F(
). However, for model selection, one may prefer the criterion of maximizing the original doubly-adjusted R-square to that of maximizing the regularized one because of the equality with popular criteria based on Cp or AIC. Hence, we first select the model for which the original doubly-adjusted R-square is maximized for each gene, followed by gene selection based on the regularized doubly-adjusted R-square for the selected model. We derived this regularization in a heuristic way and its theoretical justification is subject to future research. The approaches to borrow strength across genes via hierarchical modelling (e.g. Baldi and Long, 2001; Efron et al., 2001; Wright and Simon, 2003) may provide a framework for this. If clinical variables are highly correlated in a set of samples, then the models chosen for individual genes are likely to be highly unstable because of lack of power to discriminate between the different effects. One should select a set of samples to be well balanced especially when specimen banks from large cohort studies or clinical trials are available.
Determination of the set of candidate models is an important issue. One can assume, in principle, many candidate models, for example, by using different categorizing or higher order effects for phenotypic variables or by imposing constraints for regression coefficients to reflect a particular differential pattern such as the constraint in the model M4 described in the bladder cancer example (see Section 3.2). However, a large set of candidate models may suffer from error in model selection seriously. A good strategy would be to prepare a small number of plausible models with biologically different interpretations and one very flexible model to capture other differential patterns (such as the model M4 in the bladder cancer example).
We illustrated the methods using the bladder cancer data. We selected 100 genes with various patterns of differential expression for stage and/or grade (Table 3). Different models or differential expression patterns suggested from Table 3 have the potential to reflect different biologic aspects in stage and grade progressions of bladder cancer. Although some selected genes in our analyses have already been identified and studied for bladder cancer and other cancer, many genes were newly identified by our analyses. These genes have the potential to provide new insights on molecular alternation related to disease biology and progressiveness in bladder cancer. A fraction of these genes are now under investigation. As an initial step, we confirmed reproducibility of differential expression for six selected genes which were related to RNA processing and/or translation (1, 3 and 2 genes for which the model M1, M3 and M4 was selected, respectively), but have not been studied for bladder cancer, for additional 74 samples by quantitative real time PCR assays (M. Ito, unpublished data).
Our methods could also indicate candidate molecular markers for more accurate classification of cancer. For example, genes for which the linear model, M1 or M2, were selected could supplement traditional pathological assays in determining stage or grade. Besides, our methods could help in developing new approaches to cancer diagnosis incorporating multiple diagnostic factors. For example, genes for which the model M4 with the interaction effects of stage and grade was selected (e.g. the four genes in Fig. 1) could be candidate genes useful for identifying the best (superficial and low grade) or the worst (invasive and high grade) type of bladder cancer. In such attempts, link with outcome or prognosis information should also be investigated.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Joaquin Dopazo
Received on October 2, 2006; revised on November 27, 2006; accepted on December 25, 2006
| REFERENCES |
|---|
|
|
|---|
Akaike H. Information theory and an extension of the maximum likelihood principle. In: Second International Symposium on Information Theory, —Petrov BN, Caski F, eds. ( (1973) ) Budapest: Akademia Kiado. 267–281..
Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics, ( (2001) ) 17, : 509–519.
Chen D, et al. Selecting genes by test statistics. J. Biomed. Biotechnol., ( (2005) ) 2005, : 132–138.[CrossRef][Medline].
Chu FF, et al. Bacteria-induced intestinal cancer in mice with disrupted Gpx1 and Gpx2 genes. Cancer Res., ( (2004) ) 64, : 962–968.
Chu TM, et al. A systematic statistical linear modeling approach to oligonucleotide array experiments. Math. Biosci., ( (2002) ) 176, : 35–51.[CrossRef][ISI][Medline].
Dobbin K, Simon R. Comparison of microarray designs for class comparison and class discovery. Bioinformatics, ( (2002) ) 18, : 1438–1445.
Dudoit S, et al. Multiple hypothesis testing in microarray experiments. Stat. Sci., ( (2003) ) 18, : 71–103.[CrossRef][ISI].
Efron B, et al. Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc., ( (2001) ) 96, : 1151–1160.[CrossRef][ISI].
Hocking RR. The analysis and selection of variables in linear regression. Biometrics, ( (1976) ) 32, : 1–49.[CrossRef][ISI].
Kerr MK, Churchill GA. Statistical design and the analysis of gene expression microarray data. Genet. Res., ( (2001) ) 77, : 123–128.[CrossRef][ISI][Medline].
Kerr MK, et al. Analysis of variance for gene expression microarray data. J. Comput. Biol., ( (2000) ) 7, : 819–837.[CrossRef][ISI][Medline].
Koenig-Hoffmann K, et al. High throughput functional genomics: identification of novel genes with tumor suppressor phenotypes. Int. J. Cancer, ( (2005) ) 113, : 434–439.[CrossRef][ISI][Medline].
Mallows CL. Some comments on Cp. Technometrics, ( (1973) ) 15, : 661–676.[CrossRef][ISI].
Matsui S. Statistical applications using DNA microarrays for cancer diagnosis and prognosis. In: Handbook of Statistics in Clinical Oncology, —Crowley JJ, Ankerst DP, eds. ( (2006) ) 2nd. Boca Raton: CRC Press. 419–436..
Menashi S, et al. Regulation of extracellular matrix metalloproteinase inducer and matrix metalloproteinase expression by Amphiregulin in transformed human breast epithelial cells. Cancer Res., ( (2003) ) 63, : 7575–7580.
Modlich O, et al. Identifying superficial, muscle-invasive, and metastasizing transitional cell carcinoma of the bladder: use of cDNA array analysis of gene expression profiles. Clin. Cancer. Res., ( (2004) ) 10, : 3410–3421.
Okuno T, et al. Prediction sum of squares, Akaike's information criteria and doubly-adjusted multiple correlation coefficient. Bull. Int. Stat. Inst., ( (1977) ) 47, : 370–373..
Sanchez-Carbayo M, et al. Gene discovery in bladder cancer progression using cDNA microarrays. Am. J. Pathol., ( (2003) ) 163, : 505–516.
Simon R, et al. Design and Analysis of DNA Microarray Investigations, ( (2004) ) New York: Springer..
Stauffer Y, et al. Digital expression profiles of human endogenous retroviral families in normal and cancerous tissues. Cancer Immun., ( (2004) ) 4, : 2.[Medline].
Storey JD. A direct approach to false discovery rates. J. Roy. Stati. Soc., ( (2002) ) B64, : 479–498.[CrossRef].
Sutherland LC, et al. RNA binding motif (RBM) proteins: a novel family of apoptosis modulators? J. Cell Biochem., ( (2005) ) 94, : 5–24.[CrossRef][ISI][Medline].
Thogersen VB, et al. A subclass of HER1 ligands are prognostic markers for survival in bladder cancer patients. Cancer Res., ( (2001) ) 61, : 6227–6233.
Tusher VG, et al. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA, ( (2001) ) 98, : 5116–5121.
Wright GW, Simon RM. A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics, ( (2003) ) 19, : 2448–2455.
Wolfinger RD, et al. Assessing gene significance from cDNA microarray expression data via mixed models. J. Comput. Biol., ( (2001) ) 8, : 625–637.[CrossRef][ISI][Medline].
This article has been cited by other articles:
![]() |
H. Kawanishi, Y. Matsui, M. Ito, J. Watanabe, T. Takahashi, K. Nishizawa, H. Nishiyama, T. Kamoto, Y. Mikami, Y. Tanaka, et al. Secreted CXCL1 Is a Potential Mediator and Marker of the Tumor Invasion of Bladder Cancer Clin. Cancer Res., May 1, 2008; 14(9): 2579 - 2587. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Loscalzo Association Studies in an Era of Too Much Information: Clinical Analysis of New Biomarker and Genetic Data Circulation, October 23, 2007; 116(17): 1866 - 1870. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



