Skip Navigation


Bioinformatics Advance Access originally published online on January 18, 2007
Bioinformatics 2007 23(6):732-738; doi:10.1093/bioinformatics/btl663
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/6/732    most recent
btl663v3
btl663v2
btl663v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Matsui, S.
Right arrow Articles by Ogawa, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Matsui, S.
Right arrow Articles by Ogawa, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Genomic characterization of multiple clinical phenotypes of cancer using multivariate linear regression models

Shigeyuki Matsui 1,2,*, Masaaki Ito 3, Hiroyuki Nishiyama 3, Hajime Uno 4, Hirokazu Kotani 5, Jun Watanabe 3, Parry Guilford 6, Anthony Reeve 7, Masanori Fukushima 8,2 and Osamu Ogawa 3

1Department of Pharmacoepidemiology, Graduate School of Public Health, Kyoto University, Kyotom, 2Translational Research Informatics Center, Foundation for Biomedical Research and Innovation, Kobe, 3Department of Urology, Kyoto University Graduate School of Medicine, Kyoto, 4Division of Biostatistics, School of Pharmaceutical Sciences, Kitasato University, Tokyo, 5Department of Pathology, Kyoto University Graduate School of Medicine, Kyoto, Japan, 6Pacific Edge Biotechnology, University of Otago, Dunedin, 7Cancer Genetics Laboratory, Department of Biochemistry, University of Otago, Dunedin, New Zealand and 8Dvision of Clinical Trial Design and Management, Translational Research Center, Kyoto University Hospital, Kyoto, Japan

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 DISCUSSION
 REFERENCES
 

Motivation: The development of gene expression microarray technology has allowed the identification of differentially expressed genes between different clinical phenotypic classes of cancer from a large pool of candidate genes. Although many class comparisons concerned only a single phenotype, simultaneous assessment of the relationship between gene expression and multiple phenotypes would be warranted to better understand the underlying biological structure.

Results: We develop a method to select genes related to multiple clinical phenotypes based on a set of multivariate linear regression models. For each gene, we perform model selection based on the doubly-adjusted R-square statistic and use the maximum of this statistic for gene selection. The method can substantially improve the power in gene selection, compared with a conventional method that uses a single model exclusively for gene selection. Application to a bladder cancer study to correlate pre-treatment gene expressions with pathological stage and grade is given. The methods would be useful for screening for genes related to multiple clinical phenotypes.

Availability: SAS and MATLAB codes are available from author upon request.

Contact: matsui{at}pbh.med.kyoto-u.ac.jp


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 DISCUSSION
 REFERENCES
 
The development of comprehensive, gene expression microarray technology has allowed the identification of differentially expressed genes between different clinical phenotypic classes of cancer from a large pool of candidate genes. Although many of such class comparison studies assessed relation of gene expression with a single clinical phenotype, such as stage of cancer, assessment of relationship with multiple clinical phenotypes, e.g. stage and grade of cancer, would be more relevant to better understand the underlying biological structure.

In incorporating multiple phenotypic characteristics of cancer simultaneously, multivariate linear regression models or ANOVA-type models for each gene are a useful analytical tool by introducing phenotypes as covariates for the response variable of gene expression level, although many applications concern modelling of channel-specific intensities to reflect the experimental design using multichannel arrays (e.g., Kerr et al., 2000, 2001; Wolfinger et al., 2001; Dobbin and Simon, 2002) or oligonucleotide arrays (Chu et al., 2002). In correlating gene expression data with multiple phenotypes, several groups of genes with distinct differential patterns may exist. A group of genes may relate to only a single phenotype, e.g. stage or grade, while another group of genes may relate to multiple phenotypes, e.g. both stage and grade, possibly with their interaction.

A commonly used approach for selecting genes related to multiple phenotypes is to assume a full model that incorporates all phenotypes and their possible interactions, e.g. a model with stage, grade and their interaction as covariates, to cover all possible differential patterns for multiple phenotypes. Then, an overall test for the null hypothesis of no effect of all the covariates on gene expression is performed with some control for false positives (Chen et al., 2005; Matsui, 2006). A drawback to this approach is that, although the overall test could detect genes with any differential patterns, the power of this test may be smaller for genes for which reduced models are correct. One may want to detect genes with various differential models equally. Only a single model should not be fitted for all genes exclusively, unless a particular differential pattern is of interest.

One idea to improve the power for various differential models would be to invoke some model selection. Model selection may also be warranted to gain additional insights on selected genes. For example, a group of genes for which the same model is selected have the potential to relate to the same aspect of disease biology. One approach is to perform a structural hypothesis testing. For example, we first assume the full model with stage, grade and their interaction and test if there is an interaction effect between stage and grade for all genes. For those genes where an interaction exists, the full model can be selected. For those where there is not, the reduced model with just stage and grade can be fit and the main effects of stage and grade are tested separately. A critical issue of this approach, however, is how to determine the significance levels for these tests. One possible approach is to determine the significance levels based on a criterion to control an error rate on false positives. However, there could be multiple combinations of significant levels for main and interaction effects for a given level of an overall error rate. Apart from technical issues, one may also argue in the first place that model selection is a different problem of gene selection and hence another criterion should be used for model selection.

In this article, we develop a method to select genes related to multiple clinical phenotypes via model selection based on a set of candidate, multivariate linear regression models. For each gene, all candidate models are fit and a model is selected based on a criterion of prediction error. Specifically we select the model for which the doubly-adjusted R-square proposed by Okuno et al. (1977) is maximized. This model selection procedure can be expressed as model selection based on hypothesis testing described earlier, but the significance levels are determined from the perspective of prediction error (see Section 2). Also, this model selection criterion is approximately equal to popular criteria based on Mallow's Cp (Mallows, 1973) and AIC (Akaike, 1973). Meanwhile, the doubly-adjusted R-square is a reasonable statistic for gene selection because of its link to an overall test for the null hypothesis of no effect of all phenotypic variables on gene expression. Procedures to control the multiplicity from examining the large number of genes can be developed. A simulation study is conducted to evaluate the performance of the proposed method. Lastly, we illustrate the methods using a bladder cancer data set to correlate pre-treatment gene expression levels with pathological stage and grade.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 DISCUSSION
 REFERENCES
 
2.1 Multivariate linear regression models
Assume values Yj,i for j = 1, ... , J genes from sample i = 1, ... , n. For cDNA arrays these are typically normalized log-ratios and for oligonucleotide arrays they are normalized log signals. For each gene, the expression data is linked to multiple phenotypes via multivariate linear regression models. We suppose that there are, in total, p variables, x1, ... , xp, for which their association with gene expression is of interest. Note that the set of variables may include those representing interaction effects of different phenotypes, for example, an interaction of two different phenotypes, xk1xk2, where xk1 and xk2 represent the respective phenotypes (k1 != k2). We assume various candidate models using the set of all the p variables and subsets of it as covariates. When k covariates (1 ≤ k ≤ p) are linked to the expression level for gene j, we assume the (candidate) linear model,


Formula 1

(1)
where β0, β1, ... , βk are regression coefficients. The error term, {varepsilon}j,i, is assumed independently and identically distributed the normal distribution N(0, {sigma}j2).

2.2 The Doubly-adjusted R-square
In this section, we restrict our attention to analysis of a particular gene and subscript j is dropped for brevity. The doubly-adjusted R-square is related to the prediction sum of squares (PRESS) (Hocking, 1976), defined by PRESS = Formula , where yi is the predicted value of yi obtained after omitting sample i from the fitting process (i = 1, ... , n). The PRESS is a statistic related to prediction error. If the model is correctly specified, the expected value of PRESS reduces to (n + k + 1){sigma}2. Using the commonly used, unbiased estimator for the parameter {sigma}2, RSS/(nk – 1), the expected PRESS can be estimated by (n + k + 1)RSS/(nk 1).

The doubly-adjusted R-square is defined by one minus the ratio of estimated expected prediction errors, the expected prediction error under the linear model with k covariates and that under the linear model with no covariate (k = 0) (Okuno et al., 1977),


Formula

or


Formula 2

(2)
where


Formula 3

(3)
denotes the coefficient of multiple determination or the R-square statistic. Here Formula is the total sum of squares and Formula is the residual sum of squares, where yi is the fitted value for yi and Formula is the average of gene expression across all samples. For Formula , the expected PRESS is used, instead of the usual sum of squares such as RSS and TSS, which are used for R2 in (3).

The statistic Formula has a one-to-one correspondence with an F-statistic for the overall test of the null assumption that all of the k regression coefficients are zero. Note that this null assumption represents no effects of all the p variables because no effects of the other p k variables are assumed in model (1). The F-statistic is given by


Formula 4

(4)
which has the relation with R2,


Formula 5

(5)
As such, Formula has correspondence with the F-statistic through (2).

For each gene, we select the model for which Formula is maximized over candidate models. Consider comparison of the two models, one with k covariates and the other with (k + 1) covariates after the entry of one additional covariate into the model with k covariates. The model with (k + 1) covariates will be selected if Formula . Another popular criterion for model selection is based on the partial F-statistic, which can be expressed as


Formula

The criterion Formula corresponds to


Formula

when n >> k (Okuno et al., 1977). Another popular criterion is based on Mallow's Cp (Mallows, 1973),


Formula

for the model with k covariates. The AIC (Akaike, 1973) is also a popular statistic for model selection, although assumption for the error term distribution is needed. For normally distributed error terms with known variance, AIC can be expressed as


Formula

for the model with k covariates, which is identical with Cp. The model with (k + 1) covariates may be selected if Cp(k + 1) ≤ Cp(k) or AIC(k + 1) ≤ AIC(k). When the unbiased estimate for {sigma}2 under the model with (k + 1) covariates, RSS(k+1)/(nk – 2), is replaced with {sigma}2, these criteria may correspond to


Formula

As such, the criterion based on Formula is approximately equal to the criteria based on Cp or AIC.

2.3 Selection of model and gene
We invoke model selection for each gene before gene selection. For each gene, we perform model selection based on Formula , i.e. selecting the model for which is maximized. Then, the maximum Formula is used as the statistic for gene selection, and genes with the greatest values of the maximum Formula are selected. A rationale for using Formula for gene selection is related to its correspondence with the overall F-test of no relations between all the p variables and gene expression as noted in the previous section. For genes for which this null assumption is incorrect, Formula for some candidate models or the maximum of Formula across candidate models may tend to have larger values, compared with genes for which the null assumption is correct. This tendency may be true for genes with any differential pattern for gene expression as long as this is captured by one of candidate models. Thus, our procedure is to select genes related to multiple phenotypes irrespective of their differential patterns via the optimization in gene selection. Note that other model selection criteria such as Cp and AIC may not have the correspondence with the overall test because of standardization across candidate models for each gene, which would be a drawback for using these statistics for gene selection. For example, if all p variables are used as covariates in estimating {sigma}2 for each gene, Cp for the model with k covariates may reduce to 2kp + 1 for all genes.

The multiplicity issue by repeated examinations across genes can be adjusted by controlling the family-wise error rate (FWER) or the false discovery rate (FDR), where the FWER is the probability of making one or more false positives and the FDR is the expected proportion of false positives among the positives declared (Dudoit et al., 2003). Note that the control of these error rates does not relate to individual effects but to the overall test of the null assumption of no effects for all the p variables. Many authors now favour the FDR over the FWER as the appropriate error measure for exploratory microarray studies because the FWER approach can be very conservative. A practical and conceptually simple approach to providing procedures to control the FDR conservatively is to fix the rejection region beforehand and to estimate the FDR (Storey, 2002). A gene is declared to be significant if the maximum Formula for the gene is equal to or greater than a cut-off point on this statistic. For a given cut-off point, we can estimate the FDR conservatively by a multivariate permutation procedure with random permutations of the assignment of the set of all p variables to derive the null distribution of the maximum Formula , from which the average of (false) positives can be calculated as an estimate of the expected false positives. For controlling FWER, the null distribution of the maximum of the maximum Formula across genes is derived from the random permutations, which is referred to the observed value of the maximum Formula to obtain an adjusted P-value.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 DISCUSSION
 REFERENCES
 
3.1 Simulation
We assessed adequacy of the proposed method through a small simulation study. Particularly we compared the power of our method with that of the conventional method that uses a single model with all p variables as covariates exclusively and select genes based on the overall F-test for this model. We considered three distinct phenotypes. The state of each phenotype was dichotomized, and let x1, x2 and x3 be binary variables representing respective phenotypes. Let xh = 1 for a particular state of phenotype and xh = 0 for the other states (h = 1, 2, 3). Thus, there are eight combinations in total for the level of the three phenotypes, (x1, x2, x3) = (0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 0, 0), (1, 0, 1), (1, 1, 0) and (1, 1, 1). We considered six samples for each combination (the total number of samples was 48).

We assessed the power for selecting a single informative gene which is related to phenotypes. We considered 1000 genes and, of the 1000 genes, 1 gene was informative and the other 999 were non-informative (0.1% of the whole gene is informative). For the informative gene, we assumed the model {1}, {1, 2}, {1, 2, 1x2}, {1, 2, 3} or {1, 2, 3, 1x2, 1x3, 2x3, 1x2x3}. For example, the model {1, 2, 1x2} is to a linear model with the three covariates, x1, x2 and x1x2. The error term ({varepsilon}j,i in(1)) is assumed independently and identically distributed the standard normal distribution for all genes. The size of regression coefficients ranged from 0.0 to 2.0. For the informative gene with the model, {1, 2, 1x2}, we assumed the differential patterns where the samples with one of the four combinations for the level of the two binary phenotypic variables, x1 and x2, e.g. samples with (x1, x2) = (0, 0), irrespective of the level of x3, had higher expressions compared with the other samples. Similarly, for the informative gene with the model, {1, 2, 3, 1x2, 1x3, 2x3, 1x2x3}, we assumed the differential patterns where the samples with one of the eight combinations for the level of the three binary phenotypic variables, x1, x2 and x3, e.g. samples with (x1, x2, x3) = (0, 0, 0), had higher expressions compared with the other samples.

For each simulation configuration, we performed the proposed method and the conventional method. For the proposed method, we introduced the four variables representing interaction effects, x1x2, x1x3, x2x3 and x1x2x3, in addition to the three variables, x1, x2 and x3, as covariates. For the seven covariates, we assumed the following 12 candidate models, {1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 1x2}, {1, 3, 1x3}, {2, 3, 2x3}, {1, 2, 3}, {1, 2, 3, 1x2, 1x3, 2x3} and {1, 2, 3, 1x2, 1x3, 2x3, 1x2x3}. For the conventional method, we employed only the overall F-test for the full model, {1, 2, 3, 1x2, 1x3, 2x3, 1x2x3}, for gene selection. For each procedure, we control the FWER. A null distribution of the maximum of statistic (the maximum Formula for the proposed method and the F-statistic for the conventional method) was obtained from 500 permutations, which was referred to the observed value of statistic to obtain an adjusted P-value. For each of a selected set of regression coefficients, Table 1 summarizes empirical power for the informative gene obtained from 500 simulations at the significance level of 10% for the adjusted P-value. We first confirmed that the empirical FWER was almost equal to the nominal level (data not shown). The empirical power of the proposed method can be substantially greater than that of the conventional method when models other than the full model are correct for the informative gene. The conventional method outperformed the proposed method when the full model is correct for the informative gene. Although this could be explained by error in model selection, the power loss was relatively small in our simulation setting.


View this table:
[in this window]
[in a new window]

 
Table 1. Empirical power (%) and its standard error in parenthesis of the proposed method and the conventional method for each of a selected set of regression coefficients in differential model

 
3.2 The bladder cancer example
Tumour-biopsy tissues were collected at diagnosis from 48 patients with bladder transitional cell carcinomas at the Department of Urology, Kyoto University Graduate School of Medicine between 1990 and 2003. Patients were selected on the basis of availability of tumour-biopsy samples, without regard to clinical information. Microarray experiments were performed for frozen tissues using a cDNA array that contains printings of approximately 30 000 oligonucleotides fabricated by Pacific Edge Biotechnology Limited in New Zealand. In each hybridization, fluorescent cDNA targets were prepared from a tumour mRNA sample and a reference mRNA sample contracted from a pool of cell lines of different cancers. Two replicated arrays made from the same sample of RNA (technical replicates) without reverse labelling or dye swap were averaged to improve precision of the estimate of the expression profile for a given RNA sample. After image analysis and spot filtering, the data were normalized by a locally weighted linear regression method. We excluded genes that had low variance of log ratio expressions across samples, because the observed variability for genes with low variance is more likely to be due to measurement noise than actual biological variability (e.g. Simon et al., 2004). We selected six thousands genes with variance in the top 20th percentile.

As clinical phenotypes, cancer stage and grade from pathological reports were confirmed by a pathologist. We classified stage into two categories with distinct treatment modalities, superficial (less double equals pT1) and invasive tumours (>pT1), to understand the underlying biological difference related to therapeutics. Generally, superficial tumours are treated with transurethral surgery or intravesical instillation therapy, while invasive tumours are treated with radical cystectomy or systemic chemotherapy. Of 48 patients, we had 31 superficial and 17 invasive patients. We also classified grade into two categories, low grade (grade 1 or 2) and more malignant, high grade (grade 3). Of the 31 superficial tumours, 25 and 6 were low and high grade, respectively. Of the 17 invasive tumours, 5 and 12 were low and high grade, respectively.

Although several authors have been interested in genes differentially expressed between stage classes (e.g. Sanchez-Carbayo et al., 2003; Modlich et al., 2004), it would seem more likely that multiple factors are related to gene expression simultaneously. We assumed a multivariate linear regression model for each gene. For a single gene, let Yi be the log-ratio of gene expression measurement for patient i (i = 1, ... , 48). Let stagei be a binary variable that takes 1 if the stage of patient i is invasive and zero otherwise. Let gradei be a binary variable that takes 1 if the grade of patient i is high and zero otherwise. We also consider an interaction term of stage and grade, stageixgradei. For log-transformed intensity ratios, we assume the following four candidate linear models for differential expression,


Formula

where µi = E(Yi) is the expected value or the mean value of the log-ratio for patient i and β0, β1, β2 and β3 are regression coefficients. The model M1 was to detect genes just related to stage like in previous studies. For simultaneous relation with stage and grade, M3 and M4 were assumed. Note that M4 is a saturated model (four regression parameters for four combinations of binary classes, stage and grade). It covers any differential patterns other than additive effects of stage and grade. They include differential expressions between two classes based on the combination of stage and grade. For example, a differential expression between superficial cancer with low grade (the best type of cancer) and the others is obtained by introducing the constraint β1 = β2 = –β3 (!= 0) in M4, which may yield a linear model with a single interaction effect and no main effects. These differential patterns could be identified based on the estimates of parameter coefficients for genes for which M4 is selected as illustrated below.

Table 2 shows an estimate of FDR for each of several cut-off points for the maximum Formula obtained by the multivariate permutation procedure with 2000 permutations. We selected the top 100 genes, so that the FDR was around 15%. Table 3 shows classification of 100 selected genes based on selected models and the sign of estimated regression coefficients in selected model. Note that, out of the top 100 genes, the same model was selected for 98 genes based on AIC, as expected by the equality of the criterion on Formula with that on AIC as noted in Section 2. Seventeen genes were related to only stage via the model M1; 9 (8) had negative (positive) estimates of the regression coefficient β1 for stage in M1. In other words, the 9 (8) genes were under-(over-) expressed for invasive cancer. Thirteen genes were related to only grade via the model M2; 10 (3) were under-(over-) expressed for high grade cancer. Interestingly, the rest 70 genes were related to both stage and grade; 39 genes via the model M3 without the interaction term of stage and grade and 31 genes via the model M4 with the interaction term. Of the 39 genes for which the model M3 was selected, negative values for both β1 and β2 (under-expression for invasive or high grade cancer) were suggested for 37 genes, while positive values for both β1 and β2 (over-expression for invasive or high grade tumours) for two genes. Of the 31 genes for which the model M4 was selected, 15 (16) genes had negative (positive) estimates for β3, i.e. the interaction of stage and grade.


View this table:
[in this window]
[in a new window]

 
Table 2. Estimates of FDR for various cut-offs for the maximum Table 2

 

View this table:
[in this window]
[in a new window]

 
Table 3. Sign of estimated regression coefficients in the selected models for the 100 significant genes

 
Figure 1 shows typical mean expression profiles for the selected genes with negative (Fig. 1A and 1C) or positive interaction (Fig. 1B and 1D). Figures 1A and 1B indicate differential expression between invasive cancer with high grade (the worst type) and the others, while Figures 1C and D indicate differential expression between superficial cancer with low grade (the best type) and the others. In general, selected genes for which under-expression was linked to more progressed cancer, i.e. invasive and/or high grade, involved putative tumour suppressor genes. A gene related to tumour protein p53, which is a well-known tumour suppressor, was related to grade via the model M2 and the estimate for β1 (the regression coefficient for grade) was negative. Glutathione peroxidase (GPX) 2, which is a member of oxidoreductase and the destruction of it increased the genesis of bacteria-induced intestinal cancer (Chu et al., 2004), was related to both stage and grade via the model M3 and the estimates for β1 and β2 (the regression coefficient for stage and grade, respectively) were negative. With respect to genes for which the model M4 was selected, putative tumour suppressor genes, RNA binding motif protein 3 (RBM3) (Sutherland et al., 2005) in Figure 1A and CD-81 (Koenig-Hoffmann et al., 2005) in Figure 1D were under-expressed for the worst type of cancer (invasive and high grade) and for cancers other than the best type (superficial and low grade), respectively.


Figure 1
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Typical expression profiles for genes with negative or positive interaction. Each of four plots AD shows the mean expressions of a gene of the 100 significant genes for the four types of cancer combined by stage, superficial and invasive, and grade, low (dashed line) and high (solid line). The estimates of parameters (β1, β2, β3) for the model M4 are also given.

 
On the other hand, selected genes for which over-expression was linked to more progressed cancer are suggested to act like as oncogenes. Amphiregulin in Figure 1B acts as a protease of the extracellular matrix (ECM) with induction of extracellular matrix metalloproteinase (MMP) –2 and –9, and the expression level was associated with survival of bladder cancer patients (Menashi et al., 2003; Thogersen et al., 2001). Highly aggressive invasive cancers are characterized by their activity of inducing break down of cell– cell or cell– matrix adhesion, modulation of ECM proteolysis and induction of angiogenesis. A family of human endogenous retroviruses (HERVs), HERV-H, in Figure 1C was highly expressed in bladder cancer (Stauffer et al., 2004). Amphiregulin in Figure 1B and HERV-H in Figure 1C were over-expressed for the worst type of cancer (invasive and high grade) and for cancers other than the best type (superficial and low grade), respectively.


    DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 DISCUSSION
 REFERENCES
 
We have developed a method to select genes related to multiple clinical phenotypes via model selection based on a set of multivariate linear regression models. As indicated by the simulation study, our methods with model selection based on the doubly-adjusted R-square, Formula , can substantially improve the power for various differential models for multiple phenotypes, compared with the conventional method that uses only a single model for gene selection. Also, the model selection indicates appropriate model for each selected gene, which may be helpful to gain insights on the relationship between gene expression and multiple phenotypes. Hence, our methods would be a useful tool for screening for selecting genes with various differential models for multiple phenotypes. In linear regression models, various types of phenotypic variables including nominal, ordinal and continuous variables with three or more levels can be accommodated. The Formula can be generally adopted for both nested and non-nested models. Our methods are hence applicable to a wide variety of correlative analyses of global gene expression profiles from microarrays with multiple phenotypes. Although we have focused on gene selection, the model building process for prediction, e.g. for diagnostic prediction, preferably by generating hybrid models consisting of different kind of models based on a set of genes, is interesting and thus subject to future research.

Because of the large number of genes in microarray experiments, there will always be some genes with a very small sum of squares across samples, so that the value of an overall test statistic will be very large whether or not their averages are large. Tusher et al. (2001) have proposed a regularization that avoids this difficulty. For the overall F-test in (4), this regularization may be expressed as


Formula

where {phi} is a fudge factor whose value is determined to reduce the dependence of F({phi}) versus {phi} (Tusher et al., 2001). As a simple modification of our methods, we obtain a regularized version of the doubly-adjusted R-square by replacing the usual F-statistic with this regularized F-statistic in (2), (3) and (5). This regularized statistic would be reasonable for gene selection because of its one-to-one correspondence with F({phi}). However, for model selection, one may prefer the criterion of maximizing the original doubly-adjusted R-square to that of maximizing the regularized one because of the equality with popular criteria based on Cp or AIC. Hence, we first select the model for which the original doubly-adjusted R-square is maximized for each gene, followed by gene selection based on the regularized doubly-adjusted R-square for the selected model. We derived this regularization in a heuristic way and its theoretical justification is subject to future research. The approaches to borrow strength across genes via hierarchical modelling (e.g. Baldi and Long, 2001; Efron et al., 2001; Wright and Simon, 2003) may provide a framework for this.

If clinical variables are highly correlated in a set of samples, then the models chosen for individual genes are likely to be highly unstable because of lack of power to discriminate between the different effects. One should select a set of samples to be well balanced especially when specimen banks from large cohort studies or clinical trials are available.

Determination of the set of candidate models is an important issue. One can assume, in principle, many candidate models, for example, by using different categorizing or higher order effects for phenotypic variables or by imposing constraints for regression coefficients to reflect a particular differential pattern such as the constraint in the model M4 described in the bladder cancer example (see Section 3.2). However, a large set of candidate models may suffer from error in model selection seriously. A good strategy would be to prepare a small number of plausible models with biologically different interpretations and one very flexible model to capture other differential patterns (such as the model M4 in the bladder cancer example).

We illustrated the methods using the bladder cancer data. We selected 100 genes with various patterns of differential expression for stage and/or grade (Table 3). Different models or differential expression patterns suggested from Table 3 have the potential to reflect different biologic aspects in stage and grade progressions of bladder cancer. Although some selected genes in our analyses have already been identified and studied for bladder cancer and other cancer, many genes were newly identified by our analyses. These genes have the potential to provide new insights on molecular alternation related to disease biology and progressiveness in bladder cancer. A fraction of these genes are now under investigation. As an initial step, we confirmed reproducibility of differential expression for six selected genes which were related to RNA processing and/or translation (1, 3 and 2 genes for which the model M1, M3 and M4 was selected, respectively), but have not been studied for bladder cancer, for additional 74 samples by quantitative real time PCR assays (M. Ito, unpublished data).

Our methods could also indicate candidate molecular markers for more accurate classification of cancer. For example, genes for which the linear model, M1 or M2, were selected could supplement traditional pathological assays in determining stage or grade. Besides, our methods could help in developing new approaches to cancer diagnosis incorporating multiple diagnostic factors. For example, genes for which the model M4 with the interaction effects of stage and grade was selected (e.g. the four genes in Fig. 1) could be candidate genes useful for identifying the best (superficial and low grade) or the worst (invasive and high grade) type of bladder cancer. In such attempts, link with outcome or prognosis information should also be investigated.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Joaquin Dopazo

Received on October 2, 2006; revised on November 27, 2006; accepted on December 25, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 DISCUSSION
 REFERENCES
 

    Akaike H. Information theory and an extension of the maximum likelihood principle. In: Second International Symposium on Information Theory, —Petrov BN, Caski F, eds. ( (1973) ) Budapest: Akademia Kiado. 267–281..

    Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics, ( (2001) ) 17, : 509–519.[Abstract/Free Full Text].

    Chen D, et al. Selecting genes by test statistics. J. Biomed. Biotechnol., ( (2005) ) 2005, : 132–138.[CrossRef][Medline].

    Chu FF, et al. Bacteria-induced intestinal cancer in mice with disrupted Gpx1 and Gpx2 genes. Cancer Res., ( (2004) ) 64, : 962–968.[Abstract/Free Full Text].

    Chu TM, et al. A systematic statistical linear modeling approach to oligonucleotide array experiments. Math. Biosci., ( (2002) ) 176, : 35–51.[CrossRef][ISI][Medline].

    Dobbin K, Simon R. Comparison of microarray designs for class comparison and class discovery. Bioinformatics, ( (2002) ) 18, : 1438–1445.[Abstract/Free Full Text].

    Dudoit S, et al. Multiple hypothesis testing in microarray experiments. Stat. Sci., ( (2003) ) 18, : 71–103.[CrossRef][ISI].

    Efron B, et al. Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc., ( (2001) ) 96, : 1151–1160.[CrossRef][ISI].

    Hocking RR. The analysis and selection of variables in linear regression. Biometrics, ( (1976) ) 32, : 1–49.[CrossRef][ISI].

    Kerr MK, Churchill GA. Statistical design and the analysis of gene expression microarray data. Genet. Res., ( (2001) ) 77, : 123–128.[CrossRef][ISI][Medline].

    Kerr MK, et al. Analysis of variance for gene expression microarray data. J. Comput. Biol., ( (2000) ) 7, : 819–837.[CrossRef][ISI][Medline].

    Koenig-Hoffmann K, et al. High throughput functional genomics: identification of novel genes with tumor suppressor phenotypes. Int. J. Cancer, ( (2005) ) 113, : 434–439.[CrossRef][ISI][Medline].

    Mallows CL. Some comments on Cp. Technometrics, ( (1973) ) 15, : 661–676.[CrossRef][ISI].

    Matsui S. Statistical applications using DNA microarrays for cancer diagnosis and prognosis. In: Handbook of Statistics in Clinical Oncology, —Crowley JJ, Ankerst DP, eds. ( (2006) ) 2nd. Boca Raton: CRC Press. 419–436..

    Menashi S, et al. Regulation of extracellular matrix metalloproteinase inducer and matrix metalloproteinase expression by Amphiregulin in transformed human breast epithelial cells. Cancer Res., ( (2003) ) 63, : 7575–7580.[Abstract/Free Full Text].

    Modlich O, et al. Identifying superficial, muscle-invasive, and metastasizing transitional cell carcinoma of the bladder: use of cDNA array analysis of gene expression profiles. Clin. Cancer. Res., ( (2004) ) 10, : 3410–3421.[Abstract/Free Full Text].

    Okuno T, et al. Prediction sum of squares, Akaike's information criteria and doubly-adjusted multiple correlation coefficient. Bull. Int. Stat. Inst., ( (1977) ) 47, : 370–373..

    Sanchez-Carbayo M, et al. Gene discovery in bladder cancer progression using cDNA microarrays. Am. J. Pathol., ( (2003) ) 163, : 505–516.[Abstract/Free Full Text].

    Simon R, et al. Design and Analysis of DNA Microarray Investigations, ( (2004) ) New York: Springer..

    Stauffer Y, et al. Digital expression profiles of human endogenous retroviral families in normal and cancerous tissues. Cancer Immun., ( (2004) ) 4, : 2.[Medline].

    Storey JD. A direct approach to false discovery rates. J. Roy. Stati. Soc., ( (2002) ) B64, : 479–498.[CrossRef].

    Sutherland LC, et al. RNA binding motif (RBM) proteins: a novel family of apoptosis modulators? J. Cell Biochem., ( (2005) ) 94, : 5–24.[CrossRef][ISI][Medline].

    Thogersen VB, et al. A subclass of HER1 ligands are prognostic markers for survival in bladder cancer patients. Cancer Res., ( (2001) ) 61, : 6227–6233.[Abstract/Free Full Text].

    Tusher VG, et al. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA, ( (2001) ) 98, : 5116–5121.[Abstract/Free Full Text].

    Wright GW, Simon RM. A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics, ( (2003) ) 19, : 2448–2455.[Abstract/Free Full Text].

    Wolfinger RD, et al. Assessing gene significance from cDNA microarray expression data via mixed models. J. Comput. Biol., ( (2001) ) 8, : 625–637.[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Clin. Cancer Res.Home page
H. Kawanishi, Y. Matsui, M. Ito, J. Watanabe, T. Takahashi, K. Nishizawa, H. Nishiyama, T. Kamoto, Y. Mikami, Y. Tanaka, et al.
Secreted CXCL1 Is a Potential Mediator and Marker of the Tumor Invasion of Bladder Cancer
Clin. Cancer Res., May 1, 2008; 14(9): 2579 - 2587.
[Abstract] [Full Text] [PDF]


Home page
CirculationHome page
J. Loscalzo
Association Studies in an Era of Too Much Information: Clinical Analysis of New Biomarker and Genetic Data
Circulation, October 23, 2007; 116(17): 1866 - 1870.
[Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/6/732    most recent
btl663v3
btl663v2
btl663v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Matsui, S.
Right arrow Articles by Ogawa, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Matsui, S.
Right arrow Articles by Ogawa, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?