Bioinformatics Advance Access originally published online on January 18, 2005
Bioinformatics 2005 21(9):1950-1957; doi:10.1093/bioinformatics/bti267
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Testing association of a pathway with survival using gene expression data
1Department of Medical Statistics, Leiden University Medical Center PO Box 9604, 2300 RA Leiden, The Netherlands
2Mathematical Institute, Leiden University The Netherlands
3Department of Pathology, Leiden University Medical Center The Netherlands
4Department of Pediatrics, Leiden University Medical Center The Netherlands
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: A recent surge of interest in survival as the primary clinical endpoint of microarray studies has called for an extension of the Global Test methodology to survival.
Results: We present a score test for association of the expression profile of one or more groups of genes with a (possibly censored) survival time. Groups of genes may be pathways, areas of the genome, clusters from a cluster analysis or all genes on a chip. The test allows one to test hypotheses about the influence of these groups of genes on survival directly, without the intermediary of single gene testing. The test is based on the Cox proportional hazards model and is calculated using martingale residuals. It is possible to adjust the test for the presence of covariates. We also present a diagnostic graph to assist in the interpretation of the test result, visualizing the influence of genes. The test is applied to a tumor dataset, revealing pathways from the gene ontology database that are associated with survival of patients.
Availability: The Global Test for survival has been incorporated into the R-package globaltest (version 3.0), available at http://www.bioconductor.org
Contact: j.j.goeman{at}lumc.nl
| 1 INTRODUCTION |
|---|
|
|
|---|
A microarray experiment typically results in many thousands of measurements, each relating to the expression level of a single gene. Single genes, however, are often not the primary theoretical focus of the researcher, who might be more interested in certain pathways or genomic regions that are suspected to be biologically relevant.
For this reason we have introduced the Global Test for groups of genes (Goeman et al., 2004), which allows the unit of analysis of the microarray experiment to be shifted from the single gene level to the pathway level, where a pathway may be any set of genes, e.g. chosen using the Gene Ontology database or from earlier experiments. For every pathway, the Global Test can test (with a single test) whether the expression profile of that pathway is significantly associated with a clinical variable of interest. This allows researchers to immediately test theoretical hypotheses on the clinical importance of certain pathways. Even when such hypotheses are not directly available from biological theory or past research, the Global Test can significantly reduce the multiple testing problem, because there are typically much fewer pathways than genes.
In the original publication of the Global Test, the clinical variable could be either a continuous measurement or a 0/1 group indicator. Recently, however, there has been a surge of interest in the survival time of patients as the primary clinical outcome in a microarray experiment. Many of these studies focus on the prediction of survival, e.g. in breast cancer (Van't Veer et al., 2002; Van de Vijver et al., 2002; Pawitan et al., 2004) and in lung cancer (Wigle et al., 2002; Beer et al., 2002). Other studies use multiple testing methods to find genes which are associated with survival (Jenssen et al., 2002).
The present study extends the Global Test methodology to survival outcomes. It allows the researcher to test whether the expression profile of a given set of genes is associated with survival. More precisely it tests whether individuals with a similar gene expression profile tend to have similar survival times. A significant pathway may be a mix of genes which are upregulated for patients with a short survival time, genes which are downregulated for the same patients, and other genes that show no association with survival at all.
The test of the present paper is based on the Cox proportional hazards model. Therefore it avoids the requirement that many analysis strategies have of choosing an arbitrary cut-off (e.g. 5 years survival), but uses all survival information that is present in the data. Technically, the test is derived from the goodness-of-fit test of Verweij et al. (1998). The original Global Test was derived in a similar way from a goodness-of-fit test for generalized linear models (Le Cessie and Van Houwelingen, 1995). The two Global Tests are therefore highly comparable and allow quite similar interpretations.
In this paper we also show how the test can be adjusted for the presence of covariates (possible confounders or competing risk factors). This allows better use of the Global Test in observational studies. Furthermore, it allows the researcher to establish that the microarray really adds something to the predictive performance of known risk factors, showing that it is not simply an expensive way to measure risk factors already known. It also allows the test to be used on more complex designs than a simple one-sample follow-up study.
The new Global Test method presented in this paper has been incorporated into the R-package globaltest, version 3.0, which is available at http://www.bioconductor.org
The approach will be illustrated on a dataset of 17 osteosarcoma patients, testing pathways from the Gene Ontology database.
| 2 THE MODEL |
|---|
|
|
|---|
The Global Test exploits the duality between association and prediction. By definition, if two things are associated, knowing one improves prediction of the other. Hence, if survival is associated with gene expression profile, this means that knowing the gene expression profile allows a better prediction of survival than not knowing the expression profile.
With this idea in mind we make a prediction model for prediction of survival from the gene expression measurements. The most convenient choice for such a model is the Cox proportional hazards model, which is the most widely used model for survival data in medical research. The Cox model uses the full empirical distribution of the survival times and it can handle censored data, i.e. samples for which the exact survival time is not known, but for which it is only known that the patient is still alive at a certain moment (Klein and Moeschberger, 1997). The use of the Cox model requires a true follow-up study design, meaning that patients are not selected on their survival times in any way. If such a patient selection was made, the methods of this paper may not be appropriate: in Van't Veer et al. (2002) for example, where a selected group of early metastases was compared to a selected group which was metastasis-free for at least 5 years, the original Global Test for a 0/1 outcome is preferable (Goeman et al., 2004).
Suppose the matrix of normalized gene expression measurements for the group of genes of interest is given by the n x m matrix X with elements xij, where n is the sample size and m the number of genes in the group. Suppose also that there is a number p
0 of covariates for each patient, which we put in an n x p data matrix Z with elements zij. It will be assumed that p < n, but no such restriction is put on m.
Cox's proportional hazards model (Klein and Moeschberger, 1997) assumes the hazard function at time t for individual i to relate to the covariates as
![]() | (1) |
, relating to the gene expressions, and
, relating to the covariates. The hazard function determines the survival function Si(t), which gives the probability that individual i survives up to time t, through
![]() |
is the cumulative hazard up to time t.
In this model, showing that the gene expressions are associated with survival is equivalent to rejecting the null hypothesis
![]() |
To obtain a test that works whatever the value of m, we put an extra assumption on the regression coefficients ß1,...,ßm. We assume that the regression coefficients of the genes are random and a priori independent with mean zero and common variance
2. The null hypothesis now becomes simply
![]() |
1, ...,
p of the covariates are not assumed to be random. The Cox model with random coefficients is an empirical Bayesian model and is closely linked to penalized likelihood methods. It should be noted that we have not assumed a specific distributional form for the regression coefficients; the derivation of our test is invariant to the choice of the shape of this distribution. Choosing a Gaussian distribution results in a Cox ridge regression model (Pawitan et al., 2004); choosing a double exponential distribution results in a LASSO model (Tibshirani, 1997). Both models can also be used to predict survival times of patients.
In the context of testing it is most insightful to view the prior distribution of the regression coefficients as the focus of the power of the test. The test that will be derived in the next section will be a score test, which has the property that it has optimal power against alternatives with small values of the parameter
2. This property stems from the fact that the score test is equivalent to the likelihood ratio test in the limit where the alternative
2
0 (Cox and Hinkley, 1974). Alternatives with small values of
2 tend to have small values of
, so that the test can be said to be optimal on average against alternatives with small values of
. These alternatives are mainly alternatives which have all or most regression coefficients non-zero but small. The test can therefore be said to be optimized against alternatives in which all or most genes have some association with the outcome. This alternative is precisely the situation in which we are interested, because we want to say something about the pathway as a whole.
Alternative tests can easily be derived for regression coefficients with a more complex covariance structure. If the vector ß = (ß1,...,ßm)' is assumed a priori to have mean zero and covariance matrix
2
, the resulting test of H0 would be optimal against alternatives with small values of ß'
ß. The standard choice of
= Im distributes power equally over all directions of ß, while a different choice will have more power against deviations from H0 in directions which correspond to the larger eigenvalues of
. This property could be exploited in the derivation of a test for a specific purpose or to incorporate prior knowledge. In this paper we shall restrict ourselves to
= Im.
| 3 DERIVATION OF THE TEST |
|---|
|
|
|---|
Testing the association of a group of genes with survival can therefore be done by testing H0 in the empirical Bayesian model (1) with random regression coefficients. In this section we will derive the test statistic for this test. A score test for the same model has also been studied by Verweij et al. (1998) in the context of testing the fit of the Cox model. Their derivation was based on the partial likelihood of the Cox model. In this paper we give an alternative derivation based on the full likelihood and a simpler martingale argument.
We derive the test in stages. First suppose that all parameters except
2 are known, i.e. the regression coefficients
1,...,
p and the baseline hazard function h(t) are known. In this simplified situation it will be relatively easy to derive the score test, which can be generalized to the situation with unknown parameters later in this section.
3.1 The basic score test
By definition a score test is based on the derivative of the log-likelihood at the value of the parameter to be tested. Suppose for each individual i we have observed a survival time ti and a status indicator di, where di = 1 indicates death (the patient died at ti) and di = 0 indicates censoring (the patient was lost to follow-up at ti). The log-likelihood of
2 in model (1) is
![]() | (2) |
![]() |
ds is the cumulative baseline hazard.
From the assumptions on the distribution of ß1,...,ßm, we can derive the distribution of r = (r1,...,rn)', the vector of the linear effects of the gene expressions. This r has mean zero and covariance matrix
2R, where R = XX'. For the general likelihood (2) and an r of this form, Le Cessie and Van Houwelingen (1995) have used a Taylor approximation to derive
![]() |
![]() | (3) |
For known H(t) and known c1,...,cn, expression (3) can be standardized to have unit variance and be used as the score test statistic. When these parameters are unknown, we must plug in maximum likelihood estimates for them under the null model in which
2 = 0. Standardizing the score test is traditionally done using the Fisher information, calculated from the second derivatives of the log-likelihood. In this case these calculations are very unpleasant, and it turns out to be simpler to standardize using the estimated variance of the test statistic.
3.2 Using estimated baseline hazard
We shall first plug in the estimate for the cumulative hazard H(t), but still assume that
1,...,
p and hence c1,...,cn are known. As the maximum likelihood estimate of H(t) we can take the Breslow estimator (Klein and Moeschberger, 1997, Section 8.6)
![]() |
, i = 1,...,n.
Using twice the estimated derivative of the log-likelihood (3) as the test statistic and writing it in matrix notation, we get the test statistic
![]() | (4) |
The derivation of estimates for the mean and variance of T is quite technical and is given in Section 3.5. The estimated mean is
![]() | (5) |
![]() |
i pij = dj and
j pij = ûi.
The estimated variance of T is
![]() | (6) |
[diag(R) + 2 R (mj pj)]. The diag of a square matrix is the column vector of its diagonal elements; 1 is an n x 1 vector of ones, and mj is the j-th column of the matrix M = (D P)B, where D = diag(d) is a diagonal matrix with Dii = di and B is an n x n matrix with elements bij = 1{ti < tj}. The elements mij of M can be interpreted as the estimated martingale residual of individual i just before time tj.
For purposes of interpretation it is often easier to take
![]() |
, so that it leads to the same standardized test statistic:
![]() |
3.3 Using estimated regression coefficients
In general the regression coefficients
1,...,
p of the covariates are not known but must be estimated. Replacing
1,...,
p by their maximum likelihood estimates will still give a valid score test for H0, but with a different distribution of the test statistic. We use the following approximation to this distribution which is derived by Verweij et al. (1998).
The estimated martingale residuals d
based on the estimated
can be approximated in a first order Taylor approximation by
![]() | (7) |
![]() |
. The expectation of T0 can be estimated using the formulas in Section 3.2. They are approximately
![]() |
![]() |
. To evaluate ÊT0 and
we replace the parameter values of
1,...,
p by their estimates. Simulations in Verweij et al. (1998) show this approximation to be quite accurate.
3.4 The distribution of the test statistic
There are two ways to calculate the P-value of the test: by asymptotic theory and by permutation arguments. We outline both the options and their advantages.
Equation (9) in Section 3.5 it will be shown that the centered test statistic T ÊT can be written as a linear combination of n martingales. Therefore by the martingale central limit theorem (Andersen et al., 1993) the distribution of the standardized Q converges to a standard normal distribution as n
. This fact motivates the use of a normal approximation to the distribution of Q to calculate the one-sided P-value (Verweij et al., 1998). Interesting simulations which give insight in to the power of the score test in a random effects survival model are given in Andersen et al. (1999).
For small samples the asymptotic distribution may not be reliable enough. An alternative is to calculate Q for all, or a random sample of many (10 000), permutations of the martingale residuals of the n samples. This randomly redistributes the vectors of gene expression measurements over the individuals, while keeping the relationship between the fixed covariates and survival the same. The resulting distribution is another approximation to the null distribution of Q, which can be used to find the P-value. Use of the permutation null distribution requires the assumption that there is no relationship between the gene expressions on the one hand and the covariates and the censoring mechanism on the other hand: permuting destroys these associations. This makes the permutation null distribution less useful when covariates are present.
The main advantage of the permutation-based P-value is that it gives an exact P-value, which is guaranteed to keep the alpha level provided enough permutations are used. This is especially useful for smaller sample sizes, where we may not trust the normality of the distribution of Q. The advantage of the asymptotic theory P-valueaside from being much quicker to calculateis that it has more power: the permutation based P-value does not use the full null distribution, but the null distribution conditional on the set of observed martingale residuals. With this conditioning the test loses some power, as the set of observed residuals is informative for the parameter
2.
3.5 Counting process calculations
In this technical section we calculate the mean and variance of the test statistic T under the null hypothesis for known c1,...,cn but estimated H(t), as given in Equations (5) and (6). For this we will use a counting process notation (Fleming and Harrington, 1991; Andersen et al., 1993). The strategy we will use is common in martingale theory: we write our test statistic T as the limit of a process T(t) as t
and decompose T(t) into a martingale and a compensator. The limit of the compensator is the estimator of the mean of T and the limit of the predictable variation process is the estimate of the variance. For an alternative derivation, see Verweij et al. (1998).
Let Y(t) = (Y1(t),...,Yn(t))' be the vector of at-risk processes of individuals 1,...,n and N(t) = (N1(t),...,Nn(t))' the vector of their counting processes. Then N has intensity process
= CY(t)H(t), where C is a diagonal matrix with Cii = eci, i = 1,...,n. Write 1 = (1,...,1)', n x 1 and N(t) = 1'N(t), the total counting process.
In the counting process notation, d = N(
) and
with
, where V = CY(1'CY)1. Wherever possible we will drop the dependence on time for convenience of notation.
Note that the compensator of
is
, which is also the compensator of N. Write
. Then
and
is a martingale vector. Subtracting the intensities and writing M = N
,
![]() |
The statistic T is T(
) with
![]() |
![]() | (8) |
is a predictable process. Using Equation (8) and some linear algebra we can say that, almost surely,
![]() |
Because
is a martingale and
is predictable, the compensator of the process T is
, which we can estimate by
![]() |
![]() | (9) |
![]() |
, which we can estimate by
![]() |
To evaluate ÉT and
we use
![]() |
![]() |
Writing P for the n x n matrix with elements pij and M for the n x n matrix with elements mij, the results (5) and (6) follow.
| 4 INTERPRETATION |
|---|
|
|
|---|
When testing a specific pathway for a specific sample of patients, it is usually not satisfactory to report only the resulting p-value. In this section we will discuss some issues related to the interpretation of the test result. We show how to calculate and visualize the influence of individual genes on the test result. We also propose a diagnostic which can be used when many genes are associated with survival, to assess whether a gene group is exceptional. We only give the theory here; for an example, see Section 5.
4.1 Interpretation of the test statistic
The test of this paper is derived from the Cox model in the same way as the Global Test in Goeman et al. (2004) was derived from the generalized linear model. The functional form of the test statistic is therefore quite similar, with the martingale residuals taking the place of the residuals from the generalized linear model in that paper. Much of the interpretation of the test statistic is therefore also quite similar.
Central to all interpretation of the test outcome is the matrix R = XX' which figures prominently in the formula for the test statistic. It is an n x n matrix which can be seen as describing the similarities in the expression profile between the samples. The entry Rij is relatively large if samples i and j have a relatively similar expression profile over the pathway of interest.
To show the role of the matrix R, we can rewrite the unstandardized test statistic T0 as
![]() |
4.2 Gene plot
To investigate the influence of individual genes on the test outcome we can rewrite
, where xi is the i-th column of X (i = 1,...,m), containing the measurements for the i-th gene. The unstandardized test statistic then becomes
![]() |
is exactly the unstandardized global test statistic for testing whether the pathway containing only gene i is associated with survival. The test statistic of a pathway is therefore a weighted average of the test statistics for the m genes in the pathway.
In a plot we can visualize the influence of the individual genes by showing the values Ti ÉTi, with their standard deviation under the null hypothesis (calculated using the methods of Section 3). An example of such a gene plot is given in Figure 1. In this plot, large positive values indicate genes with a large (positive or negative) association with survival and hence genes that make the pathway more significant. As Ti
||xi||2, genes with more expression variance tend to carry more weight in the pathway.
|
Note that the visualized values of the gene influences Ti in the gene plot are essentially univariate: they depend only on the gene i itself. The multivariate nature of the test statistic Q is therefore not visible in the gene plot. It comes in because, although T0 is the sum of the Ti and ÉT0 is the sum of the ÉTi, the variance of T0 is generally not the sum of the variances of the Ti.
4.3 The comparative p
The Global Test tests the null hypothesis that the pathway is not associated with survival. This null hypothesis only depends on the observed survival and on the genes in the pathway itself: the result is absolute, not relative to the other pathways.
However, there are situations in which one would be more interested in a relative result. If the Global Test on the set of all genes is very significant, we can usually expect a sizeable proportion of the genes on the array to be associated with survival. In that case we can expect many pathways to show association with survival as well. This will hold especially for the larger pathways, which will often include some of the genes which are associated with survival.
In such situations we propose a diagnostic called comparative p, which can help interpret the p-value that comes out of the test. The comparative p for a pathway of size m with P-value
is defined as the proportion of randomly selected sets of genes of the size m that have a Global Test P-value smaller or equal to
. To calculate this comparative p we draw 1000 or 10 000 random gene sets from the array without replacement.
The comparative p fulfills a role different from the P-value and should only be used alongside it. It is a diagnostic, not a P-value in the statistical sense. It tells whether the P-value of a group of genes is much lower than could be expected from a gene group of its size in this dataset.
| 5 APPLICATION: OSTEOSARCOMA DATA |
|---|
|
|
|---|
We applied the above methodology to a dataset of 17 osteosarcoma patients from the Leiden University Medical Center.
5.1 Data
A genome wide screen of gene expression in osteosarcoma was done using Hu133a gene expression chips (Affymetrix, Santa Clara, CA). This chip contains 22 283 genes. A successful hybridization was obtained for 17 osteosarcoma biopsies. Three of the samples were amplified, labelled and hybridized in duplicate, and one sample in triplicate. These technical replicates were averaged after gene expression measures were obtained, which was done using gcrma (Wu et al., 2004). No pre-selection of genes was made.
The 17 patients were followed up for 10 years. Median survival time was 40 months. Available covariates included the presence of metastasis at diagnosis, histology and response to neo-adjuvant chemotherapy. However, as treatment was not uniform over all patients, these covariates were not prognostic and we did not consider them.
Pathway information was obtained from the Gene Ontology (GO) database, using the BioConductor GO package (Zhang, 2004). Pathways that were considered to be of specific interest were cell cycle (GO: 7049), DNA repair (GO: 6281), angiogenesis (GO: 1525), skeletal development (GO: 1501) and apoptosis (GO: 6915).
5.2 Analysis
When testing pathways of interest, it is advisable to also test the pathway of all genes on the chip for association with survival. This shows whether the overall gene expression profile is associated with survival. The results for the pathway of all genes and for the five pathways of primary interest are given in Table 1. We calculated the P-value using both the asymptotic theory method and the permutation method (using 100 000 permutations).
|
The permutation P-values tend to be somewhat more conservative than the asymptotic P-values, reflecting both the slight loss of power for the permutation test and a deviation from asymptotic normality due to the small number of samples.
In this dataset the expression profile over the set of all genes on the chip is significantly associated with survival. Note that this does not mean that every gene on the chip is associated with survival. It means that the patients who die early are relatively similar to each other in terms of their overall expression profile, while patients who live long are likewise relatively similar. It also means that there is some potential for prediction of survival based on gene expression, even before any pre-selection of genes. The cell cycle, DNA repair and apoptosis pathways are clearly associated with survival, while there is no evidence for this association in angiogenesis and skeletal development.
Because the test for all genes was significant, we expect a sizeable proportion of genes to be associated with survival, so that many pathways will be associated with survival. The comparative p gives a measure of whether the p-value found for the pathway is unusually low given that it is a pathway of its size from this dataset (Section 4.3). For the results in Table 1, 10 000 gene sets were sampled for each pathway. We used the asymptotic P-values for the comparative p calculations.
We conclude that cell cycle and DNA repair are more clearly associated than could be expected from a gene set of its size in this dataset: only around 60 out of 10 000 random gene sets of size 1115 have a lower P-value than the cell cycle pathway. The expression profile of the apoptosis pathway is clearly associated with survival, as can be seen from the p-values; however it is not exceptional in that: more than 20% of random gene sets have a lower P-value than apoptosis. The skeletal development pathway is interesting in its own way: it is clearly not associated with survival (p = 0.5) and this is quite exceptional for a pathway of this size in this dataset: only around 20 in 10 000 random gene sets had a higher P-value. The skeletal development pathway seems to include uncommonly few genes which are associated with survival.
It can occur in some datasets that the set of all genes is not significant, while some pathways (e.g. DNA repair) are significant. This occurs in Table 1 for example if we use FDR-adjusted P-values with a threshold of 0.01 (Benjamini and Yekutieli, 2001). The result for all genes can be seen as a false negative test result. However, another valid interpretation is that the prediction of survival without biological pre-selection of genes is uncertain, but if it is known a priori that the genes in the DNA repair pathway are likely to be informative, some prediction of survival is possible.
5.3 Mining the GO database
If it is not a priori known which pathways are of specific interest, one can also use a data-mining approach, trying to find those pathways which are most significantly associated with survival.
For the osteosarcoma data we explored the Gene Ontology database. Of all GO terms, 4032 matched at least one gene on the hu133a chip. We excluded all terms which matched only one gene, because the interesting single gene pathways would already have been found in single gene testing. This left 3080 pathways, all of which we tested for association with survival. We used the asymptotic P-value, because due to the randomness in the the permutation P-value, it does not give a unique list. Table 2 gives the 10 GO-terms with the smallest P-values.
|
To adjust for multiple testing, one can use the Benjamini and Hochberg FDR (Benjamini and Yekutieli, 2001). All 10 pathways in Table 2 are significant on an FDR of 0.05. The P-values of the pathways tend to have positive correlations because of pathway overlap and pathways being subsets of other pathways. An FDR-controlling procedure that would make use of these dependencies would potentially gain much power in this situation.
The literature confirmed the importance of many of these GO-terms in tumorigenesis. For example, both microtubule cytoskeleton (GO:0015630) and phosphorylation of Stat3 protein (GO:0042518) are known to be involved in growth and differentiation signaling, processes which are often disturbed in tumors. Second-messenger mediated signaling (GO:0019932) is a superset of the Stat3 pathway. Protein amino acid methylation (GO:0006479) is involved in protein degradation. Alterations in the stability of proteins is often a hallmark of tumors and may affect the aggressiveness of a tumor and thereby the patient's survival.
5.4 A diagnostic plot
To learn more about the outcome of the Global Test than just the P-value, one can use the diagnostic plot described in Section 4. We illustrate the use of this plot on the microtubule cytoskeleton pathway, which emerged in Section 5.3.
The gene plot for the 21 genes in this pathway is given in Figure 1. Each bar gives the Global Test statistic for testing whether the gene set containing only that single gene is associated with survival. The test statistic for the whole pathway is a weighted average of the bars of the genes (Section 4.2). The color of the bars distinguishes between positive and negative association with survival.
Figure 1 shows that at least only 4 out of 21 genes in the microtubule cytoskeleton pathway show a significant association with survival on their own. Further, the pathway is a mix of genes which are positively and negatively associated with survival. An analysis of the gene plot can form the basis for further investigation of the structure of the pathway, perhaps to formulate hypotheses on interesting subpathways.
| 6 DISCUSSION |
|---|
|
|
|---|
It has often been remarked that the key to successful microarray data analysis lies in an intelligent integration of advanced statistical methods with the vast domain of biological knowledge that is already available. The Global Test for survival presented in this paper is a step forward in this direction, combining known biological pathway information with the statistical sophistication of the Cox proportional hazards model.
Due to its complexity the Cox model has been slow to find its way into microarray methodology. Most methods require survival to be reduced to a two-valued variable, using an arbitrary cut-off, resulting in unnecessary loss of information. By using the Cox model for survival, gene expression analysis can improve performance and also become better connected to traditional medical statistics.
Pathway information is available from many databases and is essential for the understanding of the outcomes of a microarray experiment. The Global Test methodology allows researchers to look directly for important pathways, without first having to go through single gene testing. This may lead to a better use of pathway information and more directly interpretable results.
Received on September 13, 2004; revised on January 4, 2005; accepted on January 6, 2005
| REFERENCES |
|---|
|
|
|---|
Andersen, P.K., Borgan, Ø., Gill, R.D., Keiding, N. Statistical Models Based on Counting Processes, (1993) , NY Springer.
Andersen, P.K., Klein, J.P., Zhang, M.J. (1999) Testing for center effects in multi-center survival studies: a Monte Carlo comparison of fixed and random effects tests. Stat. Med., 18, 14891500[CrossRef][Medline].
Beer, D.G., Kardia, S.L.R., Huang, C.C., Giordano, T.J., Levin, A.M., Misek, D.E., Lin, L., Chen, G., Gharib, T.G., Thomas, D.G., et al. (2002) Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat. Med., 8, 816824[ISI][Medline].
Benjamini, Y. and Yekutieli, D. (2001) The control of the false discovery rate in multiple testing under dependency. Ann. Statist., 29, 11651188[CrossRef].
Le Cessie, S. and Van Houwelingen, H.C. (1995) Testing the fit of a regression model via score tests in random effects models. Biometrics, 51, 600614[CrossRef][ISI][Medline].
Cox, D.R. and Hinkley, D.V. Theoretical Statistics, (1974) , Boca Raton, FL Chapman & Hall.
Fleming, T.R. and Harrington, D.P. Counting Processes and Survival Analysis, (1991) , NY John Wiley and Sons.
Goeman, J.J., Van de Geer, S.A., De Kort, F., Van Houwelingen, J.C. (2004) A global test for groups of genes: testing association with a clinical outcome. Bioinformatics, 20, 9399
Jenssen, T.K., Kuo, W.P., Stokke, T., Hovig, E. (2002) Associations between gene expressions in breast cancer and patient survival. Hum. Genet., 111, 411420[CrossRef][ISI][Medline].
Klein, J.P. and Moeschberger, M.L. Survival Analysis: Techniques for Censored and Truncated Data, (1997) , NY Springer.
Pawitan, Y., Bjöhle, J., Wedren, S., Humphreys, K., Skoog, L., Huang, F., Amler, L., Shaw, P., Hall, P., Bergh, J. (2004) Gene expression profiling for prognosis using Cox regression. Stat. Med., 23, 17671780[Medline].
Tibshirani, R. (1997) The LASSO method for variable selection in the Cox model. Stat. Med., 16, 385395[CrossRef][ISI][Medline].
Van't Veer, L.J., Dai, H.Y., Van de Vijver, M.J., He, Y.D.D., Hart, A.A.M., Mao, M., Peterse, H.L., Van der Kooy, K., Marton, M.J., Witteveen, A.T., et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530536[CrossRef][Medline].
Verweij, P.J.M., Van Houwelingen, H.C., Stijnen, T. (1998) A goodness-of-fit test for Cox's proportional hazards model based on martingale residuals. Biometrics, 54, 15171526.
Van de Vijver, M.J., He, Y.D., Van't Veer, L.J., Dai, H., Hart, A.A.M., Voskuil, D.W., Schreiber, G.J., Peterse, J.L., Roberts, C., Marton, M.J., et al. (2002) A gene-expression signature as a predictor of survival in breast cancer. N. Eng. J. Med., 347, 19992009
Wigle, D.A., Jurisica, I., Radulovich, N., Pintilie, M., Rossant, J., Liu, N., Lu, C., Woodgett, J., Seiden, I., Johnston, M., et al. (2002) Molecular profiling of non-small cell lung cancer and correlation with disease-free survival. Cancer Res., 62, 30053008
Wu, Z., Irizarry, R.A., Gentleman, R., Martinez-Murillo, F., Spencer, F. (2004) A model based background adjustement for oligonucleotide expression arrays. J. Am. Stat. Assoc., 99, 909917[CrossRef][ISI].
Zhang, J. GO: A Data Package Containing Annotation Data for GO, (2004) R package version 1.6.5.
This article has been cited by other articles:
![]() |
G. G. Van den Eynden, M. Smid, S. J. Van Laere, C. G. Colpaert, I. Van der Auwera, T. X. Bich, P. van Dam, M. A. den Bakker, L. Y. Dirix, E. A. Van Marck, et al. Gene Expression Profiles Associated with the Presence of a Fibrotic Focus and the Growth Pattern in Lymph Node-Negative Breast Cancer Clin. Cancer Res., May 15, 2008; 14(10): 2944 - 2952. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Nam and S.-Y. Kim Gene-set approach for expression pattern analysis Brief Bioinform, May 1, 2008; 9(3): 189 - 197. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Smid, Y. Wang, Y. Zhang, A. M. Sieuwerts, J. Yu, J. G.M. Klijn, J. A. Foekens, and J. W.M. Martens Subtypes of Breast Cancer Show Preferential Site of Relapse Cancer Res., May 1, 2008; 68(9): 3108 - 3114. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. J. Goeman and U. Mansmann Multiple testing on the directed acyclic graph of gene ontology Bioinformatics, February 15, 2008; 24(4): 537 - 544. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Westermann, K.-O. Henrich, J. S. Wei, W. Lutz, M. Fischer, R. Konig, R. Wiedemeyer, V. Ehemann, B. Brors, K. Ernestus, et al. High Skp2 Expression Characterizes High-Risk Neuroblastomas Independent of MYCN Status Clin. Cancer Res., August 15, 2007; 13(16): 4695 - 4703. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Kienle, T. Katzenberger, G. Ott, D. Saupe, A. Benner, H. Kohlhammer, T. F.E. Barth, S. Holler, J. Kalla, A. Rosenwald, et al. Quantitative Gene Expression Deregulation in Mantle-Cell Lymphoma: Correlation With Clinical and Biologic Factors J. Clin. Oncol., July 1, 2007; 25(19): 2770 - 2777. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. M. Kemp, N. R. Nirmala, and J. D. Szustakowski Extending the pathway analysis framework with a test for transcriptional variance implicates novel pathway modulation during myogenic differentiation Bioinformatics, June 1, 2007; 23(11): 1356 - 1362. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. J. Goeman and P. Buhlmann Analyzing gene expression data in terms of gene sets: methodological issues Bioinformatics, April 15, 2007; 23(8): 980 - 987. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. A. Zapala and N. J. Schork Multivariate regression analysis of distance matrices for testing associations between gene expression patterns and related variables PNAS, December 19, 2006; 103(51): 19430 - 19435. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||





































