Skip Navigation


Bioinformatics Advance Access originally published online on January 18, 2005
Bioinformatics 2005 21(9):1950-1957; doi:10.1093/bioinformatics/bti267
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/9/1950    most recent
bti267v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (20)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Goeman, J. J.
Right arrow Articles by van Houwelingen, H. C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Goeman, J. J.
Right arrow Articles by van Houwelingen, H. C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Testing association of a pathway with survival using gene expression data

Jelle J. Goeman 1,2,*, Jan Oosting 3, Anne-Marie Cleton-Jansen 3, Jakob K. Anninga 4 and Hans C. van Houwelingen 1

1Department of Medical Statistics, Leiden University Medical Center PO Box 9604, 2300 RA Leiden, The Netherlands
2Mathematical Institute, Leiden University The Netherlands
3Department of Pathology, Leiden University Medical Center The Netherlands
4Department of Pediatrics, Leiden University Medical Center The Netherlands

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 THE MODEL
 3 DERIVATION OF THE...
 4 INTERPRETATION
 5 APPLICATION: OSTEOSARCOMA DATA
 6 DISCUSSION
 REFERENCES
 

Motivation: A recent surge of interest in survival as the primary clinical endpoint of microarray studies has called for an extension of the Global Test methodology to survival.

Results: We present a score test for association of the expression profile of one or more groups of genes with a (possibly censored) survival time. Groups of genes may be pathways, areas of the genome, clusters from a cluster analysis or all genes on a chip. The test allows one to test hypotheses about the influence of these groups of genes on survival directly, without the intermediary of single gene testing. The test is based on the Cox proportional hazards model and is calculated using martingale residuals. It is possible to adjust the test for the presence of covariates. We also present a diagnostic graph to assist in the interpretation of the test result, visualizing the influence of genes. The test is applied to a tumor dataset, revealing pathways from the gene ontology database that are associated with survival of patients.

Availability: The Global Test for survival has been incorporated into the R-package globaltest (version 3.0), available at http://www.bioconductor.org

Contact: j.j.goeman{at}lumc.nl


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 THE MODEL
 3 DERIVATION OF THE...
 4 INTERPRETATION
 5 APPLICATION: OSTEOSARCOMA DATA
 6 DISCUSSION
 REFERENCES
 
A microarray experiment typically results in many thousands of measurements, each relating to the expression level of a single gene. Single genes, however, are often not the primary theoretical focus of the researcher, who might be more interested in certain pathways or genomic regions that are suspected to be biologically relevant.

For this reason we have introduced the Global Test for groups of genes (Goeman et al., 2004), which allows the unit of analysis of the microarray experiment to be shifted from the single gene level to the pathway level, where a ‘pathway’ may be any set of genes, e.g. chosen using the Gene Ontology database or from earlier experiments. For every pathway, the Global Test can test (with a single test) whether the expression profile of that pathway is significantly associated with a clinical variable of interest. This allows researchers to immediately test theoretical hypotheses on the clinical importance of certain pathways. Even when such hypotheses are not directly available from biological theory or past research, the Global Test can significantly reduce the multiple testing problem, because there are typically much fewer pathways than genes.

In the original publication of the Global Test, the clinical variable could be either a continuous measurement or a 0/1 group indicator. Recently, however, there has been a surge of interest in the survival time of patients as the primary clinical outcome in a microarray experiment. Many of these studies focus on the prediction of survival, e.g. in breast cancer (Van't Veer et al., 2002; Van de Vijver et al., 2002; Pawitan et al., 2004) and in lung cancer (Wigle et al., 2002; Beer et al., 2002). Other studies use multiple testing methods to find genes which are associated with survival (Jenssen et al., 2002).

The present study extends the Global Test methodology to survival outcomes. It allows the researcher to test whether the expression profile of a given set of genes is associated with survival. More precisely it tests whether individuals with a similar gene expression profile tend to have similar survival times. A significant pathway may be a mix of genes which are upregulated for patients with a short survival time, genes which are downregulated for the same patients, and other genes that show no association with survival at all.

The test of the present paper is based on the Cox proportional hazards model. Therefore it avoids the requirement that many analysis strategies have of choosing an arbitrary cut-off (e.g. 5 years survival), but uses all survival information that is present in the data. Technically, the test is derived from the goodness-of-fit test of Verweij et al. (1998). The original Global Test was derived in a similar way from a goodness-of-fit test for generalized linear models (Le Cessie and Van Houwelingen, 1995). The two Global Tests are therefore highly comparable and allow quite similar interpretations.

In this paper we also show how the test can be adjusted for the presence of covariates (possible confounders or competing risk factors). This allows better use of the Global Test in observational studies. Furthermore, it allows the researcher to establish that the microarray really adds something to the predictive performance of known risk factors, showing that it is not simply an expensive way to measure risk factors already known. It also allows the test to be used on more complex designs than a simple one-sample follow-up study.

The new Global Test method presented in this paper has been incorporated into the R-package globaltest, version 3.0, which is available at http://www.bioconductor.org

The approach will be illustrated on a dataset of 17 osteosarcoma patients, testing pathways from the Gene Ontology database.


    2 THE MODEL
 TOP
 Abstract
 1 INTRODUCTION
 2 THE MODEL
 3 DERIVATION OF THE...
 4 INTERPRETATION
 5 APPLICATION: OSTEOSARCOMA DATA
 6 DISCUSSION
 REFERENCES
 
The Global Test exploits the duality between association and prediction. By definition, if two things are associated, knowing one improves prediction of the other. Hence, if survival is associated with gene expression profile, this means that knowing the gene expression profile allows a better prediction of survival than not knowing the expression profile.

With this idea in mind we make a prediction model for prediction of survival from the gene expression measurements. The most convenient choice for such a model is the Cox proportional hazards model, which is the most widely used model for survival data in medical research. The Cox model uses the full empirical distribution of the survival times and it can handle censored data, i.e. samples for which the exact survival time is not known, but for which it is only known that the patient is still alive at a certain moment (Klein and Moeschberger, 1997). The use of the Cox model requires a true follow-up study design, meaning that patients are not selected on their survival times in any way. If such a patient selection was made, the methods of this paper may not be appropriate: in Van't Veer et al. (2002) for example, where a selected group of early metastases was compared to a selected group which was metastasis-free for at least 5 years, the original Global Test for a 0/1 outcome is preferable (Goeman et al., 2004).

Suppose the matrix of normalized gene expression measurements for the group of genes of interest is given by the n x m matrix X with elements xij, where n is the sample size and m the number of genes in the group. Suppose also that there is a number p ≥ 0 of covariates for each patient, which we put in an n x p data matrix Z with elements zij. It will be assumed that p < n, but no such restriction is put on m.

Cox's proportional hazards model (Klein and Moeschberger, 1997) assumes the hazard function at time t for individual i to relate to the covariates as

(1)
where h(t) is an unknown baseline hazard function and ci + ri is a linear function of the covariates, which is split up in our case into , relating to the gene expressions, and , relating to the covariates. The hazard function determines the survival function Si(t), which gives the probability that individual i survives up to time t, through

where is the cumulative hazard up to time t.

In this model, showing that the gene expressions are associated with survival is equivalent to rejecting the null hypothesis

that all regression coefficients relating to the gene expressions are zero. If m were always small, we could test H0 using classical tests which were developed for the Cox model. These tests do not work for general m, however (Klein and Moeschberger, 1997, Section 8.2).

To obtain a test that works whatever the value of m, we put an extra assumption on the regression coefficients ß1,...,ßm. We assume that the regression coefficients of the genes are random and a priori independent with mean zero and common variance {tau}2. The null hypothesis now becomes simply

so that the dimension of H0 does not depend on m anymore. Note that the coefficients {gamma}1, ..., {gamma}p of the covariates are not assumed to be random.

The Cox model with random coefficients is an empirical Bayesian model and is closely linked to penalized likelihood methods. It should be noted that we have not assumed a specific distributional form for the regression coefficients; the derivation of our test is invariant to the choice of the shape of this distribution. Choosing a Gaussian distribution results in a Cox ridge regression model (Pawitan et al., 2004); choosing a double exponential distribution results in a LASSO model (Tibshirani, 1997). Both models can also be used to predict survival times of patients.

In the context of testing it is most insightful to view the prior distribution of the regression coefficients as the focus of the power of the test. The test that will be derived in the next section will be a score test, which has the property that it has optimal power against alternatives with small values of the parameter {tau}2. This property stems from the fact that the score test is equivalent to the likelihood ratio test in the limit where the alternative {tau}2 -> 0 (Cox and Hinkley, 1974). Alternatives with small values of {tau}2 tend to have small values of , so that the test can be said to be optimal on average against alternatives with small values of . These alternatives are mainly alternatives which have all or most regression coefficients non-zero but small. The test can therefore be said to be optimized against alternatives in which all or most genes have some association with the outcome. This alternative is precisely the situation in which we are interested, because we want to say something about the pathway as a whole.

Alternative tests can easily be derived for regression coefficients with a more complex covariance structure. If the vector ß = (ß1,...,ßm)' is assumed a priori to have mean zero and covariance matrix {tau}2{Sigma}, the resulting test of H0 would be optimal against alternatives with small values of ß' {Sigma} ß. The standard choice of {Sigma} = Im distributes power equally over all directions of ß, while a different choice will have more power against deviations from H0 in directions which correspond to the larger eigenvalues of {Sigma}. This property could be exploited in the derivation of a test for a specific purpose or to incorporate prior knowledge. In this paper we shall restrict ourselves to {Sigma} = Im.


    3 DERIVATION OF THE TEST
 TOP
 Abstract
 1 INTRODUCTION
 2 THE MODEL
 3 DERIVATION OF THE...
 4 INTERPRETATION
 5 APPLICATION: OSTEOSARCOMA DATA
 6 DISCUSSION
 REFERENCES
 
Testing the association of a group of genes with survival can therefore be done by testing H0 in the empirical Bayesian model (1) with random regression coefficients. In this section we will derive the test statistic for this test. A score test for the same model has also been studied by Verweij et al. (1998) in the context of testing the fit of the Cox model. Their derivation was based on the partial likelihood of the Cox model. In this paper we give an alternative derivation based on the full likelihood and a simpler martingale argument.

We derive the test in stages. First suppose that all parameters except {tau}2 are known, i.e. the regression coefficients {gamma}1,...,{gamma}p and the baseline hazard function h(t) are known. In this simplified situation it will be relatively easy to derive the score test, which can be generalized to the situation with unknown parameters later in this section.

3.1 The basic score test
By definition a score test is based on the derivative of the log-likelihood at the value of the parameter to be tested. Suppose for each individual i we have observed a survival time ti and a status indicator di, where di = 1 indicates death (the patient died at ti) and di = 0 indicates censoring (the patient was lost to follow-up at ti). The log-likelihood of {tau}2 in model (1) is

(2)
where

is the contribution to the log-likelihood of individual i for fixed ri, and ds is the cumulative baseline hazard.

From the assumptions on the distribution of ß1,...,ßm, we can derive the distribution of r = (r1,...,rn)', the vector of the linear effects of the gene expressions. This r has mean zero and covariance matrix {tau}2R, where R = XX'. For the general likelihood (2) and an r of this form, Le Cessie and Van Houwelingen (1995) have used a Taylor approximation to derive

For the Cox model this becomes

(3)
where ui = eciH(ti), i = 1,...,n, is the hazard incurred by individual i up to time ti. Note that diui is the martingale residual of individual i at time ti (Klein and Moeschberger, 1997, Section 11.3).

For known H(t) and known c1,...,cn, expression (3) can be standardized to have unit variance and be used as the score test statistic. When these parameters are unknown, we must plug in maximum likelihood estimates for them under the null model in which {tau}2 = 0. Standardizing the score test is traditionally done using the Fisher information, calculated from the second derivatives of the log-likelihood. In this case these calculations are very unpleasant, and it turns out to be simpler to standardize using the estimated variance of the test statistic.

3.2 Using estimated baseline hazard
We shall first plug in the estimate for the cumulative hazard H(t), but still assume that {gamma}1,...,{gamma}p and hence c1,...,cn are known. As the maximum likelihood estimate of H(t) we can take the Breslow estimator (Klein and Moeschberger, 1997, Section 8.6)

and write , i = 1,...,n.

Using twice the estimated derivative of the log-likelihood (3) as the test statistic and writing it in matrix notation, we get the test statistic

(4)
where d = (d1,...,dn)', û = (û1,...,ûn)' and Û = diag(û), an n x n diagonal matrix with Ûii = ûi.

The derivation of estimates for the mean and variance of T is quite technical and is given in Section 3.5. The estimated mean is

(5)
where P is an n x n matrix with i,j-th element

Each pij is the increment of the cumulative hazard incurred by individual i at time tj, so that {sum}i pij = dj and {sum}j pij = ûi.

The estimated variance of T is

(6)
where pj is the j-th column of P and [diag(R) + 2 R (mjpj)]. The diag of a square matrix is the column vector of its diagonal elements; 1 is an n x 1 vector of ones, and mj is the j-th column of the matrix M = (DP)B, where D = diag(d) is a diagonal matrix with Dii = di and B is an n x n matrix with elements bij = 1{ti < tj}. The elements mij of M can be interpreted as the estimated martingale residual of individual i just before time tj.

For purposes of interpretation it is often easier to take

as the unstandardized test statistic. It has ÊT0 = trace(RÛ – PP') and , so that it leads to the same standardized test statistic:

3.3 Using estimated regression coefficients
In general the regression coefficients {gamma}1,...,{gamma}p of the covariates are not known but must be estimated. Replacing {gamma}1,...,{gamma}p by their maximum likelihood estimates will still give a valid score test for H0, but with a different distribution of the test statistic. We use the following approximation to this distribution which is derived by Verweij et al. (1998).

The estimated martingale residuals du based on the estimated can be approximated in a first order Taylor approximation by

(7)
with V = WZ(ZWZ')–1Z', W = diag(û) – PP' and Z the n x p data matrix of the fixed covariates. Therefore the unstandardized test statistic T0 can be approximated as

with . The expectation of T0 can be estimated using the formulas in Section 3.2. They are approximately

and

with . To evaluate ÊT0 and we replace the parameter values of {gamma}1,...,{gamma}p by their estimates. Simulations in Verweij et al. (1998) show this approximation to be quite accurate.

3.4 The distribution of the test statistic
There are two ways to calculate the P-value of the test: by asymptotic theory and by permutation arguments. We outline both the options and their advantages.

Equation (9) in Section 3.5 it will be shown that the centered test statistic T – ÊT can be written as a linear combination of n martingales. Therefore by the martingale central limit theorem (Andersen et al., 1993) the distribution of the standardized Q converges to a standard normal distribution as n -> {infty}. This fact motivates the use of a normal approximation to the distribution of Q to calculate the one-sided P-value (Verweij et al., 1998). Interesting simulations which give insight in to the power of the score test in a random effects survival model are given in Andersen et al. (1999).

For small samples the asymptotic distribution may not be reliable enough. An alternative is to calculate Q for all, or a random sample of many (10 000), permutations of the martingale residuals of the n samples. This randomly redistributes the vectors of gene expression measurements over the individuals, while keeping the relationship between the fixed covariates and survival the same. The resulting distribution is another approximation to the null distribution of Q, which can be used to find the P-value. Use of the permutation null distribution requires the assumption that there is no relationship between the gene expressions on the one hand and the covariates and the censoring mechanism on the other hand: permuting destroys these associations. This makes the permutation null distribution less useful when covariates are present.

The main advantage of the permutation-based P-value is that it gives an ‘exact’ P-value, which is guaranteed to keep the alpha level provided enough permutations are used. This is especially useful for smaller sample sizes, where we may not trust the normality of the distribution of Q. The advantage of the asymptotic theory P-value—aside from being much quicker to calculate—is that it has more power: the permutation based P-value does not use the full null distribution, but the null distribution conditional on the set of observed martingale residuals. With this conditioning the test loses some power, as the set of observed residuals is informative for the parameter {tau}2.

3.5 Counting process calculations
In this technical section we calculate the mean and variance of the test statistic T under the null hypothesis for known c1,...,cn but estimated H(t), as given in Equations (5) and (6). For this we will use a counting process notation (Fleming and Harrington, 1991; Andersen et al., 1993). The strategy we will use is common in martingale theory: we write our test statistic T as the limit of a process T(t) as t -> {infty} and decompose T(t) into a martingale and a compensator. The limit of the compensator is the estimator of the mean of T and the limit of the predictable variation process is the estimate of the variance. For an alternative derivation, see Verweij et al. (1998).

Let Y(t) = (Y1(t),...,Yn(t))' be the vector of at-risk processes of individuals 1,...,n and N(t) = (N1(t),...,Nn(t))' the vector of their counting processes. Then N has intensity process {Lambda} = CY(t)H(t), where C is a diagonal matrix with Cii = eci, i = 1,...,n. Write 1 = (1,...,1)', n x 1 and N(t) = 1'N(t), the total counting process.

In the counting process notation, d = N({infty}) and with , where V = CY(1'CY)–1. Wherever possible we will drop the dependence on time for convenience of notation.

Note that the compensator of is {Lambda}, which is also the compensator of N. Write . Then and is a martingale vector. Subtracting the intensities and writing M = N {Lambda},

The statistic T is T({infty}) with

From the integration by parts formula (Fleming and Harrington, 1991) it follows that, almost surely,

(8)
where is a predictable process. Using Equation (8) and some linear algebra we can say that, almost surely,

Because is a martingale and is predictable, the compensator of the process T is , which we can estimate by

The process S = T – ÊT is a martingale. It can be written in the following way:

(9)
as the integral of the predictable process vector

over the martingale vector M. The predictable variation process of S is therefore , which we can estimate by

To evaluate ÉT and we use

and

Writing P for the n x n matrix with elements pij and M for the n x n matrix with elements mij, the results (5) and (6) follow.


    4 INTERPRETATION
 TOP
 Abstract
 1 INTRODUCTION
 2 THE MODEL
 3 DERIVATION OF THE...
 4 INTERPRETATION
 5 APPLICATION: OSTEOSARCOMA DATA
 6 DISCUSSION
 REFERENCES
 
When testing a specific pathway for a specific sample of patients, it is usually not satisfactory to report only the resulting p-value. In this section we will discuss some issues related to the interpretation of the test result. We show how to calculate and visualize the influence of individual genes on the test result. We also propose a diagnostic which can be used when many genes are associated with survival, to assess whether a gene group is exceptional. We only give the theory here; for an example, see Section 5.

4.1 Interpretation of the test statistic
The test of this paper is derived from the Cox model in the same way as the Global Test in Goeman et al. (2004) was derived from the generalized linear model. The functional form of the test statistic is therefore quite similar, with the martingale residuals taking the place of the residuals from the generalized linear model in that paper. Much of the interpretation of the test statistic is therefore also quite similar.

Central to all interpretation of the test outcome is the matrix R = XX' which figures prominently in the formula for the test statistic. It is an n x n matrix which can be seen as describing the similarities in the expression profile between the samples. The entry Rij is relatively large if samples i and j have a relatively similar expression profile over the pathway of interest.

To show the role of the matrix R, we can rewrite the unstandardized test statistic T0 as

which is the sum over the term-by-term product of the entries of R and the entries of the matrix (d – û)(d – û)'. The i,j-th entry of the latter matrix is large whenever samples i and j have similar martingale residuals. The test statistic T0 is therefore relatively large whenever the entries of the matrices R and (d – û)(d – û)' are correlated, which is when similarity in gene expressions tends to coincide with similarity in the martingale residual. Hence, the test statistic is large if individuals who die sooner than expected tend to be relatively similar in their gene expression profiles and individuals who live longer than expected also tend to be similar in their gene expression profiles.

4.2 Gene plot
To investigate the influence of individual genes on the test outcome we can rewrite , where xi is the i-th column of X (i = 1,...,m), containing the measurements for the i-th gene. The unstandardized test statistic then becomes

where is exactly the unstandardized ‘global’ test statistic for testing whether the ‘pathway’ containing only gene i is associated with survival. The test statistic of a pathway is therefore a weighted average of the test statistics for the m genes in the pathway.

In a plot we can visualize the influence of the individual genes by showing the values Ti – ÉTi, with their standard deviation under the null hypothesis (calculated using the methods of Section 3). An example of such a ‘gene plot’ is given in Figure 1. In this plot, large positive values indicate genes with a large (positive or negative) association with survival and hence genes that make the pathway more significant. As Ti {propto} ||xi||2, genes with more expression variance tend to carry more weight in the pathway.



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 1 Gene plot of microtubule cytoskeleton pathway, showing the sorted Global Test statistics for testing the 21 single gene pathways which make up the pathway.

 
Note that the visualized values of the gene influences Ti in the gene plot are essentially univariate: they depend only on the gene i itself. The multivariate nature of the test statistic Q is therefore not visible in the gene plot. It comes in because, although T0 is the sum of the Ti and ÉT0 is the sum of the ÉTi, the variance of T0 is generally not the sum of the variances of the Ti.

4.3 The comparative p
The Global Test tests the null hypothesis that the pathway is not associated with survival. This null hypothesis only depends on the observed survival and on the genes in the pathway itself: the result is absolute, not relative to the other pathways.

However, there are situations in which one would be more interested in a relative result. If the Global Test on the set of all genes is very significant, we can usually expect a sizeable proportion of the genes on the array to be associated with survival. In that case we can expect many pathways to show association with survival as well. This will hold especially for the larger pathways, which will often include some of the genes which are associated with survival.

In such situations we propose a diagnostic called ‘comparative p’, which can help interpret the p-value that comes out of the test. The comparative p for a pathway of size m with P-value is defined as the proportion of randomly selected sets of genes of the size m that have a Global Test P-value smaller or equal to . To calculate this comparative p we draw 1000 or 10 000 random gene sets from the array without replacement.

The comparative p fulfills a role different from the P-value and should only be used alongside it. It is a diagnostic, not a P-value in the statistical sense. It tells whether the P-value of a group of genes is much lower than could be expected from a gene group of its size in this dataset.


    5 APPLICATION: OSTEOSARCOMA DATA
 TOP
 Abstract
 1 INTRODUCTION
 2 THE MODEL
 3 DERIVATION OF THE...
 4 INTERPRETATION
 5 APPLICATION: OSTEOSARCOMA DATA
 6 DISCUSSION
 REFERENCES
 
We applied the above methodology to a dataset of 17 osteosarcoma patients from the Leiden University Medical Center.

5.1 Data
A genome wide screen of gene expression in osteosarcoma was done using Hu133a gene expression chips (Affymetrix, Santa Clara, CA). This chip contains 22 283 genes. A successful hybridization was obtained for 17 osteosarcoma biopsies. Three of the samples were amplified, labelled and hybridized in duplicate, and one sample in triplicate. These technical replicates were averaged after gene expression measures were obtained, which was done using gcrma (Wu et al., 2004). No pre-selection of genes was made.

The 17 patients were followed up for 10 years. Median survival time was 40 months. Available covariates included the presence of metastasis at diagnosis, histology and response to neo-adjuvant chemotherapy. However, as treatment was not uniform over all patients, these covariates were not prognostic and we did not consider them.

Pathway information was obtained from the Gene Ontology (GO) database, using the BioConductor GO package (Zhang, 2004). Pathways that were considered to be of specific interest were cell cycle (GO: 7049), DNA repair (GO: 6281), angiogenesis (GO: 1525), skeletal development (GO: 1501) and apoptosis (GO: 6915).

5.2 Analysis
When testing pathways of interest, it is advisable to also test the ‘pathway’ of all genes on the chip for association with survival. This shows whether the overall gene expression profile is associated with survival. The results for the pathway of all genes and for the five pathways of primary interest are given in Table 1. We calculated the P-value using both the asymptotic theory method and the permutation method (using 100 000 permutations).


View this table:
[in this window]
[in a new window]
 
Table 1 Global test results for the osteosarcoma data and the pathways of primary interest

 
The permutation P-values tend to be somewhat more conservative than the asymptotic P-values, reflecting both the slight loss of power for the permutation test and a deviation from asymptotic normality due to the small number of samples.

In this dataset the expression profile over the set of all genes on the chip is significantly associated with survival. Note that this does not mean that every gene on the chip is associated with survival. It means that the patients who die early are relatively similar to each other in terms of their overall expression profile, while patients who live long are likewise relatively similar. It also means that there is some potential for prediction of survival based on gene expression, even before any pre-selection of genes. The cell cycle, DNA repair and apoptosis pathways are clearly associated with survival, while there is no evidence for this association in angiogenesis and skeletal development.

Because the test for all genes was significant, we expect a sizeable proportion of genes to be associated with survival, so that many pathways will be associated with survival. The comparative p gives a measure of whether the p-value found for the pathway is unusually low given that it is a pathway of its size from this dataset (Section 4.3). For the results in Table 1, 10 000 gene sets were sampled for each pathway. We used the asymptotic P-values for the comparative p calculations.

We conclude that cell cycle and DNA repair are more clearly associated than could be expected from a gene set of its size in this dataset: only around 60 out of 10 000 random gene sets of size 1115 have a lower P-value than the cell cycle pathway. The expression profile of the apoptosis pathway is clearly associated with survival, as can be seen from the p-values; however it is not exceptional in that: more than 20% of random gene sets have a lower P-value than apoptosis. The skeletal development pathway is interesting in its own way: it is clearly not associated with survival (p = 0.5) and this is quite exceptional for a pathway of this size in this dataset: only around 20 in 10 000 random gene sets had a higher P-value. The skeletal development pathway seems to include uncommonly few genes which are associated with survival.

It can occur in some datasets that the set of all genes is not significant, while some pathways (e.g. DNA repair) are significant. This occurs in Table 1 for example if we use FDR-adjusted P-values with a threshold of 0.01 (Benjamini and Yekutieli, 2001). The result for all genes can be seen as a false negative test result. However, another valid interpretation is that the prediction of survival without biological pre-selection of genes is uncertain, but if it is known a priori that the genes in the DNA repair pathway are likely to be informative, some prediction of survival is possible.

5.3 Mining the GO database
If it is not a priori known which pathways are of specific interest, one can also use a data-mining approach, trying to find those pathways which are most significantly associated with survival.

For the osteosarcoma data we explored the Gene Ontology database. Of all GO terms, 4032 matched at least one gene on the hu133a chip. We excluded all terms which matched only one gene, because the interesting single gene pathways would already have been found in single gene testing. This left 3080 pathways, all of which we tested for association with survival. We used the asymptotic P-value, because due to the randomness in the the permutation P-value, it does not give a unique list. Table 2 gives the 10 GO-terms with the smallest P-values.


View this table:
[in this window]
[in a new window]
 
Table 2 Global Test results for the osteosarcoma data on 3080 Gene Ontology pathways, showing the top 10 FDR-adjusted P-values

 
To adjust for multiple testing, one can use the Benjamini and Hochberg FDR (Benjamini and Yekutieli, 2001). All 10 pathways in Table 2 are significant on an FDR of 0.05. The P-values of the pathways tend to have positive correlations because of pathway overlap and pathways being subsets of other pathways. An FDR-controlling procedure that would make use of these dependencies would potentially gain much power in this situation.

The literature confirmed the importance of many of these GO-terms in tumorigenesis. For example, both microtubule cytoskeleton (GO:0015630) and phosphorylation of Stat3 protein (GO:0042518) are known to be involved in growth and differentiation signaling, processes which are often disturbed in tumors. Second-messenger mediated signaling (GO:0019932) is a superset of the Stat3 pathway. Protein amino acid methylation (GO:0006479) is involved in protein degradation. Alterations in the stability of proteins is often a hallmark of tumors and may affect the aggressiveness of a tumor and thereby the patient's survival.

5.4 A diagnostic plot
To learn more about the outcome of the Global Test than just the P-value, one can use the diagnostic plot described in Section 4. We illustrate the use of this plot on the microtubule cytoskeleton pathway, which emerged in Section 5.3.

The gene plot for the 21 genes in this pathway is given in Figure 1. Each bar gives the Global Test statistic for testing whether the gene set containing only that single gene is associated with survival. The test statistic for the whole pathway is a weighted average of the bars of the genes (Section 4.2). The color of the bars distinguishes between positive and negative association with survival.

Figure 1 shows that at least only 4 out of 21 genes in the microtubule cytoskeleton pathway show a significant association with survival on their own. Further, the pathway is a mix of genes which are positively and negatively associated with survival. An analysis of the gene plot can form the basis for further investigation of the structure of the pathway, perhaps to formulate hypotheses on interesting subpathways.


    6 DISCUSSION
 TOP
 Abstract
 1 INTRODUCTION
 2 THE MODEL
 3 DERIVATION OF THE...
 4 INTERPRETATION
 5 APPLICATION: OSTEOSARCOMA DATA
 6 DISCUSSION
 REFERENCES
 
It has often been remarked that the key to successful microarray data analysis lies in an intelligent integration of advanced statistical methods with the vast domain of biological knowledge that is already available. The Global Test for survival presented in this paper is a step forward in this direction, combining known biological pathway information with the statistical sophistication of the Cox proportional hazards model.

Due to its complexity the Cox model has been slow to find its way into microarray methodology. Most methods require survival to be reduced to a two-valued variable, using an arbitrary cut-off, resulting in unnecessary loss of information. By using the Cox model for survival, gene expression analysis can improve performance and also become better connected to traditional medical statistics.

Pathway information is available from many databases and is essential for the understanding of the outcomes of a microarray experiment. The Global Test methodology allows researchers to look directly for important pathways, without first having to go through single gene testing. This may lead to a better use of pathway information and more directly interpretable results.

Received on September 13, 2004; revised on January 4, 2005; accepted on January 6, 2005

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 THE MODEL
 3 DERIVATION OF THE...
 4 INTERPRETATION
 5 APPLICATION: OSTEOSARCOMA DATA
 6 DISCUSSION
 REFERENCES
 

    Andersen, P.K., Borgan, Ø., Gill, R.D., Keiding, N. Statistical Models Based on Counting Processes, (1993) , NY Springer.

    Andersen, P.K., Klein, J.P., Zhang, M.J. (1999) Testing for center effects in multi-center survival studies: a Monte Carlo comparison of fixed and random effects tests. Stat. Med., 18, 1489–1500[CrossRef][Medline].

    Beer, D.G., Kardia, S.L.R., Huang, C.C., Giordano, T.J., Levin, A.M., Misek, D.E., Lin, L., Chen, G., Gharib, T.G., Thomas, D.G., et al. (2002) Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat. Med., 8, 816–824[ISI][Medline].

    Benjamini, Y. and Yekutieli, D. (2001) The control of the false discovery rate in multiple testing under dependency. Ann. Statist., 29, 1165–1188[CrossRef].

    Le Cessie, S. and Van Houwelingen, H.C. (1995) Testing the fit of a regression model via score tests in random effects models. Biometrics, 51, 600–614[CrossRef][ISI][Medline].

    Cox, D.R. and Hinkley, D.V. Theoretical Statistics, (1974) , Boca Raton, FL Chapman & Hall.

    Fleming, T.R. and Harrington, D.P. Counting Processes and Survival Analysis, (1991) , NY John Wiley and Sons.

    Goeman, J.J., Van de Geer, S.A., De Kort, F., Van Houwelingen, J.C. (2004) A global test for groups of genes: testing association with a clinical outcome. Bioinformatics, 20, 93–99[Abstract/Free Full Text].

    Jenssen, T.K., Kuo, W.P., Stokke, T., Hovig, E. (2002) Associations between gene expressions in breast cancer and patient survival. Hum. Genet., 111, 411–420[CrossRef][ISI][Medline].

    Klein, J.P. and Moeschberger, M.L. Survival Analysis: Techniques for Censored and Truncated Data, (1997) , NY Springer.

    Pawitan, Y., Bjöhle, J., Wedren, S., Humphreys, K., Skoog, L., Huang, F., Amler, L., Shaw, P., Hall, P., Bergh, J. (2004) Gene expression profiling for prognosis using Cox regression. Stat. Med., 23, 1767–1780[Medline].

    Tibshirani, R. (1997) The LASSO method for variable selection in the Cox model. Stat. Med., 16, 385–395[CrossRef][ISI][Medline].

    Van't Veer, L.J., Dai, H.Y., Van de Vijver, M.J., He, Y.D.D., Hart, A.A.M., Mao, M., Peterse, H.L., Van der Kooy, K., Marton, M.J., Witteveen, A.T., et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530–536[CrossRef][Medline].

    Verweij, P.J.M., Van Houwelingen, H.C., Stijnen, T. (1998) A goodness-of-fit test for Cox's proportional hazards model based on martingale residuals. Biometrics, 54, 1517–1526.

    Van de Vijver, M.J., He, Y.D., Van't Veer, L.J., Dai, H., Hart, A.A.M., Voskuil, D.W., Schreiber, G.J., Peterse, J.L., Roberts, C., Marton, M.J., et al. (2002) A gene-expression signature as a predictor of survival in breast cancer. N. Eng. J. Med., 347, 1999–2009[Abstract/Free Full Text].

    Wigle, D.A., Jurisica, I., Radulovich, N., Pintilie, M., Rossant, J., Liu, N., Lu, C., Woodgett, J., Seiden, I., Johnston, M., et al. (2002) Molecular profiling of non-small cell lung cancer and correlation with disease-free survival. Cancer Res., 62, 3005–3008[Abstract/Free Full Text].

    Wu, Z., Irizarry, R.A., Gentleman, R., Martinez-Murillo, F., Spencer, F. (2004) A model based background adjustement for oligonucleotide expression arrays. J. Am. Stat. Assoc., 99, 909–917[CrossRef][ISI].

    Zhang, J. GO: A Data Package Containing Annotation Data for GO, (2004) R package version 1.6.5.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Clin. Cancer Res.Home page
G. G. Van den Eynden, M. Smid, S. J. Van Laere, C. G. Colpaert, I. Van der Auwera, T. X. Bich, P. van Dam, M. A. den Bakker, L. Y. Dirix, E. A. Van Marck, et al.
Gene Expression Profiles Associated with the Presence of a Fibrotic Focus and the Growth Pattern in Lymph Node-Negative Breast Cancer
Clin. Cancer Res., May 15, 2008; 14(10): 2944 - 2952.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
D. Nam and S.-Y. Kim
Gene-set approach for expression pattern analysis
Brief Bioinform, May 1, 2008; 9(3): 189 - 197.
[Abstract] [Full Text] [PDF]


Home page
Cancer Res.Home page
M. Smid, Y. Wang, Y. Zhang, A. M. Sieuwerts, J. Yu, J. G.M. Klijn, J. A. Foekens, and J. W.M. Martens
Subtypes of Breast Cancer Show Preferential Site of Relapse
Cancer Res., May 1, 2008; 68(9): 3108 - 3114.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. J. Goeman and U. Mansmann
Multiple testing on the directed acyclic graph of gene ontology
Bioinformatics, February 15, 2008; 24(4): 537 - 544.
[Abstract] [Full Text] [PDF]


Home page
Clin. Cancer Res.Home page
F. Westermann, K.-O. Henrich, J. S. Wei, W. Lutz, M. Fischer, R. Konig, R. Wiedemeyer, V. Ehemann, B. Brors, K. Ernestus, et al.
High Skp2 Expression Characterizes High-Risk Neuroblastomas Independent of MYCN Status
Clin. Cancer Res., August 15, 2007; 13(16): 4695 - 4703.
[Abstract] [Full Text] [PDF]


Home page
JCOHome page
D. Kienle, T. Katzenberger, G. Ott, D. Saupe, A. Benner, H. Kohlhammer, T. F.E. Barth, S. Holler, J. Kalla, A. Rosenwald, et al.
Quantitative Gene Expression Deregulation in Mantle-Cell Lymphoma: Correlation With Clinical and Biologic Factors
J. Clin. Oncol., July 1, 2007; 25(19): 2770 - 2777.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. M. Kemp, N. R. Nirmala, and J. D. Szustakowski
Extending the pathway analysis framework with a test for transcriptional variance implicates novel pathway modulation during myogenic differentiation
Bioinformatics, June 1, 2007; 23(11): 1356 - 1362.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. J. Goeman and P. Buhlmann
Analyzing gene expression data in terms of gene sets: methodological issues
Bioinformatics, April 15, 2007; 23(8): 980 - 987.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
M. A. Zapala and N. J. Schork
Multivariate regression analysis of distance matrices for testing associations between gene expression patterns and related variables
PNAS, December 19, 2006; 103(51): 19430 - 19435.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/9/1950    most recent
bti267v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (20)
Right arrowRequest Permissions
Google Scholar
Right arrow