Skip Navigation


Bioinformatics Advance Access originally published online on October 10, 2006
Bioinformatics 2006 22(21):2643-2649; doi:10.1093/bioinformatics/btl450
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/21/2643    most recent
btl450v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Rajicic, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Rajicic, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Survival analysis of longitudinal microarrays

Natasa Rajicic 1,2, Dianne M. Finkelstein 1,2,*, David A. Schoenfeld 1,2 and the Inflammation Host Response to Injury Research Program Investigators{dagger}

1 Massachusetts General Hospital, Biostatistics Unit 50 Staniford Street, Suite 560, Boston, MA, USA
2 Department of Biostatistics, Harvard School of Public Health Boston, MA, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 SIMULATIONS
 4 RESULTS OF TRAUMA...
 5 DISCUSSION
 REFERENCES
 

Motivation: The development of methods for linking gene expressions to various clinical and phenotypic characteristics is an active area of genomic research. Scientists hope that such analysis may, for example, describe relationships between gene function and clinical events such as death or recovery. Methods are available for relating gene expression to measurements that are categorized or continuous, but there is less work in relating expressions to an observed event time such as time to death, response or relapse. When gene expressions are measured over time, there are methods for differentiating temporal patterns. However, methods have not yet been proposed for the survival analysis of longitudinally collected microarrays.

Results: We describe an approach for the survival analysis of longitudinal gene expression data. We construct a measure of association between the time to an event and gene expressions collected over time. Statistical significance is addressed using permutations and control of the false discovery rate. Our proposed method is illustrated on a dataset from a multi-center research study of inflammation and response to injury that aims to uncover the biological reasons why patients can have dramatically different outcomes after suffering a traumatic injury (www.gluegrant.org).

Contact: dfinkelstein{at}partners.org


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 SIMULATIONS
 4 RESULTS OF TRAUMA...
 5 DISCUSSION
 REFERENCES
 
Scientists are turning to the microarray technology for insights into the mechanisms of the human body that were poorly understood previously. We think of genes as units of heredity as they record the genetic makeup of organisms. Although it is believed that a large number of genes remain inactive for most of our lives, there are those genes for which the activity can be associated with various physiological or environmental effects. In simple terms, a gene is considered to be activated, or expressed, if its coded information is converted into proteins which are the main instigators of functions and processes in our bodies. An interesting problem in the analysis of the human genome is to relate changes in gene activity to clinical or phenotypic information. For example, scientists have been able to relate gene expression to the clinical implications of different types of cancer (Alizadeh et al., 2000; Gloub et al., 1999; van't Veer et al., 2002; van't de Vijver et al., 2002).

The nature of the data generated from a microarray experiment poses specific challenges to the statistical analysis. In a typical experiment, data from a relatively small number of subjects is available on thousands, even tens of thousands of genes, whichmakes many of the classical statistical procedures unapplicable. The classical data types, on the other hand, are usually such where the number of explanatory variables does not exceed the number of subjects on which the data are collected. Time-to-an-event data poses additional challenges owing to the presence of censoring.

We propose a method to study relationships between repeatedly collected gene expressions and time to an event of interest. Our problem is motivated by the data from Inflammation and Host Response to Injury research project (also referred to as the Glue Grant, http://www.gluegrant.org). This multi-center and multi-disciplinary collaboration aims at better understanding of processes involved in the immune system's response to an injury, as well as uncover the biological reasons why seemingly alike patients can have dramatically different outcomes after suffering a traumatic injury or burns. Doctors hope that identifying the genetic factors will help in predicting the course of recovery of severely injured patients.

Numerous methods have been proposed and developed for relating gene microarrays to either continuous or categorical measurements such as comparison of treatment groups or the level of a known biomarker. Ring and Ross (2002) offer a comprehensive review of methodsthat use microarrays for tumor classification. In comparison, fewer methods have been suggested for the analysis of gene expressions in relation to time to an event (e.g. death, response or relapse), or for detecting differences in genes over time (Luan and Li, 2002; Yeung et al., 2003; Storey et al., 2005). Methods for the use of longitudinal microarray in either describing or predicting survival outcomes are currently unavailable.

Methods for relating gene activity to the occurrence of an event have been proposed when microarrays are collected at a single point of time (e.g. baseline). One approach is to first use an unsupervised classification method, e.g. hierarchical clustering, to generate two or more groups of patient samples (Rosenwald et al., 2002; Makretsov et al., 2004). The survival distributions within such generated clusters are then compared using the log-rank test and displayed by Kaplan–Meier curves. A second related approach is to first cluster genes based on their expressions across different patient samples, and then use cluster averages of the gene expressions as explanatory variables in a Cox proportional hazard regression model (Li and Luan, 2003). However, both these approachesdo not capture the marginal relationship between gene expressions and time to an event (Sorlie et al., 2001; Jung et al., 2005). One may end up with gene classes that do not represent any meaningful grouping in terms of the survival, or the results may vary due to a particular clustering algorithm used.

A number of published approaches apply partial least squares (PLS) method to generate linear combinations of gene expressions as predictors in the proportional hazards model (Nguyen and Rocke, 2002). PLS is a method related to principal components analysis (PCA), but while PCA creates combinations that maximize the explained variability among predictors only, PLS aims at maximizing correlations between predictors and the response variable. Bair and Tibshirani (2004) first calculated the Cox score for each gene (a statistic based on a proportional hazards partial likelihood) in order to select a subset of genes, then employed the PLS method to a reduced set of genes to arrive at the best model. Park et al., (2002) first reformulated the survival outcomes problem into a generalized linear (Poisson) regression, then applied the PLS algorithm to derive a parsimonious model. In a related approach, Li and Gui (2004) and Gui and Li (2005) reduced the dimensionality of the microarray predictor space by either partial or penalized Cox regression. Although in all of these approaches the high-dimensionality of the microarray is reduced, direct interpretation of the fitted parameters in terms of individual genes is not possible. In contrast, we are interested in developing a method that can be used as a first step in identifying individual genes for further investigation.

When thousands of genes are measured in a single experiment, a single question of interest can be formulated as simultaneous testing of numerous individual hypotheses. Testing many statistical hypotheses at once increases the possibility of a Type I error, as a significant result may occur purely by chance, regardless of the nature's true state. The control of the false discovery rate (FDR) has become a widely used method of error control in the analysis of gene microarrays (Storey and Tibshirani, 2003). The control of FDR involves an estimate of the proportion of falsely positive genes among all genes found positive (i.e. exhibit differential expression in different samples or states under investigation). Westfall and Young (1989) promoted the use of permutations to allow for dependences among test statistics. In this paper, we propose a permutation-based method that is related to the method of Storey and Tibshirani (2003) and to the popular SAM method (Tusher et al., 2001).

A gene-specific test statistic is defined in Section 2.2. A multiple testing algorithm that controls the number of false positive findings is described in Section 2.4. The results of a series of simulations are presented in Section 3, whereas the analysis on the data from the study of inflammation and response to trauma is presented in Section 4. All programs for the analysis presented in this paper were performed using R statistical package (2005; http://www.R-project.org) and can be obtained from the first author (N.R.) or from http://hedwig.mgh.harvard.edu/biostatistics/software.php.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 SIMULATIONS
 4 RESULTS OF TRAUMA...
 5 DISCUSSION
 REFERENCES
 
2.1 Notation
For each of the n patients enrolled in a clinical study, we record the elapsed time from the beginning of the study to the occurrence of the event of interest (e.g. death or recovery). Since we are interested in examining whether there is an association between the time to an event and changes in gene expression, we also observe patients' microarray over time. A microarray here is a collection of p expression values on p different genes. For a given gene, let Xi(t) denote expressionsfor patient i at time t. Time to occurrence of the event is recorded using two indicator variables, {delta} and Y. Say subject i had an event at time t, then {delta}i(t) = 1 and {delta}i(t) = 0 for times prior to time. Similarly, Yi(t) = 1 if i is still at risk at time t, meaning that the patient is under observation and has not experienced the event by time t. A set of subjects remaining at risk at time t is of size Formula. Let Formula denote a total number of subjects who experienced an event at time t. For patients to whom the event does not occur for the duration of the study, or for whom other reasons have discontinued the study follow-up, we say that their event time is censored. We further make a distinction between the observed event times, {tau}k, k = 1, ... , m, and scheduled (i.e. planned) visit times, tj, j = 1, ... , J, as the planned timing of visits may not coincide with the observed event times. Here, m is the total number of observed events, and J is the total number of scheduled visits. Since events are considered to be ‘terminal’, each subject can experience it only once during the study follow-up, and the total number of observed events, m, is less or equal to the total number of subjects, m ≤ n.

2.2 Test statistic
To test the association between thousands of longitudinally collected gene expressions and time to an event of interest, we want to use a test statistic that is not only intuitive but also simple to calculate. This is primarily because we intend to use a permutation-based testing procedure which is computationally intensive. The approach we take is to first calculate one test statistic for each gene, then determine the significance of each association using permutations. We begin by examining a general class of nonparametric tests for survival data, formulated by Jones and Crowley (1989). We assume that for each subject i at risk at time t, it is possible to define a (quantitative) value Zi(t) that represents subject's covariate measurement, and denote by Formula the average value of Z for subjects at risk at time t. It is also assumed that the increasing covariate values correspond to either increasing or decreasing chances of event occurrence. Let m denote the total number of observed events, then the test statistic can be written as follows:

Formula 1(1)
Note that {omega}(t) are optional weights chosen to emphasize either early or late events. The score statistic based on the partial likelihood from a Cox regression model (Cox, 1972) is a member of this general class. Jones and Crowley (1989) investigate various choices for weights and labels ({omega}(t), Zi(t)), where using (1, Xi(t)) results in the following test statistic:

Formula 2(2)
where, Formula 2 is the average gene expression for subjects at risk at time t, Formula 2. This test statistic captures the difference between the observed covariate values for subjects that had an event at a given time-point, and the average covariate value for subjects still at risk an instant before the event occurred. The differences are then summed up over all observed event times.

In our approach, however, we cannot implement (2) without further modification. This is because the structure of our problem presents several challenges that need to be addressed. We want the outer summation in both (1) and (2) to be over the unique observed event times, {tau}k, which results in summands involving current gene expression values at the time of an observed event. Ideally, if we had observed expression values for all subjects currently at risk at time of an observed event, the statistic would be easily and correctly calculated. In clinical trials, however, data are often collected according to some schedule of study visits. The data on time-varying covariates for all subjects at risk may not be available at the time an event occurred but rather at more than one prior scheduled visit times. For example, in the Glue study, gene expression is obtained on seven scheduled visits over a period of 28 days but respiratory recovery event can occur and be recorded on any day during that time. A simple approach to deal with intermittent covariate data is the ‘last observation carried forward’ (LOCF) approach, where the most recent available observation is used in place of the missing data. If the event had occurred at some timek between the two scheduled visits tj–1 and tj, so that Formula 2 is not available, microarray collected at time tj–1 would be used in place of Formula 2. Although the LOCF approach would be simple to implement, it has been traditionally heavily criticized as it produces biased results. We, therefore, explore a different approach to handle intermittent microarray data.

2.3 Semi-parametric test of association
Another way to deal with the intermittently available data in a time-to-event study is to model unknown values using measurements available up to that time. Taking into account our limited knowledge about the longitudinal behavior of microarrays, we search for ways to model the gene expression over time without assuming strict distributional properties. In order to do this, we follow the approach outlined in Tsiatis and Davidian (2001), who extend and apply the concept of a conditional score (Stefanski and Carroll, 1987) to the joint modeling of the longitudinal and event data. Namely, we would like to consider a random effects model for the longitudinal expression data, as this model provides a way to incorporate subject-specific random intercept and slopes: Formula 2. In this way, we also address the issue of between-subject gene expression variability which can be substantial (Cheng Li, personal communication).

However, the model estimate of the unknown gene expression value Xi(t) at time t would require a distributional assumption for Formula 2. As noted earlier, owing to the unknown longitudinal behavior of genes, we would like to avoid specifying the exact distribution of the random effects. Fortunately, if we know the sufficient statistic for {alpha}i, we can use it to avoid making assumptions on the distribution of the random effects. This is because conditioning on the sufficient statistic, Si (t), would remove the dependence of the conditional distribution on the random effects {alpha}i. Namely, the sufficient statistic for {alpha}i, conditional on subject i being at risk at time t, [i.e. Yi(t) = 1], is

Formula 2
Here, {sigma}2(t) accounts for the uncertainty when using Formula 2 as an estimate of the unknown covariate value Xi(t). We describe the estimation of {sigma}2(t) below. More importantly, assuming subject i is at risk at time t, and using all available data up to and including t, Formula 2 can simply be the ordinary least square estimate. The joint likelihood of the events {delta}i(t) and the model estimate Formula 2 can then be factored into two parts, one of which does not involve a random variable {alpha}i, and an other that does not involve information on the event:

Formula 2
The conditional likelihood Formula 2 does not depend on the random effect {alpha}i. It arguably contains all the relevant information about the parameter of interest ß, and thus can be used to construct estimating equations for ß (Tsiatis and Davidian, 2001):

Formula 3(3)
where Formula 3. Here we emphasize that the outer summation is taken over the observed event times Formula 3.

We now proceed to construct a score test for testing Formula 3 versus Formula 3. Let Formula 3 be the test statistic for such test. The numerator, U(0) is found by evaluating (3) when H0 is true:

Formula 4(4)
Also note that U(0) has a form familiar to that in (2). The difference between the two statistics is that (4) estimates the unknown value of the covariate at the observed event time using the available covariate history.

The estimate of variance is obtained by finding a first derivative of (3), evaluated for ß = 0:


Formula 5

(5)
The estimate Formula 4 is a product of two quantities. First is the estimate of the unknown variability in measuring Xi(t) at time t, which is estimated by the pooled estimate of the residual sums of squares over all subjects. The second quantity is the estimate of variance of the predicted value Formula 4 using available values up to and including time t. Specifically, variance of the predicted value Formula 4 at time t is Formula 4, where mi(t) is the number of available observations for subject i up to time t, and SSi(t) is the corresponding sum of squared differences from the mean, using values up to and including time t. As before, Y(t) is the total numberof subjects at risk at time t, and n(t) is the number of events at time t. Formula 4 is the sample variance of the ordinary least square estimates, Formula 4, among subjects at risk at t. The resulting score test closely resembles the score test from the Cox's proportional hazards model for non-time-varying covariates or for completely known time-varying covariate histories. Our proposed test can also be viewed as a test of form Formula 4 for all X, where {lambda}(t) is a hazard function.

2.4 Calculation of the significance levels
The statistic in (4) is best suited for examining a single covariate, whereas we need to test thousands of genes (covariates) to determine which gene's changes over time are associated with a clinical event. The testing procedure controls for the number of false positive findings, as this is a standard approach for the genomewide studies. Simply put, to determine whether the calculated test statistic Tg is significant or not, we want to consider the number of falsely positive findings (FP) among the total number called significant (TP) when that test statistic is used as a cutoff value,

Formula 4

The expected value of this ratio is defined as the FDR for statistic Tg (Storey and Tibshirani, 2003)

Formula 4

One simple way to obtain an estimate of FDR(Tg) is to directly estimate the numerator and the denominator (Xie et al., 2005). We call this estimator the false positive ratio (FPR) in order to emphasize its derivation. To estimate the denominator, we use the total number of test statistics called significant when Tg is used as a cut-off value, i.e. #(|T| > Tg). The numerator is estimated using permutation-based estimate of the null distribution of the test statistics. Given the nature of ourlongitudinal data with survival endpoints, the actual permutation needs to be clearly defined. At each observed event time, we permute the event indicators among subjects at risk at that time. In other words, the number of subjects with events is kept fixed at each event time, with their event indicators randomly exchanged among those currently at risk. Let T = [T1, ... , Tp] be test statistics calculated on each of the p-genes in the original data. Here, I(·) is the usual indicator function, where I(a) = 1, if a true.

  1. At each observed event time k, permute indicators of events among those subjects still at risk at that time. This is equivalent to choosing n(t) elements out of Y(t), at time t.
  2. Using such perturbed data, calculate a set of p test statistics, Formula 4
  3. Compare each original Tg with all permutation-based T* and call the number of false positives the number among T* that are greater than Tg,


    Formula


  4. Repeat Steps 1–3 many times, say, hundred times. For each gene g, Formula 4, this produces a sequence of hundred numbers. Denote by Formula the mean value of such sequence for test statistic Tg.
  5. For each gene, the estimated proportion of false positives is the ratio of Formula over the total number of statistics called significant when Tg is used as a cut-off value, i.e. Formula 4. Thus, the estimate of the FPR for Tg is:

    Formula

If a test statistic has an estimated proportion of false positives below a desired, pre-specified level, e.g. 10%, then the hypothesis is rejected and the observed test statistic is declared statistically significant. Our testing procedure is similar tothe approach proposed by Storey and Tibshirani (2003), when the estimated proportion of null hypotheses Formula 4 is set to 1, and the results are described in terms of the test statistic (rather than the appropriately defined P-value). The presented algorithm can also be viewed as a form of the Empirical Bayes calculation of the FDR (Efron et al., 2001).

We calculate the permutation distribution by permuting the event indicators among subjects at risk at each time. Alternatively, the true permutation distribution for the test statistics would ostensibly be found by permuting the event times among the patients. The problem with this approach is that no samples were collected after the event for each patient, and if they were, the gene expression values after the event may have been affected by the event which would preclude their use. One way to fill-in such missing data, would be to define a distance metric in order to select among subjects with complete covariate series those that are ‘close’ or ‘similar’ to the subject with the missing observation. The algorithm will then proceed as follows: (1) cluster subjects according to their microarray at time t – 1, (2) note the cluster membership of the subject with the missing t array, and (3) impute the missing array by calculating some sample measure (e.g. mean array) using the remaining members of the cluster and their expressions at time t. The testing algorithm can then continue with Step 2. One potential limitation with this approach is that the small size of a cluster of subjects determined to be ‘close’ or ‘similar’ to the subject with the missing observation may introduce bias when calculating the imputed value.


    3 SIMULATIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 SIMULATIONS
 4 RESULTS OF TRAUMA...
 5 DISCUSSION
 REFERENCES
 
We performed a series of simulations to assess validity and performance of our proposed method. The following describes an algorithm to generate longitudinal expressions along with survival outcomes that emulate the data-generating mechanisms presented bythe actual problem.

  1. Obtain an estimate of a variance–covariance matrix, Formula 4, of the random effects by fitting a random effects model to a selection of genes from the actual data.
  2. Sample from a bivariate normal distribution with zero mean and variance–covariance matrix obtained in Step 1 to get a set of random effects (intercepts and slopes): Formula 4, for Formula 4.
  3. Generate individual gene trajectories for each subject and for all genes, using the generated random effects: Formula 4, where Formula 4.
  4. Choose a value of the association parameter b. Assuming an exponential hazard function Formula 4, the parameter b captures the strength of association between time to an event and the time-dependent covariate trajectory X(t). Using the exponential form of the hazard function {lambda}(t), and an estimate of an underlying event hazard, {lambda}0, which uses the number of observed events, we derive an inverse of the cumulative hazard distribution:

    Formula 4

  5. By knowing the form of the inverse cumulative hazard function, we can use the Probability Integral Transformation to sample survival times, T. We first generate replicates of the uniformly distributed random variable U ~ Unif (0,1), and then generate survival times as Formula 4.

In the final step of the algorithm, generated survival times that exceed 28 days are considered as censored. The produced survival times are ‘linked’ to the trajectories generated in Step 3 through the random effects {alpha}. Namely, since the same random effects generated in Steps 1–2 are used in Step 5 to generate survival times T, subjects with comparable random effects get assigned similar event times (e.g. early or late).

Using the above algorithm, we generated 600 samples of data. Each sample consists of 100 subjects with 500 longitudinal gene expressions over seven time-points. Out of 500, 50 genes were set to be significantly associated with the time to an event. Testing was carried out for three choices of the association parameter b, as well as two values for the measurement error, Formula 4. Within each simulation, the false positive ratio of 10% was used as a cut-off value for determining significance. Simulations were executed using R statistical software and the results are presented in Table 1.


View this table:
[in this window]
[in a new window]

 
Table 1 Simulation results n = 100 subjects; p = 500 genes; 600 replications

 
To understand these results, let us examine the case when the association parameter b is set to 2.5, the measurement error of individual genes is Formula 4, and 50 out of a total of 500 simulated genes have trajectories associated with the time to an event (i.e. 10% of genes are significant). A median number of genes found positive over 600 simulations is 55 genes. The median false positive proportion over 600 simulations is 0.092, with an interquartile range of (0.056,0.137). Similar results are found for the remainder of the cells. When b = 0, all genes are expected to be nonsignificant under Formula 4. If the association parameter b is set to zero, any significant genes should be found purely by chance, and we would expect the total number of significant genes to be zero.

Inspection of the simulation results shows that our method performs reasonably well. The proportion of false positive findings remains close to the pre-specified 10% mark for both choices of the association parameter b and the two levels of measurement error. Also note that the proportion of false positives remains similar across the two columns. A change in the pre-specified association parameter b does not dramatically change the proportion of false positive. However, the assumed level ofthe measurement error seems to influence the estimated number of false positive findings. Estimates of false positive proportions are higher for Formula 4.


    4 RESULTS OF TRAUMA DATA ANALYSIS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 SIMULATIONS
 4 RESULTS OF TRAUMA...
 5 DISCUSSION
 REFERENCES
 
We apply our method to the data from Inflammation and Host Response to Injury research project (the Glue study). This collaborative program examines the biological reasons why patients can have dramatically different clinical outcomes after experiencing a traumatic injury. Among many scientific questions posed by the Glue investigators is whether we can identify genes whose temporal changes relate to the time until a specific clinical event. It is reasonable to assume that genes exhibiting greater variation are more likely to be associated with the time to an event. Patients in the Glue study are followed for 28 days from the time they experience a serious traumatic injury. This is an example of a right-censored data, which we assume for the development of our method. Genomic data collected on Days 0, 1, 4, 7, 14, 21 and 28 are generated using commercially available oligonucleotide array technology (Affymetrix Inc., 2001). Each microarray includes expressions on 54 674 probe sets (which we will call ‘genes’ for the purposes of our analysis). Gene expressions were extracted from oligonucleotide probes by employing a PM-only analysis of Li and Wong (2001) and normalized across arrays to achieve comparable levels, using the ‘Invariant Set’ method in the dChip software (Li and Wong, 2003). Finally, the gene expression values were log-transformed prior to any calculations.

To reduce the overwhelming dimensionality of a microarray, we first excluded those genes labeled ‘Absent’ over all arrays by the Affymetrics software. We then performed a simple filtering of genes and included only those genes whose estimated coefficientof variation (CV) exceeds a certain threshold. Although more complex filtering can be used, We can also proceed without filtering at all. Our choice of threshold is somewhat ad hoc as we aimed at having a couple of thousands of genes to work with, instead of over 54 000, in this hypothesis-generating approach. This brought the number of genes to < 4000 (P = 3,914). Data on 56 subjects with complete entries were included in the analysis. For the purposes of survival analysis, we define the event of interest as ‘respiratory recovery’ which occurs when a patient no longer depends on a machine to breathe. The response is thus defined as the time from injury (and entry into the ICU) to getting off the ventilator. Of the 56 patients, 52 experienced recovery of respiratory function before the end of 28- day follow-up. Four patients remained on the ventilator by Day 28 and thus had recovery time censored at 28 days.

The test statistic that measures the association between each gene and the time to an event is calculated for each gene separately. We performed described permutation-based testing procedure to determine significance of each test statistic. Of the 3 914 investigated genes, 154 were identified as statistically significant when we used 0.10 as a cut-off value for determining the significance of each individual gene. As a comparison, a total of 694 genes were identified as significantly associated with the time to a recovery when their test statistics are compared to the 10th-percentiles of the normal distribution.

A sample Gene Ontology for the selection of genes for which change in expression over time is associated with the time to respiratory recovery is given in Table 2. The sample genes are grouped in those that exhibited positive association with the time to respiratory response, and those with a negative association. For example, increased expression for NM_153701 [GenBank] (interleukin 12 receptor) is associated with the shorter time to recovery.


View this table:
[in this window]
[in a new window]

 
Table 2 Description of a subset of significant genes

 
Results for a sample of four significant genes are presented in Figure 1, with plots of individual patient trajectories over time. The four panels are identified by accession numbers. In order to further illustrate our results, patients are distinguished by whether they had a ‘late’ respiratory recovery (those occurring after Day 16). Solid lines represent the patients who either experienced a recovery after Day 16 or their times were censored at Day 28. The two plots on the right-hand side are for genes negatively related to the time to a recovery. A decrease in gene expression on these two genes is related to a shorter time to recovery. The opposite is true for the other two plots where an increase in gene expression relates to a shorter time to recovery.


Figure 1
View larger version (27K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Plot of patients trajectories for four genes found significant. Solid lines represent patients with recovery events that occurred after day 16, or patients censored at day 28. The test statistics corresponding to the two plots to the right are negative, indicating an inverse relationship between gene expressions and the time to a recovery. Conversely, the test statistics corresponding to the two plots to the left are positive, an increase in gene expression is associated with a shorter time to recovery.

 
The tens of thousands of gene expressions measured repeatedly over time on tens or hundreds of patients can create a considerable computational difficulty owing to the enormity of the resulting datasets. To further help in reducing the time needed for lengthy computations, we collaborated with an application specialist at the Massachusetts General Hospital Biostatistics Unit in order to obtain, develop, test and employ a parallel computing system (Lazar and Schoenfeld, 2004). The 30+ node computer cluster helped us greatly to reduce the time needed for lengthy computations.


    5 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 SIMULATIONS
 4 RESULTS OF TRAUMA...
 5 DISCUSSION
 REFERENCES
 
We provide a method for the survival analysis of longitudinally collected microarrays. We address the issues of intermittently collected microarray data as well as the unknown longitudinal behavior of a single gene expression. A limitation of our approach is that the one-dimensional construction of the test statistic does not necessarily address the high-dimensionality of the problem as it does not use information across all genes simultaneously.

More complex longitudinal models can easily be incorporated into our approach. A minimum set of assumptions regarding the functional relationship between longitudinal gene expression and timing of the events will depend on an individual biological problemat hand. For example, a natural extension would be to implement the approach of Song et al., (2002), which requires only the assumption that the random effects have a smooth density. Another modification, which may be relevant in some applications, is to devisea multiple imputation procedure for the unknown covariate values. Although this will certainly add to the overall computational complexity, it would be interesting to explore whether it can be incorporated so as to take advantage of the computations already inplace and the high-dimensionality of the data. Finally, in order to make the proposed test statistics more robust to potential outliers, the actual values of gene expressions may have to be replaced by ranks or some function of the ranks.

In the Glue study, patients were closely monitored at all times for a period of 28 days. The study subjects either experienced an event or their time is censored at the end of the study follow-up. This is an example of Type I censoring where there is no possibility of missing data owing to a dropout. However, in a typical clinical trial where study participants are followed for a longer period of time, it is likely that the censoring due to early dropout would be an issue. It is straightforward to accommodate this type of censoring in our approach.

The continual advancement of the microarray technology will ultimately result in many large studies routinely including longitudinal genomic observations as part of the study follow-up. The longitudinal microarray and event time data will thus become morecommon. Also, other applications of high-dimensional data such as proteomics and metabolomics data will arise. Therefore, further development of efficient methodologies to handle these high-dimensional event time datasets are needed.


    Acknowledgments
 
This work was partially supported by a Large-Scale Collaborative Project Award (U54-GM62119) from The National Institute of General Medical Sciences, National Institutes of Health. We wish to thank Dr Cheng Li for valuable insights on the subject. We also extend our thanks to Drs John Storey and John Quackenbush for their helpful suggestions and review. The application specialist, Peter Lazar, engineered and maintained the parallel system in R used in our analysis. Finally, we are especially grateful to the four reviewers whose careful review and constructive criticism greatly improved the quality of the manuscript. The following is a list of participating investigators of the Inflammation and the Host Response to Injury research project: Brett D. Arnoldo, MD; Henry V. Baker, PhD; Paul Bankey, MD; Timothy R. Billiar, MD; Bernard H. Brownstein, PhD; Steve E. Calvano, PhD; David G. Camp II, PhD; Celeste Campbell-Finnerty, PhD; George Casella, PhD; Irshad H. Chaudry, PhD; J. Perren Cobb, MD; Ronald W. Davis, PhD; Asit K. De, PhD; Constance Elson, PhD; Bradley Freeman, MD; Richard L. Gamelli, MD; Nicole S. Gibran, MD; Brian G. Harbrecht, MD; Douglas L. Hayden, MA; Laura Hennessy, RN; David N. Herndon, MD; Jureta W. Horton, PhD; Marc G. Jeschke, MD, PhD; Jeffrey Johnson, MD; Matthew B. Klein, MD; James A. Lederer, PhD; Stephen F. Lowry, MD; Ronald V. Maier, MD; John A. Mannick, MD; Philip H. Mason, PhD; Grace P. McDonald-Smith, Med; Carol L. Miller-Graziano, PhD; Michael N. Mindrinos, PhD; Joseph P. Minei, MD; Lyle L. Moldawer, PhD; Ernest E. Moore, MD; Avery B. Nathens, MD, PhD, MPH; Grant E. O'Keefe, MD, MPH; Laurence G. Rahme, PhD; Daniel G. Remick Jr, MD; D.A.S. PhD; Michael B. Shapiro, MD; Geoffrey M. Silver, MD; Richard D. Smith, PhD; John Storey, PhD; Ronald G. Tompkins, MD; Sc.D., Mehmet Toner, PhD; H. Shaw Warren, MD; Michael A. West, MD; Wenzhong Xiao, PhD.

Conflict of Interest: none declared.


    FOOTNOTES
 
{dagger}Contributing members of the Inflammation and the Host Response to Injury investigators are listed in Acknowledgements. Back

Associate Editor: Martin Bishop

Received on May 11, 2006; revised on August 15, 2006; accepted on August 17, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 SIMULATIONS
 4 RESULTS OF TRAUMA...
 5 DISCUSSION
 REFERENCES
 

    Affymetrix, Inc. (2001) Microarray Suite 5.0. Santa Clara, CA.

    Alizadeh, A.A., et al. (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503–511[CrossRef][Medline].

    Bair, E. and Tibshirani, R. (2004) Semi-supervised methods for predicting patient survival from gene expression papers. PLoS Bio, . 2, 5011–5022.

    Cox, D.R. (1972) Regression models and life tables. J. R. Stat. Soc. B, 34, 187–120.

    Davidov, O. and Zelen, M. (1998) Urn sampling and the proportional hazard model. Lifetime Data Anal, . 4, 309–327[CrossRef][ISI][Medline].

    Efron, B., et al. (2001) Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc, . 96, 1151–1160[CrossRef][ISI].

    Golub, T.R., et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537[Abstract/Free Full Text].

    Gui, L. and Li, H. (2005) Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics, 21, 3001–3008[Abstract/Free Full Text].

    Jones, M. and Crowley, J. (1989) A general class of nonparametric tests for survival analysis. Biometrics, 45, 15770.

    Jung, S.H., et al. (2005) A multiple testing procedure to associate gene expression levels with survival. Stat. Med, . 24, 3077–3088[CrossRef][ISI][Medline].

    Lazar, P. and Schoenfeld, D. (2004) Self-contained parallel system for R. , Boston, MA MGH Biostatistics.

    Li, H. and Luan, Y. (2003) Kernel Cox regression models for linking gene expression profiles to censored survival data. Proc. Pac. Symp. Biocomput, . 65–76.

    Li, H. and Gui, L. (2004) Partial Cox regression analysis for high-dimensional microarray gene expression data. Bioinformatics, 20, Suppl. 1, I208–I215.

    Li, C. and Wong, W.H. (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc. Natl Acad. Sci. USA, 98, 31–36[Abstract/Free Full Text].

    Li, C. and Wong, W.H. (2003) DNA-Chip analyzer (dChip). In Parmigiani, G., Garrett, E.S., Irizarry, R., Zeger, S.L. (Eds.). The Analysis of Gene Expression Data: Methods and Software, , New York Springer, pp. 120–41.

    Luan, Y. and Li, H. (2002) Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics, 19, 474–482.

    Makretsov, N.A., et al. (2004) Hierarchical clustering analysis of tissue microarray immunostaining data identifies prognostically significant groups of breast carcinoma. Clin. Cancer Res, . 10, 6143–6151[Abstract/Free Full Text].

    Nguyen, D. and Rocke, D. (2002) Partial least squares proportional hazard regression for application to DNA microarray survival data. Bioinformatics, 18, 1625–1632[Abstract/Free Full Text].

    Park, P., et al. (2002) Linking gene expression data with patient survival times using partial least squares. Bioinformatics, 18, S120–S127[Abstract].

    R Development Core Team. (2005) R: a language and environment for statistical computing. , Vienna, Austria R Foundation for Statistical Computing.

    Ring, B.Z. and Ross, D.T. (2002) Microarrays and molecular markers for tumor classification. Genome Biol, . 3, 1–6[Medline].

    Rosenwald, A., et al. (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N. Engl. J. Med, . 346, 1937–1947[Abstract/Free Full Text].

    Song, X., et al. (2002) A Semiparametric likelihood approach to joint modeling of longitudinal and time-to-event data. Biometrics, 58, 742–753[CrossRef][ISI][Medline].

    Sorlie, T., et al. (2001) Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc. Natl Acad. Sci. USA, 98, 10869–10874[Abstract/Free Full Text].

    Stefanski, LA. and Carroll, R.J. (1987) Conditional scores and optimal scores in generalized linear measurement error models. Biometrika, 74, 703–716[Abstract/Free Full Text].

    Storey, J.D. and Tibshirani, R. (2003) Statistical significance for genomewide studies. Proc. Natl Acad. Sci. USA, 100, 9440–9445[Abstract/Free Full Text].

    Storey, J.D., et al. (2005) Significance analysis of time course microarray experiments. Proc. Natl Acad. Sci. USA, 102, 12837–12842[Abstract/Free Full Text].

    Tsiatis, A.A. and Davidian, M. (2001) A semiparametric estimator for the proportional hazards model with longitudinal covariates measured with error. Biometrika, 88, 447–458[Abstract/Free Full Text].

    Tusher, V., et al. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 5116–5121[Abstract/Free Full Text].

    van't Veer, L.J., et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530–536[CrossRef][Medline].

    van de Vijver, M.J., et al. (2002) A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med, . 347, 1999–2009[Abstract/Free Full Text].

    Westfall, P. and Young, S. (1989) p-value adjustments for multiple tests in multivariate binomial models. J. Am. Stat. Assoc, . 84, 780–786[CrossRef].

    Xie, Y., et al. (2005) A note on using permutation-based false discovery rate estimates to compare different analysis methods for microarray data. Bioinformatics, 21, 4280–4288[Abstract/Free Full Text].

    Yeung, K.Y., et al. (2003) Clustering gene-expression data with repeated measurements. Genome Biol, . 4, R34[CrossRef][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/21/2643    most recent
btl450v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Rajicic, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Rajicic, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?