Skip Navigation


Bioinformatics Advance Access originally published online on June 16, 2005
Bioinformatics 2005 21(16):3377-3384; doi:10.1093/bioinformatics/bti544
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/16/3377    most recent
bti544v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Liu, H.
Right arrow Articles by Wong, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Liu, H.
Right arrow Articles by Wong, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Use of extreme patient samples for outcome prediction from gene expression data

Huiqing Liu *, Jinyan Li and Limsoon Wong

Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore 119613

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 METHODS
 EXPERIMENTS AND RESULTS
 DISCUSSION
 REFERENCES
 

Motivation: Patient outcome prediction using microarray technologies is an important application in bioinformatics. Based on patients' genotypic microarray data, predictions are made to estimate patients' survival time and their risk of tumor metastasis or recurrence. So, accurate prediction can potentially help to provide better treatment for patients.

Results: We present a new computational method for patient outcome prediction. In the training phase of this method, we make use of two types of extreme patient samples: short-term survivors who got an unfavorable outcome within a short period and long-term survivors who were maintaining a favorable outcome after a long follow-up time. These extreme training samples yield a clear platform for us to identify relevant genes whose expression is closely related to the outcome. The selected extreme samples and the relevant genes are then integrated by a support vector machine to build a prediction model, by which each validation sample is assigned a risk score that falls into one of the special pre-defined risk groups. We apply this method to several public datasets. In most cases, patients in high and low risk groups stratified by our method have clearly distinguishable outcome status as seen in their Kaplan–Meier curves. We also show that the idea of selecting only extreme patient samples for training is effective for improving the prediction accuracy when different gene selection methods are used.

Contact: huiqing{at}i2r.a-star.edu.sg

Supplementary information: http://research.i2r.a-star.edu.sg/huiqing/supplementaldata/survival/survival.html


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 METHODS
 EXPERIMENTS AND RESULTS
 DISCUSSION
 REFERENCES
 
Currently, the risk of a cancer patient is mostly measured by various clinical factors such as size of the original tumor, extent of local invasion and spread to distant organs. However, in many cases, patients with a similar clinical diagnosis may have different responses to the same treatment. For example, for patients suffering from acute myeloid leukemia, the most common acute leukemia in adults, chemotherapy can induce a complete remission in 70–80% of younger patients, but many of them have a relapse and die of their disease (Bullinger et al., 2004). Though a stem-cell transplantation approach can prevent the relapse of this disease, this approach is associated with a high treatment-related mortality (Bullinger et al., 2004). Therefore, accurate outcome prediction methods are needed to personalize the treatment plan for each individual patient. Thus, improper treatment and subsequent severe sufferings of patients (e.g. decline in IQ, hormonal deficiency problems) or inefficient treatment that causes relapse can be avoided.

Microarray technology enables monitoring of disease progression and prediction of patient outcome at the molecular level. A few previous studies have shown promising results for survival prediction from gene expression profiles and clinical data for certain diseases (Rosenwald et al., 2002; Beer et al., 2002; van de Vijver et al., 2002; Yeoh et al., 2002; Bullinger et al., 2004). These studies have demonstrated to be useful for optimizing treatment plans for individual patient, and also have recommended candidate genes that may be useful for developing innovative therapies and generating opportunities for drug discovery.

In early approaches proposed for outcome prediction from gene expression profiles, the traditional Cox proportional hazards model (Cox, 1972; Lunn and McNeil, 1995) is usually used to select genes. Using this model, genes most related to survival are identified by a univariate Cox analysis and a risk score is then defined as a linear weighted combination of the expression values of the identified genes (Beer et al., 2002; Rosenwald et al., 2002). Recently, machine learning technologies are involved. For example, Ando and Katayama (2002) have proposed a fuzzy neural network system to predict the survival of patients using gene expression profiles as input; Park et al. (2002) have developed a method to co-relate gene expression data to patient survival time using a partial least squares regression technique; and Shipp et al. (2002) have employed a weighted voting algorithm to identify cured versus fatal for outcome prediction of diffuse large B-cell lymphoma (dataset of Rosenwald et al., 2002). In a more recent work (LeBlanc et al., 2003), a gene index technique has been introduced to identify the associations between gene expression levels and patient outcome. The core idea of their method is to combine the correlation between genes with patient outcome as well as class membership for the ranking. Very recently, a semi-supervised method has been proposed to make use of both clinical information and gene expression profile for outcome prediction (Bair and Tibshirani, 2004). In this method, a subset of genes whose Cox score exceeds a certain threshold are chosen, and then unsupervised learning techniques (clustering or principal components analysis) are applied to these genes to group patients into different risk categories. An important suggestion made by Bair and Tibshirani (2004) is that analyzing patients with different survival rates based on gene expression data would help in identifying subtypes of cancer.

In this paper, we present a new computational method for outcome prediction based on gene expression profiles. Different from all previous works, our idea to form the training data is novel. Our training data consists of only two types of extreme patient samples: short-term survivors who got an unfavorable outcome within a short period and long-term survivors who were maintaining a favorable outcome after a long follow-up time. We do not consider patient samples between the two extreme cases in the training. A reason to select these extreme patient samples for training is that the sharp contrast between short-term and long-term survivors should be more informative and reliable (than those medium-term cases) for building and understanding the relation between genes and outcome. The addition of the medium-term patient samples would bring in more noise and confusion signals.

To identify genes most associated with the outcome, we apply a two-phase feature selection method to the selected training data. The two-phase feature selection method combines an entropy measurement (Fayyad and Irani, 1993) and the Wilcoxon rank sum test method (Wilcoxon, 1945) for identifying those sharp discriminative genes. To construct a model for survival risk estimation, we train a linear kernel support vector machine (SVM) based on the selected training samples and the selected genes. When a patient sample is given for outcome prediction, we calculate a risk score by feeding the patient's expression profiles to the established model. Based on this score, we then assign this patient to one of pre-defined risk groups such as high risk, intermediate risk or low risk group. Explicit threshold values to categorize different risk groups can be easily obtained based on the training results, so that outcome prediction for new patients is possible.

We apply our method to three large datasets: a dataset consisting of 240 patients with diffuse large-B-cell lymphoma (Rosenwald et al., 2002), a dataset of 116 patients with adult acute myeloid leukemia (Bullinger et al., 2004) and a dataset of 295 patients with breast cancer (van de Vijver et al., 2002). The corresponding Kaplan–Meier plots illustrate that the patients assigned to different risk groups based on our risk score have significantly different outcome. To further examine the idea of the training sample selection, we use a different feature selection method, called SAM (significance analysis of microarrays) (Tusher et al., 2001), to find genes associated with outcome. Comparisons with results of this approach demonstrate again the effectiveness of our extreme training sampleselection idea.


    METHODS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 EXPERIMENTS AND RESULTS
 DISCUSSION
 REFERENCES
 
We first present the new idea of selecting extreme training samples. Then we describe how to identify outcome-relevant genes from these training samples. Then, we introduce a scoring function for patients' risk estimation and outcome prediction.

Selection of extreme patient samples for training
Since our focus is on the relationship between gene expression and outcome, two types of extreme cases—short-term survivors who got an unfavorable outcome within a short period and long-term survivors who were maintaining a favorable outcome after a long follow-up time—should be more valuable than those in the ‘medium-term’ status. i.e. we do not expect reliable prediction can come out from analyzing living patients whose available follow-up time is short. So, the training data used in our method consists of only these two types of samples. This idea is different from all previous approaches that always used all training samples.

Specifically for an experimental sample T, if its follow-up time is F(T) and status at the end of follow-up time is E(T), then

(1)
where, E(T) = 1 stands for ‘dead’ or an unfavorable outcome, E(T) = 0 stands for ‘alive’ or a favorable outcome, and c1 and c2 are two thresholds of survival time for selecting short-term and long-term survivors, respectively. Note that long-term survivors also include those patients who died after a very long period. The two thresholds, c1 and c2, can vary from disease to disease and from dataset to dataset. Our basic guide line for the selection of c1 and c2 is that the selected training data should contain enough training samples for learning algorithms: generally, we require that each class should have at least 10 samples, and the total number of extreme samples should be between one fourth and one half of all available samples.

Identification of relevant genes
We propose a two-phase feature selection method to identify genes expressed differentially between short-term and long-term survivors. In the first phase, we use an entropy-based feature selection method to identify those features whose expression statistically differ between the two types of extreme patient samples. In this phase, motivated by the study by Liu and Setiono (1995) that discretization has the potential to perform feature selection among numeric features, we apply a supervised discretization algorithm (Fayyad and Irani, 1993) to each of the genes. This algorithm partitions a value range of a numeric feature in a way such that each of the resulting intervals contains the same class of samples, as many as possible. For those features whose values are relatively randomly distributed between different classes of samples, the algorithm does not partition the value range, or, in other words, the feature can be only discretized to a single value. This means that those features do not make any significant contribution to the separation of the different classes of samples. Therefore, we discard them from our analysis. On the other hand, if a resulting value interval induced by the cut points of a feature contains only the same class of samples, then the partitioning of this feature has an entropy value of 0. This is an ideal case since the feature can clearly distinguish samples in the different classes. Figure 1 briefly illustrates the entropy measure, cut points and intervals. From our previous study (Li et al., 2003), this systematic method usually discretizes <10% of the original features. For details of the algorithm, interested readers are referred to Fayyad and Irani (1993) and our supplementary web site.



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 1 We place the values of a feature on the horizontal axis. There are 13 samples in two classes, class 1 and class 2. (a) shows a feature that is a poor signal and there is no cut point found to distinguish samples in the different classes; (b) shows a feature that is a potentially good signal and indicates a possible cut point. (c) shows a feature that is the strongest signal and indicates a cut point—different resulting intervals contain samples of different class.

 
In the second phase, we use the Wilcoxon rank sum test (Wilcoxon, 1945) to narrow down the features selected in the first phase by selecting only those more sharply discriminating features. It is a kind of non-parametric test since it is based on the rank of samples rather than distribution parameters such as mean and standard deviation. Wilcoxon rank sum test is an alternative to t-test but has several advantages such as its good tolerance to outliers and its robustness to data transformation. These characteristics make the Wilcoxon rank sum test a favorable feature selection method in gene expression profile study (Park et al., 2001; Troyanskaya et al., 2002). Given a gene X with its test statistical measure w(X) calculated by the Wilcoxon rank sum test, if w(X) falls in the interval [clower, cupper], where clower and cupper are the lower and upper critical test values, then X is removed from further consideration. Otherwise, gene X is selected because it rejects the null hypothesis, and thus its expression values are significantly different between the two classes. In the calculation of the two critical values clower and cupper, a significant level of 5 or 1% is generally used. We use 5% in this paper. A description of the method can be found at our supplementary web site.

Construction of an SVM scoring function
We propose a new scoring function to estimate the outcome for every patient. This scoring function is based on SVM (Vapnik, 1995). The implementation of our SVM is by Weka (version 3.2), available at http://www.cs.waikato.ac.nz/ml/weka. In our case, the SVM regression function G(T) is a linear combination of the expression values of the identified genes:

(2)
where the vectors x(i) are the support vectors (samples), yi are the class labels (1 and –1 used here) of x(i), vector T represents a test sample, and {alpha}i and b are numeric parameters that can be determined from the training data.

We map the class label ‘short-term survivors’ to 1 and ‘long-term survivors’ to –1. If G(T) > 0, then the sample T is more likely to be a ‘short-term survivor’. If G(T) < 0, then the sample T is more likely to be a ‘long-term survivor’. To transform the output of G(T) into probability-like values, we use a standard sigmoid function S(T) defined as

(3)
So, S(T) is in the range (0,1). Also note that the smaller the S(T) value is, the better outcome the patient T will have. We term S(T) as the risk score of T.

If one only categorizes patients into high risk or low risk groups, the value 0.5 is a natural cutoff for S(T). In other words, if S(T) > 0.5 then the patient T will be assigned to the high risk group; otherwise, the patient will belong to the low risk group. If more than two risk groups are considered—such as high, intermediate and low—then other cutoffs can be determined based on the risk scores of the training samples. E.g. in training set, if most of the short-term survivors have a risk score greater than r1 and most of the long-term survivors have a risk score smaller than r2, then,

(4)
In general, we require r1 > 0.5, r2 < 0.5; the selection of the precise values of r1 and r2 can be guided by the risk scores of the training samples.

To visualize the probability of survival of all patients in different risk groups, we draw Kaplan–Meier curves (Altman, 1991) for all the groups. A point in a survival curve indicates the survival fraction (or percentage) of the patients in the group at a specific time. In this study, the Kaplan–Meier survival curves are generated by GraphPad Prism (http://www.graphpad.com). To compare the survival characteristics between different risk groups, log-rank test is used. The log-rank test generates a P-value to test the null hypothesis that the survival curves are not different between two groups. The meaning of P-value is that ‘if the null hypothesis is true, what is the probability of randomly selected samples whose survival curves are not different from those actually obtained’? So if the P-value is small, the difference between groups is statistically significant. In this paper, we report P-value at 95% confidence interval.


    EXPERIMENTS AND RESULTS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 EXPERIMENTS AND RESULTS
 DISCUSSION
 REFERENCES
 
This section reports our results on three public datasets. To demonstrate the high effectiveness of our extreme sample selection method, we also report good outcome prediction results obtained by using a different feature selection method (instead of our two-phase feature selection method) to pick up important features from extreme training samples for constructing the SVM model. This feature selection method is called SAM (Significance Analysis of Microarrays) (Chu et al., 2004), which is a software developed at Stanford University (http://www-stat.stanford.edu/~tibs/SAM/). See our supplementary web page for more information about SAM.

Diffuse large-B-cell lymphoma
Survival for diffuse large-B-cell lymphoma (DLBCL) patients after chemotherapy has been previously studied by Rosenwald et al. (2002) based on gene expression profiling and a Cox proportional hazards model. In that study, expression profiles of biopsy samples from 240 patients are used. The data include a preliminary group consisting of 160 patients and a validation group of 80 patients, each of them is described by 7399 microarray features.

We set c1 = 1 year and c2 = 8 years in Formula (1) to select short-term and long-term survivors from the preliminary 160-patient group. There are in total 47 short-term survivors and 26 long-term survivors. So, our training set consists of only 73 samples. From these 73 extreme patient samples, we identified 84 features that are related to patient survival status by using our two-phase feature filtering method. Interestingly, some of the selected genes are also listed in the Table 1 of the study by Rosenwald et al. (2002) where significant survival-associated genes are reported and previously studied. The common ones include AA805575 (GenBank accession no.) in germinal-center B-cell signature, X00452 and M20430 in MHC class II signature, and D87071 in lymph-node signature. Some other top-ranked genes (with smaller entropy value) in our list also have clear gene signatures. For example, BC012161, AF061729 and U34683 are in proliferation signature, BF129543 is in germinal-center B-cell signature, and K01144 and M16276 are in MHC class II signature.


View this table:
[in this window]
[in a new window]
 
Table 1 Comparison of the P-value (of log-rank test) obtained by applying different gene identification schemes and sample selection methods to the DLBCL data

 
We constructed a good SVM model based on the 73 extreme training samples and the 84 discriminative features. This SVM can completely separate the 47 short-term survivors and 26 long-term survivors; the lowest risk score assigned to the short-term survivors by the model is >0.7, and most of the long-term survivors has a risk score <0.3. In Rosenwald et al. (2002) the 80 validation samples are stratified according to the quartiles of the scores with each of quartiles consisting of 20 patients (P < 0.001). To compare our results with those achieved by Rosenwald et al. (2002) we also partition patients into four risk groups but in a different way, defined as

(5)
where the threshold 0.5 is the mean value of 0.7 and 0.3. The overall survival Kaplan–Meier curves of the four risk groups are plotted in Figure 2 for the 80 validation samples.



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 2 Kaplan–Meier plots illustrate the estimation of overall survival among four different patient risk groups for a DLBCL study. A tick mark on the plot indicates that one sample is censored at the corresponding time.

 
We can see that the 5-year survival rates for the high risk and low risk groups are clearly distinguishable. Though there is no distinct overall survival between the two intermediate groups, the 5-year survival rates of these two groups are obviously different from that in the high risk group or the low risk group. This suggests that three groups would be appropriate for these DLBCL samples. So in the rest of this study, we merge intermediate-high and intermediate-low risk patients into a single intermediate risk category. Figure 3 shows Kaplan–Meier curves where we can see a significant survival difference for patients in each of our risk categories, for the 80 testing samples, for the 167 validation samples [80 testing samples plus 87 (160–73) ‘non-extreme’ samples in the original training set] and for the total 240 samples.



View larger version (8K):
[in this window]
[in a new window]
 
Fig. 3 Kaplan–Meier plots illustrate the estimation of survival among different risk groups for a DLBCL study. (A) for the 80 testing samples, (B) for the 167 validation samples (80 testing samples plus 87 ‘non-extreme’ samples in the original training set) and (C) for the entire 240 samples.

 
Other results on the validation samples obtained from this dataset are reported in Table 1. These include P-values of the following tests: (1) using all features, features selected in Phase I, features selected by our two-phase method and features selected by SAM based on the extreme patient samples; (2) the above feature selection methods but based on the original training samples (taking status as class labels). From these results, we can see that the use of all training samples irrespective of the extreme cases cannot achieve a good P-value no matter which feature selection method is applied. For more information, interested readers are referred to Figure F1 of our supplementary information to see Kaplan–Meier survival curves of these experiments (only for those with training sample selection). By the way, on the same dataset, Bair and Tibshirani (2004) achieved P = 0.00124 by categorizing the patients into two risk groups using a semi-supervised machine learning approach.

Breast cancer
Currently, breast cancer patients with the same stage of disease have markedly different treatment responses and overall outcome (van't Veer et al., 2002). The widely used clinical predictors for metastases, such as lymph node status and histological grade, cannot provide accurate classifications for the tumors. Thus, more accurate methods of prognostication in breast cancer are needed to improve the selection of patients for adjuvant systemic therapy (van de Vijver et al., 2002).

A comprehensive study for the prediction of the time to metastasis in breast cancer has been conducted by van't Veer et al. (2002) where a 70-gene prediction model has been developed using gene expression profile of 78 breast cancer patients. Those important genes were identified from >5000 genes using a complicated method. The method has the following steps: (1) calculating the correlation coefficient of the expression for each gene with disease outcome and sorting the magnitude of the correlation coefficient to form a rank-ordered list, (2) sequentially adding subsets of five genes from the top of the list to the classification model, (3) evaluating the model using leave-one-out cross validation and (4) repeating the steps (2) and (3) until an optimal number of marker genes is reached. This 70-gene prediction model was re-used later in a separate study by van de Vijver et al. (2002) for analyzing a bigger dataset of 295 breast cancer patients.

In our study, we use van de Vijver's dataset. Note that this dataset contains the 61 lymph-node-negative patients of van't Veer's dataset. We conduct two kinds of analyses: metastasis and survival.

Metastasis prediction for breast cancer patients
Distant metastases are defined as a first event to indicate a treatment failure, and the data on patients is usually analyzed from the date of surgery to the time of the first event (i.e. distant metastases or death) or the date when the data is censored (van de Vijver et al., 2002). Patients involved in metastasis study include those who had had distant metastases as a first event within 5 years and those who had remained free of disease for at least 5 years.

To select extreme cases, we set c1 = 3 years and c2 = 10 years in Formula (1). A total of 52 short-term survivors (i.e. who had distant metastases within 3 years) and 76 long-term survivors (i.e. who remained free of disease at least 10 years) are among the 295 patients. As there is no independent validation data in this dataset, we randomly selected 40 samples from each of these two types extreme cases to form our training set, and all the remaining samples (215 samples) are treated as validation data. We identified 9 genes from the 70 available genes based on the 80 training samples. The nine genes are all selected by the entropy method in Phase I, and the Wilcoxon rank sum test does not remove any of them in Phase II. The constructed SVM model assigns a validation sample T to the high risk group if the risk score S(T) > 0.5, or otherwise to the low risk group if S(T) ≤ 0.5. From the Kaplan–Meier curves drawn in Figure 4, we can see a significant difference in the probability that patients would remain free of distant metastases between the high and low risk groups of patients.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 4 Kaplan-Meier plots show the probability that patients would remain metastasis-free among different risk groups for a breast cancer study. (A) for the 80 training samples, (B) for the 215 validation samples, and (C) for all the 295 samples.

 
Results of using different gene identification schemes and our selected training samples are also good for this study—the use of all 70 genes or 31 SAM-selected genes can also achieve very small P-value (<0.0001) on the validation samples. Please refer to Figure F2 of our supplementary information to see Kaplan–Meier survival curves of these two tests. By the way, we find six common features selected by both our method and SAM. They are Contig38288_RC, Contig55725_RC, NM_020974, NM_003981, NM_016359 and X05610.

Our result obtained on this dataset is not directly comparable to that obtained by van de Vijver et al. (2002) because the 70-gene model they used was built on van't Veer's data. They have reported a good result of P < 0.001 on these 295 samples. As mentioned, these 295 patients include 61 (of 78) training samples with lymph-node-negative. In a study on the same data, Bair and Tibshirani (2004) selected only 5 genes from 70 candidates using those 78 training samples. With a proposed supervised principal components method, they have achieved P < 0.001 for all the 295 patients and P = 0.00328 for 234 patients excluding those used for training. The reason that we do not follow the same training and validating strategy is that we cannot find clear indications in van de Vijver's dataset for the 61 training samples.

Survival prediction for breast cancer patients
Besides the probability of remaining free of distant metastases, we also analyze the overall survival of breast cancer patients using gene expression profile of these 295 samples.

To select extreme cases, we set c1 = 5 years and c2 = 10 years in Formula (1). Thus 48 short-term survivors and 83 long-term survivors are found among the 295 patients. Similar to the metastases study, we randomly selected 30 samples from each of these two types of extreme cases to form our training set, all the remaining samples (235 samples) are treated as validation data. We identified 16 genes based on the 60 selected training samples. The constructed SVM model assigns a validation sample T to the high risk group if the risk score S(T) > 0.5, or otherwise to the low risk group if S(T) ≤ 0.5. From the Kaplan–Meier curves drawn in Figure 5, we can see a significant difference in overall survival between the high and low risk groups of patients.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 5 Kaplan–Meier plots show the overall survival among different risk groups for a breast cancer study. (A) for the 60 training samples, (B) for the 235 validation samples and (C) for all the 295 samples.

 
Results of using different gene identification schemes and our selected training samples are also good for this study—the use of all 70 genes or 35 SAM-selected genes can also achieve very small P-value (<0.0001) on the validation samples. Please refer to Figure F3 of our supplementary information to see Kaplan–Meier survival curves of these two tests. In addition, we find 14 common genes selected by both our method and SAM. They are NM_007203, NM_005915, Contig38288_RC, Contig55725_RC, Contig46223_RC, NM_020974, NM_016577, Contig35251_RC, NM_014791, NM_003981, NM_006681, X05610, NM_000849 and Contig56457_RC.

Adult acute myeloid leukemia
Currently, the prognostic indicators to identify the appropriate therapy for acute myeloid leukemia (AML) patients include age, cytogenetic findings, the white-cell count and the presence or absence of recurrent cytogenetic aberrations (Bullinger et al., 2004). However, these factors do not fully reflect the molecular heterogeneity of the disease and treatment stratification is difficult. Thus, predictors built on gene expression profiles are expected to accurately predict the clinical outcome at molecular level so that appropriate treatment can be tailored for individual patient.

Bullinger et al. (2004) have studied gene expression in peripheral-blood samples or bone marrow samples from 116 adults with AML and identified new molecular subtypes of AML by unsupervised hierarchical clustering analysis. They have randomly divided these 116 samples into a training set containing 59 samples and a test set containing 57 samples—with a similar number of samples in each set are from patients who had died. In the training set, 26 patients were alive with follow-up time from 138 to 1625 days, and 33 were dead with follow-up time from 1 to 730 days.

To select extreme cases, we set c1 = 1 year and c2 = 2 years in Formula (1). A total of 29 short-term survivors and 8 long-term survivors are found among the 59 training samples. So, our training set consists of only these 37 samples. From these 37 extreme patient samples, we identified 50 features that are related to patient survival status by using our feature filtering method. The constructed SVM model assigns a validation sample T to the high risk group if the risk score S(T) > 0.5, or otherwise to the low risk group if S(T) ≤ 0.5. The Kaplan–Meier curves in Figure 6 shows a significant difference in overall survival between the high and low risk groups of patients: for the 57 testing samples, for the 79 validation samples (including 22 ‘non-extreme’ cases in the original training set) and for the entire 116 samples.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 6 Kaplan–Meier plots estimate the overall survival among different risk groups for an adult acute myeloid leukemia study. (A) for the 57 testing samples, (B) for the 79 validation samples (57 testing samples plus 22 ‘non-extreme’ cases in the original training set) and (C) for the entire 116 samples.

 
For this dataset, we also obtained the results of using different feature selection schemes or without training sample selection. In Table 2, P-value for each of these tests are listed. Kaplan–Meier survival curves for some of these experiments can be found in Figure F4 of our supplementary information. Generally, we use similar number of SAM-selected genes as that selected by our method.


View this table:
[in this window]
[in a new window]
 
Table 2 Comparison of the P-value (of log-rank test) obtained by applying different gene identification schemes and sample selection methods to theAML data

 
On the same dataset, Bullinger et al. (2004) has applied 149 SAM-selected cDNAs that were identified from all training samples, and a clustering method to estimate the outcome. They have reported a good result (P = 0.006) on overall survival of the patients in their poor-outcome and good-outcome groups. We tried to feed the same number of SAM-selected features to our model, but we only obtained a result of p = 0.0133 by using all training samples (see results in Table 2). However, we achieved better performance by using our selected training samples. In addition, our results are also better than that (P = 0.00136) reported by Bair and Tibshirani (2004) on the same dataset.


    DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 METHODS
 EXPERIMENTS AND RESULTS
 DISCUSSION
 REFERENCES
 
Recall that we selected only long-term and short-term survivors for training the prediction models. Table 3 lists the size change from the original training samples to our selected training set for the DLBCL and AML applications. Oberve that our method uses roughly half of the samples as training. As already shown in Table 1 and Table 2, these informative training samples can indeed make performance improvement, even using SAM for gene selection.


View this table:
[in this window]
[in a new window]
 
Table 3 Number of samples in original training data and selected training set of the DLBCL and AML datasets

 
As discussed in the section METHOD, we have some basic guidelines to determine the thresholds c1 and c2 that are defined in Formula (1). Bearing these minimum constraints in mind, we have tried several different c1 and c2 values in our study. In Table 4, P-values (of the log-rank test) associated with the Kaplan–Meier survival curves on validation samples under different selections of the c1 and c2 from DLBCL study are listed. All results are based on the selected genes using our gene identification scheme. We can see that for a range of c1 and c2 (i.e. c1 < 3 years and c2 ≥ 7 years), we can achieve better predictions by selecting extreme samples. In any case, the selection of c1 and c2 can be further refined by running cross-validation on training samples.


View this table:
[in this window]
[in a new window]
 
Table 4 Results of using different thresholds c1 (years) and c2 (years) in training sample selection on DLBCL study.

 
To demonstrate the effectiveness of selecting extreme samples, we have also done following tests. (1) The use of only ‘non-extreme’ samples in the original training set to build prediction model. As expected, the results are not good. For example, in DLBCL study, there are 87 ‘non-extreme’ samples left after we select 73 extreme cases from 160 samples in the preliminary group of the data. When we use these samples to train model, we get P = 0.4481 (40 features selected by our method) and P = 0.5887 (all genes) on 80 validation cases. (2) Incorporation of the idea of transductive SVM (tSVM) to include those "non-extreme" cases into training data as unlabeled samples to build prediction model. The results are not better than those we presented above. In the AML study, tSVM achieves P = 0.0574 (50 features selected by our method), P = 0.0487 (top 100 features selected by SAM) and P = 0.0468 (all genes) on the 57 validation samples. In the DLBCL study, tSVM achieves P = 0.0044 (84 features selected by out method), and P = 0.0113 (all genes) on the 80 validation samples. The software we used is SVMlight (version 6.01, http://svmlight.joachims.org/).

According to our experience on gene expression data analysis, the entropy measure can filter out about 90–95% of total number of genes (Li et al., 2003). This point has been verified again in our outcome prediction: for example, the entropy measure retains only 61 features (out of total 6283 candidates) in AML study. After further being filtered by Wilcoxon rank sum test, only 50 of them are kept to build the prediction model. Most importantly, these selected genes achieve better experimental performance—the use of only Wilcoxon rank sum test to select features (i.e. no Phase I) will lead to a larger number of selected genes but worse results. For example, if we only apply rank sum test to the 70 genes provided by the breast cancer data, 40 genes will be selected for metastasis prediction and 42 genes selected for survival prediction. With these larger number of genes, the P-values for testing on validation samples are only 0.0004 (metastasis) and 0.0007 (survival). This observation indicates that only applying rank sum test may not be powerful enough to reduce the number of genes and thus, we use it at second phase after more than 90% genes have been removed by the entropy-based algorithm. Table 5 shows in DLBCL and AML studies, the number-change trend of features from original data, to the entropy selection (Phase I) and to Wilcoxon rank sum test (Phase II), as well as to applying only the rank sum test. It can be seen that the feature reduction is mostly by the entropy selection.


View this table:
[in this window]
[in a new window]
 
Table 5 Number of genes left after feature filtering for each phase of our gene identification scheme and for only applying Wilcoxon rank sum test (i.e. RSTOnly) in the DLBCL and AML studies

 
In the current study, a simple linear kernel SVM is trained on the selected samples and genes to build a scoring model. The model then assigns each validation sample a risk score to predict the patient outcome. Based on the training results, we can derive explicit thresholds (e.g. 0.5, 0.3 and 0.7) of our risk score to categorize patients into different risk groups. Thus, when a new case comes, we are able to assign it to the corresponding risk group easily according to its risk score.

In summary, we have applied statistical and machine learning technologies to predict patient outcome from gene expression profiles and clinical information. Different from other works, we pick out extreme cases to form the training set, consisting of only short-term survivors who died within a short period and long-term survivors who were still alive after a relevant long follow-up time. Our in-silico experimental results on three public gene expression datasets have demonstrated the high effectiveness ofour idea.

We have some ongoing studies. (1) Datasets from other tumors are under analysis. (2) In order to obtain a more refined set of genes, some matrices to measure the correlation between the genes selected by our two-phase filtering method are under testing—correlation test will be conducted within each of the two groups of genes left by the Wilcoxon rank sum test (i.e. one group contains genes whose statistical measure is less than the lower bound critical value and one group contains genes whose statistical measure is larger than the upper bound critical value). (3) Direct prediction of patient survival time using regression algorithms. But one concern is that those alive patients with short-term follow-ups may not be useful for this direct regression approach.


    Acknowledgments
 
The authors would like to acknowledge the two anonymous reviewers for their many valuable comments on the manuscript.

Conflict of Interest: none declared.

Received on January 4, 2005; revised on May 12, 2005; accepted on June 14, 2005

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 METHODS
 EXPERIMENTS AND RESULTS
 DISCUSSION
 REFERENCES
 

    Altman, D.B. Practical Statistics for Medical Research, (1991) Chapman and Hall.

    Ando, T. and Katayama, M. (2002) Selection of causal gene sets from transcriptional profiling by FNN modeling and prediction of lymphoma outcome. Proceedings of the 13th Intl. Conf. Genome Informatics , pp. , pp. 278–279 http://www.jsbi.org/journal/GIW02/GIW02P041.pdf.

    Bair, E. and Tibshirani, R. (2004) Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol., 2, 0511–0522.

    Beer, D.G., et al. (2002) Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat. Med., 8, 816–823[ISI][Medline].

    Bullinger, L., et al. (2004) Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia. NEJM, 350, 1605–1616[Abstract/Free Full Text].

    SAM user guide and technical document Chu, G., Narasimhan, B., Tibshirani, R., Tusher, V.G. (2004) .

    Cox, D.R. (1972) Regression models and life-tables (with discussion). J.R. Stat. Soc., B34, 187–220.

    Fayyad, U. and Irani, K. (1993) Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the 13th International Joint Conference on Artificial Intelligence , pp. 1022–1029.

    LeBlanc, M., et al. (2003) Directed indices for exploring gene expression data. Bioinformatics, 19, 686–693[Abstract/Free Full Text].

    Li, J., Liu, H., Wong, L. (2003) Mean-entropy discretized features are effective for classifying high-dimensional biomedical data. Proceedings of the 3rd ACM SIGKDD Workshop on Data Mining , pp. 17–24 http://www.cs.rpi.edu/~zaki/BIOKDD03/proceedings/4-li.pdf.

    Liu, H. and Setiono, R. (1995) Chi2: feature selection and discretization of numeric attributes. Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence , pp. 338–391 http://ieeexplore.ieee.org/iel2/3474/10227/00479783.pdf.

    Lunn, M. and McNeil, D.R. (1995) Applying Cox regression to competing risks. Biometrics, 51, 524–532[CrossRef][ISI][Medline].

    Park, P.J., et al. (2001) A non-parametric scoring algorithm for identifying informative genes from microarray data. Pac. Symp. Biocomput., 52–63.

    Park, P.J., et al. (2002) Linking gene expression data with patient survival times using partial least squares. Bioinformatics, 18, Suppl 1, S120–S127[Abstract].

    Rosenwald, A., et al. (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. NEJM, 346, 1937–1947[Abstract/Free Full Text].

    Shipp, M.A., et al. (2002) Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med., 8, 68–74[CrossRef][ISI][Medline].

    Troyanskaya, O.G., et al. (2002) Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics, 18, 1454–1461[Abstract/Free Full Text].

    Tusher, V.G., et al. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 5116–5121[Abstract/Free Full Text].

    Vapnik, V.N. The Natural of Statistical Learning Theory, (1995) Springer.

    van de Vijver, M.J., et al. (2002) A gene-expression signature as a predictor of survival in breast cancer. NEJM, 347, 1999–2009[Abstract/Free Full Text].

    van't Veer, L.J., et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530–536[CrossRef][Medline].

    Wilcoxon, F. (1945) Individual comparisons by ranking methods. Biometrics, 1, 80–83[CrossRef][ISI].

    Yeoh, E.-J., et al. (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1, 133–143[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
JCOHome page
A. M. Glas, L. Knoops, L. Delahaye, M. J. Kersten, R. E. Kibbelaar, L. A. Wessels, R. van Laar, J. H. J.M. van Krieken, J. W. Baars, J. Raemaekers, et al.
Gene-Expression and Immunohistochemical Study of Specific T-Cell Subsets and Accessory Cell Types in the Transformation and Prognosis of Follicular Lymphoma
J. Clin. Oncol., February 1, 2007; 25(4): 390 - 398.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/16/3377    most recent
bti544v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Liu, H.
Right arrow Articles by Wong, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Liu, H.
Right arrow Articles by Wong, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?