Skip Navigation


Bioinformatics Advance Access originally published online on August 22, 2006
Bioinformatics 2006 22(21):2635-2642; doi:10.1093/bioinformatics/btl442
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/21/2635    most recent
btl442v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Shen, R.
Right arrow Articles by Meng, Z.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shen, R.
Right arrow Articles by Meng, Z.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Eigengene-based linear discriminant model for tumor classification using gene expression microarray data

Ronglai Shen 1, Debashis Ghosh 1, Arul Chinnaiyan 2 and Zhaoling Meng 3,*

1 Department of Biostatistics, University of Michigan Ann Arbor, MI 48109-0602, USA
2 Department of Pathology and Urology, University of Michigan Ann Arbor, MI 48109-0602, USA
3 Biostatistics and Programming Sanofi aventis, PO Box 6800, Bridgewater, NJ 08807-0800, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

Motivation: The nearest shrunken centroids classifier has become a popular algorithm in tumor classification problems using gene expression microarray data. Feature selection is an embedded part of the method to select top-ranking genes based on a univariate distance statistic calculated for each gene individually. The univariate statistics summarize gene expression profiles outside of the gene co-regulation network context, leading to redundant information being included in the selection procedure.

Results: We propose an Eigengene-based Linear Discriminant Analysis (ELDA) to address gene selection in a multivariate framework. The algorithm uses a modified rotated Spectral Decomposition (SpD) technique to select ‘hub’ genes that associate with the most important eigenvectors. Using three benchmark cancer microarray datasets, we show that ELDA selects the most characteristic genes, leading to substantially smaller classifiers than the univariate feature selection based analogues. The resulting de-correlated expression profiles make the gene-wise independence assumption more realistic and applicable for the shrunken centroids classifier and other diagonal linear discriminant type of models. Our algorithm further incorporates a misclassification cost matrix, allowing differential penalization of one type of error over another. In the breast cancer data, we show false negative prognosis can be controlled via a cost-adjusted discriminant function.

Availability: R code for the ELDA algorithm is available from author upon request.

Contact: zhaoling.meng{at}sanofi-aventis.com

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Gene expression microarrays have provided a high-throughput platform to discover genomic biomarkers for cancer diagnosis and prognosis. Many gene expression signatures have been identified in recent years for accurate classification of tumor subtypes or for prognosis of patient survival outcome (Alizadeh et al., 2000; Khan et al., 2001; van't Veer et al., 2002; Rhodes et al., 2004; Chang et al., 2005). In drug development, it has also become a useful tool to screen for surrogate markers that are indicative of early drug efficacy (Gunther et al., 2003). A variety of classification methods have been used in this context. The linear discriminant analysis (LDA), which is based on finding linear combinations of the gene expression profiles, has recently been applied in such high-dimensional data. Studies have demonstrated favorable classification performances of LDA models when compared with more complicated and computationally intensive algorithms such as neural networks and aggregative tree methods (Tibshirani et al., 2002; Dudoit et al., 2002). One well-known example is the nearest Shrunken Centroids (SC) algorithm (Tibshirani et al., 2002). The classifier minimizes a standardized squared distance of a test sample to the k-th class shrunken centroids, providing a highly interpretable and efficient algorithm for microarray data. It is equivalent to a linear discriminant method; further assuming a diagonal covariance matrix (gene-wise independence) to reduce data dimension and to simplify the discriminant function for computational efficiency. The authors showed that the SC classifier outperformed the neural network method of Khan et al. (2001) in the Small Round Blue Cell Tumor (SRBCT) data.

For classification using gene expression microarray data, feature selection is a necessary step to deal with the large p (number of genes) and small n (sample size) problem. Many studies use univariate statistics considering one gene at a time to exclude non-informative genes for classification. For instance, the SC algorithm excludes noisy genes if the standardized between-class distance measure is below a shrinkage threshold value. Dudoit et al. (2002) pre-select a set of top ranking genes based on the between- and within-group variance ratios to build the classifier. However, univariate selection can be an inadequate approach from both statistical and biological point of view. First, the most discriminatory genes identified individually do not necessarily constitute the best classifier when put together (Ein-Dor et al., 2005; Dabney and Storey, 2005). Second, it is biologically unsensible to examine gene expression profiles in separation, given that genes function in a co-regulatory network. A greedy searching algorithm ignoring the gene–gene correlations, such as a univariate selection, tends to include elements contributing highly redundant information.

In contrast to using a feature selection to exclude non-informative genes for classification, a principal component analysis (PCA) reduces dimensionality by capturing the variance of the entire dataset in terms of the first few principal components (PCs). The method projects the original expression data onto a substantially reduced space, where the resulting eigenvectors are orthonormal superpositions of the genes or arrays and used as the ‘new’ variables for analyses (Alter et al., 2000). The disadvantages of PCA include (1) it achieves little dimension reduction in the original gene space using all the genes, (2) each PC is a linear combination of all the genes and thus carry no immediate biological interpretations and pose difficulties for downstream gene annotation and validation studies and (3) the resulting PCs are obtained in an unsupervised fashion and therefore may be irrelevant for classifying the outcome.

We propose an Eigengene based Linear Discriminant Analysis (ELDA) to address feature selection in a multivariate framework for linear discriminant models. For a set of discriminatory genes, we apply a modified rotated Spectral Decomposition (SpD) technique (Meng et al., 2003) to select genes contributing primarily to the most important eigenvectors. It is an analogous idea that Meng et al. (2003) employed in genotyping studies of single-nucleotide polymorphism (SNP) markers closely positioned on the chromosome. The authors demonstrated using both simulations and experimental datasets that the genotyping requirement was significantly reduced by such a dimension reduction approach while maintaining the genetic information content throughout the region. In a gene co-expression analysis context, Horvath et al. (2006) defined a ‘hub’ gene of an expression module to be a gene with strong intramodule connectivity—a definition analogous to that of a ‘hub’ node in protein–protein interaction networks. The connectivity of a gene was computed as the sum of its pair-wise weighted correlation to all other genes in the same module, and used to evaluate the capability of each gene in summarizing the expression variation of that functional module. We will discuss in the Methods section how the ELDA feature selection relates to a ‘hub’ gene selection.

In the algorithm, we have also incorporated a misclassification cost matrix to allow differential penalization of various types of classification errors. For a highly heterogeneous disease such as breast cancer, patients vary significantly in treatment responses and overall outcome. As pointed out by van't Veer et al. (2002), to assign patients for adjuvant therapy, a false negative error (predicting poor prognosis as good prognosis) is a more costly misclassification than a false positive error (predicting good prognosis as poor prognosis). False negative prognoses lead to postponed treatment or even failure to treat patients with poor outcome as the patient is erroneously predicted to do well. In drug development studies, classification with unequal costs is important. For a compound screening using a gene expression signature, errors made in classifying a non-efficacious response to an efficacious response group will cause false positive hits to carry onto large expensive clinical trials. On the other hand, errors made in classifying a sample from the efficacious response group to the non-efficacious response group will lead to false negatives that pass a promising compound without further investigation. Under different circumstances, one type of errors will be considered more expensive than the other. We will show that a cost-adjusted classification algorithm allows flexibility in these contexts.

We consider three cancer microarray datasets, the SRBCT data used in Tibshirani et al. (2002), the leukemia data analyzed in Golub et al. (1999), and the breast cancer data from van't Veer et al. (2002). All three studies employed a diagonal linear discriminant type of method coupled with a univariate feature selection scheme. We report in the Results section that ELDA classifiers use substantially smaller number of genes compared with their univariate selection-based counterparts. Using the breast cancer prognosis data as a motivating example, we show desired sensitivity and specificity can be obtained by adjusting the misclassification cost matrix.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
2.1 The shrunken centroids method
We start by reviewing the SC method of Tibshirani et al. (2002). Let xij denote the expression value (pre-processed) of gene i in sample j. For a K-class problem, let Formula be the k-th class centroid and Formula be the overall centroid for gene i. A standardized distance between the class k centroid and the overall centroid is computed as

Formula 1(1)
where si is the pooled within-class standard deviation and s0 is an ad hoc constant added to deal with large number of genes. The SC method imposes a shrinkage on Formula 1 to obtain a shrunken distance

Formula 2(2)
where + means positive part (t + = t if t > 0 and zero otherwise). The soft-thresholding parameter {Delta} covers a wide range of values for a varying amount of noise reduction. In particular, the method selects a subset of genes with non-zero Formula 2, denoted as Formula 2, of a decreasing size as the shrinkage threshold {Delta} increases. Replace dik with Formula 2 in (1), and the shrunken centroid for gene i in class k can be written as

Formula 3(3)
which is considered a de-noised version of the original centroid Formula 3 toward the overall mean Formula 3. The classification rule is then to assign a test sample x* to class k where the standardized squared distance of the test sample expression vector on S{Delta} to the k-th shrunken centroid is the minimum, i.e.

Formula 4(4)
where {pi}k is the sample proportion of class k. The duality of the soft-thresholding approach lies in that (1) it builds in an inherent feature selection that excludes noisy genes with non-significant mean differences across classes and (2) the class centroids are shrunken toward the overall centroid, which is intended to avoid overfitting.

2.2 The nearest centroids method coupled with shrinkage selection (NCS{Delta})
In the SC method, one functionality of the soft-thresholding is to impose an accumulating amount of shrinkage on the class centroids. In particular, {Delta} can start from zero, reducing Formula 4 to the original class centroid Formula 4, to a large value such that all K class centroids are forced to be the overall mean Formula 4. As {Delta} increases, the class centroids are forced to become more similar to each other for the purpose of ‘de-noising’ Formula 4 given a large amount of genes. However, this can result in conservatism on the choice of {Delta} and hence an intrinsic preference for a larger classifier. Dabney (2005) made a similar argument and proposed that shrinking centroids across genes rather than across classes could offer a greater gain for classification. Given these considerations, we propose to use a nearest centroids (NC) based discriminant score coupled with the shrinkage feature selection, denoted as Formula 4, as the base classification method for the ELDA algorithm. The discriminant function is

Formula 5(5)
where no shrinkage is put on the class centroid Formula 5.

2.3 Eigengenes: using a modified rotated spectral decomposition for optimal feature selection
In both SC and Formula 5, feature selection is based on the shrunken distance Formula 5 computed for each gene i individually. As mentioned before, such univariate selection tends to overselect and result in redundant information. The selected S{Delta} set contains highly correlated genes such that a gene-wise independence assumption will be unrealistic.

We propose to use a modified rotated SpD to diagonalize S{Delta} and select a relatively uncorrelated subset S{Delta},rSpD that captures the majority of the variation in the S{Delta} set. In the first step, the correlation matrix R of dimension L x L is computed for the L genes in S{Delta} at shrinkage threshold {Delta}. Apply a SpD to R,

Formula 6(6)
where {lambda}i is the i-th eigenvalue with Formula 6; and ei is the i-th eigenvector. This is the same step taken in a PCA analysis. The PCs are defined by the eigenvectors in the order of decreasing eigenvalues. They are subsequently used as the ‘new’ variables in the reduced data space (Alter et al., 2000; Yeung and Ruzzo, 2000). As mentioned earlier, the PCs retain all the genes and therefore are not directly interpretable. They achieve no dimension reduction in the original gene x array space.

To deal with this, we adopt a rotation and selection step to link the actual gene expression vectors {xi : i = 1, ..., L} to the eigenvectors {ei : i = 1, ..., L}. In particular, a varimax-rotation (Kaiser, 1958) procedure is applied to the original set of eigenvectors, E = e1, ···, eL. It finds an orthogonal transformation T, E* = ET, such that the rotated vector Formula 6 will have sparse loadings, thereby clarifying the association of the genes to the eigenvectors.

Define an eigengene to be a gene with the corresponding expression profile contributing primarily to the most important eigenvectors (PCs), and the eigengene based feature selection is achieved in the following sequential steps:

  1. Fix the classifier size p, p ≤ L, such that the largest p eigenvectors combined explain 100 x {theta}% of the variability of the set. In other words, find p such that

    Formula 7(7)
    where {theta} is chosen to be 0.95 in the algorithm.

  2. Based on the rotated vector Formula 7, compute the contribution of gene i to the largest p loadings and to the remaining L p loadings in the following forms,

    Formula 8(8)
    In fact, {Gamma}l represents the average association of gene l to the first p PCs and {gamma}l is the average association of gene l to the rest.

  3. Select gene l if {Gamma}l > {gamma}l to include in S{Delta},rSpD until p such genes are selected.

The discriminant function based on S{Delta},rSpD is written as

Formula 9(9)

Horvath et al. (2006) defined a hub gene in a gene co-expression network to be the ones with high intramodule connectivity. Under a factorizable network, they show such connectivity is proportional to the correlation between the expression vector of the i-th gene and the most important eigenvector(s). In that sense, by selecting genes associated largely with the first p eigenvectors that explains the majority of the variation in the data, we are targeting the ‘hub’ genes that are most characteristic components of those cancer-induced expression modules.

In summary, our method projects the univariately selected S{Delta} set to an orthonormal S{Delta},rSpD set using a modified rotated data decomposition technique. The eigengene-based feature selection yields a small set of ‘hub’ components that captures the variability of the original set. Classifiers based on gene-wise independence assumption is expected to give better performances using S{Delta},rSpD than its univariate selection-based counterparts S{Delta} after eliminating the highly correlated elements.

2.4 Unequal misclassification costs
When using a classifier to make decisions on a future case, some types of error can be more costly than others. For instance, false negative diagnoses carry heavier consequences for patients who in fact have cancer; whereas in drug compound screening, false positive hits cause a more severe problem when a non-efficacious compound is carried on to large expensive clinical trials. An important feature of our algorithm is therefore an incorporation of a cost matrix to allow differential penalization of different types of errors. As defined in Johnson and Wichern (1998), the expected cost of misclassification is

Formula 10(10)
where Formula 10 is the prior probability of class k', k' = 1, ... K. The quantity P(k|k'), for k != k', is the probability of classifying a sample to class k when it actually belongs to class k', and is equal to Formula 10k') dx. Accordingly, c(k|k') is the cost of such misclassification with c(k|k') = 0 when k = k'.

The classification rule that minimizes the ECM is to assign a sample expression vector x to class k where the cost-adjusted probability of misclassification

Formula 11(11)
is the smallest. Assuming a multivariate normal density function for f and a common diagonal covariance matrix, the ECM classification rule takes a simple form

Formula 12(12)
Incorporating (12) in the ELDA framework, we have the following cost-adjusted discriminant function,

Formula 13(13)
When given equal costs, say c(k|k') {equiv} 1 for k != k', it is straightforward to show that (13) reduces to (9), as minimizing

Formula 13
is equivalent as maximizing {pi}kf(x|k) for class k.

2.5 Cross validation
We consider two commonly used cross-validation schemes to obtain CV error rates in the training data. A leave-one-out CV is known to produce nearly unbiased prediction error estimates, but the estimate is often criticized to be highly variable. A 10-fold cross-validation, on the other hand, reduces the variability, but can introduce bias in the error estimates (Braga-Neto and Doughherty, 2004). To complement the two cross-validation schemes, we determine the classification accuracy rate by repeating a 10-fold cross validation 10 times such that the training samples will be randomly divided 100 times into training groups consisting of 90% of the samples and test groups consisting of the remaining 10% of the samples. The error rate estimates and number of genes used for a classifier are then averaged over all the 10-fold cross-validations. Such repeated 10-fold CV estimator has been recommended as an overall error estimator of choice in terms of reduced variance (Kohavi, 1995). Our final classifier used to validate in the test cohort is constructed by ranking the genes by their appearance frequencies in all the CV steps.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
3.1 The small round blue cell tumor data
The first example we consider is the small round blue cell tumor (SRBCT) data (Khan et al., 2001). The tumors are classified as Burkitt lymphoma (BL), Ewing sarcoma (EWS), neuroblastoma (NB) or rhabdomyosarcoma (RMS). The dataset consists of expression measurements on 2308 genes using cDNA microarrays in a total of 83 SRBCT samples, which are divided into 63 training and 20 test samples. As shown in Table 1, at {Delta} = 3.81 for the shrunken distance-based univariate selection, the SC and Formula 13 classifier requires a S{Delta} set of 78 genes to control the misclassification error rates <5%. In contrast, ELDA needs a substantially smaller S{Delta},rSpD set of 25 genes that achieves 1% cross-validation error rate in the training samples and zero errors in the test set.


View this table:
[in this window]
[in a new window]

 
Table 1 Comparison of the classifier size and error rates on the SRBCT data

 
As discussed earlier, the SC method has a preference for relatively large classifiers owing to the shrinkage imposed on the class centroids. Increasing the shrinkage threshold results in (1) a decreasing number of genes selected and (2) progressively less discriminant SCs, leading to poor classification performances for small classifiers. In Table 1, at a threshold value {Delta} = 7.28, the five gene SC classifier gives a significantly higher error rates (24 and 25% in training and test set respectively) compared to the other two classifiers using the original centroids.

3.2 The leukemia data
In the next example we use the leukemia dataset used in Golub et al. (1999). There are 38 training samples consisting 27 acute myeloid leukemia (AML) and 11 acute lymphoblastic leukemia (ALL). A test cohort of 20 AML and 14 ALL samples is also available. Expression measurements are obtained on 7129 genes using Affymetrix oligonucleotide microarrays. In Table 2, at {Delta} = 2.33, it is impressive that ELDA uses a mere 17 genes to keep the CV and test error rates under 5%, whereas SC and Formula 13 each demand 129 genes.


View this table:
[in this window]
[in a new window]

 
Table 2 Comparison of the classifier size and error rates on the Leukemia data

 
3.3 The breast cancer data
The breast cancer prognosis study (van't Veer et al., 2002) includes a training set comprising 78 breast carcinomas, of which 34 are in a poor prognosis group (distant metastases within 5 years), and 44 in a good prognosis group (remaining disease-free for at least 5 years). The test cohort has 7 good prognosis and 12 poor prognosis samples. Expression levels are measured on 25 000 genes synthesized by inkjet technology using cDNA microarrays.

As shown in Figure 1, ELDA consistently outperforms SC and Formula 13 in classification accuracy. Table 3 summarizes the performance of the best classifiers at an optimal {Delta} value trained in cross-validation under each method. In this dataset, not only does the ELDA classifier use a substantially smaller set of genes (44 versus 231 required by SC and Formula 13), it demonstrates significantly better classification accuracies compared with its univariate feature selection based analogues (training: 84 versus 76%; test: 89 versus 79%). Obtained in a slightly different framework, the van't Veer 70-gene signature comprises genes selected based on a univariate correlation with the outcome computed using the entire training set. Although the study has received criticisms on not cross-validating the feature selection step and thus gives over-optimistic CV error rates (Simon et al., 2003), the 70-gene signature has nonetheless shown successes in validation studies (van de Vijver et al., 2002). It is therefore worth comparing the ELDA classifier with this benchmark signature. Table 3 shows that the ELDA classifier achieves similar classification performance as the van't Veer signature (albeit overfitted), yet using 40% less genes.


Figure 1
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Classification accuracy by the SC, NCS{Delta} and ELDA classifier in the breast cancer data. The top axis indicates both the number of genes in the univariately selected S{Delta} set at each threshold level (upper side), and the number of genes in the corresponding S{Delta},rSpD set (lower side). Equal costs c(0|1)= c(1|0) is assumed.

 


View this table:
[in this window]
[in a new window]

 
Table 3 Breast cancer gene expression classifiers

 
Van de Vijver et al. (2002) validated the van't Veer 70-gene signature in an independent cohort. We sought to validate the ELDA signature in a similar fashion. Tumors were taken from 295 consecutive invasive breast cancer cases selected from the frozen tissue bank of the Netherland cancer institute. We excluded the 60 nodal negative cases previously used in the van't Veer study for overfitting concerns. A Kaplan-Meier survival analysis (Supplementary Figure 1) revealed that the ELDA classifier is equally predictive of patient survival probabilities as the benchmark 70-gene signature.

3.3.1 Gene annotation analysis
Of the 44 genes selected by ELDA, only 12 are also found in the van't Veer 70-gene signature. The difference probably comes from a combination of the different discriminant functions, feature selection methods, and cross-validation schemes implemented. To obtain insights on how the two classifiers demonstrate comparable results despite such dramatically different choices of genes, we did a gene annotation analysis. In particular, we map the signature genes with known functional entries in the combined protein information databases (Swiss-Prot, TrEMBL and PIR) to their UniProt ID and then use the high-throughput GoMiner (Zeeberg et al., 2005) to carry out a Gene Ontology (GO) annotation and functional enrichment analysis. Of the 99 most enriched GO categories with at least two mapped genes from the 70-gene signature, 77 (78%) of them are in common with those mapped from the ELDA 44-gene signature (Supplementary Figure 2). The large overlap of the GO functional categories explains the similar prognostic performances of the two signatures. On the other hand, the ELDA classifier has an average 52% decrease in the number of components in each enriched functional group—a result of eliminating redundant information through de-correlating the gene expression profiles.


Figure 2
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 ELDA with a differential cost matrix applied to the breast cancer data. Classification accuracies are split into sensitivity and specificity in the training (ac) and in the test set (df), and plotted as a function of the shrinkage threshold.

 
3.3.2 Finding ‘Hub’ elements of the gene regulatory pathways
Table 4 lists the top six enriched molecular functions annotated from the two classifiers. As an illustration, four components from the 70-gene signature, WISP1, ESM1, IGFBP5 (GenBank accession AF055033 [GenBank] ) and IGFBP5 (GenBank accession NM_000599 [GenBank] ), representing three unique genes, are mapped to a cell growth regulation group. Up-regulation of these genes has been previously shown to correlate with prognostic features of human breast cancer (Xie et al., 2001; McCaig et al., 2000; Aitkenhead et al., 2002). In contrast, the same cell growth regulation module is represented by a single element IGFBP5 (AF055033 [GenBank] ) in the ELDA classifier. The effect of ‘de-correlating’ using an eigengene based selection is apparently 2-fold: (1) multiple components representing the same gene (IGFBP5) are ‘dereplicated’ and (2) less representative elements of the module (ESM1 and WISP1) are excluded. The latter hypothesis has a biological basis. IGFBP5 has been shown to associate with the endothelial cell (EC) monolayers, where ESM1, an EC-specific gene involved in tumor angiogenesis, is induced (Booth et al., 1995). On the other hand, both WISP1 and IGFBP5 have been shown to involve in apoptosis, a process known to play a major role in controlling cell proliferation. This suggests that the common component IGFBP5 very likely represents the ‘hub’ element in related molecular pathways that collectively contribute to cell growth regulation. It reinforces the notion that the eigengene based feature selection targets the ‘hub’ elements in the gene co-expression network.


View this table:
[in this window]
[in a new window]

 
Table 4 Gene Ontology analysis

 
3.4 Differential misclassification applied to the breast cancer dataset
We use the breast cancer dataset as the motivating example for differential misclassification. The van’t Veer et al. (2002) study discussed the importance of obtaining high sensitivity for the purpose of selecting patients to receive adjuvant therapy and proposed to adjust the discriminant score cutoff value in an ad-hoc manner to control false negative errors. With more than two outcome classes, manually changing threshold values to obtain desired error rates in certain categories of interest will become impractical. We demonstrate using the same dataset that classification with an assigned misclassification cost matrix is a much more efficient way to balance one type of errors over another.

As shown in Figure 2a, with equal misclassification cost, sensitivities are consistently lower than specificities in the training set. The higher false negative errors can be partly attributed to the substantial heterogeneity among the poor prognosis patients (Dai et al., 2005). In our algorithm, we address this problem by specifying the cost for false negatives to be r times of the cost for false positives (c(0|1) = rc(1|0)), where a range of values (1–10) has been tried for the cost ratio r. Figure 2b and e show that for r = 3, the classifier is effectively adjusted to give consistently higher sensitivities than specificities over all threshold values in both the training and test set. At r = 6, we obtain an 89.1% sensitivity in the training samples compared with the 78.8% sensitivity with equal cost assignment (Table 5). The signature further made zero false negative errors in the test set. Note from Figure 2 that one can achieve even higher sensitivity using a larger cost ratio than the optimal r value we propose, but nevertheless at a cost of too many false positive errors.


View this table:
[in this window]
[in a new window]

 
Table 5 Differential misclassification applied to the breast cancer prognosis outcome

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
We have proposed an eigengene-based feature selection scheme for LDA applied in high-dimensional gene expression microarray data settings. Compared with the commonly applied univariate selection methods, ELDA is superior in several respects (1) it de-correlates expression profiles that contain highly redundant information; (2) it selects ‘hub’ genes that are most characteristic and summarizing of the expression variability in the data and (3) such a feature selection is applicable for any classification method. Using three cancer microarray datasets, we show that eigengene-based classifiers are substantially smaller in size—a desirable result for downstream validation studies. It further improves the interpretability of a signature with only the most representative genes of the co-expression networks being included. Classification performance wise, ELDA classifiers show significantly better results in classifying heterogeneous outcomes such as patient survival status in the breast cancer data. The rationale is obvious. Compared with univariate selection-based counterpart methods, ELDA builds on a set of less-correlated gene expression profiles that satisfy the gene-wise independence assumption in diagonal linear discriminant models. The improvement in the classification performance is evident in the breast cancer dataset (Table 3). The SRBCT and the leukemia datasets, however, showed only subtle changes as there is little room for further reducing the error rates achieved by the SCs method (Tables 1 and 2).

In the breast cancer data, we compare an ELDA 44-gene classifier with the benchmark 70-gene expression signature. Inspite of the small gene overlap, we argue the comparable prognostic values demonstrated by the two signatures stem from the similarity of the enriched processes. It is not a surprising result from an evolution point of view. As comparative genomic studies have suggested, gene duplications provide raw material for genome variation and functional diversification, leading to an ever-sophisticating biological network where one gene may involve in multiple biological pathways or one functionality in the cell can be carried out by many genes. Investigating gene expression profiles in isolation therefore often leads to inconsistent findings given disparate experimental designs, varying feature selection strategies, and further cohort-specific features. Multivariate feature selection such as the ELDA method we propose in this paper is more sensible than the widely used univariate selections in tumor classification studies using gene expression microarrays.


    Acknowledgments
 
The authors would like to thank people from BARDS, Merck Research Laboratory, especially Bret Musser, Peggy Wong, Richard Raubertas, Dan Holder, Peter Hu and Xiang Yu for their valuable comments and suggestions. This research was supported in part by the Merck Research Laboratory summer intern program 2005, and NIH grant GM72007.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Satoru Miyano

Received on May 2, 2006; revised on July 20, 2006; accepted on August 14, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

  1. Aitkenhead, M., et al. (2002) Identification of endothelial cell genes expressed in an in vitro model of angiogenesis: induction of esm-1, (beta)ig-h3, and nrcam. Microvasc. Res, . 63, 159–171[CrossRef][ISI][Medline].

  2. Alizadeh, A., et al. (2000) Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403, 503–511[CrossRef][Medline].

  3. Alter, O., et al. (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA, 97, 10101–10106[Abstract/Free Full Text].

  4. Booth, B., et al. (1995) Igfbp-3 and igfbp-5 association with endothelial cells: role of c-terminal heparin binding domain. Growth Regul, . 5, 1–17[Medline].

  5. Braga-Neto, U. and Dougherty, E. (2004) Is cross-validation valid for small-sample microarray classification? Bioinformatics, 20, 374–380[Abstract/Free Full Text].

  6. Chang, H., et al. (2005) Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. Proc. Natl Acad. Sci. USA, 102, 3738–3743[Abstract/Free Full Text].

  7. Dabney, A. and Storey, J. (2005) Optimal feature selection for nearest centroid classifiers, with applications to gene expression microarrays. UW Biostatistics Working Paper Series. Working Paper 267, http://www.bepress.com/uwbiostat/paper267.

  8. Dabney, A. (2005) Classification of microarrays to nearest centroids. Bioinformatics, 21, 4148–4154[Abstract/Free Full Text].

  9. Dai, H., et al. (2005) A cell proliferation signature is a marker of extremely poor outcome in a subpopulation of breast cancer patients. J. Natl Cancer Inst, . 65, 4059–4066.

  10. Dudoit, S., et al. (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc, . 97, 77–87[CrossRef][ISI].

  11. Ein-Dor, L., et al. (2005) Outcome signature genes in breast cancer: is there a unique set? Bioinformatics, 21, 171–178[Abstract/Free Full Text].

  12. Golub, T., et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537[Abstract/Free Full Text].

  13. Gunther, E., et al. (2003) Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro. Proc. Natl Acad. Sci. USA, 100, 9608–9613[Abstract/Free Full Text].

  14. Horvath, S., et al. (2006) Connectivity, Module-Conformity, and Significance: Understanding Gene Co-Expression Network Methods. UCLA Technical Report www.genetics.ucla.edu/labs/horvath/ModuleConformity/.

  15. Johnson, R. and Wichern, D. Applied Multivariate Statistical Analysis, (1998) 4th edn , New York Prentice Hall.

  16. Kaiser, H. (1958) The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23, 187–200[CrossRef][ISI].

  17. Khan, J., et al. (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med, . 7, 673–679[CrossRef][ISI][Medline].

  18. Kohavi, R. (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of Fourteenth International Joint Conference on Artificial Intelligence (IJCAI)San Mateo, CA, pp. 1137–1143.

  19. McCaig, C., et al. (2000) Signalling pathways involved in the direct effects of igfbp-5 on breast epithelial cell attachment and survival. J. Cell Biochem, . 84, 784–794.

  20. Meng, Z., et al. (2003) Selection of genetic markers for assiciation analyses, using linkage disequilibrium and haplotypes. Am. J. Hum. Genet, . 73, 115–130[CrossRef][ISI][Medline].

  21. Rhodes, D., et al. (2004) Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc. Natl Acad. Sci. USA, 101, 9309–9314[Abstract/Free Full Text].

  22. Simon, R., et al. (2003) Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J. Natl Cancer Inst, . 95, 14–18[Free Full Text].

  23. Tibshirani, R., et al. (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl Acad. Sci. USA, 99, 6567–6572[Abstract/Free Full Text].

  24. van’t Veer, L., et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530–535[CrossRef][Medline].

  25. van de Vijver, M., et al. (2002) A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med, . 347, 1999–2009[Abstract/Free Full Text].

  26. Xie, D., et al. (2001) Elevated levels of connective tissue growth factor, wisp-1, and cyr61 in primary breast cancers associated with more advanced features. Cancer Res, . 61, 8917–8923[Abstract/Free Full Text].

  27. Yeung, K. and Ruzzo, W. (2000) Principal component analysis for clustering gene expression data. Bioinformatics, 17, 763–774.

  28. Zeeberg, B., et al. (2005) High-throughput gominer, an industrial-strength integrative gene ontology tool for interpretation of multiple-microarray experiments, with application to studies of common variable immune deficiency (cvid). BMC Bioinformatics, 6, .


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
S. Wang and J. Zhu
Improved centroids estimation for the nearest shrunken centroid classifier
Bioinformatics, April 15, 2007; 23(8): 972 - 979.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/21/2635    most recent
btl442v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Shen, R.
Right arrow Articles by Meng, Z.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shen, R.
Right arrow Articles by Meng, Z.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?