Bioinformatics Advance Access originally published online on July 31, 2006
Bioinformatics 2006 22(19):2356-2363; doi:10.1093/bioinformatics/btl400
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Reliable gene signatures for microarray classification: assessment of stability and performance



Institute of Informatics, Ludwig-Maximilians-Universität München, Amalienstrasse 17 80333 Munich, Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Two important questions for the analysis of gene expression measurements from different sample classes are (1) how to classify samples and (2) how to identify meaningful gene signatures (ranked gene lists) exhibiting the differences between classes and sample subsets. Solutions to both questions have immediate biological and biomedical applications. To achieve optimal classification performance, a suitable combination of classifier and gene selection method needs to be specifically selected for a given dataset. The selected gene signatures can be unstable and the resulting classification accuracy unreliable, particularly when considering different subsets of samples. Both unstable gene signatures and overestimated classification accuracy can impair biological conclusions.
Methods: We address these two issues by repeatedly evaluating the classification performance of all models, i.e. pairwise combinations of various gene selection and classification methods, for random subsets of arrays (sampling). A model score is used to select the most appropriate model for the given dataset. Consensus gene signatures are constructed by extracting those genes frequently selected over many samplings. Sampling additionally permits measurement of the stability of the classification performance for each model, which serves as a measure of model reliability.
Results: We analyzed a large gene expression dataset with 78 measurements of four different cartilage sample classes. Classifiers trained on subsets of measurements frequently produce models with highly variable performance. Our approach provides reliable classification performance estimates via sampling. In addition to reliable classification performance, we determined stable consensus signatures (i.e. gene lists) for sample classes. Manual literature screening showed that these genes are highly relevant to our gene expression experiment with osteoarthritic cartilage. We compared our approach to others based on a publicly available dataset on breast cancer.
Availability: R package at http://www.bio.ifi.lmu.de/~davis/edaprakt
Contact: ralf.zimmer{at}bio.ifi.lmu.de
| INTRODUCTION |
|---|
|
|
|---|
Microarrays have become a standard means of investigating different states of biological systems on the basis of the expression of genes on a genome-wide level. A broad range of statistical analysis methods have been developed to guide the biological interpretation of these data. Typical first questions on measurements with various sample classes, e.g. tissues or disease states, are (1) how can a particular sample be classified, i.e. assigned to a sample class with high reliability? and (2) what are the molecular features best representing the differences between classes, sample subsets or individual samples? The answers to these questions could help towards a first interpretation of the data, i.e. could improve the analysis of future measurements and their assignment to a specific sample class, which could suggest further work or experiments. Moreover, gene or feature lists could provide a starting point for understanding the processes responsible for the differences between states, possibly indicating and identifying well-known and new target genes.
A wide variety of machine learning methods have been proposed for classification tasks related to microarrays, including support vector machines (SVM), k nearest neighbors (kNN), decision trees (DT) and many others (Dudoit et al., 2002). Compared to the number of samples, the number of features (i.e. genes or spots represented on the microarray) is large, which affects the performance of the classifiers. Therefore, a number of feature subset selection (FSS) methods have been developed for gene selection in microarray data (for a review see Guyon and Elisseeff, 2003). However, using an arbitrarily fixed combination of FSS method and classifier (a model) may sacrifice performance that could have been achieved with another model. The systematic pairwise combination of FSS methods with classifiers produces models with varying performance and, thus, a ranking of possible models.
Often, one is tempted to rely on a best classification method, with a certain performance, or to restrict the analysis to the top ranking genes/features. A method to resolve these issues by comprehensive evaluation and selection of models has been proposed by Statnikov et al. (2005). Another difficulty due to the relatively small number of samples is the instability of the performance of models that are simply built upon fixed gene signatures. This is due to the fact that gene signatures depend heavily on the actual dataset and are especially sensitive to noise (Guyon and Elisseeff, 2003). Repeated random sampling of arrays before feature selection can be used to assess the quality of FSS methods with regard to signature stability (Bi et al., 2003). For a recent discussion of issues in published studies see also Michiels et al. (2005).
The focus of this paper is to combine these strategies by selecting models that exhibit not only good classification performance but also stable signatures.
Our approach begins by selecting an appropriate model from a predefined library of commonly used models. Each model in the library is a combination of a feature selection method, a classifier and fixed parameters. Our model selection procedure then evaluates all models from this library for the given dataset. We repeat this evaluation over several random samplings of expression arrays to select the best model for the gene expression measurements to be analyzed. On one hand, this permits an estimation of the quality of the FSS methods via the stability of their gene signatures. On the other hand, it enables an estimation of the quality of the classifiers via the consistency of their classification performance. An additional cross validation has been performed to estimate the degree of overfitting that occurs within the random samplings during the model selection procedure to provide an estimation of the expected performance as well as an extra indication of the quality of the signature.
| METHODS |
|---|
|
|
|---|
In order to determine reliable features and stable classification accuracies, together with an estimate of the overall performance, we propose the following procedure (StabPerf), outlined in Figure 1. For given sets of gene expression measurements (arrays) for the respective classes, we use sampling to select gene signatures and classify arrays for all combinations of FSS and classification methods. Parameter combinations for individual models (i.e. the combination of feature subset selection, classifier and fixed parameters) need to be specified beforehand for the given dataset, i.e. an optimization in parameter space is not performed by our method. Nevertheless, sensible parameters need to be selected so that the feature selection methods do not produce empty sets. We define a new score to rank these models.
|
In the first step of StabPerf, a group of gene expression arrays is randomly split into a training set (6/7 of the arrays) and a validation set (1/7 of the arrays). This splitting is repeated several times (sampling), whereby the split in each iteration is randomized such that the classes are represented in each training set in the same proportions as they are in the complete dataset (balanced sampling).
Subsequently, for each training set an FSS method determines a list of relevant features (gene signature).
Then, for each gene signature from each FSS method, each classification method is trained on the respective training set, using only the features in the given signature. Accuracies for these combinations of FSS method and classifier are measured by predicting the outcome on the corresponding validation sets.
Next, we rank all possible models based on the stability of sampled gene signatures (signature stability) as well as the classification performance of the model. As multiple classification accuracies were calculated for each model, not only the total accuracy (fraction of correct predictions over all predicted samples) but also the median absolute deviation (MAD) of classification accuracies, over all sampling steps, is computed. We refer to the classification accuracy and its MAD as the performance of the classifier. The gene signature stability and the deviation of the classification accuracies are two important factors, neglected by many current approaches, both of which are used to select the best model (model selection). Finally, the chosen model is retrained on all arrays.
Classification performance of our model selection approach is estimated via stratified 10-fold cross validation (Hastie et al., 2001) since the accuracy on the validation set was used during model selection and, thus, is biased.
Model selection
The optimal model, i.e. combination of an FSS method with a classification method, is chosen based on the stability of the signatures produced by the FSS method as well as the distribution of classification accuracies.
Since classifiers can be sensitive to noise as well as over-fitting, we introduce an additional measure into our model selection score that considers the stability of an FSS method. We prefer FSS methods that produce stable gene signatures, because features which are frequently selected over many different training sets are expected to be the most biologically relevant ones with respect to the differences between sample classes. The model score can be parameterized to select models specific to the needs of particular users.
For a given FSS method S, let F be the list of all features, which have been selected in at least one of n sampling steps, i.e. for at least one training subset. Let freq(f) be the number of sampling steps in which a feature f
F has been selected. The non-adjusted stability StabNA of the FSS method is conceptually an inverse variance:
![]() | (1) |
We introduce the length adjusted stability Stab, as the non-adjusted stability does not yet account for the artificial increase in stability that occurs with increasingly long gene signatures. For example, consistently producing the signature containing every feature on the array would result in 100% stability. Therefore, non-adjusted stability is penalized by the median number of selected features µ, as a fraction of the total number of features per array |features|, weighted by a penalty factor
. Here, we use
= 10, which, with approximately 7500 features per array, effectively limits signatures to at most 750 features.
![]() | (2) |
Second, we define the classification performance Perf(M) for a given model M = (S, C), i.e. combination of FSS method S and classification method C. Here, we award a model M for a high total accuracy Acc(M) over all sampling steps and penalize a high median absolute deviation MAD(M). This penalty is also adjustable, depending on the objectives of the analysis. We use ß = 0.5:
![]() | (3) |
The model score (in the range of [0,1]):
![]() | (4) |
is also adjustable. We use
= 0.5, so that both stability and classification performance contribute equally.
Consensus gene signatures
The complete procedure described here can be considered a FSS method itself, referred to as ConsGS. ConsGS uses the stability measure of an FSS method, calculated from all its gene signatures, and retains only those features that occur in more than a given fraction
of all signatures. tau controls the sensitivity and selectivity of the method. Features selected by ConsGS are expected to be more biologically relevant, as it eliminates noise by selecting only those features that are consistently found to be significant. Furthermore, the feature frequencies over all gene signatures provide a relevance ranking for the selected features, which can be used as a starting point for evaluating candidate genes.
Feature subset selection
Various FSS methods for producing gene signatures are investigated. Here, we use the distinction made in Inza et al. (2004) between filter methods, which select genes based on statistical correlations between expression values and sample classes, and wrapper methods, which select genes based on their ability to separate sample classes, using a given classifier. The following filter methods were used in this study:
- F-test (FTt). All features with an F-test statistic above a chosen threshold t are selected, e.g. FT30 selects all features with a statistic t
30.
- Pearson correlation (PC). All features are selected with an absolute Pearson correlation r between expression values and sample groups above a chosen threshold t, i.e. |r| > t. Alternatively, the top n genes with the highest absolute Pearson correlation coefficient (referred to as PC[n]) are selected, e.g. PC[50] to select the top 50 genes.
- P-value combined with fold change (PVt FCf). Genes are selected having both t-test P-values lower than a threshold t (default: t = 0.001) as well as log fold changes |log2(fold-change)| higher than a threshold f (e.g. f = 2.5), between any two sample groups. The fold change is calculated as the ratio of the midmeans (mean of values between the 25th and 75th percentiles) of the expression values for each pairwise combination of patient groups.
Additionally, over-representation analysis (ORA) is used for optimizing both biological as well as statistical relevance (Draghici et al., 2003). ORA is widely used to analyze the distribution of GO (Ashburner et al., 2000) terms within gene signatures. Moreover, approaches use GO annotations to improve the classification, e.g. Lottaz and Spang (2005). Our approach, however, uses GO/ORA only for post-processing gene lists of the aforementioned FSS methods, thereby incorporating additional biological knowledge.
The last two FSS methods are similar to wrapper methods in that they use trained classifiers to estimate the significance of genes with respect to sample class separation. They are different from wrapper methods, however, in that they do not explicitly make use of classification accuracy to determine gene significance. This is intended to avoid basing gene selection solely on maximizing classification accuracy.
- Decision tree-based (DTt). A decision tree is trained and genes are selected starting at the root and moving down the tree, as long as they exceed a minimum significance threshold t (t = 0.01) (Breiman et al., 1984).
- SVM-based (SVMt). A linear SVM is trained, which assigns a normed weight to each gene. Genes having a normed weight above a threshold t (t = 0.00001) are selected.
Classification
We consider a number of classification methods that are commonly used in the literature:
- NSC (Nearest Shrunken Centroid) (Tibshirani et al., 2002). A modification of the nearest centroid classifier, which assigns a sample to the class with the nearest centroid. The modification consists of shrinking the centroids of each class toward the overall centroid of all classes.
- kNN (5NN, with k = 5) (Mitchell, 1997). A sample is assigned to a class based on a voting scheme among its k nearest neighboring samples.
- SVM (Boser et al., 1992). Classes are separated by finding a maximal margin hyperplane between them, either in the original feature space, or in a higher dimensional space, depending on the kernel function used. We considered SVMs both with third order polynomial kernels (SVM-P) and radial basis function kernels (SVM-R). The R library used is based on LibSVM (Chang and Lin, 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm).
- DT (Breiman et al., 1984). Each internal node in the tree describes a test on a feature and has one child node for each possible value or range of values of the feature. A class label is associated with each leaf node and the classification of a sample is determined by following the path from the root to a leaf.
Data preparation
We applied our approach to a large and difficult osteoarthritis (OA) experiment with 78 single-channel cDNA expression arrays (Aigner et al., 2006). Each array contains 7467 spots corresponding to 3468 genes, which represent the input features that are processed by the various FSS methods. The samples were divided into four classes according to the stages of degenerative cartilage progression, leading to a four-way classification problem. The arrays represented healthy (18 patients), early degenerative cartilage (20), peripheral OA (21) or central OA (19) states, whereby healthy and early patients have not been diagnosed with OA, and peripheral OA and central OA describe two variants of late OA. Scaled median absolute deviation (SMAD) (Dudoit and Yang, 2003) was applied as the most suitable between-array normalization method for this dataset (Fundel et al., 2005).
Additionally, we evaluated a publicly available dataset on breast cancer (van't Veer et al., 2002) with 78 two-channel arrays containing some 25 000 human genes.
| RESULTS |
|---|
|
|
|---|
Pairwise combinations of the aforementioned FSS and classification methods were examined for different numbers of sampling steps. Each of these models is evaluated in each sampling step. We found that the performance of models can be highly variable when the number of sampling steps is low (Figure 2). This shows that performance estimates are unreliable without repeated sampling. For example, the initial classification performance (with only one sampling step) of the model FT30/5NN was over-estimated by
20% points, compared to its performance after 250 sampling iterations. Performance underestimates can also be observed in Figure 2, where FT50/NSC gains
10% points, showing that repeated sampling prevents both types of errors caused by variable classifier performance. Based on these observations, we set the number of sampling steps to 400, so that the performance of all classifiers becomes sufficiently stable. These 400 sampling steps produce 400 trained models for each FSS and classification method, whose accuracy distribution provides a better estimate of the reliability of the model by considering the median accuracy and the MAD of the accuracy (see Table 1).
|
|
Combining FSS and classification methods
The results of all pairwise combinations of the seven FSS and five classification methods examined here are summarized in Table 1. We see that no FSS method and classification method is consistently superior with respect to all criteria, rather the performance of the different models is highly variable. This is the principle behind the model scoring function. For example, when we compare models 27 (DT0.01/5NN) and 34 (SVM0.00001/SVM-R) we see that both achieve a total accuracy of 68.4%, however, model 34 is penalized for the higher MAD (18.0%) of its accuracy, while model 27 benefits from having a much higher gene signature stability (53.7%) than that of model 34 (22.4%). However, model 27 includes more genes (234) on average than model 34(199) in order to achieve this signature stability. Nonetheless, stable gene signatures are more important than short gene signatures, causing model 27 to be rated superior to model 34 on this data. In another comparison, models 21 (PV0.01FC2.0 + ORA/NSC) and 31 (SVM0.00001/NSC) show very similar accuracies (73.4% and 72.9%, respectively) and identical MADs (12.4%), however model 21 achieves a higher gene signature stability (63.9% versus 22.4%) and accomplishes this with shorter signatures (139 genes versus 199 genes), leading to a higher rating of model 21 over model 31. Because the weightings of the different factors are configurable, the model ranking can be fine-tuned to the individual goals of the researcher.
When one considers the top model (model 24, highlighted), one sees that it does not have the highest total accuracy, the lowest MAD, the shortest gene signatures or even the highest signature stability. Nonetheless this model is ranked first because it represents the best overall performance for the given dataset, striking the best balance between performance and stability. If we consider an arbitrary fixed model, for example model 6, based on the methods used in Michiels et al. (2005), one sees that systematically evaluating all combinations of FSS and classification methods with StabPerf was able to uncover a model (i.e. model 24) with both higher total accuracy as well as better signature stability for more genes. The fixed threshold of 50 genes in PC[50] excludes many potentially significant genes, which explains its poor stability (37.9%) compared to, e.g. that of PC0.6 (51.1%), which selects genes exceeding a minimum relevance criterion.
Consensus gene signatures
For a given FSS method, some features are more consistently represented across the 400 generated gene signatures than others. In Figure 3 the frequencies of occurrence of all features selected by the chosen FSS method PV0.01FC2.0 + ORA are shown (features occurring in none of the 400 signatures not shown). We see that there is a large number of features that are present in every signature and a fair number of additional features that are present in at least
= 75% (300) of the signatures.
|
We applied this procedure to all seven FSS methods. Table 2 shows to what extent the median signature length can be reduced by building consensus signatures. While our analysis is based on spots on the array (i.e. features), one can also map spots to the genes they represent. The frequency of occurrence of a gene in the 400 signatures is defined here as the average frequency of occurrence of its corresponding spots in the 400 signatures. A gene is retained in the consensus gene signature if it was selected in 75% of these signatures. The threshold
of 75% may be adjusted, extending or shortening the consensus signatures (column Stable in Table 2), allowing control over the sensitivity and selectivity of the method.
|
A literature search was conducted to compare the 400 individual pooled signatures (215 spots
|
If sampling is neglected in such a scenario, the ratio of relevant genes to selected genes will be lower. For example, with PV0.01FC2.0 + ORA 90.3% (SD: 2.7%) of the genes in a signature are found to be OA-related, on average. However, the consensus signature contained 97.0% OA-related genes, which is more than two SDs above the mean. As PV0.01FC2.0 + ORA incorporates biological knowledge, the consensus signature produces only a small, but significant, improvement. Less stable FSS methods are expected to benefit even more from these consensus signatures. Additionally, without sampling, it is not clear, how much of the observed classification performance is due to random correlations in the expression data. Moreover, the manual validation of the selected genes is more challenging, as the signatures are longer and contain fewer relevant genes.
The effectiveness of the procedure depends on the selectivity and sensitivity of individual FSS methods. An FSS method, such as PC0.6 with a high sensitivity and low selectivity allows more effective determination of consensus signatures. Overly selective FSS methods, e.g. PC[50] on this dataset exclude many potentially stable genes, making it more difficult to produce a meaningful consensus signature. As seen in Table 2, only 11 genes are retained in the consensus gene signature for PC[50], compared to 83 genes from PC0.6. This provides fewer stable candidate genes for further examination unless the stability threshold
is lowered.
Overall classification performance on the OA dataset
The overall classification accuracy of StabPerf was estimated as 73.4% (SD: 11.0%) by 10-fold stratified cross validation for the difficult four-way classification task on the OA dataset. The accuracies can be based on different selected models for different folds. The OA disease classes normal and early as well as peripheral OA and central OA are known to be difficult to separate (Fundel et al., 2005). If one simplifies this classification problem, by combining the two classes normal with early as well as peripheral OA with central OA, we observe an average accuracy of 97.5%.
Data on breast cancer
We applied our methods to a second experiment comprising 78 arrays on breast cancer (van't Veer et al., 2002). The aim of this study was to examine if breast cancer patients could be determined that survive for at least five years free of metastases after treatment. This dataset has also been analyzed by Michiels et al. (2005) based on the same filter that has been used in the original publication. Here, the same protocol as in model 9 (compare Table 1) has been used differing only in the number of sampling steps (van't Veer et al. use 500). Across all three studies (including our own) a two-class cross-validation accuracy of
60% was estimated showing that the patient groups are difficult to separate from gene expression profiles alone. Furthermore, a Modscore of 0.083 shows that signatures generated from this dataset cannot be expected to yield a good classification performance or signature stability. We tested all 35 models that have been used for the OA data. The criteria for all FSS methods have been relaxed to accommodate for the difficult dataset (PC0.3, FT20, PV0.1 FC1.5, SVM0.0001) leading to accuracies of the different models between 50 and 67% (detailed data not shown). Here, the Modscore was anticorrelated with performance showing that either moderate accuracy or moderate feature stability could be achieved, but not both.
| DISCUSSION |
|---|
|
|
|---|
The method StabPerf proposed in this article addresses two objectives in classification tasks of microarray data. The first objective focuses on achieving high classification performance by evaluating FSS methods and classification methods in order to determine the most appropriate combination specifically for the given dataset. The second objective requires that FSS methods and the performance of the classifier are stable, i.e. resilient against variations between sets of expression arrays. We addressed these two objectives by evaluating all possible combinations of the examined FSS and classification methods and by subjecting each of these combinations to repeated random sampling.
Of course, such a strategy can only be meaningfully executed if a reasonably large dataset is being analyzed. In our case, we investigated an OA dataset with four classes and approximately twenty measurements per class. This is one of the largest OA datasets available and is large enough to allow accurate stability and reliability analyses with StabPerf. In addition, the dataset is difficult in that distinguishing the four classes is difficult both at the phenotypic and the gene expression level. In particular, this is true for the comparison between normal and early as well as between peripheral and central OA.
Neglecting the first objective, i.e. relying on a particular classification method, leads to suboptimal classification performance. This can, at the cost of increased computational effort, be avoided by evaluating all pairwise combinations of FSS and classification methods. Indeed, different models may be appropriate for different datasets and the StabPerf approach prevents one from being bound to an inappropriate model for the given dataset. On the contrary, StabPerf chooses the best model for any dataset.
If the second objective of the strategy is compromised, i.e. repeated random sampling is not used, one cannot account for the classification performance that results from learning random correlations. Compared to approaches without sampling, the classification accuracies obtained from our approach may appear inferior in some cases. However, StabPerf provides a more realistic estimation of performance and corrects for overly optimistic and overly pessimistic results. Indeed, it is preferable to estimate the expected performance on future datasets by using a distribution of accuracies, as provided by such a sampling approach.
We also do not follow the approach of standard FSS methods, which simply select genes that exhibit strong differential regulation between sample groups. Such signatures tend to be unstable and contain many genes that may not be relevant to the biological question at hand. Additionally, the detailed manual validation of standard signatures is more time consuming compared to that of a consensus signature (Table 2). This consensus signature is built from the most stable genes identified by sampling a given FSS method. The length of this consensus signature is determined by the chosen frequency threshold, allowing control over the sensitivity and selectivity of the method. The number of features required for further studies can easily be adjusted at this point by adjusting the threshold. This should avoid shortcomings of arbitrarily limiting the features at the FSS step. For the OA dataset, we were able to show that the consensus signature genes are much more likely to be related to the disease than genes occurring less frequently.
It is important to avoid performance sacrifices on the one hand but also to avoid unrealistic performance estimates on the other. This also requires that sensible parameters need to be manually chosen specifically for the given dataset beforehand to avoid FSS methods returning empty sets of features. We are aware that this might introduce a bias into performance estimation but we expect this bias to be neglegible as we do not perform systematic parameter optimization.
We showed that the proposed strategy is well suited to address both types of problems and that the achieved accuracy of 73.4% for the four-way and of 97.5% for the two-class classification problem, as estimated by 10-fold stratified cross validation, is indeed realistic for the given OA dataset. As in many cases biological classes are indeed difficult to separate, the computational cost associated with our approach is justified.
For comparison, we also analyzed a publicly available dataset on breast cancer (van't Veer et al., 2002) that has also been analyzed by Michiels et al. (2005). They found that the two classes of patients were difficult to separate (accuracy of 60%) and that feature signatures were unstable in different samplings. In terms of model classification performance our results are consistent with Michiels et al. (2005). Signatures extracted from this data are not reliable, which was clearly reflected by our Modscore.
In general, we see that processes, such as gene selection that occur earlier in the analysis of microarray data have pronounced effects on classification performance and reliable gene signatures. Therefore, it is imperative that many FSS methods be evaluated, in terms of signature stability, and that these methods be parameterized to emphasize sensitivity over selectivity. As we showed in the case of the Pearson correlation, e.g. simply selecting the top 50 genes prevents one from being able to generate meaningful consensus signatures. Indeed, consensus signatures generated from stable FSS methods have been shown to provide concise and reliable gene lists, as confirmed by checking the disease relevance of the corresponding genes.
| CONCLUSION |
|---|
|
|
|---|
New methods for feature subset selection and classification of microarray data continue to appear in the literature. For a given dataset, our new model selection approach StabPerf finds the most appropriate combination of FSS and classification method out of a predefined model library. Our sampling approach for all models, as implemented in StabPerf, delivers reliable gene signatures and robust classification performance estimates for a given measurement to be analyzed by computing data similar to Table 1.
The systematic analysis performed by StabPerf involves computational costs, which depend on the number of sampling steps and the number of FSS and classification methods used. However, the StabPerf procedure, which is available as a free R package, has been designed to exploit parallel processing facilities when in a LAM/MPI environment (Burns et al., 1994). For the data and parameter settings to reproduce Table 1
5.5 h on 20 Intel Xeon CPUs were required (without cross validation). Thus, the involved computations need not necessarily delay the analysis process in realistic applications. The advantages with respect to stability and reliability in the analysis of valuable microarray measurements justify the additional computational effort by sampling, systematic model evaluation and cross validation. For a given microarray experiment the StabPerf procedure allows one to avoid typical mistakes in early analysis steps and to obtain realistic classification performance estimates as well as stable, and therefore more reliable, ranked relevant gene lists of customizable length for further processing.
| Acknowledgments |
|---|
We would like to thank Dr. Thomas Aigner (Osteoarticular and Arthritis Research, Institute of Pathology, University of Leipzig), Dr. Klaus Lindauer (Bioinformatics, Sanofi-Aventis, Frankfurt), Dr. Joachim Saas and Dr. Eckart Bartnik (Therapeutic Department Thrombosis and Angiogenesis, Sanofi-Aventis, Frankfurt) for helpful comments and discussions. This work has partially been funded by projects BEX (Sanofi-Aventis) and BFAM (bmbf).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
The authors wish it to be known that, in their opinion, the first three authors should be regarded as joint First Authors. Associate Editor: Alvis Brazma
Received on January 10, 2006; revised on June 21, 2006; accepted on July 18, 2006
| REFERENCES |
|---|
|
|
|---|
Aigner, T., et al. (2006) Large-scale gene expression profiling major pathogenetic pathways of cartilage degeneration in osteoarthritis. Arthritis and Rheum, in press.
Ashburner, M., et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet, . 25, 259[CrossRef][Web of Science][Medline].
Bi, J., Bennett, K., Embrechts, M., Breneman, C., Song, M. (2003) Dimensionality reduction via sparse support vector machines. J. Mach. Learn. Res, . 3, , pp. 12291243[CrossRef].
Boser, B.E., Guyon, I., Vapnik, V.N. (1992) A training algorithm for optimal margin classifiers. COLT '92: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, , NY Pittsburgh, ACM Press, pp. 144152.
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. Classification and Regression Trees, (1984) Wadsworth & Brooks, Monterey.
Burns, G., Daoud, R., Vaigl, J. (1994) LAM: An open cluster environment for MPI. In John, W. Ross (Ed.). Proceedings of Supercomputing Symposium 94, University of Toronto, pp. 379386.
Chang, C. and Lin, C. (2001) LibSVM: a library for support vector machines.
Draghici, S., et al. (2003) Global functional profiling of gene expression. Genomics, 81, 98104[CrossRef][Web of Science][Medline].
Dudoit, S. and Yang, J.Y.H. (2003) Bioconductor R packages for exploratory analysis and normalization of cDNA microarray data. In Parmigiani, G., Garett, E.S., Irizarry, R.A., Zeger, S.L. (Eds.). The Analysis of Gene Expression Data: Methods and Software, , NY Springer, pp. 73101.
Dudoit, S., et al. (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc, . 97, 7787[CrossRef][Web of Science].
Fundel, K., Küffner, R., Aigner, T., Zimmer, R. (2005) Data Processing Effects on the Interpretation of Microarray Gene Expresssion Experiments. In Torda, A., Kurtz, S., Rarey, M. (Eds.). German Conference on Bioinformatics (GCB) 2005, Hamburg, Lecture Notes in Informatics, , Bonn Gesellschaft für Informatik, pp. 7791.
Guyon, I. and Elisseeff, A. (2003) An introduction to variable and feature selection. J. Mach. Learning Res, . 3, 11571182[CrossRef].
Hastie, T., Tibshirani, R., Friedman, J.H. The Elements of Statistical Learning, (2001) , NY Springer-Verlag.
Inza, I., et al. (2004) Filter versus wrapper gene selection approaches in DNA microarray domains. Artif. Intell. Med, . 31, 91103[CrossRef][Web of Science][Medline].
Lottaz, C. and Spang, R. (2005) Molecular decomposition of complex clinical phenotypes using biologically structured analysis of microarray data. Bioinformatics, 21, 19711978
Michiels, S., et al. (2005) Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet, 365, 488492[CrossRef][Web of Science][Medline].
Mitchell, T.M. Machine Learning, (1997) , McGraw-Hill, NY.
Statnikov, A., et al. (2005) A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21, 631643
Tibshirani, R., et al. (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl Acad. Sci. USA, 99, 656772
van't Veer, L.J., et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530536[CrossRef][Medline].
This article has been cited by other articles:
![]() |
A.-L. Boulesteix and M. Slawski Stability and aggregation of ranked gene lists Brief Bioinform, September 1, 2009; 10(5): 556 - 568. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Jurman, S. Merler, A. Barla, S. Paoli, A. Galea, and C. Furlanello Algebraic stability indicators for ranked lists in molecular profiling Bioinformatics, January 15, 2008; 24(2): 258 - 264. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








