Skip Navigation


Bioinformatics Advance Access originally published online on May 11, 2007
Bioinformatics 2007 23(13):1702-1704; doi:10.1093/bioinformatics/btm162
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/13/1702    most recent
btm162v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Google Scholar
Right arrow Articles by Boulesteix, A.-L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Boulesteix, A.-L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

WilcoxCV: an R package for fast variable selection in cross-validation

Anne-Laure Boulesteix *

Sylvia Lawry Centre for Multiple Sclerosis Research, Hohenlindenerstr. 1, D-81677 Munich, Germany

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: In the last few years, numerous methods have been proposed for microarray-based class prediction. Although many of them have been designed especially for the case n << p (much more variables than observations), preliminary variable selection is almost always necessary when the number of genes reaches several tens of thousands, as usual in recent data sets. In the two-class setting, the Wilcoxon rank sum test statistic is, with the t-statistic, one of the standard approaches for variable selection. It is well known that the variable selection step must be seen as a part of classifier construction and, as such, be performed based on training data only. When classifier accuracy is evaluated via cross-validation or Monte–Carlo cross-validation, it means that we have to perform p Wilcoxon or t-tests for each iteration, which becomes a daunting task for increasing p. As a consequence, many authors often perform variable selection only once using all the available data, which can induce a dramatic underestimation of error rate and thus lead to misleadingly reporting predictive power. We propose a very fast implementation of variable selection based on the Wilcoxon test for use in cross-validation and Monte Carlo cross-validation (also known as random splitting into learning and test sets). This implementation is based on a simple mathematical formula using only the ranks calculated from the original data set.

Availability: Our method is implemented in the freely available R package WilcoxCV which can be downloaded from the Comprehensive R Archive Network at http://cran.r-project.org/src/contrib/Descriptions/WilcoxCV.html

Contact: boulesteix{at}slcmsr.org


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Many applied and methodological articles have been devoted to class prediction based on high-dimensional microarray data with applications to, e.g. molecular cancer diagnosis or prediction of response to therapy. In this context, it is common practice to perform univariate variable selection before constructing a classifier, even if the chosen classification method can handle a large number of predictors. In binary classification, it is usual to rank genes according to the P-value obtained in, e.g. the t-test for two independent samples and related methods or the Wilcoxon rank sum test, also known as the Mann–Whitney test (Boulesteix and Tutz, 2006; Dettling and Bühlmann, 2003). The genes with the smallest P-values are then selected and used for classifier construction. In contrast to the t-test, the Wilcoxon rank sum test is robust against outliers, which are frequent in microarray data, and does not require normal distribution of the expression levels within both classes. This is an important advantage, since normality of gene expression data is often questionable, even after normalization. Wilcoxon-based variable selection is reported to perform very well in one of the most extensive comparison studies on microarray-based classification (Lee et al., 2005).

The performance of classification methods is commonly evaluated by cross-validation (CV) or Monte–Carlo cross-validation (MCCV). In m-fold CV, the n observations are divided into m (approximately) equally sized groups. In the k-th CV iteration, the k-th group is considered as test data set, whereas the remaining m–1 groups form the learning set which is used for classifier construction. This classifier is then used to predict the observations from group k. After the m iterations, the error rate is estimated as the proportion of misclassified observations. An important special case is leave-one-out cross-validation (LOOCV), where m = n. Monte–Carlo cross-validation (also denoted as subsampling or random splitting in the literature) also consists of several iterations in which the data set is split into learning and test sets. In contrast to CV, the test sets are not chosen to form a partition of the whole data set but drawn randomly (without replacement) from the n observations at each iteration. The number of iterations Niter is fixed by the user and can be as high as computationally feasible, leading to a more robust estimation than CV. The size ratio between learning and test data sets is also fixed by the user. Usual choices are, e.g. 2:1, 4:1 or 9:1. Repeated CV is another robust procedure (Braga-Neto and Dougherty, 2004) which consists of averaging the results obtained in CV for different partitions. Braga-Neto and Dougherty (2004) and Molinaro et al. (2005) review and compare procedures for estimating the error rate of a classifier, including those mentioned in the present article and other like bootstrap sampling.

Procedures such as CV and MCCV are commonly applied for both estimation and optimization purposes. When used for estimation, the goal of CV and MCCV is to evaluate the performance of the considered classifier on independent data, which is a major topic in all medical articles on microarray-based prediction. In the context of optimization, CV and MCCV aim at selecting the best combination of method parameters based on a learning set. These parameters are then used to predict observations from the test set. Method parameters may include, e.g. the number of components in PLS (Boulesteix and Strimmer, 2007) and other dimension reduction methods (Dai et al., 2006) or the penalty parameter in penalized logistic regression (Zhu and Hastie, 2004). When reporting the accuracy of a classification method, it is particularly important to perform such a CV-based optimization step, denoted as inner loop by Statnikov et al. (2005).

In many articles using a CV procedure (either for error rate estimation or for parameter optimization), it is unclear whether and when preliminary variable selection was performed, although bias due to too early variable selection are well documented (Ambroise and McLachlan, 2002). When LOOCV is used for error estimation, selecting variables using all n observations instead of considering variable selection as a part of classifier construction leads to downwardly biased estimation. Apparently good performing classifiers may be produced even when predictors are not associated with class membership, yielding ‘noise discovery’ (Ioannidis, 2005). When LOOCV is used to determine the optimal parameter value of a given method, for instance, the number of components in PLS dimension reduction (Boulesteix and Strimmer, 2007; Dai et al., 2006), performing variable selection with all available observations may favor sparse models.

We argue that computational expense is the main reason for variable selection to be often (spuriously) performed only once using all available observations. We propose an implementation of variable selection based on the Wilcoxon rank sum test in the context of CV and MCCV which solves this problem by using a simple mathematical formula. It outputs the Wilcoxon test statistics for all CV or MCCV iterations simultaneously in much less time than if the Wilcoxon tests were applied successively in all iterations.


    2 IMPLEMENTATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Let us consider a sample (xi, yi)i=1,...,n, where yi denotes the binary class membership (yi= 0, 1) and xi the expression level of observation i for the considered gene. For simplicity, we omit the index g (g = 1, ..., p) of the gene. Let Ri denote the rank of observation i. The Wilcoxon rank sum test tests the equality of the medians in two independent samples (here, the samples defined by yi= 0 and yi= 1). The test statistic is given as W = {Sigma}i:yi = 0 Ri, which is the sum of the ranks of observations from class yi= 0. The P-value of the test is derived from the exact null-distribution of W (for very small samples) or from the asymptotic result


Formula 1

(1)
where n0 and n1 are the numbers of observations with yi = 0 and yi = 1, respectively. In CV or MCCV, we denote as Tk (k = 1, ... , Niter) the set of the nTk observations included in the test set for the k-th iteration. For example, we have Niter = m, Formula and Tk1 {cap} Tk2 = {emptyset} {forall}k1 != k2 in m-fold CV. In the special case of LOOCV, Tk is defined as Tk = {k} and Niter = n. Let Wk denote the Wilcoxon rank sum test statistic obtained based on the sample (xi,yi)i {notin} Tk including all observations except those from Tk. We derive a new simple formula allowing to compute Wk, k = 1,..., Niter simultaneously. Let Ri,k denote the rank of observation i in the k-th iteration, i.e. in the sample (xi, yi)i {notin} Tk with the convention Ri,k = 0 = (Ri –Ri) if i isin Tk. For i {notin} Tk, we have



Formula 2

(2)

We obtain


Formula 3

(3)
This formula is based on the Ri (i = 1, ..., n) only. Hence, it allows to compute the Wk simultaneously very efficiently. Computation of the P-values and ordering of the genes can then be carried out based on the standardized statistic Formula which is asymptotically normally distributed:


Formula 4

(4)
where n0,k denotes the number of observations with yi = 0 when observations from Tk are removed. In the k-th iteration, the best genes are those with the highest Formula values.

2.1 Run time comparison
We compared the time needed to order 1000 genes in CV and MCCV by (i) running the function wilcox.test for each CV or MCCV iteration, (ii) using our novel efficient algorithm as implemented in the function wilcox.selection.split from our R package WilcoxCV. Results are given in Table 1 for different values of n and two different procedures: LOOCV and MCCV with size ratio 9:1 and Niter = 100 iterations. As can be seen from Table 1, the new algorithm reduces computation time dramatically (up to a factor 50) compared to the standard approach (carrying out the Wilcoxon rank sum test for each iteration).


View this table:
[in this window]
[in a new window]

 
Table 1. Time (in seconds) needed by the standard approach (i) (normal font) and our novel algorithm (ii) (italic) as output by the function system.time

 


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 ACKNOWLEDGEMENTS
 REFERENCES
 
I thank Martin Daumer, Korbinian Strimmer and Elisabeth Gnatowski for critically reading the manuscript. This work was supported by the Porticus Foundation in the context of the International School for Technical Medicine and Clinical Bioinformatics.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: David Rocke

Received on April 16, 2007; revised on April 16, 2007; accepted on April 22, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Ambroise C, McLachlan GJ. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Nat. Acad. Sci USA (2002) 99:6562–6566.[Abstract/Free Full Text]

    Boulesteix A-L, Tutz G. Identification of interaction patterns and classification with applications to microarray data. Comput. Stat. Data Anal (2006) 50:783–802.[CrossRef]

    Boulesteix A-L, Strimmer K. Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Brief. Bioinformatics (2007) 8:32–44.[Abstract/Free Full Text]

    Braga-Neto U, Dougherty E. Is cross-validation valid for small-sample microarray classification? Bioinformatics (2004) 20:374–380.[Abstract/Free Full Text]

    Dai JJ, et al. Dimension reduction for classification with gene expression data. Stat. Appl. Genet. Mol. Biol (2006) 5:6.

    Dettling M, Bühlmann P. Boosting for tumor classification with gene expression data. Bioinformatics (2003) 19:1061–1069.[Abstract/Free Full Text]

    Ioannidis JP. Microarrays and molecular research: noise discovery. The Lancet (2005) 365:488–492.

    Lee JW, et al. An extensive comparison of recent classification tools applied to microarray data. Comput. Stat. Data Anal (2005) 48:869–885.[CrossRef]

    Molinaro A, et al. Prediction error estimation: a comparison of resampling methods. Bioinformatics (2005) 21:3301–3307.[Abstract/Free Full Text]

    Statnikov A, et al. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics (2005) 21:631–643.[Abstract/Free Full Text]

    Zhu J, Hastie T. Classification of gene microarrays by penalized logistic regression. Biostatistics (2004) 5:427–443.[Abstract]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
A.-L. Boulesteix, C. Porzelius, and M. Daumer
Microarray-based classification and clinical predictors: on combined classifiers and additional predictive value
Bioinformatics, August 1, 2008; 24(15): 1698 - 1706.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/13/1702    most recent
btm162v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Google Scholar
Right arrow Articles by Boulesteix, A.-L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Boulesteix, A.-L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?