Skip Navigation


Bioinformatics Advance Access originally published online on May 21, 2009
Bioinformatics 2009 25(15):1884-1890; doi:10.1093/bioinformatics/btp331
This Article
Right arrow Full Text
Right arrow Full Text (Print PDF)
Right arrow All Versions of this Article:
25/15/1884    most recent
btp331v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Nicodemus, K. K.
Right arrow Articles by Malley, J. D.
PubMed
Right arrow PubMed Citation
Right arrow Articles by Nicodemus, K. K.
Right arrow Articles by Malley, J. D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2009. Published by Oxford University Press on behalf of the US government 2009

Predictor correlation impacts machine learning algorithms: implications for genomic studies

Kristin K. Nicodemus 1,2,3,* and James D. Malley 4

1 Department of Statistical Genetics, Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford OX3 7BN, 2 Department of Clinical Pharmacology, University of Oxford, Old Road Campus Research Building, Roosevelt Road, Oxford, OX3 7DQ, UK, 3 Genes, Cognition and Psychosis Program, Intramural Research Program, National Institute of Mental Health, Room 4S-235, 10 Center Drive and 4 Mathematical and Statistical Computing Laboratory, Division of Computational Bioscience, Center for Information Technology, National Institutes of Health, Bethesda, MD 20892, USA

* To whom correspondence should be addressed.


   Abstract

Motivation: The advent of high-throughput genomics has produced studies with large numbers of predictors (e.g. genome-wide association, microarray studies). Machine learning algorithms (MLAs) are a computationally efficient way to identify phenotype-associated variables in high-dimensional data. There are important results from mathematical theory and numerous practical results documenting their value. One attractive feature of MLAs is that many operate in a fully multivariate environment, allowing for small-importance variables to be included when they act cooperatively. However, certain properties of MLAs under conditions common in genomic-related data have not been well-studied—in particular, correlations among predictors pose a problem.

Results: Using extensive simulation, we showed considering correlation within predictors is crucial in making valid inferences using variable importance measures (VIMs) from three MLAs: random forest (RF), conditional inference forest (CIF) and Monte Carlo logic regression (MCLR). Using a case–control illustration, we showed that the RF VIMs—even permutation-based—were less able to detect association than other algorithms at effect sizes encountered in complex disease studies. This reduction occurred when ‘causal’ predictors were correlated with other predictors, and was sharpest when RF tree building used the Gini index. Indeed, RF Gini VIMs are biased under correlation, dependent on predictor correlation strength/number and over-trained to random fluctuations in data when tree terminal node size was small. Permutation-based VIM distributions were less variable for correlated predictors and are unbiased, thus may be preferred when predictors are correlated. MLAs are a powerful tool for high-dimensional data analysis, but well-considered use of algorithms is necessary to draw valid conclusions.

Contact: kristin.nicodemus{at}well.ox.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Associate Editor: Martin Bishop


Received on May 14, 2009; revised on May 14, 2009; accepted on May 16, 2009

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?




Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.