Skip Navigation


Bioinformatics Advance Access originally published online on June 29, 2006
Bioinformatics 2006 22(17):2173-2174; doi:10.1093/bioinformatics/btl347
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/17/2173    most recent
btl347v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bush, W. S.
Right arrow Articles by Ritchie, M. D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bush, W. S.
Right arrow Articles by Ritchie, M. D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Parallel multifactor dimensionality reduction: a tool for the large-scale analysis of gene–gene interactions

William S. Bush , Scott M. Dudek and Marylyn D. Ritchie *

Center for Human Genetics Research, Vanderbilt University Nashville, TN, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DISCUSSION
 REFERENCES
 

Summary: Parallel multifactor dimensionality reduction is a tool for large-scale analysis of gene–gene and gene–environment interactions. The MDR algorithm was redesigned to allow an unlimited number of study subjects, total variables and variable states, and to remove restrictions on the order of interactions being analyzed. In addition, the algorithm is markedly more efficient, with ~150-fold decrease in runtime for equivalent analyses. To facilitate the processing of large datasets, the algorithm was made parallel.

Availability: Parallel MDR is freely available for non-commercial research institutions. For full details see http://chgr.mc.vanderbilt.edu/ritchielab/pMDR. An open-source version of MDR software is available at http://www.epistasis.org.

Contact: ritchie{at}chgr.mc.vanderbilt.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DISCUSSION
 REFERENCES
 
The development of high-throughput genotyping technologies has dramatically increased the number of polymorphisms considered in a typical genetic association study. Detecting and characterizing interactions between these polymorphisms may be important in discovering genetic contributions to common, complex diseases. Multifactor dimensionality reduction (MDR) has been previously introduced as a non-parametric, model-free method for detecting gene–gene and gene–environment interactions (Ritchie et al., 2001; Hahn et al., 2003). MDR was designed for small candidate gene studies of <500 variables and is quickly overwhelmed by the large-scale datasets with thousands of variables generated by current technologies. In response to these trends, we have redesigned the MDR software and devised a new algorithm that can scale to handle extremely large datasets.

MDR is implemented with a cross-validation framework (typically 10-fold although any number of intervals including 1-fold can be used). In step 1 of the process, the data are divided into a training set (nine-tenth of the data) and a testing set (one-tenth of the data). Next, a set of n genetic and/or environmental factors are selected from the set of all possible factors, N. Each individual in the training set is grouped according to its state at each of the n factors (i.e. for 2 loci with 3 possible genotypes, there are 9 possible combinations; for 3 loci, 27, etc.). Then, each genotype combination is classified as high or low risk depending on the ratio of cases to controls with that genotype combination. If the ratio is <1 (there are less cases than controls) the genotype combination is classified as low risk. If the ratio is ≥1, the genotype combination is classified as high risk. This collection of high and low risk groups forms the MDR model for that combination of factors, and the fitness of that model is determined by classification error (the number of individuals incorrectly labeled by the model). To evaluate the predictive ability of a model, its prediction error is calculated based on the proportion of individuals in the testing set that are labeled incorrectly by the model. Each combination of variables (1 to k-variables) specified by the user is evaluated in this fashion, and the entire procedure is performed 10 times, using each one-tenth as a testing set and each 9 out of 10 as a training set. While several modifications to MDR have been made in this implementation, this procedure forms the basis of the MDR algorithm and has not changed.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DISCUSSION
 REFERENCES
 
The original MDR software (Hahn et al., 2003) relied on the successive use of the Standard Template Library quicksort function to subset individuals into a multilocus genotype group. This required multiple sorts to retrieve all genotype combinations for a single interaction model. To increase efficiency, we implemented a recursive binning function which combines the properties of a merge-sort and a bin-sort. This function is tree-based, placing each individual in bins according to genotype at locus 1 of the model, then recursively sorting each bin according to genotype at locus 2 (Fig. 1). All sorting operations are applied to pointers to the genotype data stored in a static array, reducing memory writes. This single sorting operation simultaneously groups all genotype combinations of the desired interaction model. In addition to a performance increase, the redesigned sorting mechanism removes hard-coded limitations in the number of study subjects, total variables, variable states and restrictions on the order of interactions being analyzed. The number of genotype states (AA, Aa, aa) corresponds to the number of bins/branches of the tree, limited only by machine memory. The depth of the tree and number of recurses of the binning function correspond to the number of loci in the interaction model: an interaction of two loci with three states each produces 9 multilocus genotype bins; a three-way interaction produces 27 multilocus genotype bins, etc.


Figure 1
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig 1 Recursive Binning Method for MDR Model Generation. In this example, an MDR model is generated from interaction of polymorphisms 3 and 6. Individuals are binned according to status (0, 1 or 2) at polymorphism 3. The individuals in each of these bins are then grouped according to status at polymorphism 6. The ratio of affected to unaffected individuals is computed for each of the nine resulting state-combinations. Each combination is classified as high-risk (if the ratio is >1) or low-risk (if the ratio is <1). A classification error is generated based on the number of individuals misclassified by the model. This process is repeated for each variable combination, and is performed in a cross-validation framework.

 
The framework of this MDR implementation is divided into a data handling class, a model generation and processing class, and a result storage class. The modular class design allows for parallelization. Also, future modifications of individual functions can be easily integrated into the class framework. The data handling class contains functions for loading the dataset into memory and implementing data splitting for cross-validation. Model generation consists of an optimized combination generator that produces a set of integer vectors containing the loci to be evaluated. This generator, designed by Donald Knuth (Knuth, 2005, http://www-cs-faculty.stanford.edu/~knuth/taocp.html), was parameterized to allow partitioning of model generation and evaluation across multiple nodes of a cluster environment. Each node generates and processes models using the recursive binning function until all specified combinations have been evaluated. Nodes then transmit their results to the master node where they are aggregated into the final set of results. Parallel MDR is an efficient implementation in that each node of the cluster processes a subset of the problem, there is little inter-process communication, and the results of each node are collected upon completion.

Parallel MDR was written in C++, using Message Passing Interface for inter-process communication and compiled using the GNU compiler collection (gcc).


    3 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DISCUSSION
 REFERENCES
 
To accommodate large-scale case-control association studies, several improvements to the original MDR software have been made. The efficiency of the algorithm has been markedly improved, decreasing processing time by ~150-fold. The nature of the new algorithm design places no limitations on the number of genotype variable states, study subjects, or total number of variables. The increased scalability and performance are essential for the analysis of large-scale datasets. Parallel MDR can accept any discrete value data-types, including allele, genotype or haplotype encoding (note it cannot consider haplotype uncertainty estimates). In addition, parallel MDR can accept datasets consisting of hundreds of thousands of variables. In smaller datasets, the performance increase allows the exhaustive evaluation of previously infeasible high-order interaction models. Statistical significance is assigned by permutation testing, where the MDR process is repeated on randomized data to generate an empirical distribution. This process is computationally intensive, and the performance improvements will expedite this process as well.

Empirical permutation testing will not be feasible for extremely large datasets, but in those cases parallel MDR still serves as a useful exploratory tool. In this vein, a user can now specify the number of models to be reported, ranked by classification error. These models can be thought of as data-driven hypotheses that may be analyzed using other analysis tools, such as logistic regression, or pursued in subsequent studies.

To evaluate computational feasibility, we implemented parallel MDR on 24 Pentium 4 processors of a Beowulf cluster (ACCRE, 2005). The 1.25 billion pairwise interactions between 50 000 simulated single nucleotide polymorphisms (SNPs) completed in ~40 h. Because each node of the cluster processes a subset of the overall problem and inter-process communication is minimized, the performance improvement owing to parallelization approaches a linear decrease in run-time as processors are added. As such, we estimate that the 5 billion pairwise interactions between 100 000 SNPs could be analyzed in ~160 h on 24 processors, or in ~40 h on 96 processors.

Large-scale studies of 50 000–100 000 SNPs may call for changes to the basic MDR methodology. In particular, cross-validation is especially costly, requiring multiple levels of processing. Reducing the number of cross-validation intervals from 10 to 5 or fewer may prove to be an effective strategy. Also, rather than keeping a single model as the result of an analysis, retaining multiple models may increase the statistical power of the method, but exposes the possibility of a higher false positive rate. These analysis considerations are currently being evaluated.

Parallel MDR has the capacity to analyze high-order interactions of small datasets and can feasibly analyze the lower end of genome-wide scale analyses (100 000–200 000 variables).


    Acknowledgments
 
The authors thank Nate Barney, Lance Hahn, Jason Moore and Bill White for insightful discussions. The authors would also like to thank Bill White for bringing the Knuth algorithm to our attention. Finally, this work was supported by National Institutes of Health grants AG20135 and HL65962. This work was conducted in part using the resources of the Advanced Computing Center for Research and Education at Vanderbilt University (http://www.accre.vanderbilt.edu).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on February 11, 2006; revised on June 12, 2006; accepted on June 22, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 DISCUSSION
 REFERENCES
 

    Hahn, L.W., et al. (2003) Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics, 19, 376–382[Abstract/Free Full Text].

    Knuth, D.E. The Art of Computer Programming, Volume 4, Fascicle 3: Generating All Combinations and Partitions, (2005) Addison Wesley Professional, ISBN 0201853949.

    Ritchie, M.D., et al. (2001) Multifactor dimensionality reduction reveals high-order interaction among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet, . 69, 138–147[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/17/2173    most recent
btl347v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bush, W. S.
Right arrow Articles by Ritchie, M. D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bush, W. S.
Right arrow Articles by Ritchie, M. D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?