Bioinformatics Advance Access originally published online on June 29, 2006
Bioinformatics 2006 22(16):2028-2036; doi:10.1093/bioinformatics/btl344
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Pathway analysis using random forests classification and regression
1 Division of Biostatistics, Department of Epidemiology and Public Health, Yale University School of Medicine New Haven, CT 06520, USA
2 W. M. Keck Biotechnology Resource Laboratory, Yale University School of Medicine New Haven, CT 06520, USA
3 Boyer Center for Molecular Medicine, Yale University School of Medicine New Haven, CT 06520, USA
4 Department of Genetics, Yale University School of Medicine New Haven, CT 06520, USA
5 Pfizer Groton Laboratories, Safety Sciences Groton, CT 06340, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Although numerous methods have been developed to better capture biological information from microarray data, commonly used single gene-based methods neglect interactions among genes and leave room for other novel approaches. For example, most classification and regression methods for microarray data are based on the whole set of genes and have not made use of pathway information. Pathway-based analysis in microarray studies may lead to more informative and relevant knowledge for biological researchers.
Results: In this paper, we describe a pathway-based classification and regression method using Random Forests to analyze gene expression data. The proposed methods allow researchers to rank important pathways from externally available databases, discover important genes, find pathway-based outlying cases and make full use of a continuous outcome variable in the regression setting. We also compared Random Forests with other machine learning methods using several datasets and found that Random Forests classification error rates were either the lowest or the second-lowest. By combining pathway information and novel statistical methods, this procedure represents a promising computational strategy in dissecting pathways and can provide biological insight into the study of microarray data.
Availability: Source code written in R is available from http://bioinformatics.med.yale.edu/pathway-analysis/rf.htm
Contact: hongyu.zhao{at}yale.edu
Supplementary Information: Supplementary Data are available at http://bioinformatics.med.yale.edu/pathway-analysis/rf.htm
| 1 INTRODUCTION |
|---|
|
|
|---|
High-throughput microarrays have become one of the most important tools in functional genomics studies and they are commonly used to address various biological questions. Although many statistical methods have been developed to analyze microarray data based on single genes, they do not take into account the dependencies among genes and are often prone to multiple testing problems. One way to address this problem is to look at gene sets rather than all the genes at once or one at a time. For example a number of methods and programs have been developed to consider gene groupings based on Gene Ontology (GO) (e.g. Harris et al., 2004). Pathway-based methods represent another promising approach. Pathways are sets of genes that serve a particular cellular or physiologic function. Ranking pathways relevant to a particular phenotype, e.g. cancer, can help researchers focus on a few sets of genes. They are particularly useful for generating further biological hypotheses of interest. In addition, pathway-based tests, which look at a set of related genes, have the ability to detect subtle expression level changes that commonly used univariate tests cannot (Mootha et al., 2003). A review paper describing the advantages of performing pathway-based tests has been published (Curtis et al., 2005). Furthermore, as pathways are functional subunits of the cellular systems, looking at them may improve the ability to tease out biologically meaningful information from microarray data. These considerations have motivated various research groups to look at gene sets rather than single genes (e.g. Hosack et al., 2003; Rajagopalan and Agarwal, 2005).
Microarray data have been used for disease classification and treatment prognosis through various classification methods including hierarchical clustering (Sotiriou et al., 2003), Bayesian predictors (Wright et al., 2003), and support vector machines (SVMs) (Furey et al., 2000). Although classification has been around for decades, the availability of high dimensional gene expression datasets has sparked new statistical and computational challenges and ideas. Current approaches have predominantly focused on classification based on all the genes in a dataset. Some methods perform better than others in certain datasets but there is no approach superior to others across all datasets. However, the resulting sets of genes that are found to be good classifiers may be hard to interpret. Therefore, there is a need to develop pathway-based approach with efficient classification and regression methods, e.g. tree-based ones, that would allow us to obtain results that are more closely tied with the biological mechanism of diseases.
This paper describes a methodology for ranking informative pathways on the outcome of interest, either a categorical phenotype or a continuous measure. The rationale of this method stems from the fact that the better a pathway is at classifying the phenotype groups or explaining the continuous outcome, the more relevant this pathway is to the phenotype of interest. In this paper, we will make use of Random Forests, a tree-based classification and regression method, developed by late Leo Breiman and Adele Cutler (Breiman et al., 2003, http://oz.berkeley.edu/users/breiman/Using_random_forests_v4.0.pdf). The performance of Random Forests is comparable with SVMs and researchers have shown that Random Forests performed better than other machine learning methods in genomics and proteomics studies (e.g. Wu et al., 2003, Svetnik et al., 2003 and Qi et al., 2006). In addition, both the Random Forests classification and regression procedures can also identify important features, in this case genes, that are useful in either classifying between different groups or explaining the variation in the outcome of interest. These may turn out to provide informative biomarkers for screening or serve as drug targets. It also provides nice illustrations for identifying outlying observations in a pathway-based context. We outline below a new approach for doing pathway-based classification and regression which makes good use of both clinical data and biological information. As an illustration, we applied our approach to an investigative toxicology study in dogs designed to identify mechanisms of drug induced vascular injury, one dataset with a three-level outcome and four datasets with binary phenotypes. Pathways and genesets came from KEGG (Kanehisa et al., 2004), BioCarta (http://www.biocarta.com/) and Broad Institute (http://www.broad.mit.edu/gsea/).
| 2 METHODS AND DATA |
|---|
|
|
|---|
Figure 1 is a flow chart of how to integrate microarray data with pathway information using classification and regression.
Random Forests (Breiman, 2001) is an improved Classification and Regression Trees (CART) method (Breiman et al., 1984). It grows many classification trees or regression trees and thus the name Forests. Every tree is built using a deterministic algorithm and the trees are different owing to two factors. First, at each node, a best split is chosen from a random subset of the predictors rather than all of them. Second, every tree is built using a bootstrap sample of the observations. The out-of-bag (OOB) data, approximately one-third of the observations, are then used to estimate the prediction accuracy. Unlike other tree algorithms, no pruning, trimming of the fully grown tree, is involved. Each observation is assigned to a leaf, the terminal node of a tree, according to the order and values of the predictor variables. For a particular tree, the predictions for observations are given only for the OOB data. The overall prediction is then calculated by averaging over all the trees built.
2.1 Classification method
We propose to use Random Forests for doing classifications in pathway-based analysis. Small classification error based on genes in a given pathway would indicate the pathway as potentially interesting. Random Forests has several unique features. One is that it provides an unbiased estimate of the classification error as the forest is built, which we will use to rank important pathways. It also has a method to adjust the error rate when there is an unbalanced number of samples in the comparison groups. By providing an estimate of the importance of the genes in both classification and regression, we can find potentially informative and biologically relevant genes within a particular pathway. Moreover, output diagrams can suggest relation between variables and the classification or regression. Pathway-based Random Forests classification allows us to learn more from the biological system rather than looking at individual genes, examine how various pathways are interconnected and investigate the shared components among different pathways as we shall see in the latter part of the next section.
2.1.1 Important pathways
We identify important pathways through the OOB estimate of classification error. Each tree of the Random Forests procedure is constructed using a bootstrap sample of the data and about one-third of the original data are left out of the sample in the construction of the i-th tree (Breiman, 2001). These OOB data are then run down the tree to get an unbiased estimate of the classification error for the i-th tree. Thus there is no need for cross-validation. An overall OOB error rate is given when the specified numbers of trees are added to the forest. The OOB error rate obtained for each pathway is used to rank pathways, the smaller the OOB error rate the better a pathway is able to classify between the phenotypes of interest.
Apart from the OOB error rate, Random Forests also provides a proximity measure. To obtain this measure, the entire training set is run down an unpruned tree. If case 1 and case 2 both result in the same terminal node, then the proximity between case 1 and case 2 is increased by one. The normalized proximity measure is the total count divided by the number of trees. Proximities of between pairs of classes allow us to investigate whether individuals are behaving similarly in the sample or not. From these pairwise proximity measures, we can pick out outliers which are cases having small proximities to most other cases. Outlier measure for case nc of class c is defined as follows:
![]() | (1) |
It is the inverse of the sum of squares of proximity (n,k) for all k in the same class as n; therefore, if the value is relatively large, it is a possible outlier. The normalized measure is defined as |out(nc) – median(c)|/deviation(c), where median(c) = medianc[out(nc)] for all n in class c, deviation(c) =
class c|out(nc – median (c)|/[# in class c]. A value >10 is usually considered an outlier (Breiman et al., 2003).
2.1.2 Important genes
It is often of interest to know which of the variables are important in classification. There are two measures of importance in Random Forests, the mean decrease in accuracy and the Gini index. They give possible ways to quantify which genes are most informative, i.e. contribute most to the prediction accuracy, for giving the correct pathway-based classification. Finding this gene as an informative one is an indication of the strength or usefulness in distinguishing between the two phenotype groups of interest. The decrease in the Gini index is not as reliable as the marginal decrease in accuracy (Breiman et al., 2003), therefore, the latter will be used in our examples.
To obtain importance measure for gene G in a particular pathway, the Random Forests algorithm permutes the values of gene G in the OOB individuals for the i-th tree and puts these new gene expression values down the tree to get classifications. The randomly permuted values of gene G in the OOB individuals and the outcome of interest become independent of each other. When we have two phenotypes of interest, i.e. two classes, the margin is defined as (% of votes for true class) – (% of votes for the other class). The larger the margin the more confident the Random Forests prediction is. If the gene is a good predictor, the gene is likely to be close to the origin of the tree and a large proportion of the trees built will have the gene in it. A large proportion of the OOB data are misclassified as a result. This implies that we expect a decrease in the margin compared with the value before the random permutation. And vice versa, if the gene is not a good predictor, there will be little change to the margin. The mean decrease in accuracy MDA(G) for gene G is the lowering of the margin across all individuals divided by the number of cases when gene G is randomly permuted. The margin can also be written as follows:
![]() | (2) |
Now the MDA can be defined in terms of margin:
![]() | (3) |
True class when a correct vote is given to individual j for the i-th tree with gene G randomly permuted. Random Forests outputs variable importance plots for decrease in accuracy. In addition, partial plots provide a way to visualize the marginal effect of gene on class probability in Random Forests classification. Finally, overlapping of genes among the top ranked pathways can be explored using visualization software such as Cytoscape (Shannon et al., 2003).
2.2 Regression method
The Random Forests regression tree is built in a similar fashion as that in the classification method. The goal for the regression method is to find a tree that best predicts the continuous outcome of interest such as lesion scores in a toxicology study.
2.2.1 Important Pathways
Just like the case for classification, we can once again rank the pathways based on a Random Forests regression run. The percent variance explained (% Var Explained) is defined as 1 – (Mean square error)/(Variance (response)), where response of interest is our clinical outcome and mean square error (MSE) is the sum of squared residuals divided by the sample size. It indicates how well the set of gene expressions in a particular pathway is able to explain the variation in the response of interest. The percent variance explained is viewed as a pseudo r-square. A pathway which helps explain the variation in the response variable is an indication of the informativeness of the pathway relative to other pathways with smaller percent variance explained. Again, a diagram of MSE against the Random Forests object can be plotted. In the regression case, one cannot output the outlier plot under the current R implementation as the definition of the outlying measure for different classes no longer applies.
2.2.2 Important Genes
As in the classification case, two importance measures can be obtained for regression as well. They are the mean decrease in accuracy and mean decrease in MSE. The marginal effect of a gene on the response within a particular pathway can be illustrated in the partial plot in the regression case. The set of genes found may be important biomarkers for clinical phenotype prediction.
The library package randomForest v4.5-16 from the R program (http://www.r-project.org/) was used in our analysis (Liaw and Wiener, 2002).
2.3 Other machine learning methods
We compared Random Forests with other classification tools using the implementations in R. The classifiers chosen were (1) Linear Discriminant Analysis lda, (2) Neural Network, which consists of an input-trained hidden layer of non-linear functions, nnet with 3 units in the hidden layer, (3) Bagging, bootstrap aggregation of trees also developed by Breiman, bagging in the ipred package, (4) SVM, a method to fit maximum-margin hyperplanes, svm with C-classification type and the radial basis kernel, (5) k-Nearest Neighbor ipredknn by considering both 1 and 3 neighbors and (6) Naive Bayes, which calculates the conditional a posterior probability from predictor variables based on Bayes rule, naiveBayes.
We used both 5-fold cross validation and 632plus (Efron and Tibshirani, 1997) to assess the error rate estimates of these different machine learning methods. We presented both of these methods because with a sample size of 30 or more there are conflicting views as to which method is better (Braga-Neto and Dougherty, 2004; Fu et al., 2005; Molinaro et al., 2005).
2.4 Datasets
2.4.1 Pathways
A total of 441 pathways were used for the analysis. Some of these pathways are wired diagrams of genes and molecules from KEGG and BioCarta. Others are manually curated. The distribution is as follows:
- A total of 129 pathways were taken from KEGG, a pathway database with the majority responsible for metabolism, degradation and biosynthesis. There are also a few signal processing pathways and others related to human diseases. KEGG has a wide variety of organisms ranging from human, mouse and rat to bacteria like Escherichia coli with direct links to the genes or gene products involved in the biochemical reactions.
- We considered 312 BioCarta pathways, more than twice the number of pathways compared with KEGG. Most of these pathways are related to signal transduction for human and mouse with a smaller set of metabolic pathways.
- Two gene sets are manually curated Leukocyte adhesion and Eicosenoid metabolism.
2.4.2 Genesets
Several genesets available from the Broad Institute website (http://www.broad.mit.edu/gsea/) were used for analyzing the corresponding datasets in Subramanian et al. (2005).
2.4.3 Simulated dataset
In order to understand how well our approach is performing under the null and alternative hypotheses, we did some simulations. We used the simulator within the boost R package (Dettling, 2004) to simulate the pathway-based gene expression data. This simulator allows us to retain the pathway correlation structure from real data. For the null case and the alternative case, pathways from the canine dataset with large OOB error rates and small OOB error rates were chosen respectively. We used the complex interaction model (Dettling, 2004) to obtain the log-odds ratio to determine the class label for simulated subjects under the alternative. Whereas in the null case, class labels for the two classes were generated with equal probability. Each simulation generated 30 iid samples of gene expression with pathways of size 10, 20, 30 or 40.
2.4.4 Real data
A summary of the datasets used in this analysis is given in Table 1.
The Breast dataset has three tumor classes of 49 breast cancer patients (Farmer et al., 2005). Tumors are classified into luminal, basal and apocrine classes. The canine dataset was generated from investigative toxicology studies designed to identify the molecular pathogenesis of drug-induced vascular injury in coronary arteries of dogs treated with adenosine receptor agonist CI-947 (Enerson et al., 2006). Male beagle dogs were treated with CI-947 at low (2 mg/kg) and high (10 mg/kg) doses, and groups (n = 4) of control and treated dogs were euthanized at 6 and 16 h post dosing. Peracute to acute coronary arteriopathy occurred in all drug-treated groups, and the incidence of arterial lesions in each dog was dose and time dependent. Each animal was assigned a histopathology score ranging from 0 to 21 based on the number of lesions observed in 5 mm tissue sections. In addition, a recovery group consisting of three treated (10 mg/kg) and four control dogs was also included in the dataset. Custom Affymetrix GeneChips containing probe sets for 12 473 canine genes were used to profile gene expression of left extramural coronary arteries from all dogs. The canine genes were mapped to human orthologs for pathway analysis. The corresponding human orthologs for dogs were generated by matching the gene sequences using BLAST. The p53 dataset contains 50 NCI-60 cell lines with patients carrying mutations in the p53 gene. A comparison of Female and Male Lymphoblastoid cell lines was the subject of interest for the Gender dataset. Both Lung_a (Bhattacharjee et al., 2001) and Lung_b (Beer et al., 2002) datasets are lung cancer datasets and provide outcomes as good or poor ones.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Applications of RF classification
In this section, we applied Random Forests to assess its ability in giving biological insights in various microarray datasets.
3.1.1 Canine dataset
We used 441 gene sets whose sizes vary between 3 and 151 for Random Forests classification and sought to distinguish between dogs with lesion and those without. As outliers can highly affect the accuracy of classification, we should remove them before ranking the important pathways. Pathways were ranked using the OOB error rate. The best OOB error rate achieved using 100 000 trees was 9.68% using all the 31 dogs in the study which means the best we could do is with 3 dogs being misclassified (see TS1 in Supplementary Materials). Top pathways from the table were chosen for outlier detections. From the outlier plots of those top ranked pathways, we found that animals #10 and #19 were frequently the outliers with outlying measure >10. Animal #10 was an outlier in Circadian Rhythm, LDL Pathway and Msp-Ron Receptor Signaling pathway while animal #19 was an outlier in Leukocyte Adhesion, LDL pathway and Msp-Ron Receptor Signaling pathway. No animals were considered as outliers in the CTCF pathway. The detection of animals #10 and #19 as outliers seems biologically plausible as those two dogs had lesion score either higher/lower from the other three dogs sampled at the same time point and dosage group (see TS2 in Supplementary Materials) and thus they were classified in the other lesion group which is consistent with what we observed in the outlier plots (see FS1 in Supplementary Materials).
Using 50 000 trees and removing animals #10 and #19 gave much better classification results with only 1 animal being misclassified. After removing the outliers the classification error rates dropped from 12.9 to 3.4% for both Low Density Lipoprotein (LDL) pathway and Msp-Ron Receptor Signaling pathway and even more for Hypoxia and p53 in the Cardiovascular system pathway (see TS3 in Supplementary Materials). Our analysis indicates that these pathways were the most informative in classifying between the lesion and non-lesion groups (Table 2).
These analyses yield biologically relevant results. As shown in Table 2, the most significant pathways in the dataset are (1) LDL pathway during atherogenesis, (2) Msp-Ron Receptor Signaling Pathway and (3) Hypoxia and p53 in the Cardiovascular system. The former pathways are associated with inflammatory response and the latter one is related to hypoxic stress and induction of apoptosis (Appella and Anderson, 2001). These are clearly related to vascular injury: (1) contains merely four genes but they have high classification power. It is a collection of signaling low-density lipoprotein, colony stimulating factor, cytokines and macrophage in arterial blood and the neighboring liver and endothelial cells; (2) is a collection of macrophage-stimulating proteins in liver which act through the transmembrane receptor kinase to play a part in inflammation and tissue injury and (3) is a collection of genes induced by hypoxic stress causing p53 apoptotic activity in a different way compared with p53 induction by DNA damage (Iida et al., 2002).
In the set of pathways with OOB error 6.9%, the majority of them are related to apoptosis, immune response, cardiovascular function and sheer stress. They are the Granzyme A mediated Apoptosis pathway (apoptosis, immune surveillance), circadian rhythm (cardiovascular function) and nitric oxide pathway (NO plays an important role in shear stress) (Desail et al., 2003).
Apart from identifying which pathways are important in classification, we can also look at which animals are anomalies. For every pathway, we can find out which dogs are misclassified. For the pathway of interest, the researcher can make use of the proximity measure that can inform us which subjects/dogs are more similar to each other. This is particularly useful in seeing, for example, how a set of recovery dogs are similar to other dogs measured at different time points. An example of a proximity matrix of how similar dogs are to each other for the Hypoxia and p53 pathway is given in the DMS1 (Supplementary Materials).
Apart from the plots to see how the error rate changes with the number of trees, a couple of diagrams, outlier and multi-dimensional scaling (MDS) plots are particularly useful in pathway-based Random Forests output. Outlier plot, see Figure 2, allows us to visualize which animal has extreme values in gene expression measures compared with others. In this case, we see that none of the dogs are considered outliers if we use the cut-off value of 10. Pathway-based analysis allows us to check which dogs are outlying within a pathway with respect to others, and this may help describe other physical conditions/health of the animal. The MDS plot (Figure 3) projects a high-dimensional measure to a 2D surface giving a picture of similarities among dogs and their respective classes. In this plot, the lesion (triangles) and the non-lesion (circles) groups were well separated.
In the top ranked pathways by OOB error, we can look at the importance measure of each of the genes in them. These genes are good classifiers. For the top pathways with OOB error
6.9%, the figure in Supplementary Materials FS3 plots all the genes that have positive decrease in accuracy importance measure. Leukocyte adhesion, LDL pathway and MSP and Pertussis Toxin pathway, all share the component CCL2 (see FS3 of Supplementary Materials for the plot). This gene is ranked #187 in the two sample t-test which is very difficult to identify as there are many others with similar magnitude in terms of p-values. CCL2, chemokine (C–C motif) ligand 2 also known as MCP-1, is related to vascular injury and is shown to be one of the strongly upregulated chemokines in endothelial microvascular injury (Charo et al., 2004; Panzer et al., 2006; Rothenbacher et al., 2005). In figure FS3 shown in Supplementary Materials, the darker red the genes are the more informative these genes are for a particular pathway. A gene like CSNK1A1 is shared between Circadian rhythm, Hypoxia and p53 pathway and WNT Signaling pathway (Zhao et al., 2004). PRKCA2 and PRKCA were also found in three pathways, WNT Signaling, NO Signaling and Pertussis Toxin pathways. This is supported by evidence that PRKCA, protein kinase Ca, forms a complex with active RhoA which is involved in a signaling pathway in the endothelial cell and contributes to vascular inflammation (Bolick et al., 2005). Among all the top pathways with OOB error
6.9%, there are three pathways that do not have any overlaps with the others: Aminosugars metabolism, Aminoacyl-tRNA biosynthesis and Granzyme A mediated Apoptosis pathway. As seen from the figures in FS4 of Supplementary Materials, there are 11 important genes that arecolored red and are connected to at least two pathways, while 24 important genes are unique to one particular pathway.
3.1.2 Breast cancer dataset
We next turn to a study of breast cancer with three tumor subtypes. We used Random Forests classification to see which pathway is good at classifying the patients into the correct subgroups. The top pathway CARM1 and Regulation of Estrogen Receptor makes biological sense, as it has been shown that estrogen receptor plays an important role in normal breast development and is expressed in common cancer subtypes (Shao and Brown, 2004). In addition, the Regulation of BAD phosphorylation containing a widely expressed BCL-2 family member has been studied for epidermal growth factor receptor (EGFR) targeted therapy for breast cancer (Motoyama and Hynes, 2003). A recent paper in Cancer Research (Mehra et al., 2005) has identified GATA3 as a potential prognostic marker for breast cancer. Researchers have demonstrated that tumors have abnormal bioenergetics. Subjects with cancer can show a systematic loss of energy involving the interaction of tumour glycolysis and gluconeogenesis (Permual et al., 2005). This may be the reason why glycolysis and gluconeogenesis pathway also has a low error rate (see DS1 in Supplementary Materials).
3.1.3 Gender dataset—lymphoblastoid cells
Random Forests classification found three biologically informative gene sets: testis expression genes, female reproductive tissue genes, two genesets of X inactivation genes (Carrel and Willard, 2005; Disteche et al., 2002). These are also discovered in GSEA. Since we are comparing between female and male groups, the discovery of X inactivation as one of the top on the lists is expected. As noted by Subramanian et al. (2005), there is a high proportion of mRNA expression in the reproductive tissues (testis and uterus) as in the lymphoblastoid cells (see DS1 in Supplementary Materials).
Applications of RF classification to three other datasets can be found in DS1 in Supplementary Materials.
3.2 Application of RF regression
3.2.1 Canine dataset
In the regression setting, we used pathway-based regression forests to predict the lesion score. In this analysis we excluded the outliers, dogs #10 and #19, that were found during our Random Forests classification run. The same 441 gene sets were used. As in the classification case, the number of trees could affect the MSE which is used to calculate the percent variance explained. We saw once again that the MSE became stable after the number of trees reached 50 000 for most pathways. Some examples of MSE plot versus the number of trees are shown in figures FS5 in Suppl. Materials. For example, for pathway Hypoxia and p53 in the Cardiovascular system, the MSE plot against the number of trees no longer fluctuates. A similar pattern can be seen for pathways, such as the Pertussis toxin-insensitive CCR5 Signaling pathway, which require more trees to get a stable MSE. Therefore, we decided to use 50 000 trees again for the regression case.
The top pathways ranked by percent variance explained are shown in Table 3. Out of 441, 144 were found to have a positive percent variance explained. It was not surprising to see that the majority of the top 15 pathways found by classification were also highly ranked in the regression case. Our analysis shows the following pathways had percent variance explained of >20%: (1) One carbon pool by folate, (2) Antisense pathway, (3) Aminoacyl-tRNA biosynthesis, (4) Hypoxia and p53 in the Cardiovascular system, (5) Adhesion molecules on Lymphocyte and (6) Neutrophil and its Surface Molecules. All were useful in explaining the variation in the lesion scores among the animals.
One carbon pool by folate, the Antisense pathway and Aminoacyl-tRNA biosynthesis may not seem directly related to inflammatory or response to vascular pathology, but as we shall see in the next section, some of the genes within them explained why it was picked up and biologically plausible. As described in the previous section, Hypoxia and p53 in the Cardiovascular system is a pathway related to hypoxic stress and apoptosis. As for (5) Adhesion molecules on Lymphocyte and (6) Neutrophil and its Surface Molecules, these two pathways differ only by one gene and consist of molecules interacting with endothelial cells that are responsible for sending inflammatory signals and triggering immune response.
To investigate which dogs were more similar to each other in a given pathway, a proximity matrix can be output from a Random Forests run. Although an outlier plot cannot be plotted, an MDS plot can be used to see which of the animals are closer to each other in a particular pathway. As for this pathway, we can see three distinct groups when the proximity matrix is projected onto a 2D plane (see FS6 in Supplementary Materials).
When we applied the importance measures to identify important genes, the majority of the genes found have references that support their relation with vascular injury. The most informative gene GARS in Aminoacyl–tRNA pathway has been shown to be a target of autoantibodies in the human autoimmune diseases by Entrez Gene (Maglott et al., 2005). For the Antisense pathway, it contains three genes, (ADAR, SFPQ and MATR3) that are particularly important in explaining the variance among the lesion scores. ADAR, Adenosine deminase, is a maker for T cell activation and is related to the production of neutrophils which has close interaction with endothelial cells (Erkilic et al., 2003). MTHFD2, methylenetetrahydrofolate dehydrogenase, within the one carbon pool by folate pathway, has been shown to be up-regulated in vascular endothelial cells when treated with a chemical (Sato et al., 1998).
Genes that are found to be informative in the same pathway in both classification and regression have very similar genes. Unlike the case for classification, besides the three pathways: (1) Monocyte and its Surface Molecules, (2) Adhesion molecules on Lymphocyte and (3) Neutrophil and its Surface Molecules which actually differs in not more than a couple of genes, the top 15 pathways do not have many top important genes in common. Two other examples are PRKCA2 which is shared between the WNT Signaling pathway and the Pertussis Toxin-insensitive CCR5 Signaling pathway and the CCL2 for MSP-RON and Pertussis Toxin-insensitive CCR5 Signaling pathway (see FS7 in Supplementary Materials).
3.3 Comparison with other classification methods
To compare the classification methods, we looked at three datasets and two methods to estimate error rates. From Tables 4 and 5, we see that Random Forests performs comparably to SVM for the canine dataset and actually does better than SVM in the other two datasets. Random Forests only falls to k-Nearest Neighbor classifiers for the gender dataset and Bagging for the p53 dataset. Surprisingly, naive approaches, such as Linear Discriminant Analysis and Naive Bayes classifiers did pretty well compared with other methods. Random Forests has its pluses over the other classification methods, as it can handle more than two classes as demonstrated in the applications section in addition to its ability to perform regression.
3.4 Simulation studies
To assess the type I error rate, we simulated 100 datasets from the null as described in the Methods and Data section. For every simulated data, we first calculated the observed error rate. We then permuted the label of the iid samples 1000 times, calculated the number of permuted OOB error rates for Random Forests less than the observed and divided that by the number of permutations to obtain a permutation p-value. For both type I error and power, we calculated the number of p-values
0.05 cutoff and divided by the number of simulated datasets.
From Table 6, we can see that the observed Random Forests Type I error was not statistically significantly different from the 0.05 nominal level across different pathway size. SVM had smaller type I error rate and managed to keep below 4% for the four different pathway sizes. On the other hand, in terms of power (Table 7), Random Forests is doing better than SVM across the board, it achieved 80% power with pathway size 20. A grid search for the optimal cost parameter for SVM was performed using (cost = 10–3, 10–2, 10–1, 1, 10, 100, 1000 and 10 000) to maximize the power.
To assess the stability of the OOB error rate, we ran 5-fold cross-validation 50 times for both Random Forests and SVM using the 441 pathways from the canine dataset. The average standard deviations of the error rates were very similar, 5.58 and 5.37% for Random Forests and SVM respectively. The standard deviations of the error rates ranged from 0.83 to 9.87% for Random Forests and lied between 1.63 and 8.56% for SVM. For the pathways with <10% classification error rate for both methods (Table 8), we see that Random Forests achieved smaller standard deviations error rates, three out of four times compared with SVM. These figures strengthen our confidence in the OOB error estimates obtained from Random Forests.
3.5 Comparison with other pathway-based methods
We did a comparison between Random Forests and other pathway-based methods using the canine dataset. To evaluate the benefit of pathway-based analysis, we considered one standard approach commonly used in practice. We first used a simple t-test and found 5000 significant probesets at the 0.01 cutoff. Mapping them back to pathways and then ranking the pathways by the number of genes mapped or the proportions of genes mapped, we were not able to find as many interesting pathways compared to using Random Forests classification. Similar results were noted for the 0.05 cutoff (see TS5 in Supplementary Materials).
Currently, there are two popular pathway-based tests in use. These are Fisher's exact test which uses t-test to find significantly changed genes and then pathways enriched for these genes and Gene Set Enrichment Analysis (GSEA) (Subramanian et al., 2005). Both of these tests only consider binary phenotypes. After correcting for multiple comparisons using FDR, there were 18 pathways that were found to be significant at 0.05 level according to the pathway-based Fisher's exact test (see TS6 in Supplementary Materials). These pathways include several which are not likely to be related to vascular injury, such as the Oxidative phosphorylation, Deregulation of CDK5 in Alzheimers Disease, Glycolysis/Gluconeogenesis and Proteasome. In addition, none of these pathways have keywords Apoptosis and Hypoxia in them. On the other hand, some of the significant pathways such as the Inhibition of Matrix Metalloproteinases and the Metabolism of Anandamide, an Endogenous Cannabinoid are more related with angiogenesis, inflammation and immune response. The overlapping pathways with Random Forests classification are Circadian rhythm and Role of Ran in mitotic spindle regulation. Many of the top ranked pathways from Random Forest classification are found with very large p-values, in fact, half of the 14 pathways have a p-value of 1. This may be owing to the fact that the proportions of genes are significant by univariate test are relatively small compared with the number of non-significant genes within those pathways. Fisher's exact test tends to miss out on pathways with genes that are strong classifiers. It lacks the ability to identify informative genes controlling for other genes for a particular pathway as the significant genes were found by univariate t-tests. In contrast, our Random Forests approach provides convenient ways to do so using the decrease in accuracy or decrease in MSE in the classification and regression case respectively.
For GSEA, we chose to input gene sets of size 500 giving a total of 397 genesets which met the criterion. Again, we were trying to distinguish between the lesion versus no lesion groups. Three different significant measures were given in the program output. The FDR q-values, recommended by the author of the GSEA paper, were all >0.74 and the most naive FWER p-values were at least 0.40. Therefore, we can only use the nominal p-value levels for identifying significant pathways (see TS7 in Supplementary Materials). If we chose the 0.05 cutoff, only nine significant pathways were found. Among those top ranked ones, pathways like the Protein Export, Proteasome, Ganglioside biosynthesis, Regulation of Spermatogenesis by CREM, Spliceosomal Assembly and Ghrelin Regulation of Food Intake and Energy Homeostasis have no apparent relation with vascular injuries. Other pathways such as Eukaryotic protein translation and Regulation of eIF4e and p70 S6 Kinase contain eIF genes that are known to control a few biological processes including Endoplasmic Reticulum Stress, Hypoxia and Growth Factors that are related to vascular injuries. If we relax the cutoff to 0.10, then we are able to detect two pathways, Granzyme A mediated Apoptosis pathway and Hypoxia and p53 in the Cardiovascular system, which were both found in Random Forests with OOB of <6.9%. It appears to pick up pathways that contain genes more relevant to the study compared with Fisher's exact test, although it was not able to pick up Low-density lipoprotein (LDL) pathway owing to the small number of genes it has. The genesets, Leukocyte Adhesion, known to be related to vascular injury has a p-value of 0.143. GSEA is highly dependent on the background expression levels of genes; on the other hand, Random Forests, which solely looks at the effect of the genes in the pathway, may perform better. Both GSEA and Random Forests will take into account the dependencies among genes in some ways, but they tend to capture different signals.
In summary, we found that two widely used approaches are very reliant on the number of genes in the pathway. GSEA by default excludes genesets of size <25. Although both Fisher's exact test and GSEA were able to find a few pathways related to vascular injury, Random Forests was able to detect more pathways that are related to phenotype of interest. It allows a more comprehensive analysis of pathway-based approach too. Not only can Random Forests facilitate the detection of outliers but it also allows the visualization of how similar the subjects of study are through the proximity matrix and MDS plot.
| 4 DISCUSSION |
|---|
|
|
|---|
In this article, we have described a pathway-based approach in analyzing microarray data using Random Forests. This approach is helpful in identifying pathways that are useful in classifying a categorical outcome or predicting a continuous measure and isolating important genes. It has a number of distinct features to help both biologists and bioinformaticians understand biological systems better. Finding important pathways allows researchers to focus on a subset of genes that explain the response of interest. We were able to rank pathways and genes that are related to various diseases and pathologies of interest for six microarray datasets.
As we know, genes are not independent of each other but rather work in pathways. We have demonstrated that the proposed method based on Random Forests may be more useful in identifying genes that are good classifiers and predictors than single gene-based methods. Pathway analysis using Random Forests allows the researcher to make full use of the biological information available by combining microarray data and pathway information from externally available databases such as KEGG and BioCarta. In comparison with other machine learning methods, Random Forests remains within the top two in terms of misclassification rates. Our simulation studies have demonstrated that Random Forests is comparable or even better than SVM in some cases. In addition, Random Forests provides us with a proximity matrix and allows the detection of pathway-based outliers. These can tell us which of the subjects of interest is behaving differently from others in a particular pathway which is often useful in medical studies. The proximity matrix can also be visualized on a multi-dimensional scaling plot. This is definitely a benefit over examining genes on an individual basis. Moreover, we can make use of the importance measure to identify genes that are more informative in doing the pathway-based classification or prediction. This can help pick up important genes of interest and they may turn out to be novel biomarkers. Comparing the overlapping of important genes in different pathways will also allow us to investigate whether most of them lie in intersection of those pathways.
Although it is difficult to conclude which of the pathway methods is superior, we have shown the weaknesses of other pathway tests in the context of the canine data analyzed here. Both Fisher's exact test and GSEA are sensitive to the number of genes in the pathway, whereas our approach is less susceptible to this problem. Random Forests was able to identify pathways that are biologically meaningful. Random Forests also allows classification on two or more subtypes. In reality many clinical outcomes are measured in continuous measure, such as glucose level or white blood cells count. The Random Forests regression approach described is one of the first which can make use of a continuous measure for ranking pathways. It would certainly motivate and draw the interest of other researchers to develop better methods.
|
|
|
|
|
|
|
|
|
|
|
| Acknowledgments |
|---|
This work was supported through a grant from Pfizer. We thank two anonymous reviewers for their valuable comments and suggestions.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
Received on April 8, 2006; revised on June 5, 2006; accepted on June 20, 2006
| REFERENCES |
|---|
|
|
|---|
Appella, E. and Anderson, C.W. (2001) Post-translational modifications and activation of p53 by genotoxic stresses. Eur. J. Biochem, . 268, 2764–2772[Web of Science][Medline].
Beer, D.G., et al. (2002) Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat. Med, . 8, 816–824[Web of Science][Medline].
Bhattacharjee, A., et al. (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl Aacd. Sci. USA, 98, 13790–13795
Bolick, D.T., et al. (2005) 12/15-lipoxygenase regulates intercellular adhesion molecule-1 expression and monocyte adhesion to endothelium through activation of RhoA and nuclear factor-
B. Arterioscler. Thromb. Vasc. Biol, . 25, 2301
Braga-Neto, U.M. and Dougherty, E.R. (2004) Is cross-validation valid for small-sample microarray classification? Bioinformatics, 20, 374–380
Breiman, L., et al. Classification and Regression Trees, (1984) , Belmont, California Wadsworth International Group.
Breiman, L., et al. (2001) Random forests. Mach. Learn, . 45, 5–32[CrossRef].
Breiman, L., et al. (2003) Manual on setting up, using, and understanding Random Forests V4.0.
Carrel, L. and Willard, H.F. (2005) X-inactivation profile reveals extensive variability in X-linked gene expression in females. Nature, 434, 400–404[CrossRef][Medline].
Charo, I.F., et al. (2004) Chemokines in the pathogenesis of vascular disease. Circ. Res, . 95, 858
Curtis, R.K., et al. (2005) Pathways to the analysis of microarray data. Trends Biotechnol, . 23, 429–435[CrossRef][Web of Science][Medline].
Desail, A., et al. (2003) Nitric oxide modulates MCP-1 expression in endothelial cells: implications for the pathogenesis of pulmonary granulomatous vasculitis. Inflammation, 27, 213–223[CrossRef][Web of Science][Medline].
Dettling, M. (2004) BagBoosting for tumor classification with gene expression data. Bioinformatics, 20, 3583–3593
Disteche, C.M., et al. (2002) Escape from X inactivation. Cytogenet. Genome Res, . 99, 36–43[CrossRef][Web of Science][Medline].
Efron, B. and Tibshirani, R. (1997) Improvements on cross-validation: The .632+ Bootstrap estimator. J. Am. Stat. Assoc, . 92, 548–560[CrossRef][Web of Science].
Enerson, B.E., et al. (2006) Acute drug-induced vascular injury in beagle dogs: pathology and correlating genomic expression. Toxicol. Pathol, . 34, 27–32
Erkilic, K., et al. (2003) Adenosine deaminase enzyme activity is increased and negatively correlates with catalase, superoxide dismutase and glutathione peroxidase in patients with Behcet's disease: original contributions/clinical and laboratory investigations. Mediators Inflamm, . 12, 107–116[CrossRef][Web of Science][Medline].
Farmer, P., et al. (2005) Identification of molecular apocrine breast tumours by microarray analysis. Oncogene, 24, 4660–4671[CrossRef][Web of Science][Medline].
Fu, W., et al. (2005) Estimating misclassification error with small samples via bootstrap cross-validation. Bioinformatics, 21, 1979–1986
Furey, T.S., et al. (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16, 906–914
Harris, M.A., et al. (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res, . 32, D258–D261
Hosack, D.A., et al. (2003) Identifying biological themes within lists of genes with EASE. Genome Biol, . 4, (10) R70[CrossRef][Medline].
Iida, T., et al. (2002) HIF-1-induced apoptosis of endothelial cells. Genes Cells, 7, 143–149[Abstract].
Kanehisa, M., et al. (2004) The KEGG resource for deciphering the genome. Nucleic Acids Res, . 32, D277–D280
Liaw, A. and Wiener, M. (2002) Classification and regression by randomForest. R News, 2, 18–22.
Maglott, D., et al. (2005) Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res, . 33, D54–D58
Mehra, R., et al. (2005) Identification of GATA3 as a breast cancer prognostic marker by global gene expression meta-analysis. Cancer Res, . 65, 11259–11264
Molinaro, A.M., et al. (2005) Prediction error estimation: a comparison of resampling methods. Bioinformatics, 21, 3301–3307
Mootha, V.K., et al. (2003) PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet, . 34, 267–273[CrossRef][Web of Science][Medline].
Motoyama, A.B. and Hynes, N.E. (2003) BAD: a good therapeutic target? Breast Cancer Res, . 5, (1) 27–30[CrossRef][Web of Science][Medline].
Panzer, U., et al. (2006) Compartment-Specific Expression and Function of the Chemokine IP-10/CXCL10 in a model of renal endothelial microvascular injury. J. Am. Soc. Nephrol, . 17, 454–464
Perumal, S.S., et al. (2005) Therapeutic effect of tamoxifen and energy-modulating vitamins on carbohydrate-metabolizing enzymes in breast cancer. Cancer Chemother. Pharmacol, . 56, 105–114[CrossRef][Web of Science][Medline].
Qi, Y., et al. (2006) Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins, 63, 490–500[CrossRef][Web of Science][Medline].
Rajagopalan, D.A. and Agarwal, P. (2005) Inferring pathways from gene lists using a literature-derived network of biological relationships. Bioinformatics, 21, 788–793
Rothenbacher, D., et al. (2005) Differential expression of chemokines, risk of stable coronary heart disease, and correlation with established cardiovascular risk markers. Arterioscler. Thromb. Vascular Biol, . 26, 26:194.
Sato, N., et al. (1998) Changes of gene expression by lysophosphatidylcholine in vascular endothelial cells: 12 up-regulated distinct genes including 5 cell growth-related, 3 thrombosis-related, and 4 others. J. Biochem, 123, 1119–1126
Shannon, P., et al. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res, . 13, 2498–2504
Shao, W. and Brown, M. (2004) Advances in estrogen receptor biology: Prospects for improvements in targeted breast cancer therapy. Breast Cancer Res, . 6, 39–52[CrossRef][Web of Science][Medline].
Sotiriou, C., et al. (2003) Breast cancer classification and prognosis based on gene expression profiles from a population-based study. Proc. Natl Aacd. Sci. USA, 100, 10393–10398
Subramanian, A., et al. (2005) Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA, 102, 15545–15550
Svetnik, V., et al. (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci, . 43, 1947–1958[CrossRef][Web of Science][Medline].
Wu, B., et al. (2003) Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics, 19, 1636–1643
Wright, G., et al. (2003) A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proc. Natl Acad. Sci. USA, 100, 9991–9996
Zhao, Y., et al. (2004) Casein kinase 1alpha interacts with retinoid X receptor and interferes with agonist-induced apoptosis. J. Biol. Chem, . 279, 30844–30849
This article has been cited by other articles:
![]() |
J. S. Chang, R.-F. Yeh, J. K. Wiencke, J. L. Wiemels, I. Smirnov, A. R. Pico, T. Tihan, J. Patoka, R. Miike, J. D. Sison, et al. Pathway Analysis of Single-Nucleotide Polymorphisms Potentially Associated with Glioblastoma Multiforme Susceptibility Using Random Forests Cancer Epidemiol. Biomarkers Prev., June 1, 2008; 17(6): 1368 - 1373. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Tai and W. Pan Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data Bioinformatics, December 1, 2007; 23(23): 3170 - 3177. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Tai and W. Pan Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms Bioinformatics, July 15, 2007; 23(14): 1775 - 1782. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Goffard and G. Weiller PathExpress: a web-based tool to identify relevant pathways in gene expression data Nucleic Acids Res., July 13, 2007; 35(suppl_2): W176 - W181. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






16.0% without outliers

