Skip Navigation


Bioinformatics Advance Access originally published online on December 7, 2004
Bioinformatics 2005 21(8):1530-1537; doi:10.1093/bioinformatics/bti192
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1530    most recent
bti192v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (12)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wang, Y.
Right arrow Articles by Pearlman, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wang, Y.
Right arrow Articles by Pearlman, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data

Yuhang Wang 1,*, Fillia S. Makedon 1, James C. Ford 2 and Justin Pearlman 2,3

1Department of Computer Science, Dartmouth College 6211 Sudikoff Laboratory, Hanover, NH 03755-3510, USA
2Department of Medicine, Dartmouth-Hitchcock Medical Center, Dartmouth Medical School 1 Rope Ferry Road, Hanover, NH 03755-1404, USA
3Department of Radiology, Dartmouth-Hitchcock Medical Center One Medical Center Drive, Lebanon, NH 03756, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEM OVERVIEW
 3 STEP 1: GENE...
 4 STEP 2: HIERARCHICAL...
 5 STEP 3: REDUCE...
 6 STEP 4: CLASSIFICATION
 7 EXPERIMENTAL RESULTS
 8 DISCUSSION AND CONCLUSION
 REFERENCES
 

Motivation: Recent studies have shown that microarray gene expression data are useful for phenotype classification of many diseases. A major problem in this classification is that the number of features (genes) greatly exceeds the number of instances (tissue samples). It has been shown that selecting a small set of informative genes can lead to improved classification accuracy. Many approaches have been proposed for this gene selection problem. Most of the previous gene ranking methods typically select 50–200 top-ranked genes and these genes are often highly correlated. Our goal is to select a small set of non-redundant marker genes that are most relevant for the classification task.

Results: To achieve this goal, we developed a novel hybrid approach that combines gene ranking and clustering analysis. In this approach, we first applied feature filtering algorithms to select a set of top-ranked genes, and then applied hierarchical clustering on these genes to generate a dendrogram. Finally, the dendrogram was analyzed by a sweep-line algorithm and marker genes are selected by collapsing dense clusters. Empirical study using three public datasets shows that our approach is capable of selecting relatively few marker genes while offering the same or better leave-one-out cross-validation accuracy compared with approaches that use top-ranked genes directly for classification.

Availability: The HykGene software is freely available at http://www.cs.dartmouth.edu/~wyh/software.htm

Contact: wyh{at}cs.dartmouth.edu

Supplementary information: Supplementary material is available from http://www.cs.dartmouth.edu/~wyh/hykgene/supplement/index.htm


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEM OVERVIEW
 3 STEP 1: GENE...
 4 STEP 2: HIERARCHICAL...
 5 STEP 3: REDUCE...
 6 STEP 4: CLASSIFICATION
 7 EXPERIMENTAL RESULTS
 8 DISCUSSION AND CONCLUSION
 REFERENCES
 
The DNA microarray is a method to measure the expression level of thousands of genes simultaneously in a cell mixture. It enables clinicians to obtain the gene expression profile of tissue samples rapidly.

Phenotype is the outward, physical manifestation of an organism. The task of phenotype classification using gene expression data is to classify tissue samples into different classes of phenotypes, such as cancer versus normal, using the measured expression levels of thousands of genes in the samples as features. Recent studies have shown that gene expression data are useful for differentiating between cancerous and normal tissues (Alon et al., 1999), between multiple sclerosis patients and normal controls (Bomprezzi et al., 2003), and even among different subtypes of the same cancer (Golub et al. 1999; Armstrong et al., 2002; Alizadeh et al., 2000; Gordon et al., 2002).

Given N tissue samples and expression levels of M genes, we can store the data in a N x (M + 1) matrix as shown in Figure 1.



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 1 The data matrix. Each of the first M columns represents a gene and each row represents a tissue sample. gi,j is a numeric value presenting the gene expression level of gene i in sample j. lj in the last column is the class label for sample j.

 
Classification using gene expression data poses a major challenge because of the following characteristics:
  • M >> N. For typical datasets, M is in the range of 2000–30 000, while N is in the range of 40–200.
  • Most features (genes) are not related to the given phenotype classification problem.

To overcome this curse-of-dimensionality problem, we can use feature selection techniques to select a small subset of genes as features for classification. Gene selection has several advantages:

  1. It can improve classification accuracy. Many studies (Lu and Han, 2003; Li et al., 2004; Liu et al., 2002) have shown that gene selection prior to classification improves the classification accuracy.
  2. It can reduce the cost in a clinical setting. At the moment, expression arrays with thousands of genes are impractical to be employed in medical diagnostic laboratories (Roth, 2001). We need to drastically reduce the number of genes to measure in order to make it affordable.
  3. It could enable biologists to gain significant insight into the genetic nature of the disease and the mechanisms responsible for it (Guyon et al., 2002). This would assist in drug discovery and early diagnosis.

In this paper, we focus on this gene selection problem and propose a novel hybrid approach for finding a very small subset of genes, called marker genes, for phenotype classification. We will review related work in the next section.

1.1 Related work
1.1.1 Gene selection
Feature selection is an important aspect of data mining. When we consider genes as features, we face the problem of gene selection. The most commonly used gene selection approaches are based on gene ranking. In these gene ranking approaches, each gene is evaluated individually and assigned a score reflecting its correlation with the class according to certain criteria. Genes are then ranked by their scores and the top-ranked ones are selected.

These gene ranking methods have been based on t-statistic (Golub et al., 1999), information gain Su et al., 2003; Liu et al., 2002; Li et al., 2004), {chi}2-statistic (Liu et al., 2002), the threshold number of misclassification (TNoM) score (Ben-Dor et al., 2000) and concatenation of several feature filtering algorithms (Xing et al., 2001).

Typically, 50–100 top-ranked genes are selected according to their scores assigned by a gene ranking method. The problem here is that these selected genes are often highly correlated (Hanczar et al., 2003; Jaeger et al., 2003). This correlation may be because these genes belong to the same signaling pathways related to the disease. If a gene is high-ranked, other genes that are highly correlated with it are also likely to be ranked high by the gene ranking method. Besides, being an additional computational burden, this redundancy can also skew the results and lead to misclassifications.

Gene subset selection approaches Xiong et al., 2001; Guyon et al., 2002) have also been proposed. These methods search for an optimal subset of genes for classification through the space of gene subsets. Clearly, this problem has a huge search space. Although many heuristic search strategies (Jain et al., 2000) can be employed, these gene subset ranking methods are still too computationally expensive.

For comparative studies of different gene selection and phenotype classification methods, see Li et al. (2004) and Liu et al. (2002).

1.1.2 Gene clustering
In the classification problem discussed above, genes are treated as features. On the other hand, in the gene clustering problem, samples are treated as features. The goal of gene clustering is to cluster genes with correlated expression patterns across different samples. If a number of genes are expressed in correlation, they may be functionally related. This would provide hints about the structures of the gene network involved. The most prevalent approaches for gene clustering include hierarchical clustering (Eisen et al., 1998), k-means (Herwig et al., 1999), self organizing maps (SOM) (Tamayo et al., 1999) and subspace clustering (Wang et al., 2002a).

Other approaches to mining gene expression data are also possible. For example, Tang and Zhang (2003) tried to discover hidden phenotype structures underlying gene expression data.

1.1.3 Pre-filtering approaches
Most recently, two pre-filtering approaches have been proposed for dealing with the gene redundancy problem.

Jaeger et al. (2003) proposed a pre-filtering approach where fuzzy C-means clustering was applied before gene ranking. The idea is that similar genes found by clustering can be discarded. Gene ranking methods are then applied to the remaining genes to select out the top-ranked genes.

A similar approach was also proposed by Hanczar et al. (2003). In this case, k-means clustering was used as a pre-filter step to select ‘prototype genes’, essentially the means of clusters.

However, a problem with these two approaches is that the number of clusters must be given and they did not address how to find the correct number for clusters. It is prohibitively expensive to try all possible numbers for clusters in order to find a setting that provides the best classification performance.

1.2 Our proposed hybrid approach
In summary, the problem with existing gene ranking approaches for selecting informative genes is that many of these genes are highly correlated. For the reasons discussed above, it would be ideal to have a very small set of non-redundant, but still highly discriminative genes.

In this paper, we propose a novel hybrid approach for selecting such marker genes from microarray data. Our approach sequentially combines gene ranking and clustering analysis. In this approach, we first applied feature filtering algorithms to select a set of top-ranked informative genes, and then applied hierarchical clustering on these genes to generate a dendrogram. Finally, a sweep-line algorithm was used to analyze the dendrogram. After these steps, marker genes are selected by collapsing dense clusters. When applied to two public datasets for cancer classification, our approach was capable of selecting very small sets of marker genes without sacrificing classification accuracy.

Our approach is different from the previous pre-filtering approaches in that:

  • We apply gene ranking methods first.
  • We determine the best number of clusters systematically. We do this by leave-one-out cross-validation (LOOCV) on the training data, trying all of the different options for extracting clusters from the dendrogram. This is affordable in terms of computation time because the genes have been pre-selected by a gene ranking method.

1.3 Outline of the paper
The remainder of the paper is organized as follows. Section 2 gives an overview of the proposed system. Section 3 discusses the gene ranking step and the feature filtering algorithms employed. Section 4 explains the hierarchical clustering step. Section 5 presents the details about how to extract gene clusters from the dendrogram output, and how to find marker genes by collapsing the clusters into cluster representatives. Section 6 introduces the classifiers used in our experiments. Section 7 presents the experimental results of our proposed system on two public datasets. Section 8 concludes the paper.


    2 SYSTEM OVERVIEW
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEM OVERVIEW
 3 STEP 1: GENE...
 4 STEP 2: HIERARCHICAL...
 5 STEP 3: REDUCE...
 6 STEP 4: CLASSIFICATION
 7 EXPERIMENTAL RESULTS
 8 DISCUSSION AND CONCLUSION
 REFERENCES
 
An overview of HyKGene, the Hybrid system for marKer Gene selection, is provided in Figure 2. In the first step, we apply feature filtering algorithms on the training data to select a set of top-ranked informative genes. In the second step, hierarchical clustering is applied on the set of top-ranked genes to generate a dendrogram. In the third step, we use a sweep-line algorithm to extract clusters from the dendrogram and then select the marker genes by collapsing clusters. Each marker gene corresponds to a cluster. The best number of marker genes is determined by LOOCV on the training data. Finally, classification of test data is done using only the marker genes as features.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 2 HyKGene: a hybrid system for marker gene selection.

 

    3 STEP 1: GENE RANKING
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEM OVERVIEW
 3 STEP 1: GENE...
 4 STEP 2: HIERARCHICAL...
 5 STEP 3: REDUCE...
 6 STEP 4: CLASSIFICATION
 7 EXPERIMENTAL RESULTS
 8 DISCUSSION AND CONCLUSION
 REFERENCES
 
In this step, we apply feature filtering algorithms to rank genes. Feature filters are techniques to compute statistics over the features that indicate which features are better than others. Features (genes) are ranked according to such statistics. In this paper, we use the following three popular feature filtering methods: Relief-F, Information Gain, and the {chi}2-statistic.

3.1 Relief-F
One of the most widely used feature filters is the Relief-F algorithm (Kononenko, 1994). The basic idea of Relief-F is to draw instances at random, compute their nearest neighbors and adjust a feature weighting vector to give more weight to features that discriminate the instance from neighbors of different classes. Specifically, it tries to find a good estimate of the following probability to assign as the weight for each feature f.

This approach has shown good performance in various domains (Robnik-Sikonja and Kononenko, 2003).

3.2 Information Gain
Information Gain (Mitchell, 1997) measures the number of bits of information obtained for class prediction by knowing the value of a feature. Let denote the set of classes. Let V be the set of possible values for feature f. The information gain of a feature f is defined to be:

This method has been used for gene selection by Liu et al. (2002) and Li et al. (2004). Note that Information Gain requires that numeric features be discretized. We use the entropy-based discretization method (Fayyad and Irani, 1993) implemented in Weka (Witten and Frank, 1999). Li et al. (2003) have shown that mean-entropy discretized features are effective for classification using gene expression data.

3.3 {chi}2-statistic
The {chi}2-statistic (Liu and Setiono, 1995) measures the lack of independence between f and c. It is defined as follows:

where V is the set of possible values for feature f, Ai(f = v) is the number of instances in class ci with f = v, Ei(f=v) is the expected value of Ai(f = v). Ei(f = v) is computed with

where N is the total number of instances.

This method has been used for gene selection by Liu et al. (2002). It also requires numeric features to be discretized. We used the same discretization method as above.


    4 STEP 2: HIERARCHICAL CLUSTERING
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEM OVERVIEW
 3 STEP 1: GENE...
 4 STEP 2: HIERARCHICAL...
 5 STEP 3: REDUCE...
 6 STEP 4: CLASSIFICATION
 7 EXPERIMENTAL RESULTS
 8 DISCUSSION AND CONCLUSION
 REFERENCES
 
After applying feature filtering algorithms on the whole set of genes using training data, we pre-select the top-ranked genes from the whole set. Typically, 50–100 genes are pre-selected. Then, we apply hierarchical clustering (HC) on this set of pre-selected genes.

HC is a statistical method for finding relatively homogeneous clusters of instances based on their features. In this case, we consider the genes as instances and the samples as features. Given a set of Ms pre-selected genes to be clustered, the agglomerative method for HC works as follows:

  1. Assign each gene to a cluster.
  2. Find the closest pair of clusters and merge them into a single cluster.
  3. Compute the distances between the clusters.
  4. Repeat Steps (ii) and (iii) until all of the Ms genes are clustered into a single cluster.

Step (iii) can be done in different ways. In this paper, the distance between two clusters is computed using the average linkage approach, which uses the average distance between all pairs of genes in the two clusters.

Let Gi = (gi,1,gi,2,...,gi,N), where gi,1,gi,2,...,gi,N are the gene expression values for gene i of sample 1 through N. The distance d(p,q) between the genes p and q is defined as follows:

where r(Gp,Gq) is the Pearson's correlation coefficient between Gp and Gq.

The result of this hierarchical clustering process is a dendrogram, or hierarchical tree where heights represent degrees of dissimilarity. Each merge step in the clustering process becomes a join (U-shaped line) in the tree. Figure 3a shows an example of dendrogram.



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 3 (a) Dendrogram example. The vertical scale corresponds to the distance between clusters. (b) The dendrogram cut into three clusters by the sweep-line.

 
Clusters are not explicit in the output of the HC algorithm and have to be determined from the dendrogram representation. This is done in the next section.


    5 STEP 3: REDUCE GENE REDUNDANCY BY COLLAPSING CLUSTERS
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEM OVERVIEW
 3 STEP 1: GENE...
 4 STEP 2: HIERARCHICAL...
 5 STEP 3: REDUCE...
 6 STEP 4: CLASSIFICATION
 7 EXPERIMENTAL RESULTS
 8 DISCUSSION AND CONCLUSION
 REFERENCES
 
Some genes in the pre-selected set of genes are highly correlated. To reduce the gene redundancy and extract marker genes, we extracted clusters by analyzing the dendrogram output of the HC algorithm, and then collapsed each cluster into one representative gene from each cluster. The extracted representative genes were used as marker genes in the classification step. The most important parameter here is the number of clusters to be extracted. In this paper, we determine the best number of clusters by LOOCV on the training data, trying all of the O(Ms) different ways to extract clusters from the dendrogram. This is in contrast to the k-means clustering where k must be pre-specified.

We use the following sweep-line algorithm to find marker genes.

  1. while Sweeping a horizontal line l from top to bottom in the dendrogram do
  2. if l hits a merge point then
  3. Cut through the dendrogram with the line.
  4. Extract resulting connected components as clusters.
  5. Find the representative gene of each cluster.
  6. Using the representative genes from all clusters as features, compute the LOOCV classification accuracy on the training data.
  7. end if
  8. end while
  9. Output as marker genes the smallest set of representative genes corresponding to the best LOOCV accuracy.

In Step 5, we find the representative gene of a cluster by finding the gene with the minimum sum of squares of distances to all other genes in the cluster. Formally, given a cluster C, we find the gene s C, such that {sum}tCd(s,t)2 is minimum.

The running time of the above sweep-line algorithm is ), since Step 5 takes O(Ms) time in the worst case. In practice, this algorithm is reasonably fast. On a modern PC, it takes <8 min to run on dendrograms constructed from 100 genes.


    6 STEP 4: CLASSIFICATION
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEM OVERVIEW
 3 STEP 1: GENE...
 4 STEP 2: HIERARCHICAL...
 5 STEP 3: REDUCE...
 6 STEP 4: CLASSIFICATION
 7 EXPERIMENTAL RESULTS
 8 DISCUSSION AND CONCLUSION
 REFERENCES
 
After selecting the marker genes, we classify the test data using one of the following classifiers: k-nearest neighbor (k-NN), linear support vector machine (SVM), C4.5 decision tree and Naive Bayes (NB). The classification results of these algorithms are then used to evaluate the effectiveness of our proposed marker gene selection method.

6.1 k-nearest neighbor
The k-NN classifier (Dasarathy, 1991) is a well-known non-parametric classifier. To classify a new input x, the k-NNs are retrieved from the training data. The input x is then labeled with the majority class label corresponding to the k-NNs.

6.2 Support vector machine
The SVM belongs to a new generation of learning system based on recent advances in statistical learning theory (Vapnik, 1998). A linear SVM, which is used in our system, aims to find the separating hyperplane with the largest margin, defined as the sum of the distances from a hyperplane (implied by a linear classifier) to the closest positive and negative exemplars. The expectation is that the larger the margin, the better the generalization of the classifier. In a non-separable case, a linear SVM seeks a trade-off between maximizing the margin and minimizing the number of errors.

6.3 C4.5 decision tree
C4.5 (Quinlan, 1993) is a well-known decision tree based classifier. A decision tree is a tree structure where non-leaf nodes represent tests on one or more features and leaf nodes reflect classification outcomes. An instance is classified by starting at the root node of the decision tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute. This process is then repeated at the nodes on this branch until a leaf node is reached.

In our experiments, we use the J4.8 algorithm in Weka (Witten and Frank, 1999), which is a Java implementation of C4.5 Revision 8.

6.4 Naive Bayes
The NB classifier is a probabilistic algorithm based on Bayes' rule and the simple assumption that the feature values are conditionally independent given the class. Given a new sample observation, NB estimates the conditional probabilities of classes using the joint probabilities of training sample observations and classes.


    7 EXPERIMENTAL RESULTS
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEM OVERVIEW
 3 STEP 1: GENE...
 4 STEP 2: HIERARCHICAL...
 5 STEP 3: REDUCE...
 6 STEP 4: CLASSIFICATION
 7 EXPERIMENTAL RESULTS
 8 DISCUSSION AND CONCLUSION
 REFERENCES
 
In this section, we present results on two public datasets. Details about the data, preprocessing, experimental parameters and results are provided in the following sections.

7.1 Datasets used
In this paper, we used the following three public datasets.

The three datasets are summarized in Table 1.


View this table:
[in this window]
[in a new window]
 
Table 1 Summary of datasets used in experiments

 
7.2 Data preprocessing
Data preprocessing is very important for microarray data analysis. Before feeding the datasets to our system, we performed data pre-processing as follows:
  • Missing gene expression values for a gene were replaced by the mean value of that gene.
  • Gene expression levels for each gene were normalized by subtracting the mean and dividing by the SD of that gene. The result is that expression levels for each gene have mean equal to 0 and variance equal to 1.

7.3 Experimental settings
We have implemented the proposed system in Java and MATLAB using Weka 3.4.1 (Witten and Frank, 1999) and PRTools 3.1.7 (Prt, 2003, http://www.ph.tn.tudelft.nl/~bob/PRTOOLS.html). All of the experiments were conducted on a Pentium M 1.7 GHz PC with 768 MB RAM. We consider the performance of our system with all possible combinations of the three feature filtering algorithms and four classifiers discussed above. For the k-NN classifier, we use the Euclidean distance as the distance metric in the experiments, and the best k between 1 and 10 is found by performing LOOCV on the training data.

The goal of our sweep-line algorithm with LOOCV is to find the best number of clusters. However, the SOM method of Kohonen (2001) is also commonly used to choose the best number of clusters (Wang et al., 2002b). To compare the performance of our method with that of SOM clustering, we used the SOM Toolbox for Matlab SOM (2002, http://www.cis.hut.fi/projects/somtoolbox/) to select the best number of clusters. Basically, a SOM was created to cluster the genes to the map units, and then k-means clustering was applied to cluster the map units themselves, where the Davies–Bouldin index was used to find the optimum number of clusters. The cluster representatives (marker genes) from the SOM clustering results were extracted with the same method as in Step 5. In the experiments, the default settings of the SOM Toolbox were used.

The HykGene software and datasets used in this paper are available at the supplementary information site (http://www.cs.dartmouth.edu/~wyh/hykgene/supplement/index.htm).

7.4 RESULTS
For each of the three gene ranking methods (Relief-F, Information Gain and {chi} 2-statistic), we used it to select the top-50 and top-100 ranked genes. Then, we compared the LOOCV accuracy of the following three cases:

  1. Use all the top-ranked genes. In this case, all of the top-ranked genes are directly used for classification.
  2. Apply SOM clustering. The SOM clustering is applied to the top-ranked genes to select the best number of clusters.
  3. Apply HykGene. The top-ranked genes are further processed using our HykGene approach.

To estimate the probability that we obtained the selected marker genes from the top-ranked genes by chance, we randomly selected from the top-ranked genes the same number of marker genes as chosen by the HykGene method for 1000 times. For each time, we followed the same procedure as used by HykGene to carry out LOOCV. The P-value was defined as the probability of observing a mis-classification rate for the randomly chosen marker genes that was less than or equal to that observed using the HykGene methods.

Tables 24 show the results on the ALL/AML leukemia dataset; Tables 5–7 show the results on the colon tumor dataset; Tables 8–10 show the results on the MLL dataset. Tables 2 4 are included in this paper, whereas tables 5–10 with hyper-links to lists of selected marker genes are available in the Supplementary information site. In each table, LOOCV accuracy is listed for the above three cases. For the last two cases, the number of selected marker genes was also recorded. The lists of selected genes can be found in the Supplementary information site. Note that our HykGene approach selects different numbers of genes when it uses different classifiers. This is because the HykGene approach, like the wrapper method (John et al., 1994), depends on the classifier to do cross-validation. We can observe from the results that:

  • In all cases, our HykGene approach offered better (or the same) LOOCV accuracy compared to base line approaches that use the top-50 or top-100 ranked genes directly for classification.
  • Our HykGene approach can eliminate most of the genes from the top-50 or top-100 genes. In many cases, only a few (1–10) marker genes were selected.
  • The SOM clustering was able to pick small sets of marker genes, but its LOOCV classification accuracies are consistently worse than our approach, no matter what classifier and gene ranking method were used.


View this table:
[in this window]
[in a new window]
 
Table 2 Results on ALL/AML leukemia dataset using Relief-F

 

View this table:
[in this window]
[in a new window]
 
Table 4 Results on ALL/AML leukemia dataset using {chi}2-statistic

 
The HykGene Software, datasets used in this paper, Tables 2–10 with hyper-links to lists of selected marker genes are available at the Supplementary information site (http://www.cs.dartmouth.edu/~wyh/hykgene/supplement/index.htm).


    8 DISCUSSION AND CONCLUSION
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEM OVERVIEW
 3 STEP 1: GENE...
 4 STEP 2: HIERARCHICAL...
 5 STEP 3: REDUCE...
 6 STEP 4: CLASSIFICATION
 7 EXPERIMENTAL RESULTS
 8 DISCUSSION AND CONCLUSION
 REFERENCES
 
The main contribution of this paper is to introduce a novel hybrid approach, HykGene, which sequentially combines gene ranking and hierarchical clustering to select marker genes for classification. A prototype system has been implemented and tested on two public datasets. Experiments showed very good LOOCV accuracy using relatively few marker genes (100% on ALL/AML leukemia dataset using 5 genes, 91.9% on colon tumor dataset using 3 genes and 100% on MLL leukemia dataset using 26 genes). Using far fewer genes, our approach has been shown to be able to offer better (or the same) LOOCV accuracy compared with conventional approaches using all of the top-50 or the top-100 ranked genes directly.

Currently, the gene closest to the cluster centroid is selected by our method as the representative for that cluster. We are currently investigating alternative approaches that use Gene Ontology to guide this selection process.


View this table:
[in this window]
[in a new window]
 
Table 3 Results on ALL/AML leukemia dataset using Information Gain

 


    Acknowledgments
 
This work was supported in part by the National Science Foundation under grants ITR-0312629 and IDM-0083423, and also in part by FAMRI.

Received on August 15, 2004; revised on November 25, 2004; accepted on November 25, 2004

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEM OVERVIEW
 3 STEP 1: GENE...
 4 STEP 2: HIERARCHICAL...
 5 STEP 3: REDUCE...
 6 STEP 4: CLASSIFICATION
 7 EXPERIMENTAL RESULTS
 8 DISCUSSION AND CONCLUSION
 REFERENCES
 

    Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., et al. (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503–511[CrossRef][Medline].

    Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J. (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl Acad. Sci. USA, 96, 6745–6750[Abstract/Free Full Text].

    Armstrong, S.A., Staunton, J.E., Silverman, L.B., Pieters, R., den Boer, M.L., Minden, M.D., Sallan, S.E., Lander, E.S., Golub, T.R., Korsmeyer, S.J. (2002) MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat. Genet., 30, 41–47[CrossRef][ISI][Medline].

    Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M., Yakhini, Z. (2000) Tissue classification with gene expression profiles. J. Comput. Biol., 7, 559–583[CrossRef][ISI][Medline].

    Bomprezzi, R., Ringner, M., Kim, S., Bittner, M.L., Khan, J., Chen, Y., Elkahloun, A., Yu, A., Bielekova, B., Meltzer, P.S., et al. (2003) Gene expression profile in multiple sclerosis patients and healthy controls: identifying pathways relevant to disease. Hum. Mol. Genet., 12, 2191–2199[Abstract/Free Full Text].

    Dasarathy, B. Nearest Neighbor Norms: NN Pattern Classification Techniques, (1991) , Los Alamitos, CA, USA IEEE Computer Society Press.

    Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 14863–14868[Abstract/Free Full Text].

    Fayyad, U. and Irani, K. (1993) Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the 13th International Joint Conference on Artificial Intelligence , Portland, OR , pp. 1022–1027.

    Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537[Abstract/Free Full Text].

    Gordon, G.J., Jensen, R.V., Hsiao, L.L., Gullans, S.R., Blumenstock, J.E., Ramaswamy, S., Richards, W.G., Sugarbaker, D.J., Bueno, R. (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res., 62, 4963–4967[Abstract/Free Full Text].

    Guyon, I., Weston, J., Barnhill, S., Vapnik, V. (2002) Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389–422[CrossRef][ISI].

    Hanczar, B., Courtine, M., Benis, A., Hennegar, C., Clement, K., Zucker, J.-D. (2003) Improving classification of microarray data using prototype-based feature selection. SIGKDD Explor. Newslett., 5, 23–30.

    Herwig, R., Poustka, A.J., Muller, C., Bull, C., Lehrach, H., O’Brien, J. (1999) Large-scale clustering of cDNA-fingerprinting data. Genome Res., 9, 1093–1105[Abstract/Free Full Text].

    Jaeger, J., Sengupta, R., Ruzzo, W.L. (2003) Improved gene selection for classification of microarrays. Pac. Symp. Biocomput., 53–64.

    Jain, A.K., Duin, R.P.W., Mao, J. (2000) Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell., 22, 4–37[CrossRef].

    John, G., Kohavi, R., Pfleger, K. (1994) Irrelevant features and the subset selection problem. Proceedings of the Eleventh International Conference on Machine Learning (ICML'94) , New Brunswick, NJ , pp. 121–129.

    Kohonen, T. Self-organizing maps, (2001) 3rd edn. , Berlin, New York, 00052663 Teuvo Kohonen Springer series in information sciences; 30. Springer, pp. 403–486.

    Kononenko, I. (1994) Estimating attributes: analysis and extensions of relief. Proceedings of the European conference on machine learning on Machine LearningCatana, Italy , New York Springer-Verlag, pp. , pp. 171–182.

    Li, J., Liu, H., Wong, L. (2003) Mean-entropy discretized features are effective for classifying high-dimensional biomedical data. Proceedings of the 3rd ACM SIGKDD Workshop on Data Mining in Bioinformatics , Washington, DC.

    Li, T., Zhang, C., Ogihara, M. (2004) A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics, 20, , pp. 2429–2437[Abstract/Free Full Text].

    Liu, H., Li, J., Wong, L. (2002) A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome Inform., 13, 51–60.

    Liu, H. and Setiono, R. (1995) Chi2: feature selection and discretization of numeric attributes. Proceedings of the Seventh International Conference on Tools with Artificial Intelligence , Herndon, VA , pp. 388–391.

    Lu, Y. and Han, J. (2003) Cancer classification using gene expression data. Inform. Syst., 28, 243–268[CrossRef].

    Mitchell, T.M. Machine Learning, (1997) , New York McGraw-Hill.

    Quinlan, J.R. C4.5: Programs for Machine Learning, (1993) , San Francisco, CA, USA Morgan Kaufmann Publishers Inc.

    Robnik-Sikonja, M. and Kononenko, I. (2003) Theoretical and empirical analysis of relieff and rrelieff. Machine Learning, 53, 23–69[CrossRef].

    Roth, F.P. (2001) Bringing out the best features of expression data. Genome Res., 11, 1801–1802[Free Full Text].

    Su, Y., Murali, T., Pavlovic, V., Schaffer, M., Kasif, S. (2003) RankGene: identification of diagnostic genes based on expression data. Bioinformatics, 19, 1578–1579[Abstract/Free Full Text].

    Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., Golub, T.R. (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA, 96, 2907–2912[Abstract/Free Full Text].

    Tang, C. and Zhang, A. (2003) Mining multiple phenotype structures underlying gene expression profiles. Proceedings of the Twelfth International Conference on Information and knowledge managementNew Orleans, LA ACM Press, pp. 418–425.

    Vapnik, V.N. Statistical Learning Theory, (1998) , New York, USA Wiley-Interscience.

    Wang, H., Wang, W., Yang, J., Yu, P.S. (2002a) Clustering by pattern similarity in large data sets. Proceedings of the 2002 ACM SIGMOD International Conference on Management of DataMadison, WI , New York, USA ACM Press, pp. , pp. 394–405.

    Wang, J., Delabie, J., Aasheim, H., Smeland, E., Myklebost, O. (2002b) Clustering of the som easily reveals distinct gene expression patterns: results of a reanalysis of lymphoma study. BMC Bioinformatics, 3, 36[CrossRef][Medline].

    Witten, I.H. and Frank, E. Data mining: practical machine learning tools and techniques with Java implementations, (1999) , San Francisco, CA Morgan Kaufmann.

    Xing, E.P., Jordan, M.I., Karp, R.M. (2001) Feature selection for high-dimensional genomic microarray data. Proceedings of the Eighteenth International Conference on Machine LearningWilliams College, MA , San Francisco, USA Morgan Kaufmann Publishers Inc., pp. , pp. 601–608.

    Xiong, M., Fang, X., Zhao, J. (2001) Biomarker identification by feature wrappers. Genome Res., 11, 1878–1887[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
CirculationHome page
T. W. Chittenden, J. A. Sherman, F. Xiong, A. E. Hall, A. A. Lanahan, J. M. Taylor, H. Duan, J. D. Pearlman, J. H. Moore, S. M. Schwartz, et al.
Transcriptional Profiling in Coronary Artery Disease: Indications for Novel Markers of Coronary Collateralization
Circulation, October 24, 2006; 114(17): 1811 - 1820.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1530    most recent
bti192v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (12)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wang, Y.
Right arrow Articles by Pearlman, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wang, Y.
Right arrow Articles by Pearlman, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?