Gene selection via the BAHSIC family of algorithms
1National ICT Australia and Australian National University, Canberra, 2University of Sydney, Australia, 3Institute for Informatics, Ludwig-Maximilians-University, Munich and 4Max Planck Institute for Biological Cybernetics, Tübingen, Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Identifying significant genes among thousands of sequences on a microarray is a central challenge for cancer research in bioinformatics. The ultimate goal is to detect the genes that are involved in disease outbreak and progression. A multitude of methods have been proposed for this task of feature selection, yet the selected gene lists differ greatly between different methods. To accomplish biologically meaningful gene selection from microarray data, we have to understand the theoretical connections and the differences between these methods. In this article, we define a kernel-based framework for feature selection based on the Hilbert–Schmidt independence criterion and backward elimination, called BAHSIC. We show that several well-known feature selectors are instances of BAHSIC, thereby clarifying their relationship. Furthermore, by choosing a different kernel, BAHSIC allows us to easily define novel feature selection algorithms. As a further advantage, feature selection via BAHSIC works directly on multiclass problems.
Results: In a broad experimental evaluation, the members of the BAHSIC family reach high levels of accuracy and robustness when compared to other feature selection techniques. Experiments show that features selected with a linear kernel provide the best classification performance in general, but if strong non-linearities are present in the data then non-linear kernels can be more suitable.
Availability: Accompanying homepage is http://www.dbs.ifi.lmu.de/~borgward/BAHSIC
Contact: kb{at}dbs.ifi.lmu.de
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Gene selection from microarray data is clearly one of the most popular topics in bioinformatics. To illustrate this, the database for Bibliography on Microarray Data Analysis (Li, 2006) has grown from less than 100 articles in 2000 to 1690 articles in January 2007. What are the reasons for this huge interest in feature selection?
There are two main reasons for this popularity, the first biological, the second statistically motivated. First, by selecting genes from a microarray that result in good separation between healthy and diseased patients, one hopes to find the significant genes affected by the disease, or even causing it. This is a central step towards understanding the underlying biological process.
Second, classifiers on microarray data tend to overfit due to the low number of patients and the high number of observed genes. This means that they achieve high accuracy levels on the training data, but do not generalize to new data. The underlying problem is that if sample size is much smaller than the number of genes, one can distinguish different classes of patients based on the noise present in these measurements, rather than on distinct biological characteristics of their gene expression levels. Via feature selection, one aims to reduce the number of genes by removing meaningless features.
Although feature selection on microarrays is popular, gene selection methods suffer from several problems. First of all, they lack robustness. In Ein-Dor et al. (2006), prognostic cancer gene lists selected from microarrays differ significantly between different methods, and even for different subsets of the same microarray datasets. The authors conclude that thousands of samples are needed for robust gene selection. Given that clinical studies almost exclusively deal with comparatively low sample sizes, this is a very pessimistic view of clinical microarray data analysis. At the other end of the spectrum are recent results in sparse decoding (Candes and Tao, 2005; Wainwright, 2006) which suggest that for a very well defined family of inverse problems, asymptotically only
observations are needed to recover n features accurately from d dimensions.
Besides small sample size and high dimensionality, another crucial problem arises from the plethora of feature selection methods for microarray data. Each approach is endowed with its own theoretical analysis, and the connections between them are so far poorly understood (Stolovitzky, 2003). This makes it difficult to explain why different algorithms generate different prognostic gene lists on the same set of cancer microarray data. A unifying framework for feature selection algorithms would help to understand these relations and to clarify which feature selection algorithms are most helpful for gene selection.
In this article, we present such a unifying framework called BAHSIC. BAHSIC defines a class of backward (BA) elimination feature selection algorithms that make use of (i) kernels and (ii) the Hilbert–Schmidt independence criterion (HSIC) (Gretton et al., 2005). We show that BAHSIC includes several well-known feature selection methods, namely Pearson's correlation coefficient (Ein-Dor et al., 2006; van 't Veer et al., 2002), t-test (Tusher et al., 2001), signal-to-noise ratio (Golub et al., 1999), Centroid (Bedo et al., 2006; Hastie et al., 2001), Shrunken Centroid (Tibshirani et al., 2002, 2003) and ridge regression (Li and Yang, 2005).
By choosing different kernels, one may define new types of feature selection algorithm. We show that several well-known feature selection methods merely differ in their choice of kernel. Furthermore, BAHSIC can be extended in a principled fashion to multiclass and regression problems, in contrast to most competing methods which are exclusively geared towards two-class problems.
In a broad experimental evaluation, we compare feature selection methods that are instances of BAHSIC to several competing approaches, with respect to both the robustness of the selected features and the resulting classification accuracy. Our unified framework assists us in explaining how the kernel used by a particular feature selector determines which genes are preferred. Our experiments show that features selected with a linear kernel provide the best classification performance in general, but if strong non-linearities are present in the gene expression data then non-linear kernels can be more suitable.
| 2 FEATURE SELECTION AND BAHSIC |
|---|
|
|
|---|
The problem of feature selection can be cast as a combinatorial optimization problem. We denote by
|
| (1) |
To describe our feature selection criterion, we begin with the simple example of linear dependence detection, which we then extend to the detection of more general kinds of dependence. Consider spaces
and
, on which we jointly sample observations (x, y) from a distribution
. We may define a covariance matrix
|
| (2) |
i the singular values of |
|
We now define
and
more broadly as two domains from which we draw samples (x, y) as before: these may be real valued, vector valued, class labels, strings (Lodhi et al., 2002), graphs (Gärtner et al., 2003) and so on (see Schölkopf et al., 2004) for further examples in bioinformatics). We define a (possibly non-linear) mapping
from each
to a feature space
, such that the inner product between the features is given by a kernel function
:
is called a reproducing kernel Hilbert space (RKHS).1 Likewise, let
be a second RKHS on
with kernel
and feature map
(y). We may now define a cross-covariance operator between these feature maps, which is analogous to the covariance matrix in (2): this is a linear operator
such that
|
| (3) |
is the tensor product (see Baker, 1973; Fukumizu et al. 2004 for more detail). The square of the Hilbert–Schmidt norm of the cross-covariance operator HSIC, |
| (4) |
|
| (5) |
We now describe two theorems from Gretton et al. (2005) which support our using HSIC as a feature selection criterion. The first (Gretton et al., 2005, Theorem 3) shows that the empirical HSIC converges in probability to its population counterpart with rate
. This implies that if the empirical HSIC is large, then given sufficient samples it is very probable that the population HSIC is also large; likewise, a small empirical HSIC likely corresponds to a small population HSIC. Moreover, the same features should consistently be selected to achieve high dependence if the data is repeatedly drawn from the same distribution. The second result (Gretton et al., 2005, Theorem 4) states that when
are RKHSs with universal (Steinwart, 2002) kernels k , l on respective compact domains
and
, then
if and only if x and y are independent. In terms of our microarray setting, using a universal kernel such as the Gaussian RBF kernel or the Laplace kernel, HSIC is zero if gene expression levels and class labels are independent; clearly we want to reach the opposite result, namely strong dependence between expression levels and class labels. Hence, we try to select genes that maximize HSIC.
2.1 BAHSIC
Having defined our feature selection criterion, we now describe an algorithm that conducts feature selection on the basis of this dependence measure. Using HSIC, we can perform both forward and backward selection of the features. In particular, when we use a linear kernel on both the data and labels, forward selection and backward selection are equivalent: the objective function decomposes into individual coordinates, and thus feature selection can be done without recursion in one go.
In the case of more general kernels, forward selection is computationally more efficient, however, backward elimination (BA) in general yields better features, since the quality of the features is assessed within the context of all other features. Hence, we present the BA version of our algorithm here.
Our feature selection algorithm BAHSIC appends the features from
to the end of a list
so that the elements towards the end of
have higher relevance to the learning task. The feature selection problem in (1) can be solved by simply taking the last t elements from
. Our algorithm produces
recursively, eliminating the least relevant features from
and adding them to the end of
at each iteration. In describing the algorithm, we modify our notation for
to make clearer its dependence on the set of features chosen. Thus, we replace the definition in (5) with
, where
are the features used in computing the data kernel matrix K, and
is the parameter for the data kernel
(for instance, this might be the size of a Gaussian kernel, or the degree of a polynomial kernel). The set
denotes all possible kernel parameters.
|
Step 3 of the algorithm optimizes over the set
. For this reason
is restricted so as to make this search practical (the nature of the restriction depends on both the data and the kernel: for instance, in the case of the size parameter of a Gaussian kernel, we consider an interval of the form
). If we have no prior knowledge regarding the nature of the non-linearity in the data, then optimizing over
is essential: it allows us to adapt to the scale of the nonlinearity present in the (feature-reduced) data. If we have prior knowledge about the type of non-linearity, we can use a kernel with fixed parameters for BAHSIC. In this case, Step 3 can be omitted since there will be no parameter to tune.
Step 4 of the algorithm is concerned with the selection of a set
of features to eliminate. While one could choose a single element of
, this would be highly inefficient when there are a large number of irrelevant features. On the other hand, removing too many features at once risks the loss of relevant features. In our experiments, we found a good compromise between speed and feature quality was to remove 10% of the current features at each iteration.
| 3 FEATURE SELECTORS THAT ARE INSTANCES OF BAHSIC |
|---|
|
|
|---|
In this section, we will show that several feature selection criteria are special cases of BAHSIC, and thus BAHSIC is capable of finding and exploiting dependence of a much more general nature (for instance, dependence between data and labels with graph and string values).
We first define the symbols used in the following sections. Let X be the full data matrix with each row a sample and each column a feature, x be a column of X and xi be the entries in x. Let y be the vector of labels with entries yi. When the labels are multidimensional, we express them as a matrix Y, with each row a datum and each column a dimension. The kth column of Y is then
.
Suppose the number of data points is m. We denote the mean of a particular feature of the data as
, and its SD as sx. For two-class data, let the number of the positive and negative samples be
and
, respectively (
). In this case, denote the mean of the samples from the positive and the negative classes by
and
, respectively, and the corresponding SD by
and
. For multiclass data, we let mi be the number of samples in class i, where
and
. Finally, let
be a column vector of all ones with length k and
be a column vector of all zeros.
3.1 Pearson's correlation
Pearson's correlation is commonly used in microarray analysis (Ein-Dor et al., 2006; van 't Veer et al., 2002), and is defined as
|
| (6) |
|
| (7) |
3.2 Mean difference and its variants
The difference between the sample means of the positive and negative classes,
, is useful for selecting discriminative features. With different normalization of the data and labels, many variants can be derived. For example, the centroid (lin) (Bedo et al., 2006), t-score (t) (Hastie et al., 2001), moderated t-score (m-t), signal-to-noise ratio (snr) and B-statistics (lods) (Smyth, 2004) all belong to this subfamily.
We will start by showing that
is a special case of HSIC. This is straightforward if we assign
as the labels to the positive samples and
to the negative samples. Applying a linear kernel on both domains leads to the equivalence
|
| (8) |
The t-test is defined as
, where
. The square of the t-test is equivalent to HSIC if the data is normalised by
. The signal-to-noise ratio, moderated t-test, and B-statistics are three variants of the t-test. They differ only in their respective denominators, and are thus special cases of HSIC if we normalize the data accordingly. For example, we obtain the signal-to-noise ratio if the data are normalized by
.
3.3 Shrunken centroid
The shrunken centroid (pam) method (Tibshirani et al., 2002, 2003) performs feature ranking using the differences from the class centroids to the centroid of all the data. This is also related to HSIC if specific preprocessing of the data and labels is performed. Here we will focus on constructing appropriate labels, as the normalization of the data is similar to the previous section. For two-class problems, we use the 2D label matrix
|
| (9) |
|
| (10) |
3.4 Multiclass
In addition to scoring features for two-class data, our method can readily be applied to multiclass data, by constructing an appropriate label space kernel using the class label assignments. For instance, we can score a feature for the multiclass classification problem by applying linear kernels to the following label feature vectors (3-class example):
|
| (11) |
|
| (12) |
3.5 Regression
BAHSIC can also be used to select features for regression problems, except that in this case the labels are continuous variables. Again, we can use different kernels on both the data and the labels and apply BAHSIC. In this context, feature selection using ridge regression can also be viewed as a special case of BAHSIC. In ridge regression (Hastie et al., 2001), we predict the outputs
using the predictor
by minimizing the objective function
, where the second term is known as the regularizer. Our discussion encompasses two cases: first, the linear model, in which
; and second, the non-linear case, in which each of the m rows of
is a vector of non-linear features of a particular observation xi, and
. Recursive feature elimination combined as an embedded method with ridge regression removes the feature which causes the smallest increase in
. Equivalently, after minimizing
, this is the feature which has the smallest absolute weight
.
The minimum of this objective function with respect to
is
|
| (13) |
The final case we consider is kernel ridge regression, which differs from the above in that the space of non-linear features of the input may be infinite dimensional, and the regularizer becomes a smoothness constraint on the functions from this space to the output. Specifically, the inputs are mapped to a different feature space
with kernel
, in which a linear prediction is made of the label y. Without going into further detail, we use standard kernelisation methods (Schölkopf and Smola, 2002) to obtain that the minimum objective is
. This is equivalent to defining a feature space
with kernel
on the data, and then selecting features by maximising HSIC.
| 4 ALGORITHMS UNRELATED TO BAHSIC |
|---|
|
|
|---|
In addition to the feature selection algorithms that are related to BAHSIC, we compare against three methods that are not members of the BAHSIC family: mutual information (mi), recursive feature elimination SVM (rfe) and
1-SVM for feature selection (l1). The mutual information is a measure of statistical dependence between two random variables (Cover and Thomas, 1991), and is zero if and only if the variables are independent. To use the mutual information in a filter method for feature selection, Zaffalon and Hutter (2002) compute it between each feature and the labels: the features that correspond to the highest mutual information are selected. Variants of this method can consider several features at a time, but the resulting density estimation problem becomes much harder for increased dimensions. This method is applicable to both two-class and multiclass datasets.
Recursive feature elimination SVM (Guyon et al., 2002) is an embedded method for feature selection. It aims to optimize the performance of a linear SVM by eliminating the least useful features for SVM classification in a backwards greedy fashion. Initially, an SVM using all features is trained. The least important features, estimated by the absolute value of the trained weights, are then dropped from the model and the SVM retrained. The process is carried out recursively until the desired number of features is reached.
The
1-SVM (Tibshirani,1994) is also an embedded method for feature selection. Using an
1 norm as the regularizer in an SVM results in sparse weight vectors (Fan and Li, 2001), where the number of non-zero weights depends on the amount of regularization. It is not easy to specify the exact sparsity of the solution, but in our experiments the typical number of features selected was below 50.
| 5 DATASETS |
|---|
|
|
|---|
We ran our experiments on 28 microarray datasets of gene expression levels, of which 15 are two-class datasets and 13 are multiclass datasets. Samples within one class represent one common phenotype or a subtype thereof. The 28 datasets are assigned a reference number for convenience. Two-class datasets have a reference number less than or equal to 15, and multiclass datasets have reference numbers of 16 and above. Only one dataset, yeast, has feature dimension less than 1000 (79 features), i.e. it contains expression levels for less than 1000 genes. All other datasets have dimensions ranging from
2000 to 25 000. The number of samples varies between
50 and 300 samples. A summary of the datasets and their sources is as follows:- Six datasets studied in (Ein-Dor et al., 2006). Three deal with breast cancer (van 't Veer et al., 2002; van de Vijver et al., 2002; Wang et al., 2005) (numbered 1, 2 and 3), two with lung cancer (Bhattacharjee et al., 2001; Beer et al., 2002) (4, 5), and one with hepatocellular carcinoma (Iizuka et al., 2003) (6). The B cell lymphoma dataset (Rosenwald et al., 2002) is not used because none of the tested methods produce classification errors lower than 40%.
- Six datasets studied in (Warnat et al., 2005). Two deal with prostate cancer (Dhanasekaran et al., 2001; Welsh et al., 2001) (7, 8), two with breast cancer (Gruvberger et al., 2001; West et al., 2001) (9, 10), and two with leukaemia (Bullinger et al., 2004; Valk et al., 2004) (16, 17).
- Five commonly used bioinformatics benchmark datasets on colon cancer (Alon et al., 1999) (11), ovarian cancer (Berchuck et al., 2005) (12), leukaemia (Golub et al., 1999) (13), lymphoma (Alizadeh et al., 2000) (18), and yeast (Brown et al., 2000) (19).
- Nine datasets from the NCBI GEO database. The GDS IDs and reference numbers for this article are GDS1962 (20), GDS330 (21), GDS531 (14), GDS589 (22), GDS968 (23), GDS1021 (24), GDS1027 (25), GDS1244 (26), GDS1319 (27), GDS1454 (28) and GDS1490 (15), respectively.
| 6 EXPERIMENTS |
|---|
|
|
|---|
6.1 Classification error and robustness of genes
We used stratified 10-fold cross-validation and SVMs to evaluate the predictive performance of the top 10 features selected by each method. For two-class datasets, a non-linear SVM with a Gaussian RBF kernel,
were tuned on a grid of The classification results for binary and multiclass datasets are reported in Tables 1 and 2, respectively. In addition to the error rate, we also report the overlap between the top 10 gene lists created in each fold. The multiclass results are presented separately since some older members of the BAHSIC family, and some competitors, are not naturally extensible to multiclass datasets. Our next two sections contain the analysis of these results: in Section 6.2, we discuss the consistency of each method across the various types of data, and in Section 6.3, we analyse the effect of kernel choice on performance, with a particular focus on linear versus non-linear kernels.
|
|
6.2 Performance of feature selectors across datasets
When comparing the overall performance of various gene selection algorithms, it is of primary interest to choose a method which works well everywhere, rather than one which sometimes works well and sometimes performs catastrophically. It turns out that the linear kernel (lin) outperforms all other methods in this regard, both for binary and multiclass problems.
To show this, we measure how the various methods compare with the best-performing one in each dataset in Tables 1 and 2. The deviation between algorithms is taken as the square of the difference in performance. This measure is chosen because gene expression data is relatively expensive to obtain, and we want an algorithm to select the best genes. If an algorithm selects genes that are far inferior to the best possible among all algorithms (catastrophic case), we downgrade the algorithm more heavily. Squaring the performance difference achieves exactly this effect, by penalizing larger differences more heavily. In other words, we want to choose an algorithm that performs homogeneously well in all datasets. To provide a concise summary, we add these deviations over the datasets and take the square root as the measure of goodness. These scores (the
2 distances) are listed in Tables 1 and 2. In general, the smaller the
2 distance, the better the method. It can be seen that the linear kernel has the smallest
2 distance on both the binary and multiclass datasets.
6.3 Impact of kernel on gene selection
In Section 3, we unified several feature selection algorithms in one common framework. In our feature selection evaluation experiment, we showed the linear kernel selects the genes leading to the best classification accuracies on average. From a biological perspective, the interesting questions to ask are: why does the linear kernel select the best genes on average? Why are there datasets on which it does not perform best? Finally, which genes are selected by a linear kernel-based feature selector, and which by a Gaussian kernel-based selector? In this section, we conduct experimental analyses to come up with answers to these questions. These findings have deep implications, because they help us to understand which genes will be selected by which algorithm. We summarize these implications in two rules of thumb at the end of the section.
6.3.1 Artificial genes
To demonstrate the effect of different kernels on gene selection, and the preference of certain kernels for certain genes, we created ten artificial genes and inserted them into two breast cancer datasets (datasets 9 and 10). The genes were created such that the signal-to-noise ratio was higher than those of the real genes. In a sense, we used the original microarray data as realistic noise, and we expect a feature selector to rank the artificial genes on top. We experimented with both non-linearly and linearly separable artificial genes, as shown in Figure 1. To illustrate the differences between these two types of genes, linear separability should arise when different phenotypic classes are clearly linked with certain high or low levels of expression for a group of genes (Fig. 1a). Non-linear separability might occur when one of the phenotypic classes consists of subtypes, such that both subtypes show gene expression levels different from that of a healthy patient, but one subgroup has lower expression levels and the other higher (Fig. 1b).
|
We used the median rank of the 10 artificial genes as our measure of ranking performance. This provides an estimate of the utility of the kernel for selecting the genes with high signal-to-noise ratios. We deem a feature selector competent for the task if this measure is less than 10. Table 3 lists the results of this experiment. We are particularly interested in the two new variants, RBF and dis, of the BAHSIC family. From the table, we observe that
- RBF and dis perform comparably to existing BAHSIC members, such as pc and snr, in detecting artificial genes that are linearly separable. Most methods rank the 10 inserted genes on the top.
- RBF and dis perform much better in detecting artificial genes that are separable only non-linearly. They rank the 10 artificial genes on top in at least 9 out of the 10 folds, while other methods (except mi) fall short.
|
6.3.2 Subtype discrimination using non-linear kernels
We now investigate why it is that non-linear kernels (RBF and dis) provide better genes for classification in three datasets from Table 2 [datasets 18 (Alizadeh et al., 2000) 27 (GDS1319), and 28 (GDS1454)]. These datasets all represent multiclass problems, where at least two of the classes are subtypes with respect to the same supertype.2 Ideally, the selected genes should contain information discriminating the classes. To visualize this information, we plot in Figure 2 the expression value of the top-ranked gene against that of a second gene ranked in the top 10. This second gene is chosen so that it has minimal correlation with the first gene. We use colours and shapes to distinguish data from different classes (datasets 18 and 28 each contain 3 classes, therefore we use 3 different colour and shape combinations for them; dataset 27 has 4 classes, so we use 4 such combinations).
|
We found that genes selected using non-linear kernels provide better separation between the two classes that correspond to the same supertype (red dots and green diamonds), while the genes selected with the linear kernel do not separate these subtypes well. In the case of dataset 27, the increased discrimination between red and green comes at the cost of a greater number of errors in another class (black triangle), however, these mistakes are less severe than the errors made between the two subtypes by the linear kernel. This eventually leads to better classification performance for the non-linear kernels (see Table 2).
The principal characteristic of the datasets is that the blue square class is clearly separated from the rest, while the difference between the two subtypes (red dots and green diamonds) is less clear. The first gene provides information that distinguishes the blue square class, however, it provides almost no information about the separation between the two subtypes. The linear kernel does not search for information complementary to the first gene, whereas non-linear kernels are able to incorporate complementary information. In fact, the second gene that distinguishes the two subtypes (red dots and green diamonds) does not separate all classes. From this gene alone, the blue square class is heavily mixed with other classes. However, combining the two genes together results in better separation between all classes.
6.3.3 Rules of thumb and implication to gene activity
To conclude our experiments, considering the fact that the linear kernel performed best in our feature selection evaluation, yet also taking into account the existence of non-linear interactions between genes (as demonstrated in Section 6.3.2), we can derive the following two rules of thumb for gene selection:
- always apply the linear kernel for general purpose gene selection;
- apply a Gaussian kernel if non-linear effects are present, such as multimodality or complementary effects of different genes.
This result should come as no surprise, due to the high dimensionality of microarray datasets, but we make the point clear by a broad experimental evaluation. These experiments also imply a desirable property of gene activity as a whole: it correlates well with the observed outcomes. Multimodal and highly non-linear situations exist, where a non-linear feature selector is needed (as can be seen in the outcomes on datasets 18, 27 and 28), yet they occur relatively rarely in practise.
| 7 DISCUSSION |
|---|
|
|
|---|
In this article, we have defined the class of BAHSIC feature selection algorithms. We have shown that this family includes several well-known feature selection methods, which differ only in the choice of the preprocessing and the kernel function. Our experiments show that the BAHSIC family of feature selection algorithms performs well in practise, both in terms of accuracy and robustness. In particular, the linear kernel (centroid feature selector) performs best in general, and is thus a reliable first choice that provides good baseline results.
In the artificial gene experiments, we demonstrated non-linear RBF and dis kernels can select better features when there are non-linear interactions. Furthermore, we showed on real multiclass datasets that non-linear kernels can select better genes for discriminating between subtypes. This indicates that non-linear kernels are potentially useful for finding better prognostic markers and for subtype discovery.
The BAHSIC family represents a step towards establishing theoretical links between the huge set of feature selection algorithms in the bioinformatics literature. Only if we fully understand these theoretical connections can we hope to explain why different methods select different genes, and to choose feature selection methods that yield the most biologically meaningful results.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This work was supported in part by National ICT Australia; the German Ministry for Education, Science, Research and Technology (BMBF) under grant no. 031U112F within the BFAM (Bioinformatics for the Functional Analysis of Mammalian Genomes) project which is part of the German Genome Analysis Network (NGFN); and the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. National ICT Australia is funded through the Australian Government's Backing Australia's Ability initiative, in part through the Australian Research Council.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
1 A note on the non-linear mapping: if
x, x'
+ c)t. Other kernels, like the Gaussian RBF kernel k(x, x') = exp (–0.5
–2||x–x'||2), correspond to infinitely large feature spaces. We need never evaluate these feature representations explicitly, however.
2 For dataset 18, the 3 subtypes are diffuse large B-cell lymphoma and leukaemia, follicular lymphoma and chronic lymphocytic leukaemia; for dataset 27, the 4 subtypes are various C blastomere mutant embryos: wild type, pie – 1, pie – 1 + pal – 1 and mex – 3 + skn – 1; for dataset 28, the 3 subtypes are normal cell, IgV unmutated B-cell and IgV mutated B-cell. ![]()
| REFERENCES |
|---|
|
|
|---|
Alizadeh A, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, ( (2000) ) 403, : 503–511.[CrossRef][Medline].
Alon U, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl Acad. Sci. USA, ( (1999) ) 96, : 6745–6750.
Baker C. Joint measures and cross-covariance operators. Trans. Am. Math. Soc, ( (1973) ) 186, : 273–289.[CrossRef].
Bedo J, et al. An efficient alternative to svm based recursive feature elimination with applications in natural language processing and bioinformatics. Artificial Intelligence, ( (2006) ) 4304, : 170–180. LNCS..
Beer DG, et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat. Med, ( (2002) ) 8, : 816–824.[ISI][Medline].
Berchuck A, et al. Patterns of gene expression that characterize long-term survival in advanced stage serous ovarian cancers. Clin. Cancer Res, ( (2005) ) 11, : 3686–3696.
Bhattacharjee A, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl Acad. Sci. USA, ( (2001) ) 98, : 13790–13795.
Brown M, et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci, ( (2000) ) 97, : 262–267.
Bullinger L, et al. Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia. N. Engl. J. Med, ( (2004) ) 350, : 1605–1616.
Candes E, Tao T. Decoding by linear programming. IEEE Trans. Info Theory, ( (2005) ) 51, : 4203–4215.[CrossRef].
Cover TM, Thomas JA. Elements of Information Theory, ( (1991) ) New York: John Wiley and Sons..
Dhanasekaran SM, et al. Delineation of prognostic biomarkers in prostate cancer. Nature, ( (2001) ) 412, : 822–826.[CrossRef][Medline].
Ein-Dor L, et al. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc. Natl Acad. Sci. USA, ( (2006) ) 103, : 5923–5928.
Fan J, Li R. Variable selection via nonconcave penalized likelihood an its oracle properties. J. Am. Stat. Assoc, ( (2001) ) 96, : 1348–1360.[CrossRef][ISI].
Feuerverger A. A consistent test for bivariate dependence. Int. Stat. Rev, ( (1993) ) 61, : 419–433..
Fukumizu K, et al. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. J. Mach. Learn. Res, ( (2004) ) 5, : 73–99..
Gärtner T, et al. On graph kernels: hardness results and efficient alternatives. Schölkopf B, Warmuth MK, eds. ( (2003) ) Proceedings of Annual Conference Computational Learning Theory. Springer. 129–143..
Golub TR, et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, ( (1999) ) 286, : 531–537.
Gretton A, et al. Measuring statistical dependence with Hilbert-Schmidt norms. ( (2005) ) Proceedings of the International Conference on Algorithmic Learning Theory. 63–78..
Gruvberger S, et al. Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. Cancer Res, ( (2001) ) 61, : 5979–5984.
Guyon I, et al. Gene selection for cancer classification using support vector machines. Mach. Learn, ( (2002) ) 46, : 389–422.[CrossRef].
Hastie T, et al. The Elements of Statistical Learning, ( (2001) ) New York: Springer..
Iizuka N, et al. Oligonucleotide microarray for prediction of early intrahepatic recurrence of hepatocellular carcinoma after curative resection. Lancet, ( (2003) ) 361, : 923–929.[CrossRef][ISI][Medline].
Li F, Yang Y. Analysis of recursive gene selection approaches from microarray data. Bioinformatics, ( (2005) ) 21, : 3741–3747.
Li W. Bibliography on microarray data analysis. ( (2006) )..
Lodhi H, et al. Text classification using string kernels. J. Mach. Learn. Res, ( (2002) ) 2, : 419–444.[CrossRef][ISI].
Rosenwald A, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N. Engl. J. Med, ( (2002) ) 346, : 1937–1947.
Schölkopf B, Smola A. Learning with Kernels, ( (2002) ) Cambridge, MA: MIT Press..
Schölkopf B, et al. Kernel Methods in Computational Biology, ( (2004) ) Cambridge, MA: MIT Press..
Smyth G. Linear models and empirical bayes methods for assessing differential expressionin microarray experiments. Stat. Appl. Genet. Mol. Biol, ( (2004) ) 3, ..
Steinwart I. On the influence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res, ( (2002) ) 2, : 67–93.[CrossRef][ISI].
Stolovitzky G. Gene selection in microarray data: the elephant, the blind men and our algorithms. Curr. Opin. Struct. Biol, ( (2003) ) 13, : 370–376.[CrossRef][ISI][Medline].
Tibshirani R. Regression selection and shrinkage via the lasso. In: Technical report, ( (1994) ) Department of Statistics, University of Toronto. ftp://utstat.toronto.edu/pub/tibs/lasso.ps..
Tibshirani R, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression. National Academy of Sciences, ( (2002) ) vol. 99, : 6567–6572.[CrossRef].
Tibshirani R, et al. Class prediction by nearest shrunken centroids, with applicaitons to DNA microarrays. Stat. Sci, ( (2003) ) 18, : 104–117.[CrossRef][ISI].
Tusher VG, et al. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, ( (2001) ) 98, : 5116–5121.
Valk PJ, et al. Prognostically useful gene-expression profiles in acute myeloid leukemia. N. Engl. J. Med, ( (2004) ) 350, : 1617–1628.
van de Vijver MJ, et al. A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med, ( (2002) ) 247, : 1999–2009..
van 't Veer LJ, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature, ( (2002) ) 415, : 530–536.[CrossRef][Medline].
Wainwright M. Sharp thresholds for noisy and high-dimensional recovery of sparsity. In: Technical report, ( (2006) ) UC Berkeley: Department of Statistics..
Wang Y, et al. Gene-expression profiles to predict distant metastasis of lymp








