Bioinformatics Advance Access originally published online on May 18, 2006
Bioinformatics 2006 22(15):1855-1862; doi:10.1093/bioinformatics/btl190
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Independent component analysis-based penalized discriminant method for tumor classification using gene expression data
1 Intelligent Computing Lab, Institute of Intelligent Machines, Chinese Academy of Sciences PO Box 1130, Hefei, Anhui 230031, China
2 Department of Automation, University of Science and Technology of China China
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Microarrays are capable of determining the expression levels of thousands of genes simultaneously. One important application of gene expression data is classification of samples into categories. In combination with classification methods, this technology can be useful to support clinical management decisions for individual patients, e.g. in oncology. Standard statistic methodologies in classification or prediction do not work well when the number of variables p (genes) far too exceeds the number of samples n. So, modification of existing statistical methodologies or development of new methodologies is needed for the analysis of microarray data.
Results: This paper proposes a new method for tumor classification using gene expression data. In this method, we first employ independent component analysis to model the gene expression data, then apply optimal scoring algorithm to classify them. Further speaking, this approach can first make full use of the high-order statistical information contained in the gene expression data. Second, this approach also employs regularized regression models to handle the situation of large numbers of correlated predictor variables. Finally, the predictive models are developed for classifying tumors based on the entire gene expression profile. To show the validity of the proposed method, we apply it to classify four DNA microarray datasets involving various human normal and tumor tissue samples. The experimental results show that the method is efficient and feasible.
Availability: Matlab scripts are available on request.
Contact: dshuang{at}iim.ac.cn
| INTRODUCTION |
|---|
|
|
|---|
A reliable and precise classification of tumors is essential for successful diagnosis and treatment of cancer. Current methods for classifying human malignancies are mostly to rely on a variety of morphological, clinical and molecular variables. Despite recent progress, there are still many uncertainties in diagnosis. Furthermore, it is likely that the existing classes of the tumors are heterogeneous and comprise diseases that are molecularly distant. Recently, with the development of large-scale high-throughput gene expression technology, it has become possible for ones to diagnose and classify diseases, particularly cancers, directly based on these DNA microarray technologies (Alizadeh et al., 2000). This technique has been termed as class prediction in the microarray literature (Golub et al., 1999). By monitoring the expression levels in cells for thousands of genes simultaneously, microarray experiments may lead to a more complete understanding of the molecular variations among tumors, and hence to a finer and more reliable classification.
With the wealth of gene expression data from microarrays being produced, more and more new prediction, classification and clustering techniques are being used for analysis of the data. Up to now, several studies have been reported on the application of microarray gene expression data analysis for molecular classification of cancer (Alon et al., 1999; Bittner et al., 2000; Furey et al., 2000). And, the analysis of differential gene expression data has been used to distinguish between different subtypes of lung adenocarcinoma (Bhattacharjee et al., 2001) and colorectal neoplasm (Selaru et al., 2002). Also, the work that predicts clinical outcomes in breast cancer (van't Veer et al., 2002; West et al., 2001) and lymphoma (Shipp et al., 2002) from gene expression data has been proven to be successful. Golub et al. (1999) utilized a nearest-neighbor classifier method for the classification of acute myeloid lymphoma (AML) and acute leukemia lymphoma (ALL) in children. Dudoit et al. (2002) performed a systematic comparison of several discrimination methods for classification of tumors based on microarray experiments. While linear discriminant analysis was found to perform the best, in order to utilize the method, the number of genes selected had to be drastically reduced from thousands to tens using a univariate filtering criterion.
One feature of microarray data is that the number of tumor samples collected tends to be much smaller than the number of genes. The number for the former tends to be on the order of tens or hundreds, while microarray data typically contain thousands of genes on each chip. In statistical terms, it is called large p, small n problem (West, 2003), i.e. the number of predictor variables is much larger than the number of samples. In theory, the more recent technique, support vector machines (SVM), should be more suitable for this problem. Furthermore, Furey et al. (2000) have applied SVM to classify tumors using microarray data. In fact, although SVM has been successfully applied to some other problems, it requires more training than the linear discriminant analysis. Also, the generalization of the SVM to classify more than two classes of problems is not solved significantly.
Ghosh (2003) proposed a methodology using regularized regression models for the classification of tumors. In this literature, he focused on three types of regularized regression models, i.e. ridge regression, principal components regression and partial least squares regression. One drawback of these techniques is that only second-order statistical information of the gene data is used. However, in the task such as classification, much of the important information may be contained in the high-order relationships among samples. And thus, it is important to investigate whether or not the generalizations of principal component analysis (PCA), which are sensitive to high-order relationships (not just second-order relationships), are advantageous. Usually, ICA (Bartlett, et al., 2002; Teschendorff et al., 2005) is one of such generalizations. A number of algorithms for performing ICA have been proposed. Please see literature (Hyvärinen et al., 2001) for the details of these techniques. Here, we shall employ FastICA, which was proposed by Hyvärinen (1999) and proven successful in many applications, to address the problems of tumor classification.
In this article, we present a new methodology that combines ICA and regularized regression models (Frank and Friedman, 1993) for analyzing gene expression data. We first perform ICA on gene expression data, then apply optimal scoring algorithm (Hastie et al., 1994) to classify the gene expression data. The advantages of this approach are that first, we can make full use of the high-order statistical information contained in the gene expression data; second, regularized regression models can handle the situation of large numbers of correlated predictor variables; finally, we can develop predictive models for classifying tumors based on the entire gene expression profile. To validate the efficiency, the proposed method is applied to classify four different DNA microarray datasets including colon cancer data (Alon et al., 1999), acute leukemia data (Golub et al., 1999), hepatocellular carcinoma data (Iizuka et al., 2003) and high-grade glioma data (Nutt et al., 2003). The prediction results show that our method is efficient and feasible.
| METHODS |
|---|
|
|
|---|
Independent component analysis
ICA is a useful extension of PCA that has been developed in context with blind separation of independent sources from their linear mixtures (Comon, 1994). Such blind separation techniques have been used, e.g. in various applications of auditory signal separating, medical signal processing and so on. Roughly speaking, rather than requiring that the coefficients of a linear expansion of the data vectors be uncorrelated as in PCA, in ICA these coefficients must be mutually independent (or as independent as possible). This implies that higher-order statistics are needed in determining the ICA expansion.
Considering an n x p data matrix X, whose rows ri (i=1,
, n) correspond to observational variables and whose columns cj (j=1,
, p) are the individuals of the corresponding variables, the ICA model of X can be written as
![]() | (1) |
, where H(sk) is the marginal entropy of the variable sk and H(S) is the joint entropy. Estimating the independent components can be accomplished by finding the right linear combinations of the observational variables, since we can invert the mixing matrix as
![]() | (2) |
There are a number of algorithms for performing ICA (Comon, 1994; Hyvärinen, 1999; Zheng, et al., 2005, 2006). In this paper, we shall employ the FastICA algorithm, which was proposed by Hyvärinen (1999), to address the problems of tumor classification. In this algorithm, the mutual information is approximated by a contrast function:
![]() | (3) |
Like PCA, ICA can remove all linear correlations. By introducing a non-orthogonal basis, it also takes into account higher-order dependencies in the data. Particularly, ICA is in a sense superior to PCA, which is just sensitive to second-order relationships of the data. And, the ICA model usually leaves some freedom of scaling and sorting by convention, the independent components are generally scaled to unit deviation, while their signs and orders can be chosen arbitrarily.
ICA models of gene expression data
Now let the n x p matrix X denote the gene expression data (generally speaking, n << p), xij is the expression level of the j-th gene in the i-th assay. ri (a p-dimensional vector), the i-th row of X, denotes the snapshot of the i-th assay (cell sample) (In the gene data literature, the problem is usually formulated using the transposed matrix XT). Alternatively, cj (an n-dimensional vector), the j-th column of X, is the expression profile of the j-th gene. We suppose that the data have already been preprocessed and normalized, i.e. every sample has mean zero and standard deviation one.
Regardless of which algorithm is used to compute ICA, we can apply ICA to model gene expression data as shown in Figure 1. In this model, the snapshots ri in X are considered to be a linear mixture of statistically independent basis snapshots (eigenassay) S combined by an unknown mixing matrix A. The ICA algorithm learns the weight matrix W, which is used to recover a set of independent eigenassays in the rows of U. In this architecture, the snapshots ri are variables and the gene expression profile values provide observations for the variables. Essentially, this method coincides with the traditional ICA-like model of cock-tail problem (Comon, 1994). Projecting the input snapshots onto the learned weight vectors produces the independent basis snapshots. As a result, the corresponding mixing and unmixing models can be represented as follows:
![]() | (4) |
![]() | (5) |
In this approach, ICA is used to find a matrix W such that the rows of U are as statistically independent as possible. The independent eigenassays estimated by the rows of U are then used to represent the snapshots. The representation of the snapshots consists of their corresponding coordinates with respect to the eigenassays defined by the rows of U, i.e.
![]() | (6) |
These coordinates are contained in the rows of mixing matrix A=W1. Clearly, every coordinate aj (row of A) is an n-dimensional vector while the snapshot rj is a p-dimensional vector. In general, the number of genes in a single assay is in the thousands while the number of assay is up to hundreds. So the above procedure can be used to compress the gene expression data.
From another viewpoint, the gene expression profiles (columns of X) can be regarded as points in a multidimensional space with dimensions corresponding to the number of samples. The linear ICA model X=AS represents the gene expression profiles (the columns of X) by a new set of basis vectors (the columns of A, Fig. 2). This idea is based on the assumptions that, first, the gene expression profiles are determined by a combination of hidden regulatory variables, which were called expression modes. Second, the genes' responses to these variables can be approximated by linear functions (Liebermeister, 2002; Hori et al., 2001). Expression mode k is characterized by its profile over the samples (k-th column of A) and by its linear influences on the genes (k-th row of S). In this paper, we just use this idea to find a good set of basis profiles (eigengenes) to represent gene expression data so that they can be reasonably regularized.
Search for the consensus eigenassays
Chiappetta (2004) has pointed out that unlike PCA, ICA requires searching for the maxima of a target function in a large-dimensional configuration space. Therefore, one often encounters difficulties with local maxima in which most algorithms may get stuck, and the result may be sensitive to initialization. We also find in the experiments that compared with PCA, ICA is not always reproducible when used to analyze gene expression data. This problem had also been found by Liebermeister (2002). In addition, the results obtained from an ICA algorithm are not ordered. In the literature (Chiappetta et al., 2004), the authors had considered that, the reason of this phenomenon is that the ICA algorithm may converge to local optima. In addition, they have given out a consensus source (eigenassay) search algorithm which yields extremely stable and robust estimates for the eigenassays, as well as indications relative to their stability.
In this paper, we use the method advised by Chiappetta et al. (2004) to overcome these difficulties, which uses the following procedure. The independent source estimation is run several times (say, 100 times), with different random initializations, and consensus sources are recorded namely, eigenassays which are obtained with a frequency larger than a certain threshold are conserved, and their frequencies of appearance are recorded and used as credibility indices. As a result, one is led to a (variable, data-driven) number of average consensus eigenassays s1,...,sn.
Finally, the corresponding consensus mixing matrix A is computed as follows:
![]() | (7) |
Here V is the inverse of the n x n matrix C of the scalar product of the consensus eigenassays
. For more details, please see the literature (Chiappetta et al., 2004).
Interpretation of ICA results
The ICA model states that different modes exert independent influences on the genes. To interpret in more detail, the first step of the analysis is the study of the mixing matrix A. For a fixed eigenassay, say eigenassay i, the coefficients aji represent the projection of snapshot j on source i, or the importance of eigenassay i in snapshot j. If one believes in the linear mixture of independent eigenassay model and accepts identifying a source with a regulation pathway in first approximation, the coefficients aji would allow one to assert to which extent the eigenassay i was (positively or negatively) active in snapshot j.
In addition, the distribution of the values of the column of the mixing matrix A is often interesting and may reveal specific features of the dataset. Particularly interesting is the situation where the distribution of mixing coefficients for a given eigenassay exhibit a bimodal or multimodal behavior. This indicates that the source under consideration has a good discriminating power between two or more different classes of conditions. However, as Chiappetta (Chiappetta et al., 2004) has pointed out even though bimodal distributions yield spectacular results, good discrimination may also be obtained without such a behavior.
A second step in the interpretation of ICA results is to analyze carefully the behavior of specific genes in different eigenassays. It generally happens that a given independent eigenassay is characterized by a number of significantly overexpressed (or underexpressed) genes. Putting such genes into correspondence with snapshots, or clinical data, may happen to be extremely informative. Because the main aim of this paper is not to study biological interpretation of ICA results for microarray data, moreover, there have been many literatures concern this issue, hence we will not discuss it in detail here. Readers who are interested in this issue can refer literatures further (Liebermeister, 2002; Hori et al., 2001; Martoglio et.al., 2002; Chiappetta et al., 2004).
ICA and PCA
PCA can be derived as a special case of ICA, which uses Gaussian source models. The assumption of Gaussian sources implicit in PCA makes it inadequate when the true sources are non-Gaussian. In particular, we have empirically observed that many gene expression data are sparse or super-Gaussian signals (all the four datasets used in this paper are super-Gaussian signals). When sparse source models are appropriate, ICA has the following potential advantages over PCA: (1) It provides a better probabilistic model of the data, which better identifies where the data concentrate in n-dimensional space. (2) It uniquely identifies the mixing matrix A. (3) It finds an unnecessarily orthogonal basis which may reconstruct the data better than PCA in the presence of noise. (4) It is sensitive to high-order statistics in the data, not just the covariance matrix (Bartlett et al., 2002).
Figure 3 illustrates these points with an example. The figure shows samples from a three-dimensional (3D) distribution constructed by linearly mixing two high-kurtosis sources. The figure shows the basis vectors found by PCA and by ICA on this problem. Since the three ICA basis vectors are non-orthogonal, they change the relative distance between data points. This change in metric may be potentially useful for classification algorithms, like nearest neighbor, that make decisions based on relative distances between points. The ICA basis also alters the angles between data points, which affects similarity measures such as cosines. Moreover, if an undercomplete basis set is chosen, PCA and ICA may span different subspaces. For example, in Figure 3, when only two dimensions are selected, PCA and ICA choose different subspaces (Bartlett et al., 2002).
It should be noted that ICA is a very general technique. When super-Gaussian sources are used, ICA can be seen as doing something akin to non-orthogonal PCA and to cluster analysis, however, when the source models are sub-Gaussian, the relationship between these techniques is less clear. See (Lee et al., 1999) for a discussion of ICA in the context of sub-Gaussian sources.
Penalized regression models
In this section, we briefly outline two types of regularized regression models, i.e. ridge regression and principal component regression.
Ridge regression
Consider the standard regression model
![]() | (8) |
is a random vector with zero mean and one variance. In this paper, because n is smaller than p, the usual ordinary least squares estimator will not be well defined. An alternative is to use the ridge regression estimator of ß in Equation (8):
![]() | (9) |
is a constant. The parameter
controls the amount of shrinkage in the data.
Principal component regression
The method of principal components regression can be traced back to the literature (Massy, 1965). To use this method, we first perform a singular value decomposition of the gene data X
![]() | (10) |
, where
. V is a p x n matrix with orthonormal columns. Plugging this decomposition into Equation (8), we have
|
| (11) |
. We can fit the model in Equation (11) using ordinary least squares and get an estimate of ß by multiplying V to the least squares estimator of
in Equation (11).
Optimal scoring
In the previous section, we have described two penalized regression models that have been used successfully in other applications such as in chemometrics. However, in this paper, the goal for our interest is classification. Thus, we should firstly re-express the classification problem as a regression problem. This is done using the optimal scoring algorithm. The point of optimal scoring is to turn categorical variables into quantitative ones by assigning scores to classes (categories).
Let gi denote the tumor class for the i-th sample (i=1, ... , n), we assume that there are G tumor classes so that gi takes values { 1, ... , G }. We first convert
into an n x G matrix
, where
if the i-th sample falls into class j, and 0 otherwise. Let
(k = 1, ... , G) be the n x 1 vector of quantitative scores assigned to g for the k-th class. The optimal scoring problem involves finding the coefficients ßk and the scoring maps
k that minimize the following average squared residual:
![]() | (12) |
Let
(J
G 1) be a matrix of j score vectors for the G classes, i.e. its k-th row is the scores,
. Assume that the minimization of Equation (12) is subject to the constraint
, then, as mentioned by Hastie et al. (1994), the minimization of this constrained optimization problem leads to the estimates of ßk that are proportional to the discriminant variables in linear discriminant analysis. The interested readers can refer to the literatures (Hastie et al., 1994, 1995).
Penalized optimal scoring for classification
So far, we have outlined the components necessary for the implementation of our procedure. In this section, we give out our algorithm for classifying the tumor samples.
We propose to use ICA for regularizing the gene expression data and then use a penalized optimal scoring procedure for classification. The outline of our method is shown as follows:
Step 1: Using the ICA model X = AS to present the gene expression data, i.e. using ICA and consensus sources algorithm to calculate the eigengenes (columns of A) and the independent coefficients (rows of S).
Step 2: Choose an initial score matrix
Gxj with J
G 1 satisfying
TDp
= I, where Dp = YT Y/n. Let
0 =Y
.
Step 3: Fit a multivariate penalized regression model of
0 on A, yielding the fitted values
and the fitted regression function
. Let
be the vector of the fitted regression function on X, where S+ is the pseudoinverse of S.
Step 4: Obtain the eigenvector matrix
of
, and hence the optimal scores
.
Step 5: Let
.
What should be explained is that the objective function we are minimizing in Step 3 is the following expression:
![]() | (13) |
k = Sßk. Another problem is how to choose the initial values for
. Readers can refer to the discussions about this problem in literature (Hastie et al., 1994). In fact, our algorithm is somewhat similar to the algorithm proposed by Ghosh (2003), except that we replace principal components with independent components.
Once the algorithm has been run, we now have a discriminant rule for classifying new samples. We use the nearest centroid rule to form the classifier, i.e. assign a new sample Xnew to the class j that minimizes
![]() | (14) |
denotes the fitted centroid of the j-th class. D is a matrix with diagonal element
![]() | (15) |
k is the k-th largest eigenvalue calculated in Step 4 of the algorithm.
Choosing the optimal amount of regularization
Ridge regression, principal components regression and independent component regression models involve a regularization parameter that must be selected in advance. In ridge regression method, the regularization parameter is
, while for principal components, the parameter that needs to be chosen is the number of components included in the model. For ICA regression model, the parameters that need to be chosen are both the number and the subset of eigengenes (columns of A) included in the model. In contrast to PCA method, where feature (principal component) subset selection is based on energy criterion, the selection of an ICA basis subset is not immediately obvious since the energies of the independents cannot be determined. Furthermore, it is conjectured that some feature selection scheme focused on recognition rather than on reconstruction could augment the classification performance. With this goal in mind, we used the sequential floating forward selection (SFFS) technique (Feri et al., 1994) to find the most discriminating ICA features (columns of A, every eigengene corresponding an engenassay).
For this SFFS method, features are selected successively by adding the locally best feature points, which provide the highest incremental discriminatory information, to the exiting feature subset. In addition, the SFFS method goes through cleaning periods, in which features are removed systematically so long as the performance is improved after pruning. We use leave-one-out cross-validation in the training dataset to determine the number of components to include in the model. Readers who want to know the details about SFFS can refer to literature (Feri et al., 1994).
| RESULTS |
|---|
|
|
|---|
In this section, we shall demonstrate the efficiency and effectiveness of the proposed methodology described above by classifying four datasets with various human tumor samples.
Datasets
In this study, four publicly available microarray datasets are used to study the tumor classification problem. They are colon cancer data (Alon et al., 1999), acute leukemia data (Golub et al., 1999), hepatocellular carcinoma data (Iizuka et al., 2003) and high-grade glioma data (Nutt et al., 2003), respectively. In these datasets, all data samples have already been assigned to a training set or test set.
An overview of the characteristics of all the datasets can be found in Table 1. The acute leukemia data in literature (Golub et al., 1999) have already been used frequently in previous microarray data analysis studies. Preprocessing of this dataset was done by setting threshold and log-transforming on the original data, similar to the one introduced in the original publication. Threshold technique is generally achieved by restricting gene expression levels to be larger than 20. In other words, the expression levels which are smaller than 20 will be set to 20. Regarding the log-transformation, the natural logarithm of the expression levels usually is taken. In addition, no further preprocessing is applied to the rest of the datasets.
Experimental results
We now use the proposed methodology to classify the tumor data. Since all data samples in these four datasets have already been assigned to a training set or test set, we built the classification models using the training samples and estimated the classification correct rates using the test set.
To obtain reliable experimental results showing comparability and repeatability for different numerical experiments, this study not only uses the original division of each dataset in training and test set, but also reshuffles all datasets randomly. In other words, all numerical experiments were performed with 20 random splitting of the four original datasets. And, they are also stratified, which means that each randomized training and test set contains the same amount of samples of each class compared with the original training and test set.
We used penalized independent component regression (P-ICR) proposed in this paper to analyze the four gene expression datasets. In the experiment, we have sought as many independent components as tumor samples in every run of ICA and as many consensus eigenassays as tumor samples. Also, before choosing the eigenassays with SFFS, the eigenassays with credibility <20% are deleted. For comparison, we also used penalized ridge regression (P-RR), penalized principal component regression (P-PCR) proposed in (Ghosh, 2003) and PAM (Tibshirani et al., 2002) to do the same tumor classification experiment.
The classification results for tumor and normal tissues using our proposed penalized methods are displayed in Table 2. For each classification problem, the experimental results gave the statistical means and standard deviations of accuracy on the original dataset and 20 randomizations as described above. Since the random splits for training and test set are disjoint, the results given in Table 2 are unbiased and can in general also be too optimistic.
To show the efficiency and feasibility of the method proposed in this paper, the results using other 9 methods (Methods 19) are also listed in Table 2 for comparison. These 9 methods can be subdivided in two steps: dimensionality reduction and classification. For dimensionality reduction, classical PCA as well as kernel PCA (with linear or RBF kernel) are used. Fisher discriminant analysis (FDA) and least squares support vector machine (LS-SVM) are then used for classification. Note that these methods and results were ever reported in literature (Pochet et al., 2004), where the divisional method of each training and test dataset is the same as ours. Readers can see the details about the first 9 methods from literature (Pochet et al., 2004).
From Table 2 depicted above we can see that for colon, leukemia and glioma datasets, our proposed method is indeed efficient and feasible. Yet for hepatocellular data, the classification result of our method is not perfect. In addition, the other two methods we have used in our experiment (Methods 11 and 13) are even badly for this dataset. In fact, we can also find from Table 2 that, there is no method whose classification effect is always the best for all the four datasets.
The relationship between the credibility and the discrimination of eigenassay
Table 3 shows the credibility and its discrimination of every 30 eigenassay (strictly speaking, it is the corresponding eigengene's discrimination) extracted in one experiment (running ICA 100 times, deleting the eigenassays as described in the Experimental results section). We experimentalize using the colon data that the discrimination of every eigenassay is the accuracy on training set using leave-one-out cross-validation performance. Table 4 shows the 10 eigenassays corresponding to their credibility, which are orderly selected by SFFS algorithm during five experiments (using five random splittings of the colon data, running 100 ICA times, choosing 10 eigenassays using SFFS). Only from Table 3, we cannot find the certain relationship between the credibility and the discrimination of every eigenassay. Yet, from Table 4, we can see that most of the selected eigenassays have higher credibility. However, we also found from experiments that the higher credibility eigenassays are not all selected for classification. In addition, these results are also found in other experiments using other three datasets.
| CONCLUSIONS |
|---|
|
|
|---|
In this paper, we presented ICA methods for the classification of tumors based on microarray gene expression data. The methodology involves regularizing gene expression data using ICA, followed by the classification applying penalized discriminant method. We have compared the experimental results of our method with other 12 methods, which show that our method is effective and efficient in predicting normal and tumor samples from four human tissues. Furthermore, these results hold under re-randomization of the samples.
Because currently we have no suitable gene expression data of multiclass at hand, we only studied binary tumor classification problem in the experiments. In fact, our method is in essence a method that can address the problems with multi-classes. So we can use this novel method to solve those multi-classification problems directly.
We also found in experiment that compared with PCA, ICA is not always reproducible when used to analyze gene expression data. This problem had also been found by Chiappetta (2004) and Liebermeister (2002). In literature (Chiappetta et al., 2004), the authors had considered that the reason of this phenomenon is that the ICA algorithm may converge to local optima. In addition, they have given out a consensus source search algorithm which yields extremely stable and robust estimates for the sources, as well as indications relative to their stability. In this paper, we also use this method to solve the unstability of ICA.
In future works, we will at large study the ICA model of gene expression data, how to apply the method proposed in this paper to solving multiclass problems of tumor classification and also study how to make full use of the information contained in the gene data to restrict ICA models so that more exact prediction of tumor class can be achieved.
|
|
|
|
|
|
|
| Acknowledgments |
|---|
The authors are grateful to Hong-Qiang Wang and Zhan-Li Sun for helpful discussions on this paper. This work was supported by the National Science Foundation of China (nos 30570368 and 60472111).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on November 21, 2005; revised on April 27, 2006; accepted on May 11, 2006
| REFERENCES |
|---|
|
|
|---|
Alizadeh, A.A., et al. (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503511[CrossRef][Medline].
Alon, U., et al. (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl Acad. Sci. USA, 96, 67456750
Bartlett, M.S., et al. (2002) Face recognition by independent component analysis. IEEE Trans. Neural Netw, 13, 14501464[Medline].
Bhattacharjee, A., et al. (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl Acad. Sci. USA, 98, 1379013795
Bittner, M., et al. (2000) Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature, 406, 536540[CrossRef][Medline].
Chiappetta, P., et al. (2004) Blind source separation and the analysis of microarray data. J. Comput. Biol, . 11, 10901109[CrossRef][ISI][Medline].
Comon, P. (1994) Independent component analysisa new concept? Signal Processing, 36, 287314[CrossRef][ISI].
Dudoit, S., et al. (2002) Comparison of discrimination methods for the classification of tumor using gene expression data. J. Am. Stat. Assoc, . 97, 7787[CrossRef][ISI].
Feri, F.J., et al. (1994) Comparative study of techniques for large-scale feature selection. In Gelsema, ES and Kanal, LS (Eds.). In: Pattern Recognition in Practice IV, Multiple Paradigms, Comparative Studies and Hybrid Systems, , Amsterdam Elsevier, pp. 403413.
Frank, I.E. and Friedman, J.H. (1993) A statistical view of some chemometric regression tools. Technometrics, 35, 109143[Medline].
Furey, T.S., et al. (2000) Support vector machines classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16, 906914
Ghosh, D. (2003) Penalized discriminant methods for the classification of tumors from microarray experiments. Biometrics, 59, 9921000[CrossRef][ISI][Medline].
Golub, T.R., et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531537
Hastie, T., et al. (1995) Penalized discriminant analysis by optimal scoring. Ann. Stat, . 23, 73102.
Hastie, T., et al. (1994) Flexible discriminant analysis by optimal scoring. J. Am. Stat. Assoc, . 89, 12551270[CrossRef][ISI].
Hori, G., Inoue, M., Nishimura, S., Nakahara, H. (2001) Blind gene classification based on ICA of microarray data. Proc. 3rd Int. Workshop on Independent Component Analysis and Blind Signal Separation (ICA2001)SanDiego, USA , pp. 332336.
Hyvärinen, A. (1999) Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans. Neural Netw, 10, 626634[Medline].
Hyvärinen, A., Karhunen, J., Oja, E. Independent Component Analysis, . (2001) , NY Wiley.
Iizuka, N., et al. (2003) Oligonucleotide microarray for prediction of early intrahepatic recurrence of hepatocellular carcinoma after curative resection. Lancet, 361, 923929[CrossRef][ISI][Medline].
Lee, T.-W., et al. (1999) Independent component analysis using an extended infomax algorithm for mixed sub-Gaussian and super-Gaussian sources. Neural Comput, . 11, 417441[Abstract].
Liebermeister, W. (2002) Linear modes of gene expression determined by independent component analysis. Bioinformatics, 18, 5160
Martoglio, A-M., et al. (2002) A decomposition model to track gene expression signatures: preview on observer-independent classification of ovarian cancer. Bioinformatics, 18, 16171624
Massy, W.F. (1965) Principal components regression in exploratory statistical research. J. Am. Stat. Assoc, . 60, 234246[CrossRef][ISI].
Nutt, C.L., et al. (2003) Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res, . 63, 16021607
Pochet, N., et al. (2004) Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction. Bioinformatics, 20, 31853195
Selaru, F.M., et al. (2002) Artificial neural networks distinguish among subtypes of neoplastic colorectal lesions. Gastroenterology, 122, 606613[CrossRef][ISI][Medline].
Shipp, M.A., et al. (2002) Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med, . 8, 6874[CrossRef][ISI][Medline].
Teschendorff, A.E, et al. (2005) A variational Bayesian mixture modelling framework for cluster analysis of gene-expression data. Bioinformatics, 21, 30253033
Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G. (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl Acad. Sci. USA, 99, 65676572
van't Veer, L.J., et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530536[CrossRef][Medline].
West, M. (2003) Bayesian Factor Regression Models in the Large p, Small n Paradigm. Bayesian Stat, . 7, 723732.
West, M., et al. (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl Acad. Sci. USA, 98, 1146211467
Zheng, C.H., et al. (2005) Post-nonlinear blind source separation using neural networks with sandwiched structure. LNCS, 3497, 478483.
Zheng, C.H., et al. (2006) Nonnegative independent component analysis based on minimizing mutual information technique. Neurocomputing, 69, 878883[CrossRef].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||















