Skip Navigation


Bioinformatics Advance Access originally published online on January 10, 2006
Bioinformatics 2006 22(6):755-761; doi:10.1093/bioinformatics/btk036
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/6/755    most recent
btk036v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wang, Z.
Right arrow Articles by Hoffman, E. P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wang, Z.
Right arrow Articles by Hoffman, E. P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Optimized multilayer perceptrons for molecular classification and diagnosis using genomic data

Zuyi Wang 1,2, Yue Wang 3,*, Jianhua Xuan 2, Yibin Dong 3, Marina Bakay 1, Yuanjian Feng 3, Robert Clarke 4 and Eric P. Hoffman 1

1Center for Genetic Medicine, Children's National Medical Center Washington, DC 20010, USA
2Department of Electrical Engineering and Computer Science, The Catholic University of America Washington, DC 20064, USA
3The Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University Arlington, VA 22203, USA
4Departments of Oncology, Physiology and Biophysics, Lombardi Comprehensive Cancer Center, Georgetown University Washington, DC 20007, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THEORY AND METHOD
 3 EXPERIMENTAL VERIFICATION
 4 CONCLUSIONS AND DISCUSSIONS
 REFERENCES
 

Motivation: Multilayer perceptrons (MLP) represent one of the widely used and effective machine learning methods currently applied to diagnostic classification based on high-dimensional genomic data. Since the dimensionalities of the existing genomic data often exceed the available sample sizes by orders of magnitude, the MLP performance may degrade owing to the curse of dimensionality and over-fitting, and may not provide acceptable prediction accuracy.

Results: Based on Fisher linear discriminant analysis, we designed and implemented an MLP optimization scheme for a two-layer MLP that effectively optimizes the initialization of MLP parameters and MLP architecture. The optimized MLP consistently demonstrated its ability in easing the curse of dimensionality in large microarray datasets. In comparison with a conventional MLP using random initialization, we obtained significant improvements in major performance measures including Bayes classification accuracy, convergence properties and area under the receiver operating characteristic curve (Az).

Supplementary information: The Supplementary information is available on http://www.cbil.ece.vt.edu/publications.htm

Contact: yuewang{at}vt.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THEORY AND METHOD
 3 EXPERIMENTAL VERIFICATION
 4 CONCLUSIONS AND DISCUSSIONS
 REFERENCES
 
Diagnostic classification with genomic data refers to the assignment of a particular unknown tissue sample to a known disease class based on its quantitative mRNA expression pattern from microarrays. This classification can be performed by a trained predictive classifier, such as a neural network classifier. This approach is particularly helpful for diagnosing complex genetic disease subtypes or stages whose subtle differences may be difficult to recognize by traditional clinical and pathological approaches (Bittner et al., 2000; Brown et al., 2000; Khan et al., 2001; Mjolsness and DeCoste, 2001; Ramaswamy et al., 2001; Shipp et al., 2002; West et al., 2001; Linder et al., 2004; O'Neill and Song, 2003; Wei et al., 2004). A common type of neural network classifier applied to diagnostic classification is feed-forward back-propagation multilayer perceptrons (MLP) (Fig. 1). Input vectors and the corresponding target vectors are used to train an MLP, a process that updates the weights and biases until the MLP can approximate a mapping function that associates input vectors with specific output vectors. The generalization property makes it possible to train an MLP with a representative set of input/target pairs and get good results for predicting unseen input samples. The ability of an MLP to learn complex (non-linear) and multi-dimensional mapping from a collection of examples makes it an ideal classifier for diagnostic classification (Haykin, 1999; Khan et al., 2001; O'Neill and Song, 2003; Wei et al., 2005).


Figure 1
View larger version (27K):
[in this window]
[in a new window]
 
Fig. 1 The general architecture of a two-layer MLP. The inputs and the layers of neurons are connected through sets of synaptic weights, e.g. Formula 4, and each neuron has an individual bias, e.g. b1,1.

 
Despite reported successful studies on applying MLPs to diagnoses with genomic data, such as gene expression microarray data (Khan et al., 2001; Linder et al., 2004; O'Neill and Song, 2003; Wei et al., 2005), the most critical problem, that of the curse of dimensionality, has not been effectively addressed. The curse of dimensionality is caused by the finite amount of training data available relative to the large input feature space. Accordingly, when the dimensionality increases considerably and the available information remains inadequate, the large number of model parameters in the classifier cannot be well-trained (Haykin, 1999, Jain et al. 2000). Consequently, the classifier's performance may degrade beyond a certain point with the increasing inclusion of features or dimensions. In mRNA microarray experiments, there is typically an extremely ill-conditioned ratio of sample number (tens to hundreds) to dimension number (probe or probe sets typically >10 000), which greatly augments the impact of the curse of dimensionality (Fukunaga, 1990; Haykin, 1999). In current studies, the approaches to avoid the curse of dimensionality are generally limited to directly reducing the number of inputs. The commonly applied methods include conventional dimensionality reduction methods, such as principal component analysis (Khan et al. 2001, Wei et al., 2005), t-statistics (Golub et al., 1999), correlation measure (van't Veer et al., 2002) and an MLP training-based gene selection procedure that selects genes with greater influence on the changes of outputs in an MLP (O'Neil and Song, 2003).

The design parameters in training an MLP include initial values of the model parameters (synaptic weights and biases), stopping rules, MLP architecture, etc. Since no effective algorithms are available to search for a global optimum and traditional MLP initialization is done randomly, classification performance depends largely on the initial values of weights and biases. Furthermore, the higher complexity of the classifiers often results in more local minima in the error surface, and the classifier trainings can easily be trapped into such local minima (Raudys and Skurikhina, 1992; Raudys, 1994).

We hypothesized that developing an optimization of MLP initialization will allow the reduction of the curse of dimensionality, and therefore improve performance of the MLP. Our goal was to find an effective non-random initialization scheme that places the initial state of an MLP closer to the optimal solution that is later sought by training (Wang et al., 2004). This approach is bolstered by previous studies in statistical pattern recognition field, where it has been shown that non-random initializations of MLP weights and biases resulted in the MLP with small generalization error even when the number of samples is smaller than the number of features or dimensions (Raudys, 1994; Raudys, 1997; Raudys and Skurikhina, 1992).


    2 THEORY AND METHOD
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THEORY AND METHOD
 3 EXPERIMENTAL VERIFICATION
 4 CONCLUSIONS AND DISCUSSIONS
 REFERENCES
 
2.1 wFC-based MLP Initializations
2.1.1 Linear dimension reduction and MLP feature extraction—hidden layer initialization
The MLP offers an integrated procedure for feature extraction and Bayes classification by learning the decision boundary (Haykin, 1999). Its feed-forward auto-associative architecture can also be used to construct non-linear subspaces in a supervised or unsupervised mode (Haykin, 1999; Jain et al., 2000). The output of the hidden layer may be interpreted as a set of new features presented to the output layer for classification (Haykin, 1999). On the other hand, multi-class linear discriminant analysis (LDA) provides a multivariate prediction by estimating the density function. Its subspaces that are extracted based on the weighted Fisher criterion (wFC), retain most closely the intrinsic Bayes separability (Loog et al., 2001). It can be shown that the determination of the linear dimension reduction (LDR) transformation is equivalent to finding the maximum-likelihood parameter estimates of a standard finite normal mixture (SFNM) model (Loog et al., 2001). This motivates an exploration of the connections between MLP and LDR. A natural hypothesis is that the class labels used as targets during supervised training force the outputs of the hidden layer to capture the most discriminatory components or subspaces for distinguishing the classes. Based on these theoretical observations, we suggest a wFC-based initialization mechanism for the MLP hidden layer (Wang et al., 2004). To limit the complexity of the MLP, we assume that the number of neurons in the hidden layer is smaller than the number of inputs.

Given an m0-dimensional input t-space with K0 classes, the multi-class LDR searches for a linear transformation W that transforms the original input space to a lower m1-dimensional feature x-space (m1 < m0); the extracted x-space should preserve the maximum amount of class discriminatory information. Since it is too complex to directly use the Bayes error as a criterion, the most common technique for finding this transformation is LDR that is based on Fisher criterion (Jain et al., 2000; Haykin, 1999). This method maximizes the ratio of the between-class scatter matrix to the within-class scatter matrix, thereby guaranteeing maximal separability. In this paper, we apply the wFC to the multi-class classification problem (Loog et al., 2001), and the wFC is defined as

Formula 1(1)
where W is the linear transformation matrix, {pi}k and {pi}l are the prior probabilities of classes k and l respectively, Formula 1 is the total within-class scatter matrix, and Formula 1 is the between-class scatter matrix for classes k and l. {omega}({Delta}kl) the weighting function defined as,

Formula 2(2)
where Formula 2 is the Mahanalobis distance between classes k and l with class mean vector µt and covariance matrix Ct.

It has been shown that when there are more than two classes to be classified, the conventional multi-class Fisher criterion (cFC) for deriving dimension-reduced subspace is suboptimal with respect to classification (Loog et al., 2001). The reason is that the cFC treats class pairs with various between-class distances equally. In contrast, the wFC incorporates a weight function that approximates the Bayes error rate between classes, and assigns larger weights to the closer class pairs and smaller weights to the distant pairs. Thus, in the extracted subspace found by wFC, the classes with heavy overlap gain adequate emphases, and the distant pairs remain well separated.

Finding a solution W that maximizes the wFC is essentially a problem of eigenvalue decomposition of the total Fisher scatter matrix.

Formula 3(3)

By taking only the m1 eigenvectors corresponding to the m1 largest eigenvalues (m1 < m0), we can form a transformation that not only reduces the dimensionality of the original input space, but also retains maximal class separability information. We call this procedure wFC-Discriminatory Component Analysis (wFC-DCA)

With the transformation W (m0 x m1) derived from LDR, the dimension-reduced feature subspace (x-space) with m1 dimensions becomes Formula 3, for i = 1, ... , N, where N is the number of samples, xi is the representation of the sample vector ti in the x-space with Formula 3 for r = 1, ... , mi, and bt0 is the global center of the dataset. On the other hand, the outputs of the hidden layer in the MLP (Fig. 1) can be acquired as, Formula 3, where Formula 3 is the set of synaptic weights connecting m0 inputs to neuron n at the hidden layer, an is the output of neuron n, p is the MLP input vector, b1,n is the bias of hidden neuron n and {phi}(·) is an activation function (Haykin, 1999). The connection between the LDR and the MLP feature extraction mechanism now becomes clearer, suggesting that the column vectors of the LDR matrix W can be used to initialize the weights between the input and hidden layer of an MLP, Formula 3, and their biases can be initialized as, Formula 3. The new features are further scaled by the activation function {phi}(·) that could be linear or non-linear. It has been theoretically shown that the minimization of the Bayes error with respect to the synaptic weights and biases of the MLP is equivalent to maximizing the wFC [Equation (1)], and it can be entirely determined by the hidden neurons (Haykin, 1999).

2.1.2 LDA and multi-class perceptrons—output layer initialization
Since the outputs of the hidden layer serve as new features, Fisher LDA determines a linear transformation for converting an m1-dimensional problem to a one-dimensional (1D) problem (Haykin, 1999). Consider the variable Formula 3 that is transformed from x-space to a 1D space via LDA, and the LDA is defined by

Formula 4(4)
that is known as the generalized Rayleigh quotient. The solution that maximizes J(w) is simply Formula 4, which is also a generalized eigenvalue problem.

The neuron in the output layer behaves similarly as a perceptron that can be considered as a decision making element that bears a close resemblance to the Bayes classifier, and has been generalized to multiple classes (Haykin, 1999). Specifically, the outputs of the neurons in the output layer are computed as, Formula 4 for i = 1, ... , m2, where a is the output vector of the hidden layer, wi is the set of weights connecting the hidden layer and the output neuron i, b2.i is the bias of output neuron i and m2 is the number of the output neurons, i.e. the number of classes (Fig. 1). Consider a two-class case with a linear activation function {phi}(·), we have Formula 4 with Formula 4 and Formula 4, where Formula 4. We can use two output neurons to derive a class-dependent representation by rearranging the output as, Formula 4, where Formula 4, Formula 4, Formula 4 and Formula 4, so we have Formula 4 and Formula 4. Figure 2a illustrates such an interpretation. Based on the above derivation, the class-dependent Fisher linear discriminant transformation wi can be again used to initialize the weights between the hidden and output neurons as Formula 4, and the biases of the output neurons can be, Formula 4 for i = 1, 2. Accordingly, for a three-class case, it is straightforward to have Formula 4, Formula 4, Formula 4, Formula 4 and Formula 4, Formula 4, where Formula 4. Figure 2b depicts this case. Notice that such an initialization is readily applicable to single-layer perceptrons.


Figure 2
View larger version (14K):
[in this window]
[in a new window]
 
Fig. 2 The illustrations of the MLP output layer initialization approach.

 
2.1.3 Determining the size of the hidden layer
The wFC-based MLP initialization method may also suggest a suitable number of hidden neurons, a key component of MLP architecture. Neural networks, like other flexible non-linear estimation methods, are vulnerable to problems of under-fitting and over-fitting (Haykin, 1999; Ripley, 1996). The over-fitting problem occurs more easily when the number of samples in the training set is small and the network is relatively large, which is the case for most genomic data. Therefore, it is important to use a network that is just large enough to provide an adequate fit. The resulting subspace represented by the outputs of the hidden layer should maintain as much class separability as possible (Haykin, 1999): the retained partial separability is given by Formula 4 [Equation (1)]. Hence, it is appropriate to let the number of pseudo genes (i.e. m1, the number of hidden neurons) be the number of significant eigenvalues derived from wFC-DCA because the eigenvalues represent class separability in feature space. In this study, we select the dominant eigenvalue subset that contains 99% of the total separability, and let the number of hidden neurons be equal to the number of selected eigenvalues.

2.2 Selection of MLP inputs
Input selection is a prerequisite for diagnostic classification using genomic data; we apply our newly developed two-step wFC-based input selection method (Xuan et al., 2004) that shares the same theoretical basis (wFC) with the proposed MLP initialization approach. First, we rank all genes based on their individual discriminatory power measured by the 1D wFC (Xuan et al., 2004); a gene will be selected as an individually discriminatory gene (IDG) if its discriminatory power is above an empirical threshold. Second, from the IDG pool, we select jointly discriminatory gene (JDG) subsets (with various sizes) whose joint discriminatory power is the maximum among all sets of the same size. The joint discriminatory power is also determined by the multi-dimensional version of wFC [Equation (1)]. Furthermore, the JDG sets are refined by testing on a trained MLP, which ultimately determines the ‘optimal’ diagnostic gene subset that minimizes the generalization error. From the curve of classification rate versus JDG subsets, we pick the optimal JDG subset that corresponds to the maximal classification rate as the final inputs for the MLP. This step boosts the MLP performance, and also determines its number of inputs (m0, Fig. 1).


    3 EXPERIMENTAL VERIFICATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THEORY AND METHOD
 3 EXPERIMENTAL VERIFICATION
 4 CONCLUSIONS AND DISCUSSIONS
 REFERENCES
 
3.1 Data
To highlight the biological and clinical relevance, we chose diagnostic tasks that are difficult for standard clinical and pathological methods alone. The following list summarizes the microarray datasets tested in this study.

  1. Limb-girdle muscular dystrophy (LGMD, provided by Children National Medical Center, Center for Genetic Medicine): four diagnostic groups, Fukutin related protein deficiency (FKRP) (homozygous missense for glycosylation enzyme, limb-girdle muscular dystrophy sub-type, n = 7), Becker muscular dystrophy (BMD, hypomorphic for dystrophin, n = 5), Dysferlin deficiency (putative vesicle traffic defect, n = 10), and Calpain III deficiency (n = 10), total 32 samples, 22 283 genes.
  2. Leukemia (Kohlmann et al. 2004): three diagnostic groups, T-ALL (n = 9), MLL (n = 10) and BCR-ABL (n = 15), total 34 samples, 312 genes.
  3. Central nervous system (CNS) cancer (Pomeroy et al. 2002): five diagnostic groups, Medulloblastomas (n = 60), Malignant glioma (n = 10), Rhabdoid tumours (n = 10), Normal cerebella (n = 4), Supratentorial PNET (n = 6), total 90 samples, 7129 genes.

3.2 Results
The experiments were designed to show the impact of the proposed MLP optimization method on two major aspects of MLP performance: prediction accuracy and training efficiency. For the prediction accuracy, we examined classification rate and area under the curve (Az) from receiver operating characteristic (ROC) analysis; to probe the training property we recorded initial error (mean squared error, MSE) between target and output before training, final error (MSE) after training, total number of epochs needed for convergence and percentage of converged training.

In all experiments with MLP training and testing, we applied 100 iterations of stratified 3-fold cross validations in order to ensure reliability, and all performance measures were calculated based on the results from the cross validations. In the stratified 3-fold cross validation, the dataset is randomly divided into three subsets of equal size, and the proportion of each class in each subset remains the same as that in the entire set. In each fold, one of the subsets is used for testing and the rest are combined for training; in each iteration, the training is repeated until all subsets have been used for testing.

The optimized MLPs (oMLP, wFC-based initialization) consistently outperformed conventional MLPs (cMLP, conventional random initialization) for all different tested JDG subsets (Fig. 3). We selected 200 JDG subsets consisting of 1–200 genes as inputs of the MLPs. Figure 3 plotted the curves of the classification rate from the test set (those samples not used for MLP training) versus JDG subsets, which is part of the step 2 in the two-step input selection procedure. To determine the optimal JDG subset among the 200 candidate subsets, the oMLP and cMLP were trained with the same training set and tested with the same test set in each fold for fairness and reliability. The search of the optimal JDG subset was considered sufficient when the classification rate of the oMLP did not increase substantially and the classification rate of the cMLP decreased consistently over 20 JDG sets. The oMLP was able to maintain high classification rate as the size of the JDG increased, whereas the cMLP performance degraded. Moreover, the smaller standard deviation of the oMLP classification rate across all cross validations indicated that the oMLP provided more stable performance (Table 1).


Figure 3
View larger version (8K):
[in this window]
[in a new window]
 
Fig. 3 The classification rate curves of the oMLP and cMLP with various JDG sets as inputs, (a) LGMD, (b) leukemia and (c) CNS cancer. For all JDG sets, oMLP consistently outperformed cMLP. The classification rate for each JDG set is the average of the 100 iterations of 3-fold cross validations. The JDG set corresponding to the maximal classification rate of the oMLP is considered as the optimal JDG set.

 

View this table:
[in this window]
[in a new window]
 
Table 1 The performance evaluation of the oMLP and cMLP with the optimal JDG set as inputs. MLP structure indicates the numbers of inputs (size of optimal JDG set), hidden neurons, and output neurons

 
Additionally, as ROC analysis (Metz, 1986) has been widely recognized as the most meaningful assessment of medical diagnostic performance (Metz, 1986), we also evaluated relative prediction performance of the oMLP and cMLP using a one-against-rest ROC analysis (Hand and Till, 2001) that was specifically designed for the multi-class classification. The ROC analysis offers a description of the trade-offs between true positive fraction (TPF) and false positive fraction (FPF) of a detection test as the decision threshold varies. In the one-against-rest ROC analysis, the approximated posterior probabilities (the outputs of an MLP) of test samples were recorded, and a two-class ROC analysis was applied to all combinations of one-class against the rest classes. For example, there will be n ROC curves for an n-class classification task. A ROC curve plots TPF versus FPF; generally the larger the index, Az, the better the prediction performance of the classifier. With the optimal JDG subset as inputs, the oMLPs had greater Az values for all one-against-rest combinations than the cMLPs, therefore showed better overall performance (Fig. 4). Within each individual case, the larger difference between the prediction accuracies of the oMLP and cMLP corresponds to the larger differences in Az values (Fig. 4 and Table 1).


Figure 4
View larger version (14K):
[in this window]
[in a new window]
 
Fig. 4 The one-against-rest ROC curves of the oMLP and cMLP with the optimal JDG set as inputs, (a) LGMD, (b) leukemia and (c) CNS cancer. Each dataset has several sets of ROC curves for the oMLP and cMLP, and the number of sets is equal to the number of classes in each dataset. The corresponding Az values for the oMLP and cMLP are displayed with the curves. The oMLP consistently showed superior performance over the cMLP for all datasets with larger Az values.

 
The evaluations of training properties on the oMLP and cMLP with the optimal JDG subset as inputs clearly demonstrated the effectiveness of the proposed initialization approach (Table 1 and Fig. 5). The smaller averages of initial and final MSE and the smaller STD of the final MSE in the oMLP trainings, also shown by the training curves (Fig. 5), provided clear evidence that the proper initialization offered a better starting training point so that the trainings were led to a better and less diverse convergence point. In addition, we monitored whether each training process converged by recording the percentage of converged trainings. Note that a training process is considered as converged only if it meets the error goal or is stopped by a standard early stop procedure we applied in all MLP trainings to prevent over-training. The result showed that 100% of the oMLP trainings converged, but a number of cMLP trainings were eventually terminated by a preset maximal number of epochs (Table 1). Moreover, the smaller average and STD of the number of total epochs needed by the oMLP to achieve convergence further confirmed that the oMLP needed less computational resources to reach higher classification rate (Table 1).


Figure 5
View larger version (13K):
[in this window]
[in a new window]
 
Fig. 5 The training curves of the oMLP and cMLP with the optimal JDG set as inputs, (a) LGMD, (b) leukemia and (c) CNS cancer. The training properties, e.g. the convergence speed, initial and the final errors, can be clearly observed from the figures. The trainings of the oMLP usually started from smaller initial errors and converged to smaller final errors, whereas the cMLP training started from and converged to larger and more diverse errors. All these improved properties supported the advantage of the wFC-based initialization over the conventional random initialization.

 
The two-step input selection procedure is effective and computationally feasible in handling a large number of genes so that the curse of dimensionality problem is significantly reduced to a more manageable scale. The considerable change of the classification rate over the entire curve (Fig. 3) confirmed that the content and size of the inputs strongly influenced MLP performance. Particularly, since it shares the same theoretical criterion with the proposed MLP initialization method (wFC), their joint influence is augmented.

We further compared the oMLP with two of the most commonly applied classifiers, K-nearest neighbor (KNN) and one-versus-rest support vector machine (OVR-SVM) that is a typical type of multi-class SVMs (Ramaswamy et al., 2001; Statnikov et al., 2005). Both KNN and OVR-SVM undertook rigorous optimizations for seeking optimal performance. We determined the parameter K in the KNN model based on 100 iterations of 3-fold cross validations. Each SVM unit in the OVR-SVM was tested for seven different kernel functions (linear, second and third order polynomials, and Gaussians with scale factors, 0.01, 0.1, 0.5 and 1.0), and five penalty values C = 0.001, 0.01, 0.1, 1.0 and 10.0. The KNN took the optimal JDG set as inputs; the OVR-SVM took two types of inputs, optimal JDG set and all genes. In summary, the OVR-SVM with optimal JDG set as inputs and the oMLP provided excellent and comparable accuracies and the OVR-SVM demonstrated small improvements, whereas the KNN and cMLP generally showed inferior performances (Table 2). Detailed results of these comparative experiments can be found in Appendix A.


View this table:
[in this window]
[in a new window]
 
Table 2 The classification rate of the model with the best performance for the KNN and OVR-SVM. The results are listed as average (STD)

 

    4 CONCLUSIONS AND DISCUSSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THEORY AND METHOD
 3 EXPERIMENTAL VERIFICATION
 4 CONCLUSIONS AND DISCUSSIONS
 REFERENCES
 
By suggesting an initialization technique based on the wFC and the link between the MLP mechanism and Fisher LDA, together with the input selection procedure, we offer an efficient and practical MLP prototype that can ease the curse of dimensionality in multi-class high-dimensional genomic data classification and provide excellent generalization performance. The wFC-based initialization procedure initiates the MLP close to the optimal condition for decision making, which increases the likelihood that the MLP may converge to a better local or global optimum. The curse of dimensionality is a significant problem because it can easily lead to poor predictions to test samples; classification using genomic data is more prone to this problem owing to the small ratio of the sample size to dimensionality. The reduction of curse of dimensionality in the oMLP is clearly shown by our experimental results: the oMLP was able to retain its classification rate to a very high level even when the number of the inputs significantly increased, while the cMLP performance degraded drastically. Besides, in the design of the wFC-based initialization, we discussed the close connection between the classification by MLP and by LDA, and made contributions in the theoretical insight and experimental validation on how the MLP actually works.

The improved performance of our oMLP approach does not imply that this method will be effective for any multi-class non-linearly separable problem. Such a classification problem could be an intrinsically non-linear problem, or may become a non-linear problem after dimensionality reduction according to Cover's theorem on the separability of patterns. Therefore, the hidden layer of the MLP needs to perform the additional function of transforming a non-linearly separable problem into a linear classification. This may be achieved by the existing hidden layer through dual-purpose training, or one additional hidden layer may be required. An elegant yet simple method is to apply divide-and-conquer principle to the dataset and accordingly introduce some pseudo-classes to the output layer, such that all class-pairs become linearly separable. Notice that the discrete decision fusion can be readily and effortlessly done without using any combiner, since the pseudo-classes belong to some of the known classes a priori. It is important to note that a net reduction in MLP complexity can still be achieved when m0 is large, since the total number of weights in a two-layer MLP is m1 (m0 + m2) such that the reduction due to m1 surpasses the generally limited increase due to m2. Refinements, allowing a co-determination of m1 and m2, may further reduce the curse of dimensionality and improve the generalization performance.

A complex multi-class classification task is beyond the capability of a single classifier. It is remarkable that the single classifier, oMLP, can compete with the OVR-SVM built with a collection of single binary SVMs and show comparable outstanding performance when the number of classes is relatively small (≤5, more experimental results in Appendix A). However, the OVR-SVM is generally expected to outperform most existing classifiers as the number of classes increases (Statnikov et al., 2005).

As another verification of the effectiveness of the MLP initialization, we tested and compared the untrained oMLP and cMLP, and the results showed that the untrained oMLP considerably outperformed the untrained cMLP (Appendix B). Even without training, the hidden layer of an untrained oMLP is able to extract discriminant features derived from the wFC; then the neurons in the output layer can perform linear one-versus-rest classifications based on these extracted features. We used linear transfer function in the hidden neurons and log-sigmoid transfer function in the output neurons. Hence, an untrained oMLP closely resembles LDA, and the initial condition of the oMLP (i.e. performance of the untrained oMLP) reflects the performance of LDA.

Using simulation experiments, we demonstrated that the capability of the proposed MLP optimization method is not significantly affected by the deviation of the distribution of a diagnostic group from a standard single multivariate Gaussian to a mixture of Gaussian (Appendix C). Although the wFC may only find less precise discriminant components when the distribution of each class cannot be closely modeled by a single Gaussian distribution, such loss of information is expected to be small and can be well-compensated by further training of weights and biases that offers extensive degrees of freedom in modeling decision boundary.


    Acknowledgments
 
This study was supported in part by the National Institutes of Health grants under CA109872, CA096483 and EB000830, and DOD/CDMRP grant under BC030280. Z.W. was also supported by the Crystal Ball of Virginia Beach VAs and the Muscular Dystrophy Association.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on August 23, 2005; revised on November 23, 2005; accepted on December 30, 2005

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THEORY AND METHOD
 3 EXPERIMENTAL VERIFICATION
 4 CONCLUSIONS AND DISCUSSIONS
 REFERENCES
 

    Bittner, M., et al. (2000) Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature, 406, 536–540[CrossRef][Medline].

    Brown, M.P., et al. (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl Acad. Sci. USA, 97, 262–267[Abstract/Free Full Text].

    Golub, T.R., et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537[Abstract/Free Full Text].

    Hand, D.J. and Till, R.J. (2001) A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning, 45, 171–186[CrossRef].

    Haykin, S. Neural Networks: a Comprehensive Foundation, (1999) 2nd edn Prentice-Hall, Inc.

    Jain, A.K., et al. (2000) Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell, . 22, 4–37[CrossRef].

    Khan, J., et al. (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med, . 7, 673–679[CrossRef][Web of Science][Medline].

    Kohlmann, K., et al. (2004) Pediatric acute lymphoblastic leukemia (ALL) gene expression signature classify an independent cohort of adult ALL patients. Leukemia, 18, 63–71[CrossRef][Web of Science][Medline].

    Linder, R., et al. (2004) The ‘subsequent artificial neural network’ (SANN) approach might bring more classificatory power to ANN-based DNA microarray analyses. Bioinformatics, 20, 3544–3552[Abstract/Free Full Text].

    Loog, M., et al. (2001) Multiclass linear dimension reduction by weighted pairwise Fisher criteria. IEEE Trans. Pattern Anal. Mach. Intell, . 23, 762–766[CrossRef].

    Metz, C. (1986) Statistical analysis of ROC data in evaluating diagnostic performance. Mult. Regression Anal, . 365–384.

    Mjolsness, E. and DeCoste, D. (2001) Machine learning for science: state of the art and future prospects. Science, 293, 2051–2055[Abstract/Free Full Text].

    O'Neill, M.C. and Song, L. (2003) Neural network analysis of lymphoma microarray data: prognosis and diagnosis near-perfect. BMC Bioinformatics, 4, 13[CrossRef][Medline].

    Pomeroy, S., et al. (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415, 436–442[CrossRef][Medline].

    Ramaswamy, S., et al. (2001) Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl Acad. Sci. USA, 98, 15149–15154[Abstract/Free Full Text].

    Raudys, S. (1992) Accuracy of feature selection and extraction in statistical and neural net pattern classification. Proc. Int. Conf. Pattern Recogn, . 2, 62–70.

    Raudys, S. (1994) Why do multilayer perceptrons have favorable small sample properties? Pattern Recognition in Practice IV, Elsevier Science B. V, , pp. 287–298.

    Raudys, S. (1997) On dimensionality, sample size, and classification error of nonparametric linear classification algorithms. IEEE Trans. Pattern Anal. Mach. Intell, . 19, 667–671[CrossRef].

    Raudys, S. and Jain, A.K. (1991) Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Trans. Pattern Anal. Mach. Intell, . 13, 252–264[CrossRef].

    Raudys, S. and Skurikhina, M. (1992) The role of the number of training samples on weight initialisation of artificial neural net classifier. RNNS/IEEE Symp. Neuroinform. Neurocomput, . 1, 343–353.

    Ripley, B. Pattern Recognition and Neural Networks, (1996) Cambridge University Press.

    Shipp, M.A., et al. (2002) Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med, . 8, 68–74[CrossRef][Web of Science][Medline].

    Statnikov, A., et al. (2005) A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21, 631–643[Abstract/Free Full Text].

    van't Veer, L.J., et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530–536[CrossRef][Medline].

    Wang, Y., Wang, Z., Xuan, J., Zhang, J., Hoffman, E., Clarke, R., Khan, J. (2004) Optimizing multilayer perceptrons by discriminatory component analysis. Proc. IEEE Workshop on Machine Learning for Signal Processing, 273–282.

    Wei, J.S., et al. (2004) Prediction of clinical outcome using gene expression profiling and artificial neural networks for patients with neuroblastoma [Erratum (2005) Cancer Res, 65, 374.]. Cancer Res, . 64, 6883–6891[Abstract/Free Full Text].

    West, M., et al. (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl Acad. Sci. USA, 98, 11462–11467[Abstract/Free Full Text].

    Xuan, J., Dong, Y., Khan, J., Hoffman, E.P., Clarke, R., Wang, Y. (2004) Robust feature selection by weighted fisher criterion for multiclass prediction in gene expression profiling. Proc. Int. Conf. Pattern Recogn, . 2, 291–94.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/6/755    most recent
btk036v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wang, Z.
Right arrow Articles by Hoffman, E. P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wang, Z.
Right arrow Articles by Hoffman, E. P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?