Skip Navigation


Bioinformatics Advance Access originally published online on June 18, 2008
Bioinformatics 2008 24(16):1812-1818; doi:10.1093/bioinformatics/btn316
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/16/1812    most recent
btn316v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Oh, J. H.
Right arrow Articles by Gao, J. X.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Oh, J. H.
Right arrow Articles by Gao, J. X.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Biomarker selection and sample prediction for multi-category disease on MALDI-TOF data

Jung Hun Oh 1, Young Bun Kim 1, Prem Gurnani 2, Kevin P. Rosenblatt 3 and Jean X. Gao 1,*

1Department of Computer Science and Engineering, The University of Texas, Arlington, TX 76019, 2PerkinElmer Life & Analytical Sciences, Waleham, MA 02451 and 3Department of Biochemistry and Molecular Biology, University of Texas Medical Branch, Galveston, TX 77555, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EXPERIMENTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Diseases normally progress through several stages. Therefore, biomarkers corresponding to each stage may exist. To deal with such a multi-category problem, including sample stage prediction and biomarker selection, we propose methods for classification and feature selection. The proposed classification method is based on two schemes: error-correcting output coding (ECOC) and pairwise coupling (PWC). The final decision for a test sample prediction is an integration of these two schemes. The biomarker pattern for distinguishing each disease category from another one is achieved by the development of an extended Markov blanket (EMB) feature selection method.

Results: In this study, a liver cancer matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometry (MS) dataset was used, which comprises hepatocellular carcinoma (HCC), cirrhosis, and healthy spectra. Peak patterns were discovered for distinguishing pairwise categories among the three classes. Importance and reliability of individual peaks were presented by the measurements of certain weight values and frequencies. The classification capability of the proposed approach was compared with classical ECOC, random forest, Naive Bayes, and J48 methods.

Availability: Supplementary materials are available at http://visionlab.uta.edu/biomarker/bioinfo.htm

Contact: gao{at}uta.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EXPERIMENTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
High-resolution matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometry (MS) is capable of collecting proteomic biomarkers over a broad mass range (100 to <300 000 Da) in a single acquisition and has less measurement error (mass shift) compared with surface-enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI-TOF MS). The technique also results in a higher accuracy measurement in terms of signal-to-noise ratio. Therefore, high-resolution MALDI-TOF MS has increasingly been used to diagnose early disease, and to monitor disease progression and therapeutic effects of drugs. To discover and identify unique biomarker patterns hidden in the complex and high-dimensional mass spectra, robust computational algorithms have to be developed.

For high-resolution MS data, traditional machine learning algorithms may break down with high-dimensional input (mass-to-charge ratios, features, biomarkers). Very limited research has been carried out to explore the computational challenges. Yu et al. (2005) developed two approaches using different preprocessing and classification methods to analyze high-resolution SELDI-TOF MS ovarian data (Yu and Chen, 2005; Yu et al., 2005). In the first study (Yu et al., 2005), a feature dimension reduction method consisting of a four-step preprocessing was proposed followed by a standard support vector machine (SVM) for classification. In the second paper by Yu and Chen (2005), Bayesian neural network models were used for sample classification. Sequential feature selection was carried out by using a bootstrap technique based on the two-sample Kolmogrov–Smirnov test (KS-test). Ressom et al. (2005) have designed a computational method that combines particle swarm optimization with SVM to distinguish liver cancer patients from healthy individuals in SELDI-QqTOF spectra. Recently, this group proposed to combine ant colony optimization (ACO) with SVM to distinguish hepatocellular carcinoma (HCC) patients from cirrhosis patients via MALDI-TOF mass spectra (Ressom et al., 2007). Particle swarm optimization and ACO are interesting swarm intelligence techniques that have been successfully applied to a number of optimization problems. A comparative study of several well-known classification algorithms, such as linear discriminant analysis (LDA), k-nearest neighbor (KNN), random forest (RF), bagging, boosting and SVM, has been carried out for ovarian cancer diagnosis (Wu et al., 2003). It was demonstrated that the RF approach leads to an overall higher accuracy rate as well as a more stable assessment in terms of classification errors. As a summary for current literature on high-resolution MS data analysis, the work, feature selection and classification, has been focused on binary class samples. However, medical datasets typically contain multiple classes. Therefore, feature selection as well as sample prediction for multi-categories are necessary.

Multi-category classification involves assigning an unknown sample to one of the k classes (k > 2) (Fei and Liu, 2006). There are two main strategies to tackle the multi-class problems (Ie et al., 2005). The first method considers all classes at once by constructing a decision function, which may require high cost and complexity. In the second method, the multi-class problems are broken down into a set of binary classification problems, which are more computationally tractable (Allwein et al., 2002). There exist several methods to reduce multi-class problems to binary class problems. These include one-against-the-rest, one-against-one and error-correcting output coding (ECOC). ECOC has shown a generalization capability for multi-sample classification.

ECOC was inspired from communication theory, where classi-fications wrongly guessed by classifiers can be corrected. In the ECOC multi-class classification problem, there are several factors that affect the performance of the algorithm. Investigations have been carried out on the selection of coding matrix (Dietterich and Bakiri, 1995; Ie et al., 2005; Pujol et al., 2006) and optimization of the decoding function (Passerini et al., 2004; Smith and Windeatt, 2005).

In this article, we develop an ensemble multi-class learning algorithm by integrating a new ECOC scheme and one-against-one pairwise coupling (PWC) scheme. The motivation is to take advantage of each multi-class classification strategy to achieve the most reliable sample prediction. Our contributions come from defining a performance-based weighting function for binary classifiers (dichotomies) in ECOC and a robust decoding function incorporating individual sample properties and the overall dichotomy performance. A unique set of biomarkers to distinguish each pair of categories (classes) is discovered by a new feature selection method, extended Markov blanket (EMB). Figure 1 shows the framework of the proposed biomarker selection and sample category prediction.


Figure 1
View larger version (39K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Framework of the proposed algorithm.

 
The MALDI-TOF dataset used in this study is from Ressom et al. (2007). It consists of spectra from HCC patients, cirrhosis patients and healthy individuals (Ressom et al., 2007). Among the three-class dataset, Ressom et al. used HCC and cirrhosis spectra for two category sample classification leaving healthy spectra for peak screening and outlier detection. In this study, however, we used all three-class spectra for multi-category sample classification and biomarker selection for each class.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EXPERIMENTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Data preprocessing
The binned spectra of MALDI-TOF MS for the liver cancer study were binned with a size of 100 p.p.m. and in total yielded 23 846 bins. Preprocessing by baseline correction and normalization was applied to the binned spectra prior to the next stage pattern finding. For baseline correction, we follow the same method, spline approximation, as what was done by Ressom et al. (2007). After that, each spectrum is normalized by dividing the baseline corrected spectrum by its total ion current (the summed intensity over all m/z values in the baseline corrected spectrum). Because of the small normalized intensity value, all the intensities are multiplied by 10 000 for computational convenience. Ressom et al. generated windows by combining bins after preprocessing. In this study, however, we used the binned data itself to find biomarker candidates in narrow mass regions. To reduce computational burden caused by using all the bins, we rank peaks by a feature ranking method based on the ratio of between-group to within-group peak differences and select the tractable size of features in our algorithm. The feature ranking method was proposed by Dudoit et al. (2002) for feature selection in multi-class problems. For a certain m/z peak j, the ratio is:


Formula 1

(1)
where I(·) is the indicator function, Formula denotes the average intensity of peak j across samples belonging to class k, Formula is the average intensity of peak j across all samples, xij is the intensity of sample i at peak j, and yi is the class label for sample i. The larger the ratio, the more likely m/z peak j will be relevant to the class separation. In the next section, we will describe the basic algorithms used in our biomarker selection and classifier design.

2.2 Fundamental algorithms
2.2.1 Markov blanket feature selection
Markov blanket (MB) filtering is an instance of backward feature elimination algorithms (Koller and Sahami, 1996; Pearl, 1998). Let F be a set of features with size r defined as F=(F1, ..., Fr) and M{subseteq}F be a set of features which does not contain Fi. Feature set M is called MB for Fi if Fi is conditionally independent of FM–{Fi} given M. Therefore, the information contained in feature Fi can be covered by its MB. However, since the full size MB may not be available, an approximate one that subsumes the feature information has to be sought. One MB Mi for Fi can be defined as the one having m highest Pearson correlations with Fi. In general, to reduce computational overhead and to avoid fragmenting the training samples, small value m is used.

To evaluate the closeness between Fi and its MB Mi, the following expected cross-entropy is estimated:


Formula 2

(2)
where fMi and fi are feature values to Mi and Fi, respectively, c is the class label, and D(.||.) represents cross-entropy. For any distributions µ and {sigma}, the cross–entropy of µ to {sigma} is Formula that measures the extent of the difference which is made by using {sigma} instead of µ. In Equation (2), {triangleq}(Fi|Mi)=0 means that Mi is a perfect MB for Fi, therefore Fi does not provide any information about class labels beyond that subsumed by its MB Mi. However, since this case is less likely to happen, we look for a set Mi such that {triangleq}(Fi|Mi) is small. The lower the {triangleq}(Fi|Mi), the closer the approximated MB of Fi is. Feature Fi with the lowest {triangleq}(Fi|Mi) value in the remaining features is considered to be the most redundant and should be eliminated first.

To decide the MB of each feature, the intensity values of m/z after the preprocessing are used in the calculation of Pearson correlation coefficient. The expected cross-entropy {triangleq}(Fi|Mi) based on the discretized values can be calculated as:


Formula 3

(3)
where n is the number of samples (Knijnenburg et al., 2006).

2.2.2 Support vector machine
SVM is a classification algorithm to solve two-class problems (Burges, 1998). In a high dimension space formed by mapping the training data x via a function {Phi}(x), an optimal hyperplane is sought to separate the binary labeled training data by maximizing the margin between the two classes:


Formula 4

(4)
where w is a weight vector and b is a scalar.

The weight vector can be calculated by solving a quadratic programming problem formulated to find the optimal hyperplane. Equation (5) shows the weight vector obtained by using a linear SVM:


Formula 5

(5)
where {alpha}=({alpha}1, {alpha}1, ..., {alpha}l) are Lagrange multipliers and l is the number of support vectors.

2.2.3 SVM-recursive feature elimination (RFE)
The SVM-recursive feature elimination (RFE) proposed by Guyon et al. (2002) is a backward feature elimination algorithm based on SVM. At each iteration, weights for all existing features are obtained by using Equation (5) and a feature corresponding to a smallest absolute weight is eliminated. This procedure continues until only one feature remains so that in the end all features are ranked.

2.3 A redesigned ECOC scheme for multi-class classification
ECOC is a classification method that breaks a k-class prediction problem into several binary classifications. In the ECOC framework, each class is assigned a unique codeword of length h composed of 1 and –1 (Fig. 1), forming a k x h coding matrix. Typically, the rows of the matrix are codewords assigned to the corresponding class ci and the columns Dj represent binary classifiers which partition the samples into two subsets labeled according to the coding matrix. Based on the class partitions, the matrix produces h binary classifiers called dichotomies. For a test sample, a code of length h is obtained as a result of the outputs of the h binary functions. This code is compared with each of the k codewords defined in the coding matrix, and the sample is assigned to a class with the closest codeword using certain distance measure. In doing so, the classification process can be seen as a decoding operation.

In the generic ECOC multi-class classification problem, there are several factors affecting the performance of the algorithm. As previously introduced, investigations have been carried out on the designing of the coding matrix and options for the decoding function. We developed a new ECOC framework from three aspects: a weighting strategy for different dichotomies, feature dimensionality reduction and a performance-based decoding function.

2.3.1 Weighting strategy
In a standard ECOC framework, the influence of all the dichotomies during classification of an unknown sample is equally treated. However, the importance of each dichotomy is different in terms of generating decision boundaries for the training samples. Therefore, we define a weighting function which is similar to the one used in boosting algorithms as shown in Equation (6). The weight value of each dichotomy is computed by using the error rate estimated for the dichotomy with the validation dataset. Therefore, the weight value represents how confident the dichotomy is. In Equation (6), vi and ei are the weight value and the error rate of the i-th dichotomy. In the case where the accuracy of the dichotomy is greater than 50%, the weight value becomes positive; otherwise, a negative value is returned,


Formula 6

(6)
Throughout the study, we choose 10-fold cross-validation (CV) where all samples are randomly split into 10 exclusive folds. Without loss of generality, SVM is used as the dichotomy function. For each of the 10 experiments (iterations), typically 9-folds are used as train data, and the remaining one fold is applied as test data. In our implementation, to estimate the accuracy of each dichotomy function, we further divide the 9-fold train data into 10-folds, i.e. sub-10-fold. In each iteration of the sub-10-fold, 9-folds become sub-train and one fold for sub-test. An averaged error rate resulting from sub-test data is put into Equation (6) to obtain the weight value. This estimation is separately performed for each dichotomy.

2.3.2 Feature reduction
In the training of dichotomies, due to the high dimensionality of mass profiles even after the preprocessing, irrelevant features might still exist. Therefore, using all features will degrade the ability of dichotomies and increase computational cost. Here, we employ a feature reduction algorithm based on information gain to remove irrelevant features in each dichotomy.

Let S be the set of instances from k classes, c1, c2, ..., ck, and P(ci,S) be the fraction of the instances in S that belong to ci. The entropy of the class distribution in S is as follows:


Formula 7

(7)
Suppose feature Fi has m distinct values, fFormula1, fFormula2, ..., fFormulam. Let Sj be the set of instances whose value on attribute Fi is fFormulaj. Then, the information gain of instance set S based on attribute Fi is calculated as


Formula 8

(8)


Formula 9

(9)


Formula 10

(10)
The information gain reflects the reduction in uncertainty about the overall class entropy when a certain feature Fi is given. In other words, features with zero information gain indicate the inability to reduce such uncertainty and should be removed during the training of the dichotomy function.

2.3.3 Decoding function
Based on the binary class predictions from the ensemble dichotomies, an output code is generated for the test sample. To assign the final class label to the test sample, a decoding function is required. The decoding function measures the closeness between the output code and the codewords in the coding matrix M. A new decoding function reflecting the importance of individual dichotomies is defined as:


Formula 11

(11)
where dj is the distance between the test sample and the j-th class, vi is the importance of the i–th dichotomy, xi is the i-th bit of the output code for the test sample and yFormulaj is the i-th bit of the codeword for the j-th class in the M matrix. A test sample is assigned to the class which has the minimum distance,


Formula 12

(12)

2.4 PWC scheme
Though ECOC multi-class sample prediction performs well in most cases, the major limitation of this framework is that it is difficult to use for subspace feature selection. As a result, biomarkers contributing to certain phenotypic categories cannot be estimated within this framework. Therefore, we incorporated another binary classification scheme, the pairwise one-against-one classifier, which provides us the capability to find discriminant biomarkers to distinguish phenotypic differences between classes.

Here we are not applying one-against-the-rest binary classifiers. This selection comes from a legitimate biomedical interpretation. As an example, if we choose one-against-the-rest, we need to find biomarkers to distinguish cirrhosis patients from the rest, which is the combined samples of normal and HCC. Lumping different stage samples like HCC and normal which may be very different from each other at molecular level could obscure biomarkers that truly distinguish HCC patients from cirrhosis. On the other hand, one-against-one comparisons may point out differences between classes of patients that should not be lumped together, such as normal and HCC. Furthermore, with the number of class k in biomedicine usually less than 10, the computation issue raised by one-against-one binary classification will not be a concern.

The number of all possible binary classifiers using one-against-one classification is k(k–1)/2. A common way to determine a final class for the test sample is voting. However in many cases, it is essential that multi-class classification should be a confidence measure, such as posterior probability (Wu et al., 2004). In this study, we use a probability measure based on a PWC scheme to leverage binary prediction results. The probability for a test sample belonging to class i is calculated from combining k(k–1)/2 two-class probabilities (Price et al., 1995):


Formula 13

(13)
where µij=P(y=i | y=i or j, x) corresponds to the posterior probability from the binary classifier.

Since standard SVM does not provide a way to measure the posterior probabilities, this is obtained using a parametric sigmoid model proposed by Platt (1999). For a SVM binary classification, the posterior probability is obtained as:


Formula 14

(14)
where y=1 is the class label for binary class sample x. The parameters A and B are determined by the maximum likelihood estimation from the training set.

A test sample in a PWC scheme is assigned to a class which has the maximum probability,


Formula 15

(15)

2.5 EMB feature selection
To discover the discriminant biomarkers among multi-classes, we propose a new wrapper-based feature selection algorithm called EMB. This feature selection process is embedded in the PWC multi-class classification scheme. The original MB is a filter-based feature selection method. Our algorithm considers reducing redundant features while selecting the most discriminant ones. With the feature subset selected by EMB, we run a linear SVM to calculate probabilities for the pair of classes.

To be more specific, for feature Fi remaining after preprocessing, two feature subsets are considered: the high correlated feature subset (HCFS) and the low correlated feature subset (LCFS) composed of the least correlated d features for Fi. The HCFS feature subset is used to remove redundant features as in classical MB feature selection. Our contribution comes from utilizing LCFS to estimate the classification capability of each feature during the MB process. This is derived from the fact that mutually low correlated features, in general, lead to good classification performance.

We will now describe how the EMB algorithm works out for feature selection. 10-fold CV data split is the same as previously introduced in Section 2.3.1. For each feature Fi, we perform the MB algorithm with its HCFS obtained from the train data to compute the expected cross-entropy value, {triangleq}(Fi|Mi). On the other hand, using LCFS and feature Fi, we run the linear SVM algorithm where the sub-train and sub-test data are used for training and testing, respectively, to obtain a roughly estimated accuracy, denoted as β. Then, we compute the normalized weight for each feature in the LCFS using the following function:


Formula 16

(16)
where


Formula 17

(17)
and β is the accuracy from sub-test samples in the linear SVM, k is the index for features in LCFS including feature Fi, |wk| is the absolute SVM weight obtained using Equation (5), d is the size of LCFS chosen as 10, 20 or 30, and {gamma} is a heuristic performance threshold typically specified as 0.5 (0.4 or 0.6 is also legitimate). After computing |wk| for all features in the LCFS, each |wk| is normalized by the summed absolute SVM weights of all the features in the LCFS and that of feature Fi.

Similar to SVM-RFE where features with the smallest absolute weights are removed, we assume that features with large absolute SVM weights are important in terms of classification power. Not surprisingly to observe, although the normalized SVM weight of certain features in Equation (16) may be high in an LCFS, the classification accuracy of the linear SVM performed with the limited LCFS may be low due to the partial discriminant capacity of one LCFS. Therefore, not only the normalized SVM weight for each feature but also the accuracy with its LCFS are important factors to estimate feature weight. This leads to the multiplication of the normalized SVM weight by the accuracy obtained after the SVM. Through a few experiments, we found that {gamma}=0.5 assures a good performance. If the accuracy is greater than 50%, the proposed weight becomes positive by multiplying 1 as a value of {delta}. Otherwise, we treat it as a penalty using –1 as {delta}-value.

After computing the proposed weights and expected cross-entropy {triangleq}(Fi|Mi) values for all features as introduced above, a certain feature with the smallest {triangleq}(Fi|Mi) value is removed according to MB rule. Then, the HCFS and LCFS for all but the removed feature are rebuilt. In fact, only those subsets that contain the removed feature will be affected and modified. Again, we compute the {triangleq}(Fi|Mi) values and the proposed weights for the survived features. The feature weights will be accumulated during the MB feature removal process. The whole procedure keeps going until a predefined feature number is reached.

2.6 Feature pruning by backward feature selection
After each iteration of 10-fold CV (20 times of 10-fold CV were applied, totalling 200 iterations), a mean weight value for each feature is calculated using the accumulated weight divided by the occurrence counting in all LCFS. All the features are then ranked according to the mean weights. Then, a backward feature elimination method with the top 60 features is performed by using a linear SVM to find a compact feature subset during each 10-fold CV iteration. The selection of top the 50, 60 or 70 features makes not much difference for the final results except that the computational cost is different. However, the selection of top the 40 features tends to give inferior performance from the experiment. The backward feature selection approach removes one feature at a time from the current feature set. Each time a feature without which the accuracy is improved will be excluded. This process continues until one feature remains. Features with the best accuracy form an optimal feature subset for the pair of classes. With the optimal feature subset, a one-against-one binary classifier for certain pair of classes is built. This method is performed for exhaustive pairs of classes.

After finishing all the iterations of the 10-fold CV training, we count how many times a single feature is included in individual optimal feature subsets for each pair of classes. A final feature set is sorted according to the frequency. The higher the frequency, the more reliable a feature is.

2.7 Retraining
In Section 2.5 and 2.6, a method to find important features for each pair of classes was presented. The final sample prediction will be determined from the outputs of ECOC and PWC schemes. If two resultant class labels are the same (ce=cp), the test sample is assigned to the identical class; otherwise (ce!=cp), a retraining will be performed using samples that only belong to classes ce and cp excluding other class samples. Therefore, the retraining comes to be a binary classification problem. In retraining, we use the feature subset which is found for the two classes ce and cp by EMB using the training dataset. As a consequence of the binary classification, the final class is determined for the test sample.


    3 EXPERIMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EXPERIMENTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The liver cancer data consists of 201 spectra containing three categories, HCC (78), cirrhosis (51) and health (72). In this article, all the samples were used to find biomarkers in such a multi-category liver cancer study. We implemented the proposed algorithm based on LIBSVM (Chang and Lin, 2001) and the WEKA library (Witten and frank, 2005). A linear SVM was adopted in retraining and in both ECOC scheme and PWC scheme. For the ECOC scheme, a random coding strategy was used in which values of +1, –1 were selected uniformly at random to generate codes.

According to the BW ratios after preprocessing, we ranked the 23 846 binned peaks (features) and performed experiments with the top 100, the top 200, and so forth, up to the top 500 peaks. The performance of our method was compared with other classification algorithms including standard ECOC, RF, Naive Bayes and J48 methods. In all experiments, 10-fold CV was applied.

Table 1 shows the experimental results in terms of accuracy means and SDs. The individual accuracy for each class is the ratio of correctly labeled samples over the real ones while the overall accuracy is with respect to the total correctly labeled samples for all the classes. The proposed method shows the best accuracy of 88.71% when the experiments were done with the top 300 peaks ranked by BW ratios. The corresponding accuracies for cirrhosis, HCC, and health are 86.86, 89.62 and 89.03%, respectively. As a comparison, J48 achieved the least satisfactory performance in all the trials.


View this table:
[in this window]
[in a new window]

 
Table 1. The means and SDs (in parenthesis) of accuracies in liver cancer dataset

 
In the PWC scheme, for each pair of classes (cirrhosis versus health, cirrhosis versus HCC and health versus HCC), we count how many times each peak is included in the optimal feature subset after the backward feature elimination (See Sections 2.5 and 2.6). Peaks with high frequencies are more reliable than randomly observed biomarkers. In the situation of best overall accuracy when 300 peaks were chosen, the frequencies of all optimal feature sets after backward feature elimination were counted. Among the 60 peaks, the top 20 peaks, sorted by the frequency, were listed in Figure 2a. Furthermore, we grouped the 20 peaks within 3 Da as shown in Figure 2b. Note that some peaks were commonly selected: 1796.8241, 1961.7096, 1961.9057 in cirrhosis versus health and cirrhosis versus HCC; 2373.3603 and 2373.5976 in cirrhosis versus HCC and health versus HCC; and 2389.7923 in cirrhosis versus health and health versus HCC. This further indicates that not a single biomarker but a group of them contributes to the expressions of different phenotypes. We compared peaks selected by our method with those found in cirrhosis versus HCC experiments by Ressom et al. We observed that 1865.4808 and 1797.0038 m/z, corresponding to rank 1 and 2 by our algorithm, belong to m/z windows 1864.0–1870.2 and 1793.1–1797.0, respectively, which were selected by ACO-SVM. In particular, 1865.4808 m/z is top-ranked in our algorithm as well as in both methods of weighting factor and ACO-SVM by Ressom et al.


Figure 2
View larger version (32K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. (a) The m/z values for the top 20 observed peaks out of 300 peaks obtained by BW ratios. (b) Grouping of peaks within 0.3 Da from (a).

 
Figure 3 represents the average intensity for peaks in Figure 2a. We would note that in each pair of classes, a more severe stage of the disease shows a higher intensity distribution. This may imply more protein secretion relevant to the disease or proteins from a secondary effect, such as an inflammatory response. In particular, there are a few peaks which have significant intensity difference in cirrhosis versus HCC and health versus HCC: peaks ranking 2, 9, 15 and 20 corresponding to 1797.0038, 2390.9874, 2373.5976 and 1961.9057 in cirrhosis versus HCC; peaks ranking 6, 9, 10, 13 and 19 corresponding to 2536.0392, 2535.0250, 2535.7856, 2535.2785 and 2373.5976 in health versus HCC. Note that three peaks 2535.0250, 2535.7856 and 2535.2785 among five high intensity peaks in health versus HCC were grouped as seen in Figure 2b. Also, peak 2373.5976 was commonly found in cirrhosis versus HCC (ranking 15) and health versus HCC (ranking 19) experiments, with much higher average intensity in HCC. In cirrhosis versus HCC, however, the top-ranked peak 1865.4808 not only in our algorithm but in both methods by Ressom et al. does not show a considerable difference of intensity. This further evinces that the importance of possible biomarkers is not only purely determined by its absolute volume but also by its correlation or co-regulation with other biomarkers.


Figure 3
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Average intensities for the top 20 peaks observed by the proposed method (corresponding to the peaks listed in Fig. 2). In the graphs, color green, blue and red represent health, cirrhosis and HCC, respectively.

 

    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EXPERIMENTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Disease progresses in several stages. It is important to diagnose the exact current stage of patients in order to provide proper treatments. Therefore, diagnosing multi-category diseases and finding biomarkers corresponding to each category are imperative. The work presented in this article echoes the necessity. We proposed a new multi-class sample classification scheme with simultaneous feature selection. The classification framework is formed by the integration of a redesigned ECOC scheme and a PWC scheme with each scheme producing its best prediction. If the two predictions are the same, the identical class label is assigned for a test sample; otherwise, a retraining is carried out only with the two-class samples excluding samples from other classes. Also, we proposed a feature selection method, EMB, within the multi-class classification framework. EMB chooses features by considering two aspects of biomarkers: redundancy and relevance. The reliability of features is also taken into consideration based on the appearance frequency. A final optimal feature set is discovered for pairwise categories. Experimental results using multi-category liver cancer data demonstrated the performance of the proposed work. A comparison study using different multi-classification approaches was also presented. Distinct biomarker patterns were found between pairwise categories.

For the next stage of biomarker identification, finding exact m/z values is critical to sequencing candidate selection, as small mass differences may lead to different protein candidates. Since our biomaker candidates fall into single bins whose size is usually less than 1 Da as opposed to a large window size spanning several Daltons, the proposed approach offers narrow range to choose the sequencing candidates. Followed by the definition of precise m/z mass, sequencing by tandem MS should be applied to identify the chemical formulae of selected biomarkers.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EXPERIMENTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Funding: This work was supported in part by NSF under grants IIS-0612152 and IIS-0612214.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Jonathan Wren

Received on April 4, 2008; revised on May 25, 2008; accepted on June 12, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EXPERIMENTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Allwein EL, et al. Reducing multiclass to binary: a unifying approach for margin classifiers. J. Mach. Learn. Res (2002) 1:113–141.[CrossRef][Web of Science]

    Burges CJC. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc (1998) 2:121–167.[CrossRef]

    Chang CC, Lin CJ. LIBSVM: a library for support vector machines. (2001) Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

    Crammer K, Singer Y. On the learnability and design of output codes for multiclass problems. Mach. Learn (2002) 47:201–233.[CrossRef]

    Dietterich TG, Bakiri G. Solving multiclass learning problems via errorcorrecting output codes. J. Artif. Intell. Res (1995) 2:263–286.

    Dudoit S, et al. Comparison of discriminant methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc (2002) 97:77–87.[CrossRef][Web of Science]

    Fei B, Liu J. Binary tree of SVM: a new fast multiclass training and classification algorithm. IEEE Trans. Neural Netw (2006) 17:696–704.[CrossRef][Web of Science][Medline]

    Guyon I, et al. Gene selection for cancer classification using support vector machines. Mach. Learn (2002) 46:389–422.[CrossRef]

    Hsu CW, Lin CJ. A comparison of methods for multi-class support vector machines. IEEE Trans. Neural Netw (2002) 13:415–425.[CrossRef][Web of Science][Medline]

    Ie E, et al. Multi-class protein fold recognition using adaptive codes. (2005) Proceedings of the International Conference on Machine Learning. 329–336.

    Knijnenburg T, et al. Artifacts of markov blanket filtering based on discretized features in small sample size applications. Pattern Recogn. Lett (2006) 27:709–714.[CrossRef]

    Koller D, Sahami M. Toward optimal feature selection. (1996) Proceedings of the International Conference on Machine Learning 1996. 284–292.

    Passerini A, et al. New results on error correcting output codes of kernel machines. IEEE Trans. Neural Netw (2004) 15:45–54.[CrossRef][Web of Science][Medline]

    Pearl J. Probabilistic Reasoning in Intelligent Systems. (1998) San Mateo, CA: Morgan Kaufmann.

    Platt J. Probabilistic outputs for SVMs and comparisons to regularized likelihood methods. In: Advances in Large Margin Classifiers. (1999) MIT Press.

    Price D, et al. Pairwise nerual network classifiers with probabilistic outputs. Neural Inf. Process. Syst (1995) 7:1109–1116.

    Pujol O, et al. Discriminant ECOC: a heuristic method for application dependent design of error correcting output codes. IEEE. Trans. Pattern Anal. Mach. Intell (2006) 28:1007–1012.[CrossRef][Medline]

    Ressom HW, et al. Analysis of mass spectral serum profiles for biomarker selection. Bioinformatics (2005) 21:4039–4045.[Abstract/Free Full Text]

    Ressom HW, et al. Peak selection from MALDI-TOF mass spectra using ant colony optimization. Bioinformatics (2007) 23:619–626.[Abstract/Free Full Text]

    Smith RS, Windeatt T. Decoding rules for error correcting output code ensembles. In: Proceedings of the International Workshop on Multiple Classifer Systems. (2005) 53–63.

    Witten I, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. (2005) 2nd. San Francisco: Morgan Kaufmann.

    Wu B. Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics (2003) 19:1636–1643.[Abstract/Free Full Text]

    Wu TF, et al. Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Res (2004) 5:975–1005.

    Yu JS, Chen XW. Bayesian neural network approaches to ovarian cancer identification from high-resolution mass spectrometry data. Bioinformatics (2005) 21:i487–i494.[Abstract]

    Yu JS, et al. Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data. Bioinformatics (2005) 21:2200–2209.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Acta Biochim Biophys SinHome page
Y. Pei, T. Zhang, V. Renault, and X. Zhang
An overview of hepatocellular carcinoma study by omics-based methods
Acta Biochim Biophys Sin, January 1, 2009; 41(1): 1 - 15.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/16/1812    most recent
btn316v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Oh, J. H.
Right arrow Articles by Gao, J. X.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Oh, J. H.
Right arrow Articles by Gao, J. X.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?