Skip Navigation


Bioinformatics Advance Access originally published online on January 19, 2007
Bioinformatics 2007 23(5):619-626; doi:10.1093/bioinformatics/btl678
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/5/619    most recent
btl678v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (8)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ressom, H. W.
Right arrow Articles by Goldman, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ressom, H. W.
Right arrow Articles by Goldman, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Peak selection from MALDI-TOF mass spectra using ant colony optimization

H. W. Ressom 1,*, R. S. Varghese 1, S. K. Drake 2, G. L. Hortin 3, M. Abdel-Hamid 4, C. A. Loffredo 1 and R. Goldman 1

1Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, DC 20057 USA, 2Critical Care Medicine, NIH, Bethesda, MD 20792 USA, 3Clinical Chemistry Service, Department of Laboratory Medicine, NIH, Bethesda, MD 20892 USA and 4Viral Hepatitis Research Laboratory, NHTMRI, Cairo, Egypt

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Results
 4 Conclusions
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Due to the large number of peaks in mass spectra of low-molecular-weight (LMW) enriched sera, a systematic method is needed to select a parsimonious set of peaks to facilitate biomarker identification. We present computational methods for matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) spectral data preprocessing and peak selection. In particular, we propose a novel method that combines ant colony optimization (ACO) with support vector machines (SVM) to select a small set of useful peaks.

Results: The proposed hybrid ACO-SVM algorithm selected a panel of eight peaks out of 228 candidate peaks from MALDI-TOF spectra of LMW enriched sera. An SVM classifier built with these peaks achieved 94% sensitivity and 100% specificity in distinguishing hepatocellular carcinoma from cirrhosis in a blind validation set of 69 samples. Area under the receiver operating characteristic (ROC) curve was 0.996. The classification capability of these peaks is compared with those selected by the SVM-recursive feature elimination method.

Availability: Supplementary material and MATLAB scripts to implement the methods described in this article are available at http://microarray.georgetown.edu/web/files/bioinf.htm

Contact: hwr{at}georgetown.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Results
 4 Conclusions
 ACKNOWLEDGEMENTS
 REFERENCES
 
Several proteomic methods show promise in non-invasive screening of accessible fluids such as serum. The characterization of peptides in serum and plasma by mass spectrometry (MS) is one of the promising strategies for biomarker discovery (Tammen et al., 2005; Villanueva et al., 2006). Biomarker discovery through analysis of mass spectral data requires careful experimental design. It is important to take into account population sampling, matching of controls, protocols for unbiased sample collection, uniform sample preparation methods and appropriate mass spectrometric analysis. Sorace and Zhan (2003) reported the possibility of experimental bias in their assessment of surface enhanced laser desorption/ionization time-of-flight (SELDI-TOF) analysis of ovarian cancer. Ransohoff (2005) indicated that bias will increasingly be recognized as the most important ‘threat to validity’ that must be addressed in the design, conduct and interpretation of such research. Bias can occur if the cancer and non-cancer groups are handled in systematically different ways, introducing an apparent ‘signal’ into one group but not the other.

Mass spectra contain true signal and electronic/chemical noise due to contaminants and matrix; this also causes varying baseline (Malyarenko et al., 2005). In addition, mass spectra reflect variability in sample preparation and sample degradation. Previous quality-control experiments identified properties of mass spectrometric measurements that must be accounted for in the analysis (Fung and Enderwick, 2002; Yasui et al., 2003). The impact of these artifacts can be minimized through data preprocessing steps (smoothing, baseline correction, normalization, peak detection and peak calibration). Previously, we applied these methods to preprocess SELDI-QqTOF and matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) spectra (Ressom et al., 2005, 2006). We observed that normalization of spectra contributed most to the decrease in observed variability (Orvisky et al., 2006). In this study, we add one more data-preprocessing step, which eliminates peaks that are associated with known factors other than disease such as age, gender and smoking status. We call this step peak screening. Spectral preprocessing and peak screening is followed by selection of peaks that are associated with the disease under study.

One can distinguish three main approaches for feature selection: filter, embedded and wrapper. The filter approach is commonly used to select features by applying statistical analyses (e.g. t-test, weighting factor, etc.) that recognize differentially expressed peaks between two groups with multiple subjects. The resulting peaks are then used as inputs to a pattern classification algorithm such as support vector machine (SVM). The filter approach provides generic selection of features, not tuned by a given learning algorithm and it is usually fast and easy to interpret. However, it has the following limitations: (1) it selects features based on ‘relevance’ criterion, not on their classification capability; (2) redundant features can exist; (3) features that have strong discriminating power jointly, but are weak individually are ignored.

In the embedded approach, feature selection is part of the training procedure of a classifier. The implementation of this approach depends on the type of the classifier. Guyon et al. (2002) proposed an SVM recursive feature elimination (SVM-RFE) algorithm to recursively classify samples with SVM and select features according to their weights in the SVM classifier. SVM-RFE ranks the features once using all samples, and uses the top-ranked features in the succeeding cross-validation for the classifier. This generates a biased estimation of errors and limits the search space by allowing only the top-ranked features as candidates.

In the wrapper method, features are selected by taking into account their contribution to the performance of a given type of classifier. Thus, the wrapper method allows selection of features based on ‘usefulness’ criterion. It searches for a combination of useful features from the entire set of features. It tends to find features better suited to the classifier, but it also tends to be more computationally expensive than the filter approach. Due to the large number of peaks involved in mass spectra, a systematic method is required to select the best combination of features without examining all possible combinations. Also, wrapper methods are prone to overfitting unless an internal cross-validation method is used during feature selection. Stochastic global optimization methods such as genetic algorithms (GAs), simulated annealing and swarm intelligence (SI) methods are among the ideal candidates for selecting features from a high-dimensional search space. Recently, researchers have investigated the use of these methods to select features including a recent release of ClinProTools that uses GAs to select features for SVM classifiers. In our previous work, we combined particle swarm optimization (PSO) to select SELDI-QqTOF peaks for SVM classifiers (Ressom et al., 2005).

In this article, we present computational methods to identify candidate biomarkers that distinguish hepatocellular carcinoma (HCC) from cirrhosis through MALDI-TOF analysis of enriched low-molecular-weight (LMW) serum samples. As we described previously, denaturing ultrafiltration enriches the LMW fraction of serum by removal of proteins greater than 50 kDa including albumin (Orvisky et al., 2006). The enrichment improves quality of the spectra and allows the analysis of ~300 peptides. Unlike our previous study (Ressom et al., 2006) that identified peaks distinguishing HCC patients from healthy individuals, in this study our goal is to distinguish HCC patients from cirrhotic patients. Cirrhotic patients are at increased risk of developing HCC and monitoring of cirrhotic patients can potentially decrease HCC related mortality rate. To facilitate the identification of the most useful peaks that would lead to discovery of arkers, we applied data preprocessing methods (outlier screening, baseline correction, normalization, peak detection, peak calibration and peak screening). A novel feature selection method that combines ant colony optimization (ACO) with SVM was applied to select a parsimonious set of peaks that achieved high sensitivity and specificity in distinguishing HCC from cirrhosis. The classification capability of the selected peaks is compared with those selected by the SVM-RFE method. Peptide identification and validation of the selected candidate biomarkers is in progress.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Results
 4 Conclusions
 ACKNOWLEDGEMENTS
 REFERENCES
 
Figure 1 illustrates the steps that we applied to select candidate biomarkers. In the following sections, we briefly describe each of these steps.


Figure 1
View larger version (73K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Methodology for biomarker discovery.

 
2.1 Sample collection
High incidence of viral hepatitis and HCC in Egypt presents a serious health problem. The management of the disease would benefit from identification of biomarkers related to this disease. Serum samples of HCC cases cirrhosis cases, and healthy controls were obtained from Egypt from 2000 to 2002. Controls were recruited among patients from the orthopedic and fracture clinic at the Kasr El-Aini Hospital (Cairo, Egypt) and were frequency-matched to cancer cases by gender, rural versus urban residency and age (Ezzat et al., 2005). All participants signed informed consent, provided a blood sample, and answered a questionnaire with demographic information, personal habits, medical history, occupational history and agricultural activities. Blood samples were collected in red top vacutainer tubes by trained phlebotomist each day around 10 AM and processed within a few hours according to a standard protocol. Aliquots of sera for mass spectrometric analysis were frozen at –80°C immediately after processing. Samples were sub-aliquoted at first thaw and all measurements were performed on samples of second-time thawed serum. Hepatitis C virus (HCV) and hepatitis B virus (HBV) markers in serum samples were determined by EIA for HCV antibodies (anti-HCV), HBV antibodies (anti-HBV) and HBV surface antigen (HBsAg) and by PCR for HCV virus (HCV RNA).

2.2 Sample preparation and data generation
The serum samples were enriched by denaturing ultrafiltration and desalting on C8 magnetic beads (MB) as described earlier (Orvisky et al., 2006). The procedure disrupts protein–protein interactions (Tirumalai et al., 2003) and allows an efficient recovery of a LMW serum fraction starting with 25 µl of serum. Eluted peptides were mixed with a matrix solution (3 mg/ml {alpha}-cyano-4-hydroxycinaminic acid in 50% acetonitrile/0.1% trifluoracetic acid), spotted onto AnchorChip target (Bruker Daltonics, Billerica, MA, USA) and analyzed using an Ultraflex MALDI TOF/TOF mass analyzer (Bruker Daltonics). Each spectrum was detected in linear positive mode and was externally calibrated using a standard mixture of peptides. Using the MALDI-TOF MS, we generated a total of 277 spectra, of which 62 were replicate spectra generated from a serum of one healthy individual. The replicated spectra are useful to assess the run-to-run reproducibility of our sample preparation method and the MALDI-TOF MS technology. The remaining 215 were generated using 84 sera from HCC patients 51 sera from cirrhotic patients, and 80 sera from healthy individuals. Each spectrum consisted of about 136 000 m/z values with the corresponding ion intensities over the mass range 0.9–10 kDa.

2.3 Data preprocessing
Our mass spectral preprocessing method began with outlier screening, where spectra whose total ion current differed by more than two standard deviation (SD) from the median total ion current. This method excluded 14 spectra out of 277 MALDI-TOF spectra generated in this study. Our subsequent analyses were done using the remaining 263 spectra (62 replicate spectra from a serum of one individual, 78 from patients with HCC, 51 from cirrhotic patients and 72 from healthy individuals).

The spectra were binned to reduce their dimension. We used a bin size of 100 ppm. The mean of the intensities within each bin was used as the protein expression variable (Villanueva et al., 2004). This method reduced the dimension of each spectrum from about 136 000 m/z values (over the mass range 0.9–10 kDa) to 23 846 m/z bins.

We used the 62-binned replicate spectra to assess the run-to-run reproducibility of our experiment. We transformed each intensity value in the 23 846 m/z bins by computing the base-two logarithm and found the mean log intensity value and SD of the 62 replicate spectra. The Coefficient of Variation (CV) of the 62 log-transformed intensity values ranged between 4.1 and 22.9% with a mean value of 10.5%.

For each of the other 201-binned spectra (78 HCC, 51 cirrhosis and 72 normal), we estimated the baseline by obtaining the minimum value within a shifting window size of 50 bins and a step size of 50 bins. Spline approximation was applied to regress the varying baseline. The regressed baseline was smoothed using the lowess smoothing method. The resulting baseline was subtracted from the spectrum. Then, each spectrum was normalized by dividing by its total ion current.

Prior to peak detection, we split the 78 HCC and 51 cirrhosis spectra into two data sets, training and testing. We used the training data set that consists of 30 HCC and 30 cirrhosis spectra for peak detection and calibration. The 72 spectra from healthy individuals were used for peak screening. The testing data set that consists of 48 HCC and 21 cirrhosis spectra was set aside for later evaluation as a blind data set. A peak is defined as the location of a mass point, where the sign of the slope of the intensity level (ion count) changes from positive to negative and a reasonable intensity level is measured at the mass point. To address the latter requirement, we defined a threshold line and discarded all peaks below it. To accommodate the fact that the noise level decreases as the mass increases, we defined a threshold line that is higher in the low-mass region than in the high-mass region. This is accomplished by first scaling all spectra to an overall maximum intensity of 100 and then discarding peaks under a threshold line that linearly decreases from intensity level 1–0.5 in the mass range 0.9–10 kDa.

Peak calibration was done based on the method proposed by Coombes et al. (2004). All the detected peaks were aligned by coalescing neighboring peaks within and across all training spectra into m/z windows. First, we selected peaks above a threshold line that decreases linearly from 4 to 1%. Then, we combined these peaks if they differed in location by at most 7 bins or at most 0.09% relative mass. Following this, we considered peaks with intensities between the threshold line that decreases from 4 to 1% and another threshold line, which decreases from 1 to 0.5%. These peaks were added into previously identified m/z windows if they fell within 7 bins or at most 0.09% relative mass. Note that this step may increase the width of an m/z window if a peak is added from outside, otherwise the m/z window size remains unchanged except that the number of peaks in that window will increase. We retained m/z windows that consisted of peaks from at least five spectra and discarded the rest.

Finally, we examined each peak for a potential association with covariates such as age, gender, smoking status, viral infection and residency. This analysis was performed on the samples from healthy individuals to unambiguously identify peaks associated to the covariates. Two approaches were considered in this analysis. The first approach fitted a logistic regression model where the independent variables were the intensities of a given peak across all normal samples with the status of a given covariate as the dependent variable. All covariates in this study have binary values including age (young versus. old). The association of every peak to each covariate was determined on the basis of the corresponding statistical significance (P < 0.05) in fitting a logistic regression model. In the second approach, we fitted a linear regression model for each peak. The peak intensity was the dependent variable of the linear regression model, while all the covariates were used as independent variables. For each peak, covariates that were found to be predictors of the peak intensity with a statistical significance of P < 0.05 were identified. A peak was removed from subsequent analyses, if the above two approaches showed statistically significant association with at least one of the covariates.

2.4 Peak selection
Support Vector Machines are learning kernel-based systems that use a hypothesis space of linear functions in high-dimensional feature spaces. In classification problems that involve two classes, linear SVMs search for the optimal hyperplane that maximizes the margin of separation between the hyperplane and the closest data points on both sides of the hyperplane. Thus, parameters of SVMs are determined on the basis of structural risk minimization, not error-risk minimization. Thus, they have the tendency to overcome the overfitting problem. In high-dimensional data classification problems, SVMs have proven themselves as one of the pattern classification algorithms with great generalization ability.

Ant colony optimization studies artificial systems that take inspiration from the behavior of real ant colonies (Dorigo et al., 1999). The basic idea of ACO is that a large number of simple artificial agents are able to build good solutions to solve hard combinatorial optimization problems via low-level based communications. Real ants cooperate in their search for food by depositing chemical traces (pheromones) on the ground. Artificial ants cooperate by using a common memory that corresponds to the pheromone deposited by real ants. The artificial pheromone is accumulated at runtime through a learning mechanism. Artificial ants are implemented as parallel processes whose role is to build problem solutions using a constructive procedure driven by a combination of artificial pheromone and a heuristic function to evaluate successive constructive steps. Our motivation for using ACO for feature selection is due to its efficiency and capability in identifying a set of interacting variables that are useful for classification. Also, ACO allows the integration of prior information into the algorithm for improved feature selection.

Through the probability function given below, each ant picks n sets of distinct features from L candidate peaks:


Formula 1

(1)
where {tau}i (t) is the amount of pheromone trail at time t for the feature represented by index i; {eta}i represents prior information (e.g. univariate t-statistic) for the feature represented by index i; {alpha} and ß are parameters that determine the relative influence of pheromone trail and prior information.

At t = 0, {tau}i (t) is set to a constant for all features. Thus, at the first iteration, each ant chooses n distinct features (a trail) from L features with probabilities proportional to the existing prior knowledge. Let Sj be the jth ant consisting of n distinct features. Depending on the performance of Sj, the amount of pheromone trail for Sj is updated. The performance function is evaluated on the basis of disease state classification capability of each Sj. We use the features in Sj to build a classifier and estimate the classification accuracy through the cross-validation method. The amount of pheromone trail for each feature in Sj is updated in proportion to the corresponding classification accuracy using


Formula 2

(2)
where {rho} is a constant between 0 and 1, representing the evaporation of pheromone trails; {Delta}{tau}i (t) is an amount proportional to the classification accuracy of Sj, {Delta}{tau}i(t) is set to zero if fi {notin} Sj. This update is made for all N ants (S1, ..., SN). Note that at t = 0, {Delta}{tau}i(t) is set zero for all features. The updating rule allows trails that yield good classification accuracy to have their amount of pheromone trail increased, while others gradually evaporate. As the algorithm progresses, features with larger amounts of pheromone trails and strong prior information influence the probability function to lead the ants towards them.

ACO-SVM combines ACO and SVM to select peaks that are useful for SVM classification of two groups. ACO starts with a population of N peak sets, where each peak set consists of a pre-specified number (n) of distinct peaks. Each peak is selected from a given set of candidate peaks (L) based on its probability function described previously in Equation (1). SVM classifiers are then built for each peak set and the performance of the peak set in distinguishing the two groups is evaluated through the 4-fold cross-validation method. Using Equation (2), we update the amount of pheromone trail for each peak in proportion to the classification accuracy of the peak set, in which the peak is involved. The goal is to provide those peaks that can lead to improved classification accuracy with better probability of being selected in subsequent iterations. In the following, we illustrate the hybrid ACO-SVM algorithm through an example, where we applied the algorithm to select five peaks from a total of 228 candidate peaks.

In this example, the parameters of ACO-SVM were set to the following values: N = 50, L = 228, n = 5, {alpha} = 1, ß = 1 and {rho} = 0.1. Also, the weighting factor proposed by Golub et al. (1999) was considered as prior information. Hence, {eta}i = |µ1i µ2i|/({sigma}1i + {sigma}2i) was used in Equation (1), where µ1i and µ2i represent the mean intensity of peak i in the first and the second group, respectively. Similarly {sigma}1i and {sigma}2i denote the corresponding SDs. We ranked the peaks from 1 to 228 based on decreasing order of weighting factor. Figure 2 depicts the location of each of these peaks in a two-dimensional space. Note that the dimension of the search space and the order of the peaks in the search space do not play a role, because the objective here is to maximize classification accuracy, not the distance between points. At the 1st iteration, each ant chose five peaks from 228 peaks based on Equation (1) that assigns each peak a probability of being selected. This probability function takes only the weighting factor into account during the first iteration. This is because each peak has the same amount of pheromone trail initially. In subsequent iterations, the product of amount of pheromone trail and weighting factor constitute the probability function. Figure 2 (top left figure) shows all 50 ants chosen at the first iteration, where each ant is represented by a trail of five connected circles. The classification capability of each peak set was estimated by the 4-fold cross-validation method. The top right table in Figure 2 shows the ants (peak sets) for the 1st iteration and their corresponding classification accuracy (only the top and bottom three ants are shown, sorted in decreasing order of classification accuracy). {Delta}{tau}i(t) in Equation (2) is determined for each peak in proportion to the classification accuracy of the peak set it belongs to. Based on this value, the amount of pheromone trail for each selected peak will be increased (Equation 2). This implies that the probability function for these peaks will increase proportionally, thereby increasing the chance of being selected in the next iteration. Peaks that were not selected will have zero {Delta}{tau}i(t). Thus, their corresponding amount of pheromone trail will decrease as a result of the evaporation constant ({rho}). The algorithm searches iteratively for more useful peaks guided by the estimated classification capabilities of the peak sets. At the 100th iteration, ants seemed to converge to some trails (Fig. 2, center left figure), primarily to trails that consist of features ranked favorably by weighting factor. At the 236th iteration (bottom left figure), all ants (50 peak sets) converged to one trail that goes through the peaks labeled by indices 1, 2, 8, 10 and 29. Note that at each evaluation, the training spectra are reshuffled. Thus, a repeat of the cross-validation test for the same peak set may result in a different estimate of classification accuracy. In this example, the classification accuracy range across the 50 ants improved from 48.3–83.3% at the 1st iteration to 80.3–96.7% at the 100th iteration, and converged to 96.9% accuracy at the 236th iteration. The algorithm was run for 500 iterations and no change in the peak set was observed after the 236th iteration. The algorithm took about 6.6 min on a Linux machine with dual processor of each Intel Xeon 3 GHz and 6 GB RAM to complete 236 iterations.


Figure 2
View larger version (82K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Pheromone trails for 50 ants at the 1st iteration (top), 100th iteration (middle) and 236th iteration (bottom).

 

    3 Results
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Results
 4 Conclusions
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Data preprocessing
By applying our peak detection and calibration methods to the training data set that consists of 30 HCC and 30 cirrhosis spectra (binned, baseline corrected and normalized), we obtained 249 m/z windows from 23 846 bins. For each spectrum, the maximum intensity within each window was found, yielding a 249 x 60 data matrix. We quantified the peaks in the 72 spectra from healthy individuals at the 249 m/z windows, yielding a 249 x 72 data matrix. Of the 249 peaks, 21 were found to be associated to at least one of the covariates (P < 0.05) in both peak screening methods described previously, linear and logistic regression. Table 1 presents the demographic and viral infection data for the samples from healthy individuals and the number of peaks associated with each covariate. Some of the peaks were associated to more than one covariate.


View this table:
[in this window]
[in a new window]

 
Table 1. Demographic variables and viral infections and the number of peaks associated for the 72 samples from healthy individuals

 
3.2 Peak selection
The aim is to identify candidate biomarkers that distinguish HCC samples from cirrhosis samples. We used the ACO-SVM algorithm to select peaks (m/z windows) from the training HCC and cirrhosis spectra. Only peaks that are not associated with known covariates were considered. Thus, after removing 21 peaks, a training matrix of 228 x 60 was considered for peak selection. We ran the algorithm 100 times, where each run selected five m/z windows out of 228 m/z windows. A 4-fold cross-validation was used to estimate the classification accuracy. At each evaluation, the training spectra were reshuffled. Thus, a repeat of the cross-validation for the same peak may result in a different estimate of classification accuracy. Each run consisted of a maximum of 500 iterations. Figure 3A and B depicts the frequency of occurrence of the m/z windows selected in 100 runs, sorted by frequency and by weighting factor, respectively. Eight m/z windows were selected in more than 20% of the runs.


Figure 3
View larger version (36K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Frequency of occurrence of peaks selected by ACO-SVM in 100 runs. Panel A: peaks sorted by frequency. Panel B: peaks sorted by weighting factor.

 
To evaluate the generalization capability of the peaks and the SVM classifier determined by the training data set, we used them to classify the spectra in the blind validation set, i.e. the testing spectra that were set aside during the process of data preprocessing, peak selection and building the SVM classifier. We binned, baseline corrected and normalized the testing spectra in the same way as the training spectra. Note that the testing spectra were scaled based on the parameters that were used to scale the training spectra. Figure 4 depicts the receiver operating characteristic (ROC) curves and their corresponding area under the ROC (AUROC) for the eight m/z windows, which were evaluated separately and combined in distinguishing HCC patients from cirrhotic samples in the testing data set. This figure demonstrates the generalization capability of the selected peaks and the SVM classifier in a blind validation set and the advantage of a panel of biomarkers in improving the AUROC.


Figure 4
View larger version (42K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. ROC curves of each of the eight m/z window separately and combined based on the testing spectra (i.e. a blind validation set of 38 HCC and 21 cirrhosis samples that was not involved in peak detection, peak calibration, peak selection or building the SVM classifiers). A colour version of this figure is available as supplementary material.

 
To further evaluate the usefulness of the small set of m/z windows selected by ACO-SVM, we built three SVM classifiers using three sets of features (the 23 846 m/z bins, 228 m/z windows, and the selected eight m/z windows). Note that each classifier was built using the training spectra and evaluated on the testing spectra in distinguishing HCC from cirrhosis. Figure 5 shows that the AUROC for the SVM classifier with only eight m/z windows performed as good as those that used all m/z bins or all m/z windows. Figure 6 depicts the box plots for the eight m/z windows using peaks from both training and testing spectra (i.e. 78 HCC and 51 cirrhosis spectra).


Figure 5
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. ROC curves of three SVM classifiers (all bins, all m/z windows and four m/z windows) based on the testing spectra (i.e. a blind validation set of 38 HCC and 21 cirrhosis samples that was not involved in peak detection, peak calibration, peak selection or building the SVM classifiers). A colour version of this figure is available as supplementary material.

 

Figure 6
View larger version (36K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Box plots for eight m/z windows selected by ACO-SVM (78 HCC and 51 cirrhosis samples).

 
For comparison, we used the SVM-RFE method to select eight m/z windows from the 228 candidate peaks. Table 2 presents the top eight m/z windows selected by SVM-RFE, weighting factor and ACO-SVM. Four overlaps exist between weighting factor and ACO-SVM, of which two were also selected by the SVM-RFE. These two m/z windows are the top two m/z windows selected by ACO-SVM. The table also shows the sensitivity and specificity achieved by these windows in distinguishing HCC from cirrhosis in the testing data set.


View this table:
[in this window]
[in a new window]

 
Table 2. Eight m/z windows in the order of their weighting factor, frequency of occurrence in ACO-SVM runs and SVM-RFE ranking. The sensitivity and specificity are calculated based on the testing spectra

 
Finally, we examined the consistency of the ACO-SVM algorithm in multiple runs and the impact of its free parameters such as the number of variables selected in each run (n) and the number of ants (N) by running the algorithm for n = 3,5,7 and 9 with N = 25, 50 and 100. Each combination was run 25 times. Figure 7A and B shows the frequency plot for the total 300 runs, where nine m/z windows were selected in more than 20% of the runs. Seven of the eight m/z windows found earlier were selected again along with two new m/z windows. The sensitivity and specificity of an SVM trained with the nine m/z windows calculated on the testing spectra were both 90%.


Figure 7
View larger version (37K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7. Frequency of occurrence of peaks selected by ACO-SVM in 300 runs. Panel A: peaks sorted by frequency. Panel B: peaks sorted by weighting factor.

 

    4 Conclusions
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Results
 4 Conclusions
 ACKNOWLEDGEMENTS
 REFERENCES
 
We present a novel algorithm that combines ant colony optimization with support vector machines to select the most useful peaks from a large number of candidate peaks. The candidate peaks were derived by preprocessing MALDI-TOF spectra of low-molecular-weight enriched sera. Prior to peaks selection, we removed peaks associated to covariates such as age, gender, residency, smoking and viral infection. A small set of peaks selected by the hybrid ACO-SVM algorithm achieved high sensitivity and specificity in distinguishing cirrhotic patients from patients with HCC. Identification of the peptides that the selected peaks represent is in progress. Following peptide identification, the authors plan to perform independent laboratory experiments to validate the candidate biomarkers.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Results
 4 Conclusions
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work was supported in part by the NCI grant 2R01CA85888 awarded to C.A.L. and by the US Army Medical Research and Material Command's Prostate Cancer Research Program grant W81XWH-04-1-0294, NCI grant 1R03CA119288-01A1 and Associate Membership from NCI's Early Detection Research Network awarded to R.G.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on November 19, 2006; revised on December 20, 2006; accepted on January 5, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Results
 4 Conclusions
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Coombes KR, Tsavachidis S, Morris JS, Baggerly KA, Hung MC, Kuerer HM. Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform. In: Technical Report UTMDABTR-001-04., ( (2004) ) The University of Texas M.D. Anderson Cancer Center. Available at http://www.mdanderson.org/pdf/biostats_utmdabtr-001-04.pdf..

    Dorigo M, et al. Ant algorithms for discrete optimization. Artif. Life, ( (1999) ) 5, : 137–172.[CrossRef][ISI][Medline].

    Ezzat S, et al. Associations of pesticides, HCV, HBV, and hepatocellular carcinoma in Egypt. Int. J. Hyg. Environ. Health, ( (2005) ) 208, : 329–339.[CrossRef][ISI][Medline].

    Fung ET, Enderwick C. ProteinChip clinical proteomics: computational challenges and solutions. Biotechniques, ( (2002) ) 32 (Suppl), 34–41..

    Golub TR, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, ( (1999) ) 286, : 531–537.[Abstract/Free Full Text].

    Guyon I, et al. Gene selection for cancer classification using support vector machines. Machine Learning, ( (2002) ) 46, : 389–422.[CrossRef][ISI].

    Malyarenko DI, et al. Enhancement of sensitivity and resolution of surface-enhanced laser desorption/ionization time-of-flight mass spectrometric records for serum peptides using time-series analysis techniques. Clin. Chem., ( (2005) ) 51, : 65–74.[Abstract/Free Full Text].

    Orvisky E, et al. Enrichment of low molecular weight fraction of serum for MS analysis of peptides associated with hepatocellular carcinoma. Proteomics, ( (2006) ) 6, : 2895–2902.[CrossRef][ISI][Medline].

    Ransohoff DF. Bias as a threat to the validity of cancer molecular-marker research. Nat. Rev. Cancer, ( (2005) ) 5, : 142–149.[CrossRef][ISI][Medline].

    Ressom HW, et al. Analysis of mass spectral serum profiles for biomarker selection. Bioinformatics, ( (2005) ) 21, : 4039–4045.[Abstract/Free Full Text].

    Ressom HW, et al. Biomarker identification and rule extraction from mass spectral serum profiles. ( (2006) ) Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology: Toronto, ON. 164–170..

    Sorace JM, Zhan M. A data review and re-assessment of ovarian cancer serum proteomic profiling. BMC Bioinformatics, ( (2003) ) 4, : 24.[CrossRef][Medline].

    Tammen H, et al. Peptidomic analysis of human blood specimens: comparison between plasma specimens and serum by differential peptide display. Proteomics, ( (2005) ) 5, : 3414–3422.[CrossRef][ISI][Medline].

    Tirumalai RS, et al. Characterization of the low molecular weight human serum proteome. Mol. Cell. Proteomics, ( (2003) ) 2, : 1096–1103.[Abstract/Free Full Text].

    Villanueva J, et al. Serum peptide profiling by magnetic particle-assisted, automated sample processing and MALDI-TOF mass spectrometry. Anal. Chem., ( (2004) ) 76, : 1560–1570.[Medline].

    Villanueva J, et al. Differential exoprotease activities confer tumor-specific serum peptidome patterns. J. Clin. Invest., ( (2006) ) 116, : 271–284.[CrossRef][ISI][Medline].

    Yasui Y, et al. A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. Biostatistics, ( (2003) ) 4, : 449–463.[Abstract].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
J. H. Oh, Y. B. Kim, P. Gurnani, K. P. Rosenblatt, and J. X. Gao
Biomarker selection and sample prediction for multi-category disease on MALDI-TOF data
Bioinformatics, August 15, 2008; 24(16): 1812 - 1818.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
G. B. Fogel
Computational intelligence approaches for pattern discovery in biological systems
Brief Bioinform, July 1, 2008; 9(4): 307 - 316.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
M. Hilario and A. Kalousis
Approaches to dimensionality reduction in proteomic biomarker studies
Brief Bioinform, March 1, 2008; 9(2): 102 - 118.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
A. Barla, G. Jurman, S. Riccadonna, S. Merler, M. Chierici, and C. Furlanello
Machine learning methods for predictive proteomics
Brief Bioinform, March 1, 2008; 9(2): 119 - 128.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Saeys, I. Inza, and P. Larranaga
A review of feature selection techniques in bioinformatics
Bioinformatics, October 1, 2007; 23(19): 2507 - 2517.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/5/619    most recent
btl678v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (8)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ressom, H. W.
Right arrow Articles by Goldman, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ressom, H. W.
Right arrow Articles by Goldman, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?