Skip Navigation


Bioinformatics Advance Access originally published online on March 22, 2005
Bioinformatics 2005 21(10):2200-2209; doi:10.1093/bioinformatics/bti370
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary data
Right arrow All Versions of this Article:
21/10/2200    most recent
bti370v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (15)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Yu, J. S.
Right arrow Articles by Trajanoski, Z.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Yu, J. S.
Right arrow Articles by Trajanoski, Z.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data

J. S. Yu 1,2, S. Ongarello 2,3, R. Fiedler 2, X. W. Chen 4, G. Toffolo 3, C. Cobelli 3 and Z. Trajanoski 5,*

1School of Electronics Engineering and Computer Science, Peking University China
2Institute for Genomics and Bioinformatics, Graz University of Technology 8010 Graz, Austria
3Department of Information Engineering, University of Padova Italy
4Electrical Engineering and Computer Science Department, Information and Telecommunication Center, University of Kansas USA
5Institute for Genomics and Bioinformatics, Christian-Doppler Laboratory for Genomics and Bioinformatics, Graz University of Technology 8010 Graz, Austria

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 BINNING OF RAW...
 4 KS-TEST BASED FEATURE...
 5 RESTRICTION OF COEFFICIENT...
 6 WAVELET TRANSFORMATION
 7 CLASSIFICATION BY SVM
 8 ALTERNATIVE CLASSIFIERS
 9 IMPLEMENTATION
 10 DISCUSSION
 REFERENCES
 

Motivation: High-throughput and high-resolution mass spectrometry instruments are increasingly used for disease classification and therapeutic guidance. However, the analysis of immense amount of data poses considerable challenges. We have therefore developed a novel method for dimensionality reduction and tested on a published ovarian high-resolution SELDI-TOF dataset.

Results: We have developed a four-step strategy for data preprocessing based on: (1) binning, (2) Kolmogorov–Smirnov test, (3) restriction of coefficient of variation and (4) wavelet analysis. Subsequently, support vector machines were used for classification. The developed method achieves an average sensitivity of 97.38% (sd = 0.0125) and an average specificity of 93.30% (sd = 0.0174) in 1000 independent k-fold cross-validations, where k = 2, ..., 10.

Availability: The software is available for academic and non-commercial institutions.

Contact: zlatko.trajanoski{at}tugraz.at


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 BINNING OF RAW...
 4 KS-TEST BASED FEATURE...
 5 RESTRICTION OF COEFFICIENT...
 6 WAVELET TRANSFORMATION
 7 CLASSIFICATION BY SVM
 8 ALTERNATIVE CLASSIFIERS
 9 IMPLEMENTATION
 10 DISCUSSION
 REFERENCES
 
The novel biotechnology of high-throughput and high-resolution MALDI-TOF (matrix-assisted laser desorption and ionization time-of-flight) mass spectrometry (MS) makes it promising to explore the low-molecular-weight (LMW) region of the blood proteome for the diagnosis of significant patterns for various diseases (Lilien et al., 2003; Liotta et al., 2003; Petricoin and Liotta, 2003, 2004; Wulfkuhle et al., 2003). Molecular and statistical approaches to identifying ovarian cancer in the early stage are urgently needed, and much work has already been done (Anderson et al., 2003; Bao-Ling et al., 2002; Petricoin et al., 2002a,b; Qu et al., 2002; Vlahou et al., 2003; Wu et al., 2003; Yu et al., 2004). In this work we considered the SELDI-TOF (surface-enhanced laser desorption and ionization time-of-flight) low-resolution and high-resolution raw MS data provided by National Cancer Institute (NCI),1 relative to a study conducted to discriminate ovarian cancer from normal tissue. The published high-resolution data achieved with extensive quality control and assurance (QC/QA) analysis allow superior classification patterns when compared to those obtained with low-resolution instrumentation (Conrads et al., 2004). Recently NCI has published on its website the ovarian high-resolution QqTOF SELDI data mining results of the concordant m/z regions found by some particular classifications, with both sensitivity and specificity of almost 100%. Even now, universal robust methods for identifying ovarian cancer from MS data are still in development.

One of the best challenges is to keep the discriminatory features between two classes of interest while reducing the intolerable dimensionality (Duda et al., 2001). Petricoin et al. (2002a) used genetic algorithms and self-organizing clustering analysis to extract the discriminatory proteomic pattern from the low-resolution training set, achieving sensitivity and specificity of 100% simultaneously in some particular testing trials. Since the result of genetic algorithm converges to a local optimal solution, distinct random initializations of this iterative searching algorithm may lead to distinct solutions. This brings some problems when trying to identify significant biomarkers. Still, it is believed that there should be some interesting relationships between the extracted discriminatory patterns (Zhu et al., 2003). From the ovarian high-resolution MS data, Conrads et al. (2004) showed that the most frequent m/z ratios extracted by a similar method are 845.0895, 8602.237 and 8709.548. Vlahou et al. (2003) tested the method of classification and regression tree (CART) on the ovarian cancer discrimination from benign diseases and healthy controls, which resulted in a cross-validation accuracy of 81.5%.

In this work we have developed a novel method for dimensionality reduction and tested on a published ovarian high-resolution SELDI-TOF data set. We will show that the accuracy can be improved to 85–90% on the binned MS data. Several statistical methods for the classification of ovarian cancer based on MS spectra have been compared in Wu et al. (2003), and the method of random forest was demonstrated to outperform other methods like linear discriminant analysis, quadratic discriminant analysis, k-nearest neighbor (k-NN), bagging (Bauer and Kohavi, 1999) and boosting classification trees.


    2 SYSTEMS AND METHODS
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 BINNING OF RAW...
 4 KS-TEST BASED FEATURE...
 5 RESTRICTION OF COEFFICIENT...
 6 WAVELET TRANSFORMATION
 7 CLASSIFICATION BY SVM
 8 ALTERNATIVE CLASSIFIERS
 9 IMPLEMENTATION
 10 DISCUSSION
 REFERENCES
 
We make the following assumptions:

  1. If the intensity distributions of control and cancer are distinct at a specific m/z, this m/z ratio is supposed to be a useful feature to the classification.
  2. For the control and cancer data respectively, the m/z at which the random variable of intensity has a small coefficient of variation (CV) is representative.

The MS dataset can be written as S = {(xi,yi)|xi Rm, yi = ±1, i = 1,2,...,n, where xi is an intensity vector according to a sorted sequence of m/z ratios and yi is the class label of xi (–1 for the healthy, +1 for cancer). A binary optimal classifier is a function f:Rm -> ±1 such that f(xi) = yi for both a training subset and testing subset of S. When the feature space is high-dimensional, feature selection becomes crucial as the first step towards pattern recognition. For the raw ovarian high-resolution SELDI-TOF dataset composed of 95 control samples and 121 cancer samples, the dimension of the original feature space is over 370 000.

To improve the performance of identifying ovarian cancer, we make use of a four-step data preprocessing procedure: (1) binning, (2) Kolmogorov–Smirnov (KS)-test based feature selection, (3) restriction of coefficient of variation and (4) discrete wavelet transformation. All the procedures do not depend on the particular classifier that will be used later, since they act just on the numerical characteristics of the MS data. Therefore, various classifiers could be trained and tested on the preprocessed data.


    3 BINNING OF RAW MS DATA
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 BINNING OF RAW...
 4 KS-TEST BASED FEATURE...
 5 RESTRICTION OF COEFFICIENT...
 6 WAVELET TRANSFORMATION
 7 CLASSIFICATION BY SVM
 8 ALTERNATIVE CLASSIFIERS
 9 IMPLEMENTATION
 10 DISCUSSION
 REFERENCES
 
In the first step, binning of raw MS data is performed (Fig. 1). Since the length of the observed m/z sequence varies in the raw MS data, we prepared the data as follows:

  1. Align the study sets according to the sorted union of m/z ratios into an intensity frame with missing data.
  2. To reduce the dimensionality, we binned the frame, at a given bin length l > 0, into a matrix A of m-by-n, where n = 216 (121 ovarian cancer samples and 95 control samples) and m is determined by l (Table 1). Each bin is an interval of the form [b,b+l], where b,l R+. For the binned data, without ambiguity, the m/z ratio stands for the left boundary of an interval. After binning with b N, l = 1, the dimension is reduced from 373 401 to 11 301.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 1 Classification tree on the binned MS ovarian data, where xi denotes the intensity at the i-th binned m/z ratio.

 

View this table:
[in this window]
[in a new window]
 
Table 1 Mass spectrometry data matrix of control and cancera

 
The missing data are ignored in binning. Using 0–1 cost and 10-fold cross-validation, the classification tree achieves a precision of 85–90% on the binned MS data. But since CART is a greedy algorithm that decreases the Gini impurity the most at each step (Duda et al., 2001), in general it does not guarantee an optimal reduction of entropy. The precision of the decision tree is regarded as a reference base line for the ovarian cancer diagnosis.


    4 KS-TEST BASED FEATURE SELECTION
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 BINNING OF RAW...
 4 KS-TEST BASED FEATURE...
 5 RESTRICTION OF COEFFICIENT...
 6 WAVELET TRANSFORMATION
 7 CLASSIFICATION BY SVM
 8 ALTERNATIVE CLASSIFIERS
 9 IMPLEMENTATION
 10 DISCUSSION
 REFERENCES
 
For each m/z ratio ri, we compare the distributions of values in data vectors Xi = (xi,1,...,xi,k) and by a two-sided KS-test (i.e., the null hypothesis H0 is that Xi and have the same distribution) with a given significance level {alpha}. For instance with {alpha} = 5%, H0 cannot be rejected at m/z = 703, but rejected at m/z = 8000. The dimension of feature space is reduced to 8094 after choosing only the features that do not pass the KS-test at 5% level.

More flexible feature selection can be based on the reasonably accurate p-values that are guaranteed by , where ni and are the sample sizes of Xi and respectively (Lehmann, 1975). As a comparison, Figure 2a shows the distributions of raw m/z ratios selected by KS-tests with distinct p-values (ignoring the missing intensity observations) which are quite different from those of the binned data plotted in Figure 2(b).



View larger version (30K):
[in this window]
[in a new window]
 
Fig. 2 (a) The estimated densities of raw m/z ratios selected by the KS-test with different p-values. The most frequent m/z ratios are around 8000, in accordance with m/z regions found by NCI. (b) The same experiments as (a) on the unit-binned m/z values. The dotted lines are the estimated distributions of binned m/z ratios further selected by a coefficient of variation (CV) restriction with threshold t = 0.4. This figure can be viewed in colour on Bioinformatics online.

 
As Liotta et al. (2003) pointed out, the traditional single biomarker for a particular cancer makes little sense from a biological perspective, with poor identification of early-stage cancer and the benign. For instance, the widely used biomarker of cancer antigen 125 (CA125) for ovarian cancer can only detect 50–60% of patients with stage I ovarian cancer. In contrast to the traditional way, the LMW biomarkers foreshow a satisfiable clinical application to early diagnosis of ovarian cancer (Petricoin et al., 2002a, b). From the viewpoint of pattern recognition, the biomarkers are those key features that allow a well-done classification. Usually, classification and feature selection are entangled. To avoid the over-fitting problem, several trials of feature selection are suggested, independent of classifiers as much as possible. Also, in our opinion, the biomarkers could be many selected binned m/z ratios, not necessarily particular m/z ratios if they are not able to yield a satisfiable result (Diamandis, 2004).


    5 RESTRICTION OF COEFFICIENT OF VARIATION
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 BINNING OF RAW...
 4 KS-TEST BASED FEATURE...
 5 RESTRICTION OF COEFFICIENT...
 6 WAVELET TRANSFORMATION
 7 CLASSIFICATION BY SVM
 8 ALTERNATIVE CLASSIFIERS
 9 IMPLEMENTATION
 10 DISCUSSION
 REFERENCES
 
For a positive random variable X, the coefficient of variation (CV) is defined as c = sd(X)/E(X), which can be estimated by where s and are the sample standard deviation and sample mean respectively. The m/z ratio with relatively small CV is considered as a useful feature for the classification. The CV of intensity for the healthy and cancerous will be considered separately.

Given CV thresholds of intensity, for instance tH = 0.4, tC = 0.4 for the healthy and cancerous, the feature space dimension is reduced from the second to the fourth column in Table 2.


View this table:
[in this window]
[in a new window]
 
Table 2 Feature selection by KS-test and CV restrictiona

 
The estimated distributions of binned m/z ratios selected by the KS-test with distinct p-values and a CV restriction of t = tH = tC = 0.4 are illustrated in Figure 2b. We suggest a threshold such that 85–95% of the binned m/z ratios are included by the respective empirical cumulative distribution function (CDF) of CV for control and cancer. By empirical CDFs of CV, one can choose the probabilities for control and cancer in advance, then get the corresponding CV thresholds (Fig. 3). In the following, we will set p = 0.05 and tH = tC = 0.4, which reduces the dimension of the feature space from the original 373 401 to 6757.



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 3 Empirical cumulative distributions of intensity CV for control and cancer on the sets of binned m/z ratios selected by KS-tests with different p-values. Given a CV threshold tH = tC = 0.4, for the healthy, the proportion of m/z features with CV ≤ tH is more than that for the cancerous, especially when – lg p increases. This figure can be viewed in colour on Bioinformatics online.

 
Alternatively, for normally distributed n control (or cancer) intensities at a particular binned m/z ratio, the point estimate of c can be replaced by a (1 – {alpha})-confidence upper bound conservatively, determined by the following equation:

(1)
where is the non-central t distribution function with freedom degree n – 1 and non-central parameter . Especially when c ≤ 0.3, by McKay's theorem (McKay, 1932),

(2)
we have a (1 – {alpha})-confidence upper bound

(3)


    6 WAVELET TRANSFORMATION
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 BINNING OF RAW...
 4 KS-TEST BASED FEATURE...
 5 RESTRICTION OF COEFFICIENT...
 6 WAVELET TRANSFORMATION
 7 CLASSIFICATION BY SVM
 8 ALTERNATIVE CLASSIFIERS
 9 IMPLEMENTATION
 10 DISCUSSION
 REFERENCES
 
Wavelet analysis has achieved a broad and successful application to pattern recognition in the last decade, such as in image compression, turbulence or earthquake prediction. It is also an efficient way to compress self-similar data, localizing a signal in both time and frequency (Chui, 1992; Daubechies, 1992). Compared with Fourier transformation, wavelets have advantages in analyzing physical situations where the signal contains discontinuities and sharp spikes. Recently, there is a growing interest in applying wavelet analysis to biomolecular related signals (Liò, 2003).

After applying the pyramidal algorithm (Mallat, 1989) of discrete wavelet transformation (DWT, linear complexity), the binned selected MS spectrum is compressed further to a 3382-dimensional vector of approximation coefficients, which contains most key information for classification (Fig. 4). Considering the dimension reduction, the undecimated DWT without restriction on the length of signal (traditionally it must be a power of 2) is not preferred, although it is superior to the ordinary one in many statistical applications (Nason et al., 2000). The mother wavelet used in the experiment is Daubechies family (db4) and the boundary values are symmetrically padded. The vector of approximation coefficients acts as a fingerprint of the original raw MS data, with a compression rate of more than 100. What is more, the two samples are separated in a manner of keeping the main discriminatory information for classification. Theoretically, a heavier compression rate can be achieved, at the risk of losing some useful information, by choosing a higher level of approximation coefficients.



View larger version (31K):
[in this window]
[in a new window]
 
Fig. 4 Single-level DWT of binned m/z data (the X-axis in the first figure is the ordering of binned m/z ratios, not m/z values). This figure can be viewed in colour on Bioinformatics online.

 
When applying the procedure recursively, a KS-test on the dataset of wavelet approximation coefficients, with a significance level {alpha} = 5%, can hardly reduce the dimension of the feature space any more, neither does the reasonable restriction of CV. Therefore, in a sense of data reduction, the vector of approximation coefficients inherits the key discriminatory traits of MS data.

For the high-resolution ovarian data, the vector of detail coefficients contains almost no information for the healthy, since SVMs identify all the data as cancers. In contrast, the detail coefficients calculated from the low-resolution ones lead to acceptable results (Table 3). Without KS-test based feature selection and restriction of CV, the original low-resolution ovarian dataset 8-7-02 (91 controls versus 162 cancers, http://ncifdaproteomics.com/lowresovarian.php) with 15 154 features are well classified by SVMs of Gaussian kernel parameterized by {gamma} = 2, C = 1 (Section 4). Compressed by db2, even the data of detail coefficients yield a good result, especially on the identification of ovarian cancer.


View this table:
[in this window]
[in a new window]
 
Table 3 Average performances of SVMs on the low-resolution and wavelet-reduced ovarian data in 1000 independent 2-fold cross validations

 

    7 CLASSIFICATION BY SVM
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 BINNING OF RAW...
 4 KS-TEST BASED FEATURE...
 5 RESTRICTION OF COEFFICIENT...
 6 WAVELET TRANSFORMATION
 7 CLASSIFICATION BY SVM
 8 ALTERNATIVE CLASSIFIERS
 9 IMPLEMENTATION
 10 DISCUSSION
 REFERENCES
 
The SVM method is a widely used classification method of Statistical Learning Theory, originally started by Vapnik and Chervonenkis in the 1960s (Vapnik, 1998). In case that the training set S is linearly separable, the support vector (SV) classifier (Burges, 1998; Cortes and Vapnik, 1995; Cristianini and Shawe-Taylor, 2000; Schölkoft, 1997) is the hyperplane (where w Rm,b R) with the maximal margin separating the two classified subsamples of S.

Generally in the linearly non-separable case, we reach at a soft margin allowing training errors (Shawe-Taylor and Cristianini, 2000), where the classifier H is the solution of the optimization problem that is solved by the method of quadratic programming.

Another approach to the linearly non-separable case is the kernel method (Shawe-Taylor and Cristianini, 2004; Schölkoft and Smola, 2002), by which the training data are mapped into a higher dimensional feature space H (may be infinite) and become more separable, provided that the map {phi} makes a kernel function guaranteed by the Mercer's condition (Vapnik, 1998).

Besides these techniques, there are still many significant methods of SVM in the published literature that are far more than what it would be appropriate to include here. Quite a few linkages to free SVM softwares or packages, implemented in C (or C++), Fortran, Java, Perl, R and MatLab are available at http://www.support-vector.net/software.html, for instance some popular ones like Joachims' SVMlight (Joachims, 1999) and Lin's libSVM (Lin, 1999).

Applied to the vectors of normalized wavelet approximation coefficients, the Gaussian radial basis function K(xi,xj) = exp(–{gamma} ||xi xj||2) was adopted as the kernel of the non-linear SV classifier.

The performance of classification, for a more comprehensive review, was examined by (1) k-fold cross-validation, where k = 2,3,...,10; (2) k-fold proportional validation: randomly select 100(1 – 1/k)% controls and 100(1 – 1/k)% cancers as the training set and test the classifier on the remaining samples; and (3) leave-one- out cross-validation. For each k-fold validation, the random experiment was repeated 1000 times independently (Fig. 5).



View larger version (18K):
[in this window]
[in a new window]
 
Fig. 5 (a) Distribution of precision in 1000 independent 2-fold proportional validations. There are totally 106 times that the classifier yields more than 98.3% sensitivity and 97.8% specificity simultaneously in the 1000 random experiments, in which eight times of both 100% sensitivity and 100% specificity. (b) Distribution of SV number in the last experiment. To achieve both 100% sensitivity and 100% specificity, at least 72 support vectors (35 controls and 37 cancers) are needed. The median counts of control SVs and cancer SVs are 35 and 39 respectively. In general, #(Control SVs)/#(Training Controls) < #(Cancer SVs)/#(Training Cancers), except 24 accidents. This figure can be viewed in colour on Bioinformatics online.

 
In leave-one-out cross-validation, six controls and two cancers were misclassified (NCI's high-resolution ovarian data contain 95 controls and 121 cancers). In addition, by the comparison of standard deviation between the classification precisions, all the k-fold (cross, proportional) validations show that the SV classifier is relatively stable to the cancer samples but mutable to those controls (Table 4), which coincides with our intuition about the nature of noise in control data. For each k-fold validation, the average precisions are estimated by the results of 1000 independent random experiments. Obviously, k-fold proportional validation is stricter than k-fold cross-validation, and better at surveying the robustness of classifier for each category. For instance, the worst specificity and sensitivity in 2-fold cross-validations were 85.96 and 90.30%, respectively, while for 2-fold proportional validations they were 76.60 and 86.67% [Fig. 5(a)]. Moreover, the fact of bigger and bigger deviations of control precision in k-fold proportional validations indicates that the SVM overfitting problem seems more serious for the control samples. The same thing also happens to the SV classification on the more reduced data by principal component analysis (PCA) discussed in the next section.


View this table:
[in this window]
[in a new window]
 
Table 4 k-fold (cross, proportional) validation of SVM ({gamma} = 20, C = 0.7) on the preprocessed data

 
Another compelling application is the bagging (Bauer and Kohavi, 1999) of one-hidden-layer neural network. In leave-one-out cross-validation, each training set is resampled 100 times to estimate the output (Fig. 6, Table 5). Totally, only one cancer and four controls are misclassified. However, its complexity is inferior to SVMs.



View larger version (37K):
[in this window]
[in a new window]
 
Fig. 6 The average prediction of each testing sample point in leave-one-out cross-validations, based on the resampled training set.

 

View this table:
[in this window]
[in a new window]
 
Table 5 2,10-fold cross-validations for the bagging of one-hidden-layer perceptron by the backpropagation algorithma

 
Besides decreasing the computational complexity, the procedure of ‘raw MS data -> binned MS data -> KS-test based feature selection -> restriction of CV -> wavelet analysis’ depurates the original data, and explores their category traits for the coming classification. Some other classifiers are able to benefit from the procedure as shown in the next section.


    8 ALTERNATIVE CLASSIFIERS
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 BINNING OF RAW...
 4 KS-TEST BASED FEATURE...
 5 RESTRICTION OF COEFFICIENT...
 6 WAVELET TRANSFORMATION
 7 CLASSIFICATION BY SVM
 8 ALTERNATIVE CLASSIFIERS
 9 IMPLEMENTATION
 10 DISCUSSION
 REFERENCES
 
The method illustrated in the previous section leads to a good result in classification, especially when dealing with cancer samples. In an effort to reach a better precision of identifying controls, we tested several algorithms (voted perceptron, discriminant analysis, decision trees, naïve Bayes, some meta learning schemes like bagging and decorate, random forest) on the preprocessed data as described in Sections 2 and 3, but the precisions were generally inferior to those obtained by SVMs (results are shown in Table 6).


View this table:
[in this window]
[in a new window]
 
Table 6 2,10-fold cross-validations of some methods on the preprocessed data

 
Since the high number of features (3382) could affect the performance of the mentioned classification algorithms, we performed a PCA on the preprocessed data, in order to further decrease their dimensionality. Since finding the optimal number of components for accurate classification is a non-trivial task, we applied several algorithms to the PCA-reduced dataset with a different amount of components, and at last selected the first nine as the coordinates of the updated feature space, which explain 90% variance of the data.

In leave-one-out cross-validation, nine controls and two cancers are misclassified by the SVM with parameters {gamma} = 1.7 and C = 0.7. The other experimental results of this SVM are reported in Table 7. If the training set is large enough, for instance as big as those in 4-fold cross-validation, the PCA reduction affects the distribution of sensitivity little. But it is not the case for specificity (Fig. 7).


View this table:
[in this window]
[in a new window]
 
Table 7 k-fold (cross, proportional) validation of SVM ({gamma} = 1.7, C = 0.7) on PCA-reduced data of the first nine componentsa

 


View larger version (42K):
[in this window]
[in a new window]
 
Fig. 7 Estimated densities of precisions in k-fold cross-validations, where k = 2,3,...,10. This figure can be viewed in colour on Bioinformatics online.

 
The classifier of Voted Perceptron (VP) (Freund and Schapire, 1999), a method of combining Rosenblastt's perceptron algorithm (Duda et al., 2001) with Helmbold and Warmuth's leave-one-out method (Helmbold and Warmuth, 1995), is also an approach to small sample analysis, taking advantage of the ‘boundary data’ of largest margin just as SVM does. Using the more reduced data projected on the first nine components, VP achieves a best performance in k-fold cross-validations, compared with quadratic discriminant analysis (QDA), linear discriminant analysis (LDA), Mahalanobis discriminant analysis (MDA), k-NN, naïve Bayes (NB), bagging (bootstrap aggregating), ADtree and J48tree (Table 8). When the pooled covariance matrix of training data is not positive definite, a common solution is to randomly perturb the training data.


View this table:
[in this window]
[in a new window]
 
Table 8 2,10-fold and leave-one-out cross-validations of VP, QDA, LDA, MDA, NB, bagging, k-NN, ADtree and J48tree on PCA-reduced data of the first nine componentsa

 

    9 IMPLEMENTATION
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 BINNING OF RAW...
 4 KS-TEST BASED FEATURE...
 5 RESTRICTION OF COEFFICIENT...
 6 WAVELET TRANSFORMATION
 7 CLASSIFICATION BY SVM
 8 ALTERNATIVE CLASSIFIERS
 9 IMPLEMENTATION
 10 DISCUSSION
 REFERENCES
 
The software was implemented by OSU SVM toolbox for MATLAB, based on Dr Lin's libSVM-v.2.33. Most of the other classification algorithms were taken or re-adapted from the versions present in WEKA (Witten and Frank, 2000).


    10 DISCUSSION
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 BINNING OF RAW...
 4 KS-TEST BASED FEATURE...
 5 RESTRICTION OF COEFFICIENT...
 6 WAVELET TRANSFORMATION
 7 CLASSIFICATION BY SVM
 8 ALTERNATIVE CLASSIFIERS
 9 IMPLEMENTATION
 10 DISCUSSION
 REFERENCES
 
We have developed an efficient method for dimensionality reduction from MS data based on a 4-step strategy: (1) binning; (2) two-sample KS test, (3) restriction of coefficient of variation and (4) wavelet analysis. By efficient preprocessing of high-resolution ovarian MS data, SVMs achieve a satisfiable performance of identifying cancer and the healthy. On the one hand, data preprocessing reduces the dimension of feature space; on the other hand it extrudes the most significant category traits for the coming classification.

Although low-resolution data also lead to a high precision in identifying cancers, a recent study (Baggerly et al., 2004) pointed out many of the problems that seem to characterize this dataset, partly connected with experimental procedure and design, partly with technical problems (especially calibration issues). One of the most relevant observations in our experiments is the fact that it is possible to get very good classification results employing just detail coefficients of DWT. This can either mean that ‘noise’ is in fact still enough for classification purposes, or that the whole data is affected by some form of corruption that prevents achieving a perfect classification.

The classifier-independent data preprocessing of proteomic MS data shows a promising approach to the coming classification. More robust classifiers (such as Bayesian SVM and Bayesian neural network) are still urgently needed, as well as their ensemble. In addition, the precisions could be further improved by some resampling method (Gelman et al., 2004), which assigns every testing sample point a probability of being cancer.


    Acknowledgments
 
This work was supported by the Austrian GEN-AU project BIN (Bioinformatics Integration Network), ÖAD (Austrian Exchange Service) and the EU Marie Curie Training Site grant Genomics of Lipid Metabolism.


    Footnotes
 
1See http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp for details. Back

Received on September 14, 2004; revised on March 1, 2005; accepted on March 1, 2005

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 BINNING OF RAW...
 4 KS-TEST BASED FEATURE...
 5 RESTRICTION OF COEFFICIENT...
 6 WAVELET TRANSFORMATION
 7 CLASSIFICATION BY SVM
 8 ALTERNATIVE CLASSIFIERS
 9 IMPLEMENTATION
 10 DISCUSSION
 REFERENCES
 

    Anderson, D.C., et al. (2003) A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. J. Proteome Res., 2, 137–146[CrossRef][Web of Science][Medline].

    Baggerly, K.A., et al. (2004) Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments. Bioinformatics, 20, 777–785[Abstract/Free Full Text].

    Bao-Ling, A., et al. (2003) Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Res., 62, 3609–3614.

    Bauer, E. and Kohavi, R. (1999) An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Machine Learning, 36, 105–139[CrossRef].

    Burges, C.J.C. (1998) A tutorial on support vector machines for pattern recognition. Data Mining Knowledge Discov., 2, 121–167[CrossRef].

    Chui, C.K. An Introduction to Wavelets, (1992) , New York Academic Press.

    Conrads, T.P., et al. (2004) High-resolution serum proteomic features for ovarian cancer detection. Endocrine-Related Cancer, 11, 163–178[Abstract].

    Cortes, C. and Vapnik, V.N. (1995) Support vector networks. Machine Learning, 20, 273–297.

    Cristianini, N. and Shawe-Taylor, J. An Introduction to Support Vector Machines (and Other Kernel-Based Learning Methods), (2000) , Cambridge Cambridge University Press.

    Daubechies, I. Ten Lectures on Wavelets, (1992) , Philadelphia SIAM.

    Diamandis, E.P. (2004) Mass spectrometry as a diagnostic and a cancer biomarker discovery tool: opportunities and potential limitations. Mol. Cell. Proteomics, 3, 367–378[Abstract/Free Full Text].

    Duda, R.O., Hart, P.E., Stork, D.G. Pattern Classification, (2001) , New York John Wiley & Son, Inc.

    Freund, Y. and Schapire, R.E. (1999) Large margin classification using the perceptron algorithm. Machine Learning, 37, 277–296[CrossRef].

    Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B. Bayesian Data Analysis, (2004) 2nd edn Chapman & Hall/CRC Press.

    Helmbold, D.P. and Warmuth, M.K. (1995) On weak learning. J. Comput. Syst. Sci., 50, 551–573[CrossRef].

    Joachims, T. (1999) Making large-scale SVM learning practical. In Schölkoft, B., Burges, C.J.C., Smola, A.J. (Eds.). Advances in Kernel Methods—Support Vector Learning, MIT Press, pp. 169–184.

    Lehmann, E.L. Nonparametrics: Statistical Methods Based on Ranks, (1975) , San Francisco Hoden-Day.

    Lilien, R.H., et al. (2003) Probabilistic disease classification of expression-dependent proteomic data from mass spectrometry of human serum. J. Comput. Biol., 10, 925–946[CrossRef][Medline].

    Technical report Lin, C.J. (1999) Formulations of support vector machines: a note from an optimization point of view. National Taiwan University, Dept. of Computer Science.

    Liò, P. (2003) Wavelets in bioinformatics and computational biology: state of art and perspectives. Bioinformatics, 19, 2–9[Abstract/Free Full Text].

    Liotta, L.A., et al. (2003) Clinical proteomics: written in blood. Nature, 425, 905[CrossRef][Medline].

    Mallat, S.G. (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE Pattern Anal. Machine Intell., 11, 674–693[CrossRef].

    McKay, A.T. (1932) Distribution of the coefficient of variation and the extended ‘t’ distribution. J. R. Statist. Soc., 95, 695–698.

    Nason, G.P., et al. (2000) Wavelet processes and adaptive estimation of the evolutionary spectrum. J. R. Statist. Soc., 62, 271–292[CrossRef].

    Petricoin, E.F. and Liotta, L.A. (2003) Mass spectrometry-based diagnostics: the upcoming revolution in disease detection. Clin. Chem., 49, 533–534[Free Full Text].

    Petricoin, E.F. and Liotta, L.A. (2004) SELDI-TOF-based serum proteomic pattern diagnostics for early detection of cancer. Curr. Opin. Biotechnol., 15, 24–30[CrossRef][Web of Science][Medline].

    Petricoin, E.F., et al. (2002a) Use of proteomic patterns in serum to identify ovarian cancer. Lancet, 359, 572–577[CrossRef][Web of Science][Medline].

    Petricoin, E.F., et al. (2002b) Proteomic patterns in serum and identification of ovarian cancer. Lancet, 360, 170–171[Web of Science][Medline].

    Qu, Y., et al. (2002) Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. Clin. Chem., 48, 1835–1843[Abstract/Free Full Text].

    Schölkoft, B. Support Vector Learning, (1997) , Munich R. Oldenbourg Verlag.

    Schölkoft, B. and Smola, A.J. Learning with Kernels, (2002) MIT Press.

    Shawe-Taylor, J. and Cristianini, N. (2000) Margin distribution and soft margin. In Smola, A.J., Bartlett, P.L., Schölkoft, B., Schuurmans, D. (Eds.). Advances in Large Margin Classifiers, MIT Press, pp. 349–358.

    Shawe-Taylor, J. and Cristianini, N. (2004) Kernel Methods for Pattern Analysis. , Cambridge Cambridge University Press.

    Vapnik, V.N. Statistical Learning Theory, (1998) , New York John Wiley & Son, Inc.

    Vlahou, A., et al. (2003) Diagnosis of ovarian cancer using decision tree classification of mass spectral data. J. Biomed. Biotechnol., 5, 308–314.

    Witten, H. and Frank, E. Data Mining: Practical Machine Learning Tools with Java Implementations, (2000) , San Francisco Morgan Kaufmann.

    Wu, B., et al. (2003) Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics, 19, 1636–1643[Abstract/Free Full Text].

    Wulfkuhle, J.D., et al. (2003) Proteomic applications for the early dectection of cancer. Nature, 3, 267–275.

    Yu, J.K., et al. (2004) An integrated approach to the detection of colorectal cancer utilizing proteomics and bioinformatics. World J. Gastroenterol., 10, 3127–3131[Medline].

    Zhu, W., et al. (2003) Detection of cancer-specific markers amid massive mass spectral data. PNAS, 100, 14666–14671[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
J. H. Oh, Y. B. Kim, P. Gurnani, K. P. Rosenblatt, and J. X. Gao
Biomarker selection and sample prediction for multi-category disease on MALDI-TOF data
Bioinformatics, August 15, 2008; 24(16): 1812 - 1818.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
M. Hilario and A. Kalousis
Approaches to dimensionality reduction in proteomic biomarker studies
Brief Bioinform, March 1, 2008; 9(2): 102 - 118.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Saeys, I. Inza, and P. Larranaga
A review of feature selection techniques in bioinformatics
Bioinformatics, October 1, 2007; 23(19): 2507 - 2517.
[Abstract] [Full Text] [PDF]


Home page
J Biomol ScreenHome page
C. Baumgartner and D. Baumgartner
Biomarker Discovery, Disease Classification, and Similarity Query Processing on High-Throughput MS/MS Data of Inborn Errors of Metabolism
J Biomol Screen, February 1, 2006; 11(1): 90 - 99.
[Abstract] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary data
Right arrow All Versions of this Article:
21/10/2200    most recent
bti370v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (15)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Yu, J. S.
Right arrow Articles by Trajanoski, Z.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Yu, J. S.
Right arrow Articles by Trajanoski, Z.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?