Skip Navigation


Bioinformatics Advance Access originally published online on September 18, 2007
Bioinformatics 2007 23(20):2788-2794; doi:10.1093/bioinformatics/btm442
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/20/2788    most recent
btm442v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wang, Z.
Right arrow Articles by Yang, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wang, Z.
Right arrow Articles by Yang, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

A parsimonious threshold-independent protein feature selection method through the area under receiver operating characteristic curve

Zhanfeng Wang 1, Yuan-chin I. Chang 2, Zhiliang Ying 3, Liang Zhu 4,* and Yaning Yang 1,*

1Department of Statistics and Finance, University of Science and Technology of China, Hefei, 230026, China, 2Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan, 3Department of Statistics, Columbia University, New York 10027, USA and 4Department of Gastroenterology, Changzheng Hospital, Second Military Medical University, Shanghai 200003, China.

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Protein expression profiling for differences indicative of early cancer holds promise for improving diagnostics. Due to their high dimensionality, statistical analysis of proteomic data from mass spectrometers is challenging in many aspects such as dimension reduction, feature subset selection as well as construction of classification rules. Search of an optimal feature subset, commonly known as the feature subset selection (FSS) problem, is an important step towards disease classification/diagnostics with biomarkers.

Methods: We develop a parsimonious threshold-independent feature selection (PTIFS) method based on the concept of area under the curve (AUC) of the receiver operating characteristic (ROC). To reduce computational complexity to a manageable level, we use a sigmoid approximation to the empirical AUC as the criterion function. Starting from an anchor feature, the PTIFS method selects a feature subset through an iterative updating algorithm. Highly correlated features that have similar discriminating power are precluded from being selected simultaneously. The classification rule is then determined from the resulting feature subset.

Results: The performance of the proposed approach is investigated by extensive simulation studies, and by applying the method to two mass spectrometry data sets of prostate cancer and of liver cancer. We compare the new approach with the threshold gradient descent regularization (TGDR) method. The results show that our method can achieve comparable performance to that of the TGDR method in terms of disease classification, but with fewer features selected.

Availability: Supplementary Material and the PTIFS implementations are available at http://staff.ustc.edu.cn/~ynyang/PTIFS

Contact: ynyang{at}ustc.edu.cn or czzhuliang{at}126.com

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Diagnostic tests are important tools in the development of effective intervention and treatment for major diseases such as cancer. Due to the central role of protein in the living organism, recent advances in proteomics can shed lights on discovering protein biomarkers that are indicative of a particular disease.

Mass spectrometers such as SELDI-TOF (surface enhanced laser desorption/ionization time-of-flight) and MALDI-TOF (matrix assisted laser desorption and ionization time-of-flight) measure relative abundance of different protein ions or protein fragments (peptides) indexed by the mass-to-charge ratio (m/z). Mass spectrum (MS) data can reveal proteomic patterns or features (e.g. biomarker ions) which may be related to specific characteristics of biological samples and, therefore, can be used for diagnostic purposes. They can also be used for monitoring disease progression and for evaluating treatment and intervention. Other applications of mass spectrometry include pharmaceutical analysis, biomolecule characterization, environmental assessment and forensic analysis. Expressions of protein and peptides are regarded as discriminative features related to the phenotype variations (e.g. cancer subtypes). High-throughput mass spectrometry methods generate an overwhelmingly large number of protein expressions. Protein biomarker selection is therefore a challenging subject that has been extensively studied (Adam et al., 2002; Grizzle et al., 2003; Levner, 2005; Pepe et al., 2001; Qu et al., 2002; Yasui et al., 2003; Yu and Chen, 2005).

Search of the optimal subset of features is commonly known as the feature subset selection (FSS) problem. In biomedical studies, the FSS is usually done by minimizing the area under receiver operating characteristic (ROC) curve (AUC). For a specific classifier, its ROC curve represents the true positive rate as a function of the corresponding false positive rate x. Generally, a ROC curve describes the trade-offs between the true positive rate (sensitivity) and the false positive rate (one minus specificity). The area under the ROC curve is a measure that provides an overall assessment for the performance of a classifier in terms of specificity and sensitivity. In fact, it is the probability that the classifier assigns a higher score to a diseased individual over an unaffected individual (Metz, 1978; Swets, 1988).

Comparison of ROC curves of different diagnostic tests can be informative about their discriminating power (Liu et al., 2005; Metz et al., 1984; Su and Liu, 1993; Zhou et al., 2002). Recently, for microarray data, Ma and Huang (2005) used the threshold gradient descent regularization (TGDR) (Friedman and Popescu, 2004) method for biomarker selection. It proposes a sigmoid approximation to AUC as the objective function for classification. The number of iterations is determined by a cross-validation method. This method may also be extended for MS data; but, in doing so, it tends to select more features than necessary.

In this article, we propose a parsimonious FSS procedure, to be called parsimonious threshold-independent feature selection (PTIFS), for MS expression data. It introduces a sigmoid approximation to smooth the empirical AUC, which is used as the objective function, so that computational complexity can be reduced substantially. Starting from an anchor feature and forcing the directions (or signs) of the features to be the same as in the single-feature classifiers as in the least angle regression (LARS) method (Efron et al., 2004), the feature subset is selected through an iterative algorithm until a pre-chosen criterion is fulfilled. The final classification rule is then constructed based on this feature subset.

The focus of this article will be on feature subset selection in the two-class classification problem. The new approach is threshold-independent, as its threshold parameter is determined through cross-validation. Another prominent feature is that, in the iterative updating steps, highly correlated features are less likely to simultaneously enter the selected feature subset. Results show that the new PTIFS method has comparable classification performance with the TGDR method, with the former tending to select fewer protein features.

The rest of this article is organized as follow. The next section describes the new feature subset selection method as well as the corresponding algorithm for selecting features and for estimating relevant parameters. It is followed by simulation studies and applications to a prostate cancer data set and a liver cancer data set. Concluding remarks are given in the final section.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Suppose there are two classes. Our aim is to select an optimal feature subset such that the linear combination of these features can classify the two classes efficiently. To proceed with the PTIFS method, we need a pre-fixed anchor feature as in LARS (Efron et al., 2004; Ma and Huang, 2005). It is determined by choosing the one that maximizes the empirical AUCs of single-feature classifiers. In the sequel, we refer to a selected feature as an active feature, and a non-selected feature as an inactive feature. The search of the optimal subset is conducted through an iterative inclusion–exclusion updating procedure of the active feature set until a threshold of accuracy measure based on AUC is reached. The threshold parameter is determined by cross validation. The anchor feature stays untouched during the iterative updating steps.

2.1 Area under ROC curve
Let Z be a feature measurement. The simple linear classification rule based on such a single feature can usually be expressed as Z > c for a threshold c. A sample is classified as positive (diseased) if Z > c. Let D and Formula denote, respectively, diseased and normal states. For a specific threshold c, let T(c) = Pr(Z > c|D) denote the true positive rate and Formula the false positive rate. Clearly, both T(c) and F(c) are monotone decreasing functions of c. Thus, for a given false positive rate x, we can invert F to obtain the corresponding threshold c = F – 1(x). Furthermore, the true positive rate can be written as T(F 1(x)), which, when plotted against x, is the ROC curve.

Consider the two-group classification problem with sizes n and m for the diseased and normal groups, respectively. Suppose each subject has p features. It is common in such a problem that p is much larger than n + m. Let Y and X denote, respectively, the p dimensional feature vectors of the normal and diseased groups. For a vector of constants l isin Rp, the linear combination of features l 'X and l 'Y are called linear risk scores. We want to construct a classifier for the diseased and normal groups based on such linear risk scores, with the goal of finding the ‘best’ vector l so that performance of the resulting classifier has certain desired properties.

For a given vector l, the empirical AUC is defined as


Formula 1

(1)
where {psi}(u) = 1 if u > 0 = 0.5 if u = 0 and = 0 if u < 0. Since {psi}(·) is a step function, the function (1) can be difficult to handle. In particular, it may have multiple local maxima and standard calculus-based maximization algorithms are not applicable. We therefore introduce a smoothing function K(·) to smooth the criterion function. Specifically, we will use the sigmoid function K(t) = 1/(1 + exp(– t)), though other smoothing functions can also be used. For sufficiently small h, we have K((l'Xl'Y)/h) {approx} {psi}(l'Xl'Y). The tuning parameter h is referred to as the window width which tunes the degree of smoothness. Thus, the criterion function becomes


Formula 2

(2)
The accuracy of the sigmoid approximation relies on choice of h. Following Ma and Huang (2005) and Gammerman (1996), as a rule of thumb, we choose h such that mini,j Formula . Although it is an approximation, we shall hereafter refer to (2) as the empirical AUC.

2.2 PTIFS: a new approach to feature selection
The first step in our new approach is the identification of an anchor feature. Define a p-vector {theta} = ({theta}1, ..., {theta}p)', where {theta}k = 1 if the median of Formula is greater than that for Formula and {theta}k = – 1 otherwise. That is


Formula

for k = 1, ... , p, where I{·} is the indicator function. For k = 1, ... , p, let ßk be a vector of length p with k-th element being {theta}k and others being 0. We denote the empirical AUCs for risk scores using a single feature by Rk = Rk). Without loss of generality, we assume R1 ≥ Rk, k != 1, and we choose feature 1 as the anchor. It is not difficult to see that the effect of multiplying l by a positive constant on R can be offset by adjusting h accordingly. To make it identifiable, we require that |l1| = 1, where l1 corresponds to the anchor feature.

Denote an active feature set by Formula and the corresponding inactive feature set by Formula c. Since feature 1 is the anchor with |l1| = 1, it is always in the active set. Furthermore, since |{theta}1| = 1 by definition, the coefficient of feature 1 in the linear classifier is always {theta}1. For an active set Formula , the coefficient of the corresponding linear classifier will be denoted by Formula , where Formula for j isin Formula c. The active set is updated by the following iterative procedure. Features in the inactive set are evaluated by their AUC for all the subjects, penalized by their AUC based on the misclassified subjects using the current active set. The objective function (2) is then maximized using only the features in the updated active set. If any coefficients of the updated active features disagree with their corresponding elements of {theta} in sign, then these features are removed from the active set. The algorithm iteratively updates the active and inactive sets with this inclusion–exclusion process until the accuracy measure of AUC reaches the threshold {tau} which is determined by cross-validation. Details of this algorithm are described below.

2.2.1 Algorithm

  1. Find the anchor feature which is Formula = {1} as assumed and set Formula = {phi}, the empty set.
  2. For the current active set Formula , calculate the corresponding coefficients, Formula , of the linear classifier by maximizing criterion function (2). For each k isin Formula c, compute its empirical AUC, Formula , with l = ßk and observations on subjects that are misclassified by Formula .
  3. If Formula for all k isin Formula c or Formula c = {phi}, then stop. Otherwise choose Formula , and update the active set Formula by adding the feature k0 and excluding it from the inactive set Formula c, where {lambda} > 0 is a pre-specified constant.
  4. Update Formula by maximizing objective function (2) with respect to the updated active set Formula in step (iii). Remove set Formula from Formula and add it to inactive set Formula c. If k0 isin Formula , then exclude k0 from Formula c and add k0 to Formula , otherwise add the elements of Formula to Formula c and let Formula = {phi}.
  5. Repeat (ii)–(iv) until Formula , where 0.5 ≤ {tau} < 1.

Step (iii) selects the feature in the inactive set that can best classify the two groups, taking into account the misclassified subjects by the current active set. The pre-specified quantity {lambda} is a weight on the misclassified subjects. The larger value {lambda} takes, the more attention is paid to misclassified subjects. This can be seen from the following arguments. The criterion function Rk in (2) can be partitioned into two parts on different index subsets. One, denoted by Rk1, is the sum over the subset of pairs (i, j) with at least one being correctly classified by the linear classifier based on Formula . The other one, denoted by Rk2, is the sum over the remaining pairs (i, j) that are both misclassified. Simple algebra shows that there exists a positive constant {alpha} such that Formula . It follows that Formula .

By taking Formula as the criterion function for selecting features from the inactive set, one can avoid choosing features that are strongly correlated with the active set. This is because highly correlated features tend to have similar classification performance. If there exists a feature in active set Formula which is highly correlated with k isin Formula c, then Formula , the AUC for k isin Formula c calculated based on the subjects misclassified by Formula , tends to be small. As a result, feature k is unlikely to be included into the active set.

Step (iv) re-evaluates the new active subset based on its classification performance. The sign of the coefficient of an active feature remains unchanged during the course of updating. Evidently, if {theta}k equals to 1 (or – 1), the value of feature k for diseased individuals should be greater (or less) than that for normal individuals. In other words, {theta}k represents an efficient direction of feature k in discriminating the two groups. Therefore, signs of coefficients of active features need to be identified in the whole iterative updating procedure.

In step (v), threshold {tau} can be determined by cross-validation. We uses K-fold cross-validation to determine {tau}, where K is a pre-defined integer. Break the samples randomly into K pieces of equal sizes, and choose {tau} to maximize the cross-validation objective function Formula where Formula is the estimate of l by using our above-mentioned algorithm with respect to data without the v-th subset, and R(– v)(·) is the criterion function defined in 2) evaluated without the v-th subset. Note that, in the process of cross-validation, if {tau} is unrealistically large ({tau} > max R(l)) and cannot be reached, then all the features must be either in the active set or in the temporal storage set Formula . The inactive set is thus empty and, according to criterion (iii), the algorithm converges. Since the threshold parameter {tau} is not pre-specified but rather determined from data, our method is consequently threshold independent.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Simulation
Extensive simulation studies are conducted to evaluate the proposed PTIFS method and to compare it with the TGDR method (Ma and Huang, 2005) in terms of the number of selected features and the misclassification rates of linear classifiers constructed from the selected features. We generate vectors of p = 500 features from two classes. The features are generated from normal distributions with variance 1 and with different means. We assume that only a small fraction {pi} of these features are differentially expressed between the groups. We consider two scenarios, {pi} = 0.01 and 0.05 in our simulations and the mean differences for these differential features are taken to be {delta} = 1.5 or 2.

We take sample sizes n = m = 50 for training data and n = m = 20 for testing data. We use a pre-specified {lambda} = 1 in the following simulation. Choice of {lambda} will be discussed later. For TGDR method, three upper bounds for the maximum number of iterations, 100, 200 and 500, are pre-specified, referring to respectively as TGDR100, TGDR200 and TGDR500 in the following context. The threshold parameter of the TGDR method is taken to be 1.0. We use training data to select features subset and construct linear classifier by PTIFS algorithm. For selecting the threshold {tau}, we use 5-fold cross-validation, i.e. K = 5. We then compute the misclassification rates by classifying individuals in the testing data, and use expression (2) to compute the empirical area under ROC curve. We evaluate the proposed method and make comparisons with the TGDR method by examining the number of selected features and their prediction accuracy as measured by misclassification rates.

We run 1000 simulation replicates using settings as described. The features are generated independently from normal distributions. The numbers of features selected by the proposed PTIFS algorithm and TGDR algorithm are presented in Figure 1 and Table 1. From Figure 1, we can see that PTIFS method generally selects fewer number of features than the true number of features (5 and 25 for {pi} = 0.01 and {pi} = 0.05), but TGDR method tends to select more features than expected when {pi} is smaller (0.01). For larger number of features ({pi} = 0.05), although TGDR method does not select more features than expected, it selects more features than PTIFS method does. For example, compared with TGDR100 and TGDR500, the PTIFS method selects, on average, 3.7 and 19.0 fewer features (Table 1). This can also be seen from Table 2 below for correlated features and Tables 3 and 4 for real data applications. When the magnitude of mean difference {delta} is larger, both methods tend to select fewer features. This is intuitively clear since the two classes are well separated for larger {delta} and consequently fewer features are needed to achieve a good classification. Columns 5–6 in Table 1 show respectively the misclassification rates for training data and testing data. The misclassification rates of PTIFS method are in general smaller than those for TGDR100 and TGDR200 but are slightly larger than those for TGDR500. When {pi} is smaller ({pi} = 0.01), PTIFS method has a better classification performance than the TGDR method. When {pi} is larger, the TDGR method with 500 iterations is better than the PTIFS method in prediction accuracy, but at the cost of selecting more features. For the situation of smaller sample sizes, the results are similar (Supplementary Material).


View this table:
[in this window]
[in a new window]

 
Table 1. Performances of the proposed PTIFS method and TGDR in terms of misclassification rate and number of selected features (n = m = 50 for training data, n = m = 20 for testing data)

 

View this table:
[in this window]
[in a new window]

 
Table 2. Performances of the proposed PTIFS method and TGDR for correlated features (n = m = 50 for training data, n = m = 20 for testing data; correlation coefficient between feature i and j is 0.5|i=j|)

 

View this table:
[in this window]
[in a new window]

 
Table 3. Application of PTIFS and TGDR method to prostate cancer data: number of selected features and misclassification rates

 

View this table:
[in this window]
[in a new window]

 
Table 4. Application of PTIFS to prostate cancer data: ranks and selection frequencies for the top five features

 

Figure 1
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Number of selected features [y-axis are the numbers of selected features; symbols a, b, c, d stand for respectively the four scenarios ({pi}, {delta}) = (0.01, 1.5), (0.01, 2.0), (0.05, 1.5) and (0.05, 2.0)].

 
Extensive simulations for correlated features are also conducted. Simulation settings are the same as in the above, but the features are correlated with correlation coefficient 0.5|ij| for feature i and j. Results are shown in Table 2 for {pi} = 0.05. From Table 2, the number of selected features and misclassification rates decrease with {delta}. For example, the number of selected features for PTIFS decreases from 6.479 to 4.681 when the group difference increases from 1.5 to 2.0. Correspondingly, the misclassification rate decreases from 0.053 to 0.025 for training data and from 0.0.147 to 0.082 for testing data. Comparing results in Table 2 with those in Table 1, both methods select a few more features when the features are correlated. As before, compared with TGDR method, PTIFS method selects significantly smaller number of features. But the prediction accuracy of PTIFS is better than TGDR even when the maximal number of iterations is large

In real-world applications, features tend to be correlated. Since more features than expected could lead to over-fitting and, from a practical point of view, it is not economical to use many (correlated) features as biomarkers for clinical diagnosis, the parsimonious nature of the PTIFS method is an advantage for biomedical applications, particularly when the number of features is relatively large. In addition, the PTIFS method is more stable in terms of standard errors than the TGDR method in both classification accuracy and number of selected features. As a whole, the proposed method is comparable to the TGDR procedure in classification but is more parsimonious in selecting features.

The parameter {lambda} is a weight on the subjects who are misclassifiedby the linear classifier constructed from the former iteration. In step (iii) of PTIFS algorithm, a feature is chosen such that expression Formula is maximized. Therefore, the larger value the {lambda} takes, the more weight is put on the misclassified subjects in the training data set. We now explore the prediction accuracy and the number of selected features for various choices of {lambda}. Figure 2 illustrates the relationship between {lambda} and the number of selected features or misclassification rates. Here, the fraction ({pi}) of differential features is 0.05 and the mean differences for these features are 1.5. The sample sizes are n = m = 50 for training data and n = m = 20 for testing data. From this figure, we can see that the number of selected features decreases with {lambda}. The misclassification rate decreases with {lambda} for training data and increases for testing data. However, when {lambda} = 1, the number of selected features and the misclassification rates are rather stable. Choice of {lambda} is thus relatively immaterial and we have chosen {lambda} = 1 in illustrating the proposed method. Based on the same arguments, we shall take {lambda} = 1 in the following real data example, and we recommend taking {lambda} = 1 in practice.


Figure 2
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Relationship of number of seleted features and misclassification rate with {lambda}.

 
3.2 Two real examples
We apply the proposed PTIFS method to two MS data sets, one on prostate cancer and the other on liver cancer. Seventy-five percent of the samples are randomly selected as the training data set. The set of the remaining samples is used as the testing data set. The proposed PTIFS method and TGDR method [TGDR(0.5) and TGDR(1.0) with threshold parameter 0.5 and 1.0 and maximum iterations of 500] are then applied to the training data set and prediction performance is assessed by the testing data. We take K = 5, i.e. we determine the threshold parameter {tau} by 5-fold cross-validation. This procedure is repeated for 1000 replicates.

3.2.1 Prostate cancer data
We first apply the PTIFS method to the prostate cancer MS data (Adam et al., 2002). Original serum samples of prostate cancer were retrieved from the Virginia Prostate Center Tissue and Body Fluid Bank. Serum samples are from four different groups. They are late-stage prostate cancer of stage C or D (CCD), early stage prostate cancer with stage A or B (CAB), benign prostate hyperplasia (BPH) and normal controls (NO). Sample sizes of these four groups are 83, 84, 77 and 82, respectively. Each serum sample was measured by surface-enhanced laser desorption/ionization (SELDI) mass spectrometry for protein/peptide abundance. Peak detection and alignment were operated with Ciphergen ProteinChip Software 3.0. The mass/charge range selected is from 2000 to 40 000Da for analysis. Each SELDI spectrum sample has about 80 peaks identified. There are 779 peaks in total for all the samples. In this article, we only consider the two-class classification problem for all pairs of CCD, CAB, BPH and NO. We also investigate classification of the combined group, CABCD, of prostate cancer of CCD and CAB from the normal (NO) group, and the combined group, CABCD-BPH, of CCD, CAB and BPH from normal (NO).

All results are shown in Table 3 for the following 8 two-class classifications: NO vs. CCD, NO vs. CAB, NO vs. BPH, NO vs.CABCD, NO vs. CABCD-BPH, BPH vs. CCD, BPH vs. CAB and CCD vs. CAB. The averaged numbers of features selected are illustrated in Table 3 (column 3). We can observe from this column that TGDR(0.5) selects the most features with the largest variance and the PTIFS method selects the least features with the least variance. Moreover, on average, PTIFS method chooses 5.7 fewer features compared with the TGDR(1.0) method. The TGDR(0.5) and TGDR(1.0) have roughly the same misclassification rates while the latter tends to be more stable. The misclassification rates of PTIFS are comparable with those of TGDR. Although PTIFS method may have slightly larger misclassification rates in some situations, the number of features selected by PTIFS method tends to be substantially smaller.

The five most frequently selected features in the 1000 random partitioning of the samples (into training and testing data) are shown in Table 4. In this table, m/z denotes the ratio of mass to charge and is the feature we need to select. For example, feature m/z=3896.64 is the most frequently selected (961 times) in the 1000 replicates for NO versus CCD classification. Similarly, feature m/z = 8355.56 ranks the third with frequency 0.840 in NO versus CCD classification. This table shows that the main features possessing higher classification power are consistently selected with high frequencies. For example, feature m/z = 9149.12 is almost always selected and ranks high in all normal versus disease classifications (columns 2–6). Feature m/z = 7819.75 is almost always selected for distinguishing BPH group from other groups (columns 4, 7, 8). Therefore feature m/z = 9149.12 may be an important feature for prostate cancer diagnosis and feature m/z = 7819.75 may be informative in differentiating BPH from other types of prostate cancer.

3.2.2 Liver cancer data
The second example is a liver cancer MS data set. The original serum samples of liver cancer were obtained at the Shanghai Changzheng Hospital, China. These Serum samples include 145 subjects, consisting of 54 hepatoma (H) patients, 39 patients with chronic liver disease (LD) and 52 normal individuals (NO). Chronic liver disease includes hepatitis and hepatocirrhosis. Each serum sample was measured by surface enhanced laser desorption/ionization (SELDI) mass spectrometer for protein/peptide abundance. Peak detection and alignment were operated with Ciphergen ProteinChip Software 3.0. The mass/charge range is from 1000 to 40 000Da. For all the samples 236 peaks are identified. As in the prostate cancer example, we only consider the two-class classification problem for all pairs of H, LD and NO, and classification of the disease group (LD and H), denoted as LDH, from the normal (NO) group.

Table 5 shows the averaged numbers of features selected (the third column). We can observe that, compared with TGDR (0.5) and TGDR (1.0), the PTIFS method selects the smallest number of features and with the least variance. The misclassification rates of the PTIFS method are always less than those of TGDR methods for this data set. Ranks of proportion of the top five features being selected are shown in Table 6. This table shows that the main features possessing higher classification power are consistently selected with high frequencies. For example, feature m/z = 11 866.94 is almost always selected and ranks high in all normal versus disease classifications (columns 2–4). Feature m/z = 10 862 .0 is almost always selected for distinguishing the LD group from the H (column 5). Therefore feature m/z = 11 866.94 may be an important feature for liver cancer diagnosis and feature m/z = 10 862.0 may be informative in differentiating the LD and H sub-types.


View this table:
[in this window]
[in a new window]

 
Table 5. Application of PTIFS and TGDR to liver cancer data: number of selected features and misclassification rates

 

View this table:
[in this window]
[in a new window]

 
Table 6. Application of PTIFS to liver cancer data: ranks and selection frequencies for the top five features

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Feature (protein or peptide) selection is central to the search of protein biomarkers. The area under ROC curve provides a useful measure for classification accuracy using the selected features. The ROC curve characterizes the relationship of sensitivity versus false positive rate (1 minus specificity) of a classifier. The area under ROC curve measures the trade-off between specificity and sensitivity and maximization of which can help selecting informative features.

In this study, we proposed a PTIFS method, which is in the spirit of the least angle regression (LARS) method. This method searches the feature subset for two-class classification through an iterative updating algorithm by maximizing the area under the ROC curve, keeping the efficient direction ({theta}k) of the marginal leaner of each feature unaltered. The threshold parameter of the algorithm is determined by cross-validation.

Simulation studies show that the proposed method has favorable prediction accuracy and is parsimonious in selecting features compared with existing methods. We also applied the proposed method to prostate cancer and liver cancer mass spectrometry (SELDI) data. Results show that the PTIFS method selects less features than TGDR, but has the comparable or better prediction accuracy. The PTIFS method precludes highly correlated proteins (or any expression measurements) to be simultaneously selected and this parsimonious characteristic is particularly useful in biomedical applications.

We have proposed the PTIFS method for two-class classification by using ROC curve. We have confined ourselves to pairwise classification for the multi-class data. Extension of the proposed parsimonious feature selection method requires knowledge of high dimensional ROC surface and proper definition of the target function. This is the issue under investigation and we will report the multi-class problem elsewhere.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This research was supported by China NSF Grant 10671189 and Chinese Academy of Science Grant No. KJCX3-SYW-S02 (Z.W. and Y.Y.), the grants NO.0120914 and NO. 034119832 from Science and Technology Commission of Shanghai Municipality of China (L.Z.) and US NSF and NIH grants (Z.Y.).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Thomas Lengauer

Received on June 16, 2007; revised on August 2, 2007; accepted on August 20, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Adam B-L, et al. Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Res (2002) 62:3609–3614.[Abstract/Free Full Text]

    Efron B, et al. Last angle regression. Ann. Stat (2004) 32:407–499.[CrossRef]

    Friedman JH, Popescu BE. Gradient directed regularization for linear regression and classification. In: Technical report (2004) CA: Department of Statistics, Stanford University.

    Gammerman A. Computational Learning and Probabilistic Reasoning (1996) New York: Wiley.

    Grizzle WE, et al. Serum protein expression profiling for cancer detection: validation of a SELDI-based approach for prostate cancer. Dis. Markers (2003) 19:185–195.[Web of Science][Medline]

    Levner I. Feature selection and nearest centroid classification for protein mass spectrometry. BMC Bioinformatics (2005) 6:68.[CrossRef][Medline]

    Liu A, et al. On linear combinations of biomarkers to improve diagnostic accuracy. Stat. Med (2005) 24:37–47.[CrossRef][Web of Science][Medline]

    Ma S, Huang J. Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatrics (2005) 21:4356–4362.[CrossRef]

    Metz CE. Basic principles of ROC analysis. Semin. Nucl. Med (1978) 8:283–298.[Web of Science][Medline]

    Metz CE, Wang P-L, Kronman HB. A new approach for testing the significance of differences between the ROC curves measured from correlated data. In. In: Information Processing in Medical imaging VIII—Deconick F, ed. (1984) The Hague: Nijhof. 432–445.

    Pepe MS, et al. Phases of biomarker development for early detection of cancer. J Natl Cancer Inst (2001) 93:1054–1061.[Free Full Text]

    Qu Y, et al. Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. Clin. Chem (2002) 48:1835–1843.[Abstract/Free Full Text]

    Su JQ, Liu JS. Linear combinations of multiple diagnostic markers. J. Am. Stat. Ass (1993) 88:1350–1355.[CrossRef][Web of Science]

    Swets JA. Measuring the accuracy of diagnostic systems. Science (1988) 240:1285–1293.[Abstract/Free Full Text]

    Yasui Y, et al. A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. Biostatistics (2003) 4:449–463.[Abstract]

    Yu J, Chen XW. Bayesian neural network approaches to ovarian cancer identification from high-resolution mass spectrometry data. Bioinformatics (2005) 21(Suppl. 1):i487–i494.[Abstract]

    Zhou CH, Obuchowski NA, McClish DK. Statistical Methods in Diagnostic Medicine (2002) New York: Wiley.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/20/2788    most recent
btm442v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wang, Z.
Right arrow Articles by Yang, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wang, Z.
Right arrow Articles by Yang, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?