Skip Navigation


Bioinformatics Advance Access originally published online on March 14, 2008
Bioinformatics 2008 24(9):1198-1205; doi:10.1093/bioinformatics/btn089
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/9/1198    most recent
btn089v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Xiao, Y.
Right arrow Articles by Segal, M. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Xiao, Y.
Right arrow Articles by Segal, M. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Biological sequence classification utilizing positive and unlabeled data

Yuanyuan Xiao and Mark R. Segal *

Department of Epidemiology and Biostatistics, Center for Bioinformatics and Molecular Biostatistics, University of California, San Francisco, CA 94107, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 EXISTING METHODS
 3 OUR PROPOSED METHOD:...
 4 GENOMIC DATA EXAMPLE
 5 RESULTS
 6 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: In the genomics setting, an increasingly common data configuration consists of a small set of sequences possessing a targeted property (positive instances) amongst a large set of sequences for which class membership is unknown (unlabeled instances). Traditional two-class classification methods do not effectively handle such data.

Results: Here, we develop a novel method, likely positive-iterative classification (LP-IC) for this problem, and contrast its performance with the few existing methods, most of which were devised and utilized in the text classification context. LP-IC employs an iterative classification scheme and introduces a class dispersion measure, adopted from unsupervised clustering approaches, to monitor the model selection process. Using two case studies—prediction of HLA binding, and alternative splicing conservation between human and mouse—we show that LP-IC provides superior performance to existing methodologies in terms of: (i) combined accuracy and precision in positive identification from the unlabeled set; and (ii) predictive performance of the resultant classifiers on independent test data.

Contact: mark{at}biostat.ucsf.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 EXISTING METHODS
 3 OUR PROPOSED METHOD:...
 4 GENOMIC DATA EXAMPLE
 5 RESULTS
 6 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Most sequence-based classification approaches operate under the traditional supervised learning framework, where classifier construction requires a sufficient collection of samples from all predefined classes. However, it has become increasingly common to obtain a small set of sequences possessing a targeted property (positive instances) amongst a large set of sequences for which class membership is unknown (unlabeled instances). Examples of such properties (labels) include methylation status, transcription factor binding status, microRNA targets and alternative splice sites. The lack of negative data is not uncommon as positive instances are often preferentially investigated and identified. However, their absence precludes use of traditional classification techniques. In such settings, it is seemingly desirable to explore information contained in the large volume of untapped and unlabeled genomic sequence data to enable classification. The utility of unlabeled data in boosting prediction performance has been demonstrated in the field of text classification, where applications have flowed from the explosion of text documents available in digital form. Here, in a similar vein, we investigate potential benefits of integrating unlabeled data into the sequence-based classification setting. We adapt, contrast and improve existing techniques and also devise novel methodologies. The downstream benefit to biologists is allayed cost and labor overhead without sacrificing prediction performance.

Let P designate the positive set and U the unlabeled set, which we assume to have low-positive prevalence. The goal of sequence-based classification utilizing unlabeled data is 2-fold: (i) identify/predict positive instances in U; (ii) devise an accurate classifier despite the small number of labeled instances in P. The learning objective here is therefore ‘non-standard’, in which prediction goals are focused foremost on accurately identifying positives from the existing unlabeled data, rather than future (independent) instances, and obtaining interpretable classification rules is of secondary importance.

The principal strategy employed in handling unlabeled text data (Li and Liu, 2003; Liu et al., 2002, 2003; Yu et al., 2002) involves a two-step process, which progressively identifies strong negatives (SN) from U and iteratively builds classifiers. While this approach, termed SN-IC (strong negative-iterative classification) and detailed in Section 2.1, proves effective and superior to traditional one-class (using only P) or two-class (treating U as negatives) classification methods, there remain several areas for improvement. This is especially true in the context of sequence based prediction, where positive cases in U tend to be much sparser than positives in text classification applications. This mandates high precision in positive identification, which is not prioritized in current algorithmic implementations. In addition, approaches to the critical question of model selection from the series of iterative classifiers are crude: choosing either the first or the last classifier (Li and Liu, 2003) built is far from satisfactory.

Currently, there is just one bioinformatic classification tool for unlabeled data, cluster_boost, developed in the context of hyper-methylation prediction (Goh et al., 2006). The cluster_boost methodology is motivated from imbalanced data classification problems and treats the unlabeled data as negatives. The large inequality of sample sizes of the two classes (63 positive and 14 249 unlabeled instances) is addressed by clustering the unlabeled data into a targeted number of groups and forming an ensemble of classifiers, each of which is trained using P and a derived cluster from U. It then uses a thresholding scheme applied across this ensemble to identify unlabeled instances that are consistently classified as positives. Some general concerns pertain to each of these components: (i) the clustering process relies on many data-dependent heuristic rules, which makes adaption/extension to other settings problematic; and (ii) thresholds are set based on simulations, the applicability of which is unclear; further, sensitivity to simulation specifications and extension to other settings is unexplored. When tested on a synthetic data set consisting of the entire 14 249 actual unlabeled instances spiked with 140 synthetic positives modeled to resemble real positives, cluster_boost identified 1260 instances as positives from U, of which the vast majority (14 249 x (1 – 0.916) = 1197) are false positives, with only 63 being true positives, a meager 5% of all identified positives (see Table 2 in Goh et al., 2006).

Here, we propose LP-IC (likely positive-iterative classification) to effectively utilize unlabeled data for sequence-based prediction. While our proposed approach, LP-IC, also employs an iterative classification scheme (cf SN-IC) and ensemble ideas (cf cluster_boost), it possesses the following novelties:

  • Reflecting the paucity of positives, we iteratively augment P by identifying (likely) positives from U;
  • We conduct model training using P and sub-samples of U, forming an ensemble of classifiers. The sub-sampling addresses class imbalance issue, and the multiplicity of sub-samples allows robust prediction;
  • Most importantly, we introduce a class dispersion measure adopted from unsupervised clustering approaches, to monitor the homogeneity/similarities of the declared positive instances. This enables formulation of rigorous and appropriate stopping criteria for the iteration scheme unlike the arbitrary rules that existing methods employ.

In summary, we propose an iterative classification approach that directly identifies positives from U and employs a novel class dispersion measure to guide model selection. We evaluate the performance of LP-IC by application to two data sets: (i) HLA: identification of HLA (Human Leukocyte Antigen) binding sequences; and (ii) AS: identification of conserved human–mouse alternative splicing sequences. We show that, in both cases, LP-IC compares favorably to existing methods in improving the trade-off between positive recognition rate and false positive rate, significantly reducing downstream validation workload.

This article is organized as follows. In Section 2, we review current methods in unlabeled data classification and, in Section 3, outline details of our proposed method. Section 4 presents details on obtaining and preprocessing of the HLA and the AS data sets. Results obtained from LP-IC along with comparisons with existing approaches are presented in Section 5. Section 6 discusses algorithmic issues, open questions and future directions.


    2 EXISTING METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 EXISTING METHODS
 3 OUR PROPOSED METHOD:...
 4 GENOMIC DATA EXAMPLE
 5 RESULTS
 6 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
So as to position our subsequent development, in this section we introduce a two-step strategy proposed for unlabeled text classification. We also present methods for imbalanced data classification and one-class classification, both of which are also germane.

2.1 SN-IC—strong negative iterative classification
Several algorithms proposed for text classification using unlabeled data were based on a two-step approach. The first step involves identification of a set of strong negatives (SN) from U. Often, this is pursued via classification (Li and Liu, 2003): (all) U is treated as negative, combined with P and a classifier fitted, then instances in U exceeding some (arbitrary) classification threshold are declared SN. Other approaches include spiking in positives (Liu et al., 2002) or utilizing discriminatory univariate features (Yu et al., 2002). Once SN are identified, the next step involves building classifiers based on P and SN. Naive Bayes (McCallum and Nigam, 1998), Support Vector Machines (SVM; Li and Liu, 2003; Yu et al., 2002) and boosting (Goh et al., 2006) have been applied in this context, and SVM and boosting have shown better performance (Liu et al., 2003; Goh et al., 2006); see Christianini and Shaw-Taylor (2000) and Friedman et al. (2000) for details on SVM and boosting, respectively.

In the two-step approach, iterative classification is frequently employed. This starts by training a classifier (SVM, for instance) using P and SN, then using it to predict the remaining instances in U (Yu et al., 2002). The process then proceeds iteratively, each time extracting more negatives from U\SN, combining them with SN, and stopping when no instances in U\SN can be classified as negative. How to select the best classifier from the series of classifiers built at each iteration has yet to be satisfactorily addressed. Yu et al. (2002) assumes the last classifier is the best, whereas Li and Liu (2003) compares the last with the first and chooses the one that classifies P more accurately.

Addressing the imbalance in sample sizes of P and SN at each iteration, we also investigate the effect of applying different training weights to P and SN. Specifically, we weigh training samples inversely proportional to their class (P or SN) sample sizes, and name this approach SN-IWC.

2.2 Imbalanced classification approaches
A salient attribute of the unlabeled data framework in the genomic context is imbalanced class size—labeled positives tend to be sparse whereas unlabeled data are numerous and predominantly negative. A number of techniques have been proposed to handle this. These include over-sampling the minority class under-sampling the majority class or adjusting misclassification costs. We employ a variant of the under-sampling approach similar to cluster_boost, details of which are presented in Section 5.

2.3 One-class classification
Scholkopf et al. (1999) adapted SVM methodology to the one-class classification problem, where only positive training data exists. Essentially, the origin is treated as the only member of the negative class after data transformation via a kernel. A hyper-plane is then determined which separates the transformed data from the orgin with maximum margin. In Section 5, we contrast performance of one class-SVM with LP-IC.


    3 OUR PROPOSED METHOD: LP-IC
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 EXISTING METHODS
 3 OUR PROPOSED METHOD:...
 4 GENOMIC DATA EXAMPLE
 5 RESULTS
 6 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 The algorithm
We now outline our algorithm, LP-IC. Unlike SN-IC, we seek to augment P by successively identifing ‘likely positives’ (LP) from U. This is accomplished by forming an ensemble of classifiers, each trained using updated positives and sub-samples of U. The sub-samples of U are the same size as the updated positive set so as to address class imbalance. Denote the sizes and centroids of P (U) as Np (Nu) and CP (CU), respectively. At the j-th iteration, employing threshold {theta}j (defined below), we have positive set Pj of size Nj, p with centroid CPj, and similarly, we have unlabeled set Uj of size Nj, u with centroid CUj. We conduct the following operations within each iteration:

  1. Obtain k = 1, ..., K sub-samples by randomly selecting sets of size Nj, p from Uj. Treating each sub-sample as negatives and combining it with Pj, perform (two-class) classification, yielding an ensemble of Ck classifiers (k = 1, ..., K);
  2. Classify all instances, i, using Ck and obtain the positive class assignment probabilities Vki and their average Formula .
  3. For each instance u isin Uj, if Vu ≥ {theta}j, then Pj + 1 = Pj {cup} {u} and Uj + 1 = Uj\{u};
  4. Calculate the distance DCj + 1 as the normalized squared euclidean distance of the current positive centroid CPj + 1 to the original positive centroid CP:


    Formula 1

    (1)
    D1 (D2) being the euclidean distance of CPj+1 to the original positive centroid CP (CU).

The thresholds {theta}j govern the rate at which the positive set is augmented by labeling of (putative) positives from U. One method for determining {theta}j is to require that l% of positives in P have probabilities smaller than {theta}j, and prescribe a monotonely decreasing series of l values. When l is set low, more samples from U will be identified as positives, potentially leading to a more accurate classifier for the next step yet, at the same time, risking introducing incorrectly labeled data and so degrading classifier performance. Thus, there is a tradeoff between the number of training samples and noise introduced by labeling error. This gives rise to the important question of model selection discussed next.

3.2 Model selection
We tackle model selection by making recourse to the DC measures detailed previously (Equation (1)). At early iterations, when most of the instances incorporated from U to P are true positives, we anticipate that the updated positive centroid will be very close to the original P centroid, and far from the original U centroid (D1 << D2 in Equation (1)). So, a graph of DC versus either iteration index or number of declared positives will exhibit an early constant ({approx} 0) phase of DC. At later iterations, when negative instances are being included in P, DC will increase markedly. The location of such an elbow or hinge indicates the appropriate stopping point (Tibshirani et al., 2001), which we estimate using hockey stick regression (Bacon and Watts, 1971). In brief, curve-fitting of a piecewise linear model with a connecting quadratic component is effected by non-linear least squares implemented in the R library nls. The iteration corresponding to the junction of the constant and quadratic components is the basis for model selection.

3.3 Performance metric
The performance of unlabeled data classification is measured by its accuracy and precision in identifying positives. To this end, we employ the F score, used frequently in information retrieval. F value is defined as: Formula where p is the precision (positive predictive value) and defined as: Formula where TP and FP are numbers of true and false positives respectively; r is the recall (sensitivity), and defined as: Formula where FN is the number of false negatives. A high precision ensures that the identified positives are predominantly true positives, and a high recall ensures that most of the positives are identified. F values capture the average effect of both precision and recall, and are therefore suitable for our purpose.


    4 GENOMIC DATA EXAMPLE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 EXISTING METHODS
 3 OUR PROPOSED METHOD:...
 4 GENOMIC DATA EXAMPLE
 5 RESULTS
 6 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We evaluate the performance of LP-IC on two different data sets. In both data sets, we spiked positive instances into an unlabeled background, thereby allowing assessment of true positive recovery. These data sets have very different distributional characteristics and discriminative signals between positive and negative instances, allowing evaluation of LP-IC under diverse scenarios.

4.1 HLA data: HLA-A2 binding
HLA-A2 is an allele of MHC (major histocompatibility complex) class I in human, which mostly bind peptides originating from the cytoplasm (endogenous protein or intracellular pathogens) to trigger an immune response that degrades detrimental foreign or self proteins. Peptides that bind to HLA-A2 are 9-mers and frequently possess two hydrophobic amino acid anchor residues: Lysine at position 2 and Valine at position 9. The public database MHCPEP (http://wehih.wehi.edu.au/mhcpep/; Brusic et al., 1996) was used to build a working pool of binders, comprised of known T-cell epitopes and synthetic peptides. Duplicate entries of HLA-A2 binders were deleted as were non-9-mer peptides, resulting in a total of 485 binders. We randomly select 80% (388 cases) as training positives and the remaining 20% (97 cases) constitutes the positives in the test set. Our unlabeled set is generated as follows. First, to mimic naturally occurring peptides, we generated 10 000 9-mers utilizing amino acid frequencies as in SwissProt (Emmert et al., 1994). We assume that the vast majority of such randomly generated 9-mers are unlikely to be MHC binders, therefore the positive prevalence is low. Second, we randomly partition them into training ({approx} 80%) and test ({approx} 20%) data. Third, to monitor the training process and evaluate performance, we spike a set of 78 binders into the training unlabeled instances, constituting a positive prevalence of approximately 1%. So, our training positive set P consists of 310 cases (388–78 = 310), and the unlabeled set U consists of 7998 instances, including 78 spiked-in positives. There is also a set of experimentally confirmed non-binders with 218 instances; as it is relatively sparse, we use it to contrast classification performance with the unlabeled data setting.

We represent amino acid sequences via 12 properties variables, including 10 orthogonal factors obtained from 188 biophysical properties of the 20 amino acids via principal components analysis (Kidera et al., 1985) and two QSAR (quantitative structure–activity relationship) molecular structure descriptors, isotropic surface area and electronic charge index.

4.2 AS data: human–mouse alternative splicing
Alternative splicing of human primary transcripts produces variable mRNA products by the inclusion or exclusion of individual exons. Alternative splicing events conserved across species are likely encoding biological functions of primary importance. Yeo et al. (2005) described sequence features that distinguish exons subject to evolutionarily conserved alternative splicing events from other orthologous human/mouse exons. These include: (i) exon length, (ii) upstream intron length, (iii) downstream intron length, (iv) 5' and (v) 3' splice site score, (vi) nucleotide percent identity between orthologous human and mouse exons, (vii) human–mouse intronic sequence conservation within the last and (viii) the first 150 bases downstream of the exon. Yeo et al. (2005) has curated a training set of 241 exons of the SH,M class and 5066 exons of the Sh,m class, which we use to further evaluate the performance of our LP-IC algorithm. Similar to the HLA-A2 binding data, we first set aside 10% of the instances from both classes (24 positives and 506 negatives) as test data. Among the remaining 217 positives (SH,M exons), we randomly select 47 instances to spike into the negative set (Sh,m), resulting in {approx} 1% positive prevalence. To summarize, our training positive set P consists of 170 instances, and the unlabeled set U is comprised of 47 positives and 4560 negatives.


    5 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 EXISTING METHODS
 3 OUR PROPOSED METHOD:...
 4 GENOMIC DATA EXAMPLE
 5 RESULTS
 6 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
As mentioned, we use two metrics for judging the performance of competing methods in the unlabeled data setting. Foremost is the combined accuracy and precision of positive recovery from U, as measured by the F score. However to ensure that the resultant classifier is not overly adapted solely to positive identification, we also evaluate the overall predictive performance on independent (excluded) test data. A good method should perform well by both criteria. Further, a special feature of the HLA data is the existence of the small negative data set, which we use for comparisons between classifiers built utilizing unlabeled data and those built using sparse experimental negative data.

We compare LP-IC to (i) SN-IC; (ii) SN-IWC; (iii) Undersample-Ensemble, (iv) one-class SVM and (v) traditional two-class SVM. We use SVM as our default classification method. For SN-IC, we identified strong negatives using an ensemble of 40 SVM classifiers, each of which is trained using P and a randomly selected sub-sample from U of the same size of P (Np). The averaged decision values from the SVM predictions are sorted and the instances that have the top Np most negative average decision values are designated as strong negatives, SN. The classifier built in the last iteration of classification model training is deemed the best classifier and used to classify U and the test data. SN-IWC is similar to SN-IC except that training samples are weighted inversely proportional to their sample sizes. Undersample Ensemble adopts a variant of sub-sampling to handle the imbalanced class sizes. It decomposes U into K partitions, where K is selected based on the number of positive instances, and trains K classifiers, each based on P and a derived cluster of U. Each instance in U therefore receives K predictions, and the final prediction is based on collective voting. Undersample-Ensemble is similar to cluster_boost, differing only in the selection of K and the clustering approach. We also compare LP-IC to one-class SVM and the tradition two-class SVM. In two-class SVM, we treat U as negatives and train the classifier using P and U, which is then used to derive predictions on U.

5.1 Identification of positive instances from U
Figure 1 displays the distributions of euclidean distances from instances to centroids for both data sets. It is apparent from the figure that these two data sets have distinct distributional characteristics. In the HLA data, instances in P are more homogeneous than instances in U, giving rise to a smaller mean within-class distance. This behavior is reversed in the AS data where P displays more dissimilarity/dispersion than U.


Figure 1
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Distributions of euclidean distances from positive instances to centroids of P (black solid line) and U (gray dotted line), and unlabeled instances to centroid of U (gray solid line) and P (black dotted line) for the HLA data (a) and the AS data (b).

 
Application of LP-IC We apply LP-IC to the HLA data set, performing 30 iterations of SVM using increasingly relaxed thresholds to declare positives from U: the thresholds ({theta}s) are chosen at the 95, 90, ..., 2% of the decision values of P. At each iteration, 40 sub-samplings of the updated unlabeled data U' were conducted. We use the radial basis function kernel for SVM, which has been recommended as a default (Liu et al., 2003) and shown robust behavior for this data set (Xiao and Segal, 2005). Tuning parameters of the SVM (cost and kernel width) were selected by cross validation. The normalized distance measure DC for the updated positive set (originally labeled + newly declared) was calculated using standardized (mean = 0, variance = 1) features. Traces of DC as a function of iteration index/threshold decrease are displayed in Figure 2. The bottom axis plots threshold decrease as a percentage of the decision values of the original P; left and right axes plot DC and F values, respectively; the top axis indicates numbers of instances identified as positives from U together with attendant (recall, precision) at each iteration. At the first iteration, the threshold is set high, and only eight instances (top axis) from U are predicted as positives, all of which are spike-ins (precision = 1). As the iterations proceed, more positives are identified and included for the next round of classifier training. Note that DC remains largely constant for several iterations, indicating that most of the added positives are similar to the original instances in P, and therefore are potential true positives. However, further relaxing thresholds below the 7–10th percentile of the decision values of P causes the normalized within-group distance to increase; using hockey stick regression (see Figure 1A in the Supplementary Data) places the ‘elbow’ at the 8th percentile, where 109 examples from U are identified as positives. Since we know the identities of the spiked-in positives, we can calculate the F score at each iteration (Figure 2) based on the assumption that only the spiked-in instances are positives. Because it is likely that previously unknown positives are present in the (synthetic) unlabeled set, the recall, precision and F score so calculated are underestimates. The fact that the substantial decrease in F score agrees with the DC elbow indicates that using DC to monitor the extent of positive inclusion from U is effective. Based on our elbow heuristic, we declare that LP-IC identifies 109 examples from U as positives. Among these 109 samples, 63 are spiked-in positives, and 46 are 9-mers from U. Among these 46 identified positives, all but one possess either a Lysine at position 2 or a Valine at position 9, 20 have both anchors, suggesting they are legitimate candidates for binding; see Section 5.2 below for further validation. Application of LP-IC to the AS data produces the DC trace in Figure 2 in the Supplementary Data. Here hockey stick regression (see Figure 1B in the Supplementary Data) selects the 30th percentile of {theta} values giving rise to 189 identified positives from U, and again coinciding with a substantial decrease in F.


Figure 2
View larger version (29K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Traces of the normalized euclidean distance from updated positive centroid to the original positive centroid, DC, for the HLA data set. Also plotted in gray and asterisks are F scores.

 
Comparison with other methods We next contrast the performance of LP-IC on both data sets with other existing methodologies. Table 1 gives the results of positive identification using LP-IC, SN-IC, SN-IWC, Undersample-Ensemble, one class-SVM and two-class SVM in the HLA and AS data. The third column indicates the number of identified positives. It is evident from Table 1 that the traditional two-class classification method is ill suited to handle unlabeled data classification, recovering far fewer positives than the other algorithms for both datasets. Two methods, SN-IWC and Undersample-Ensemble, identify far too many positives, giving rise to a low-precision rate (< 15% for HLA), making downstream validation analyses problematic. Our proposed methodology, LP-IC, together with SN-IC, perform the best with F scores of {approx} 0.7 for the HLA data. Positive recovery from U for the AS data proves to be more difficult and all methods fare much worse. SN-IC and LP-IC are still superior, with SN-IC possessing the highest F score, however, this comes at the expense of recovering less than half the spike-ins that LP-IC identifies.


View this table:
[in this window]
[in a new window]

 
Table 1. Results of positive identification and test set classification

 
5.2 Validation of positives identified from the unlabeled set
HLA To validate our 46 LP-IC positive identifications, we made recourse to two different in silico HLA allele specific binding prediction tools: MHCPrep (Guan et al., 2003) and BIMAS (Parker et al., 1994). BIMAS utilizes position weight matrices, whereby a matrix of weights for each amino acid at each peptide position is specified. MHCPred models were trained using IC50 values, obtained from radioligand competition assays characterizing peptide-MHC affinity, as outcome variables, and peptide backbone contribution and a series of pairwise interactions between side chains of increasing sequence separation as predictors. Both tools concatenate the 46 9-mers into one sequence and predict the binding properties of all 406 consecutive (overlapping) 9-mers from this long sequence. This construction allows us to contrast the predicted affinity of the 46 binders identified by LP-IC with the rest (406–46 = 340) of the artifactual 9-mer sequences; the latter we term ‘control’ sequences. MHCPrep predicts 36 of the 46 identified positives by LP-IC (78.3%) to be binders (see Table 1 in the Supplementary Data). In contrast, only 106 of the 340 control sequences (36.1%) are predicted to bind based on the recommended threshold of IC50 ≤ 500 nM. The average IC50 of the 46 predicted binders is 323.7 nM, significantly lower than that of the control sequences (mean > 1115.9 nM; P < 10–16). Further, the 10 9-mers that are categorized as non-binders by MHCPrep are low-confidence predictions by LP-IC, all coming in at the last three rounds (10–8th percentile for thresholding) of the iteration process. BIMAS ranks 9-mers in order of binding affinity based on half time of disassociation calculated using position weights. Among the top 50 9-mers, 33 of them are identified binders by LP-IC from the unlabeled data. These in silico validation results suggest that LP-IC is effective in accurately identifying positives from U.

As among the 189 instances identified as positives by LP-IC from U, 29 are spiked-in positives (out of a total of 47). The remaining 160 instances are from the Sh,m set, for which no conserved cross-species alternative splicing had been observed. To investigate whether these 160 instances were erroneously identified because of a too lenient model selection criterion employed by LP-IC, we compared the distribution of the eight sequence features among the following three sets: 170 training positives from SH,M, 160 identified positives by LP-IC from Sh,m and the rest of the training Sh,m set, labeled P, LP-IC-P and N, respectively. We showcase boxplots of the three most discriminating features in Figure 3, and boxplots for the rest of the features are displayed in Figure 3 in the Supplementary Data. It is apparent that the distribution of each feature in LP-IC-P conforms to those in P, and is divergent from N. Statistical analyses show that sequences in LP-IC-P have shorter exons (P < 2.8 x 10–13), longer upstream and downstream introns (P < 1.8 x 10–5) and higher conservation in the 150-base intron regions immediately upstream and downstream of the exons (P < 2.2 x 10–16) than sequences in Sh,m (see Figure 3 below and Figure 2 in Yeo et al., 2005), all of which are characteristics of the SH,M set. Indeed, testing the three sequence features between P and N also results in highly significant P-values of similar magnitudes as above, whereas testing between P and LP-IC-P yields insignificant P-values in the range of 0.12–0.75. However, the same trend in the sequence features does not exist when positives identified by SN-IWC from Sh,m (SN-IWC-P) were compared to P and N; see Figure 4 in the Supplementary Data. There SN-IWC-P shows evidence of a mixture of P and N, having intermediary medians for all features.


Figure 3
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Feature specific boxplots for the AS data. LP-IC-P represents the 160 identified positives by LP-IC from Sh,m, and P and U denote training positives and negatives respectively.

 
5.3 Predictive performance using the test set
We set aside portions of both the HLA and AS data as test sets (Section 4), enabling computation of test set sensitivity and specificity as given in Table 1. Again, LP-IC exhibits good performance comparable to the best among other algorithms. SN-IWC and Undersample-Ensemble are comparatively accurate, albeit with lower sensitivities. SN-IC, which performed well with respect to positive identification, does poorly here with reduced sensitivity. One class-SVM also fares poorly with sensitivity {approx} 50%. The traditional two-class SVM, by training with mislabeled negatives (positives in U) and by not addressing the class imbalance issue, obtains the worst overall test set performance. Note that sensitivities are under estimated for all methods, since all instances in the test set selected from U are treated as negatives.

5.4 Effect of model selection and size of training positives on SN-IC
The SN-IC methodology builds a series of classifiers and assumes that the last classifier is the best. While this approach proved effective in positive identification for both data examples, there are reasons to suspect that the observed good performance derives, in part, from the existence of sufficient numbers of positive training instances, P. Accordingly, we investigate the impact of reduced P and further explore the benefits of model selection (rather than use of the final classifier) on SN-IC. To this end, for the HLA data, we randomly select 50 and 100 instances from the entire set of P (310 labeled positives) and use them as the training set. The constitution of the unlabeled set is unchanged with 78 spike-ins amongst 7920 synthetic unlabeled instances. In addition, we refined the procedure of negative inclusion in SN-IC to include only the top (by decision values) 5% of the negatives at each iteration so as to examine a wider range of classifiers. The fitness of these 20 classifiers is judged by making recourse to the spike-ins, and hence the derived F scores, again assuming that the spike-ins are the only true positives. The classifier from this series with the highest F score is designated SN-IC:best model, whereas the last classifier is labeled SN-IC:last model. We record, for these two methods and LP-IC, sensitivity and F score of positive recovery from U, sensitivity and specificity of test set. Performance of these three methods is illustrated in Figure 5 of the Supplementary Data. It indicates that performance of SN-IC:last model is sensitive to the size of P. When the size of P is decreased to 100, positive recovery is decreased to 50%, and it drops further to 30% when the size of P is further halved. This sensitivity to the size of P is due to lack of model selection and can be salvaged by using an intermediary, rather than final, model. As expected, LP-IC performed well regardless of the size of P in terms of both positive recovery from U and test set sensitivity and, for the latter, is markedly better than either variant of SN-IC.

5.5 Comparison of predictive performance utilizing unlabeled data and sparse negatives
We next compare the classifier derived by LP-IC to a SVM trained in the traditional two-class setting. Such a setting requires obtaining labeled negative data. This negative data will necessarily be sparse in comparison with genome-scale unlabeled data. To investigate the tradeoffs between utilizing such large unlabeled data sets and small scale additional assay-based labeling, we take advantage of the set of 218 experimentally confirmed HLA-A2 non-binders. We set aside 20% (44) of the non-binders as test data. Among the remaining 80% of non-binders (174 instances), we randomly select 50, 100 or the entire 174 instances for model training using SVM, together with the 310 training binders. To account for the imbalanced class setting, negative and positive instances are weighted inversely to their sample sizes. We compare the predictive performance of models thus built, designated PN-C, to that of LP-IC. Table 2 lists the sensitivity, specificity, aROC and sensitivities at 95% and 99% specificity and the ROC curves of these four models created by thresholding decision values are displayed in Figure 6 of the Supplementary Data. As expected, increasing the size of the non-binder sets yields increasingly improved performance. Nonetheless, LP-IC is still markedly better than the best PN-C model. This superiority is particularly notable when comparing sensitivities at 99% specificity, underscoring LP-IC's capacity to achieve a high positive recognition rate without incurring erroneous negative identifications.


View this table:
[in this window]
[in a new window]

 
Table 2. Comparison of classification models using experimental negative set and unlabeled set

 

    6 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 EXISTING METHODS
 3 OUR PROPOSED METHOD:...
 4 GENOMIC DATA EXAMPLE
 5 RESULTS
 6 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Our focus here was on designing efficient ways to utilize the large volumes of sequence data, generated by various genome projects, for classification problems with only positively labeled instances, for which standard approaches are inapplicable. The characterizing attributes of this setting—imbalance in class sizes, absence of negatives, and primacy of positive identification—mandate novel and custom solutions.

Our proposed method, LP-IC, outperforms existing methods for unlabeled data in positive identification and independent test set prediction performance. Simultaneously, achieving balanced performance on both of these criteria is beyond the majority of current methodologies. By under-sampling instances in U in Undersample-Ensemble or over-weighting instances in P in SN-IWC during the training process, these methods compromise precision in favor of high recall, giving rise to unacceptably high rates of false positives in identified positives from U. On the other hand, by training with an unbalanced set of positives and (identified) negatives, the latter being dominant, SN-IC enjoys high precision but recovers far fewer positives than other methods, leading to poor performance in the category of recall / sensitivity. We showed that this deficiency can be remedied by the implementation of a systematic model selection criterion. However, because of training imbalance the test set performance of SN-IC remains inferior to LP-IC.

The model selection criterion that LP-IC employs differs from those used in conventional regression and classification problems, which often involves some form of penalization for model complexity. Rather, it utilizes class dispersion / distance measures in unsupervised clustering analysis. This reflects the data structure and analysis objective differences imparted by the presence of unlabeled data. The effectiveness of this criterion is suggested by our albeit limited in silico validation results. Our proposal for model selection based on the location of the ‘elbow’ in the DC trace ought be viewed as a guideline. It may be that a series of models is desirable depending on research goals and downstream validation capacity. For instance in the HLA study, if identification of a more stringent set of positives is sought, thresholding {theta} at 15% gives rise to a set of 60 identified positives, 10 of which were originally unlabeled and all of which were predicted as binders by MHCPrep. In both of our exemplary data sets, we utilized spike-ins to examine DC behavior and monitor model selection. F scores were calculated based on the assumption that only spike-ins are true positives in U, which is a clear underestimate of LP-IC's performance in the HLA data. Nonetheless, the fact that the substantial decrease in the F score agrees with the DC elbow indicates that using DC to monitor the extent of positive inclusion from U is effective and the usage of spike-ins can be omitted to preserve positive training samples.

The two cases studies that we have employed in this article have very different distributional characteristics for P and U and, as a result, distinct predictive performance. Classification is challenging for the AS data as evidenced by the overlapping feature distributions shown in Figures 3. The test set performance for AS has specificity at 0.89 and sensitivity at 0.83, slightly lower than the 90% obtained by Yeo et al. (2005), but both had commensurate AUC for the ROC curve. The fact that we have identified a subset of 160 instances in Sh,m, which, despite not showing alternative splicing evidence, have distributional attributes much more similar to SH,M than to Sh,m, suggests the possibility that some data might be mislabeled.

We presented results of LP-IC primarily on SVM, nevertheless performance is not classifier specific and is consistent when boosting is employed in place of SVM (results not shown). Naive Bayes (NB) enhanced by the Expectation-Maximization (EM) algorithm has also been used in text classification. EM is an iterative scheme for parameter estimation in incomplete data problems. Here, the unknown class labels constitute the incomplete information. The iteration between the E-step and the M-step is reminiscent of the iterative classifiers employed by SN-IC and has the drawback of premature convergence. NB in conjunction with EM had been shown to perform worse than SVM because of distributional assumption restrictions (Liu et al., 2003).


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 EXISTING METHODS
 3 OUR PROPOSED METHOD:...
 4 GENOMIC DATA EXAMPLE
 5 RESULTS
 6 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank CBMB, UCSF for funding support. We thank Gene Yeo for providing and reformatting the AS data.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Limsoon Wong

Received on December 18, 2007; revised on January 31, 2008; accepted on March 4, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 EXISTING METHODS
 3 OUR PROPOSED METHOD:...
 4 GENOMIC DATA EXAMPLE
 5 RESULTS
 6 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Bacon DW, Watts DG. Estimating the transition between two intersecting straight lines. Biometrika (1971) 58:525–534.[Abstract/Free Full Text]

    Brusic V, et al. MHCPEP – a database of MHC-binding peptides: update 1995. Nucleic Acids Res (1996) 24:242–244.[Abstract/Free Full Text]

    Christianini N, Shaw-Taylor J. Support Vector Machines. (2000) Cambridge: Cambridge University Press.

    Emmert DB, et al. The European Bioinformatics Institute (EBI) databases. Nucleic Acids Res (1994) 22:3445–3449.[Abstract/Free Full Text]

    Friedman JH, et al. Additive logistic regression: a statistical view of boosting. Ann. Stat (2000) 28:337–407.[CrossRef]

    Goh L, et al. Genomic sweeping for hypermethylated genes. Bioinformatics (2006) 23:281–288.[CrossRef][Web of Science][Medline]

    Guan P, et al. MHCPred: a server for quantitative prediction of peptide-MHC binding. Nucleic Acids Res (2003) 31:3621–3624.[Abstract/Free Full Text]

    Kidera A, et al. Statistical analysis of the physical properties of the 20 naturally occurring amino acids. J. Protein Chem (1985) 4:23–55.[CrossRef][Web of Science]

    Li X, Liu B. Learning to classify text using positive and unlabeled data. In: Proceedings of Eighteenth International Joint Conference on Artificial Intelligence. (2003).

    Liu B, et al. Building text classifiers using positive and unlabeled examples. In: Proceedings of the Third IEEE International Conference on Data Mining. (2003).

    Liu B, et al. Partially supervised classification of text documents. In: Proceedings of the Nineteenth International Conference on Machine Learning. (2002).

    McCallum A, Nigam K. A comparison of event models for naive bayes text classification. AAAI-98 Workshop on Learning for Text Categorization (1998).

    Parker KC, et al. Scheme for ranking potential HLA-A2 binding peptides based on independent binding of individual peptide side-chains. J. Immunol (1994) 152:163–175.[Abstract]

    Scholkopf B, et al. Estimating the support of a high-dimensional distribution. Technical report. (1999).

    Tibshirani R, et al. Estimating the number of clusters in a dataset via the gap statistic. J. Roy. Stat. Soc. Ser. B–Stat. Method (2001) 63:411–423.[CrossRef]

    Xiao Y, Segal MR. Prediction of genomewide conserved epitope profiles of HIV-1: Classifier choice and peptide representation. Stat. Appl. Genetics Mol. Biol (2005) 4. Article 25.

    Yeo GW, et al. Identification and analysis of alternative splicing events conserved in human and mouse. PNAS (2005) 102:2850–2855.[Abstract/Free Full Text]

    Yu H, et al. PEBL: positive example based learning for web page classification using SVM. In: Proceedigns of the ACM ISGKDD International Conference on Knowledge Discovery & Data Mining. (2002).


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/9/1198    most recent
btn089v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Xiao, Y.
Right arrow Articles by Segal, M. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Xiao, Y.
Right arrow Articles by Segal, M. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?