Skip Navigation


Bioinformatics Advance Access originally published online on March 24, 2007
Bioinformatics 2007 23(8):942-949; doi:10.1093/bioinformatics/btm061
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/8/942    most recent
btm061v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Tung, C.-W.
Right arrow Articles by Ho, S.-Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Tung, C.-W.
Right arrow Articles by Ho, S.-Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

POPI: predicting immunogenicity of MHC class I binding peptides by mining informative physicochemical properties

Chun-Wei Tung 1 and Shinn-Ying Ho 1,2,*

1Institute of Bioinformatics and 2Department of Biological Science and Technology, National Chiao Tung University, Hsinchu 300, Taiwan

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 

Motivation: Both modeling of antigen-processing pathway including major histocompatibility complex (MHC) binding and immunogenicity prediction of those MHC-binding peptides are essential to develop a computer-aided system of peptide-based vaccine design that is one goal of immunoinformatics. Numerous studies have dealt with modeling the immunogenic pathway but not the intractable problem of immunogenicity prediction due to complex effects of many intrinsic and extrinsic factors. Moderate affinity of the MHC–peptide complex is essential to induce immune responses, but the relationship between the affinity and peptide immunogenicity is too weak to use for predicting immunogenicity. This study focuses on mining informative physicochemical properties from known experimental immunogenicity data to understand immune responses and predict immunogenicity of MHC-binding peptides accurately.

Results: This study proposes a computational method to mine a feature set of informative physicochemical properties from MHC class I binding peptides to design a support vector machine (SVM) based system (named POPI) for the prediction of peptide immunogenicity. High performance of POPI arises mainly from an inheritable bi-objective genetic algorithm, which aims to automatically determine the best number m out of 531 physicochemical properties, identify these m properties and tune SVM parameters simultaneously. The dataset consisting of 428 human MHC class I binding peptides belonging to four classes of immunogenicity was established from MHCPEP, a database of MHC-binding peptides (Brusic et al., 1998). POPI, utilizing the m = 23 selected properties, performs well with the accuracy of 64.72% using leave-one-out cross-validation, compared with two sequence alignment-based prediction methods ALIGN (54.91%) and PSI-BLAST (53.23%). POPI is the first computational system for prediction of peptide immunogenicity based on physicochemical properties.

Availability: A web server for prediction of peptide immunogenicity (POPI) and the used dataset of MHC class I binding peptides (PEPMHCI) are available at http://iclab.life.nctu.edu.tw/POPI

Contact: syho{at}mail.nctu.edu.tw


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 
Developing a computer-aided system to design peptide vaccines is one goal of immunoinformatics. The major work of previous studies for peptide vaccine designs is to identify cytotoxic T lymphocyte (CTL) epitopes and investigate their corresponding immunogenicity. The CTL cells play a critical role in protective immunity by recognizing and eliminating self-altered cells, which recognize short peptides derived from intracellular degradation of foreign proteins in combination with major histocompatibility complex (MHC) class I molecules (Hämmerling et al., 1999). The immunogenicity of MHC class I binding peptides is their ability to induce CTL responses. Accurate predictions of the CTL epitopes and their corresponding immunogenicity are critical in developing a computer-aided system for vaccine designs.

Direct approach to predicting the CTL epitopes has been studied initially but its accuracy is fairly low (Deavin et al., 1996). Instead, indirect approach to predicting the MHC-binding peptides is useful because peptides must be processed prior to inducing cellular immune responses. The recent studies of bioinformatics utilized the information about antigen-processing pathway to predict the CTL epitopes. At first, the peptides are cleaved by proteasomal cleavage. Several studies elucidating the specificity of proteasome have been presented. To predict proteasomal cleavage sites, NetChop used a neural network method (Kesmir et al., 2002) and Pcleavage is based on a support vector machine (SVM) learning model (Bhasin and Raghava, 2005).

After cleavage, peptide fragments are transported into endoplasmic reticulum by TAP, which is the transporter associated with antigen processing. Some studies of investigating the TAP transport efficiency were presented, such as the affinity prediction of TAP-binding peptides using the cascade SVM (Bhasin and Raghava, 2004) and the prediction of TAP transport efficiency of epitope precursors using a simple scoring matrix (Peters et al., 2003). Finally, the peptide fragments that bound to MHC class I molecules are subsequently translocated to the cell surface, where these complexes may active CTL. Some methods have been developed to predict MHC class I binding affinity, such as the SVM-based SVMHC (Dönnes and Elofsson, 2002) and Gibbs sampling method (Nielsen et al., 2004). Moreover, the hybrid approaches integrated the above-mentioned methods like the prediction of proteasomal cleavage, TAP transport efficiency and MHC binding to advance the prediction performance (Dönnes and Kohlbacher, 2005; Larsen et al., 2005).

After the prediction of CTL epitopes, defining peptide immunogenicity is desirable to accurately predict immunogenicity of epitopes for the vaccine design. The peptide immunogenicity is influenced by many factors, including intrinsic physicochemical properties and extrinsic factors such as host immunoglobulin repertoire (Kanduc, 2005; Van Regenmortel, 2001). Several studies aimed to clarify the relationship between the peptide-binding affinity to the MHC molecule and its immunogenicity (Feltkamp et al., 1994; Ochoa-Garay et al., 1997). These studies revealed that moderate binding affinity of peptide-MHC molecules is essential to induce immune responses, but the ability of peptides to induce CTL responses does not strongly correlate with their affinity for the MHC molecule.

Physicochemical properties of amino acids were extensively and successfully used in sequence-based prediction methods (Blythe and Flower, 2005; Cao et al., 2006; Idicula-Thomas et al., 2006; Liu et al., 2006; Nanni and Lumini, 2006). Because of the weak correlation between peptide immunogenicity and peptide-MHC binding affinity, mining informative physicochemical properties is a potentially good approach to designing a classifier for predicting immunogenicity. Because the number of available physicochemical properties is as large as more than 500, the properties used in previous studies are usually selected according to domain knowledge (Liu et al., 2006) or the rank-based method (Sarda et al., 2005). Therefore, these methods cannot be effectively applied to the investigated intractable problems because of limited knowledge or neglect of correlated effects among multiple properties (Blythe and Flower, 2005). This study aims to design an accurate predictor by efficiently selecting a small set of informative physicochemical properties considering the correlated effects.

It is well recognized that feature selection and classifier design should be optimized simultaneously to maximize prediction accuracy (Ho et al., 2006). The SVM-based learning methods are shown effective for various prediction methods from protein sequences (Bhasin and Raghava, 2005; Dönnes and Elofsson, 2002). However, internal detection of relevant-feature correlation is not offered by conventional SVMs; meanwhile, appropriate setting of their control parameters is often treated as another independent problem (Chang and Lin, 2001). Let there be n candidates of physicochemical properties of amino acids. To maximize accuracy of the investigated prediction problem by selecting a small number m out of n properties while cooperating with SVM simultaneously, it is equivalent to solve the binary combinatorial optimization problem having a huge search space of C(n, m) = n!/(m!(n m)!).

This study proposes an efficient method to mine a feature set of informative physicochemical properties from MHC class I binding peptides to design an SVM-based system (named POPI) for prediction of peptide immunogenicity. High performance of POPI arises mainly from an inheritable bi-objective genetic algorithm (Ho et al., 2004a), which aims to automatically determine the best number m out of n = 531 physicochemical properties, identify these m properties and tune SVM parameters simultaneously by maximizing the prediction accuracy of 10-fold cross-validation (10-CV). In this study, the used dataset consisting of 428 human MHC class I binding peptides belonging to four classes of immunogenicity was established from MHCPEP, a database of MHC-binding peptides (Brusic et al., 1998). POPI, utilizing the m = 23 selected properties, performs well with accuracy of 64.72% using leave-one-out cross-validation, compared with two sequence alignment-based prediction methods ALIGN (54.91%) and PSI-BLAST (53.23%).

In contrast to the existing affinity-based methods of predicting immunogenicity by way of predicting MHC-binding peptides, POPI is the first computational system based on physicochemical properties to predict peptide immunogenicity using epitopes associated with human MHC class I molecules, which has been implemented as a web server (http://iclab.life.nctu.edu.tw/POPI).


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 
2.1 Dataset and physicochemical properties
Table 1 shows the used dataset PEPMHCI of peptides associated with human MHC class I molecules extracted from MHCPEP. The keywords used to construct the dataset are ‘HLA’ and ‘CLASS-1’ in the ‘MHCMolecule’ field. The immunogenicity of a peptide is determined by measuring the concentration of peptides giving 50% of maximum specific lysis by CTLs of target cells displaying the peptide, and is given a descriptive value. The initial numbers of peptides extracted belonging to the six classes, None, Little, Moderate, High, Immunogenic-not-quantified and Unknown, are 147, 95, 125, 132, 867 and 3251, respectively. The peptides of the classes Immunogenic-not-quantified and Unknown were not considered. After removing 19 duplicate records and 52 inconsistent records, PEPMHCI with no artificial peptide contains 428 peptides, as shown in Table 1. The shortest, averaged and longest lengths of the 428 peptides are 7, 10.26 and 25, respectively.


View this table:
[in this window]
[in a new window]

 
Table 1. The dataset PEPMHCI of peptides associated with human MHC class I molecules extracted from MHCPEP, a database of MHC-binding peptides (Brusic et al., 1998)

 
There are 544 physicochemical properties of amino acids extracted from amino acid index database version 9.0 (AAindex), which is a collection of published amino acid indices representing different physicochemical and biological properties of amino acids (Kawashima and Kanehisa, 2000). Each physicochemical property consists of a set of 20 numerical values for amino acids. The property having the value ‘NA’ in a value set of amino acid index was discarded. Finally, 531 properties were used for the following mining method.

2.2 Support vector machine
Support vector machine (SVM) is a learning model dealing with binary classification problems. SVM constructs a binary classifier by finding a hyperplane to separate two classes with a maximal distance between margins of two classes consisting of support vectors. In order to make linear separation of samples easier, SVM uses one of various kernel functions to transform the samples into a high-dimensional search space. In this work, the commonly used radial basis function is applied to non linearly transform the feature space, defined as follows:


Formula 1

(1)
The kernel parameter {gamma} determines how the samples are transformed into a high-dimensional search space. The cost parameter C>0 of SVM adjusts the penalty of total error. These two parameters C and {gamma} must be tuned to get the best prediction performance.

For multi-class classification problems, ‘one-against-one’ strategy is applied to transform the multi-class problem into several binary classification problems. Given h classes, there are h(h – 1)/2 classifiers constructed and each one trains the samples from two classes. A voting strategy is applied to give a final prediction for test samples. In this study, h = 4 and the used SVM is obtained from LIBSVM package version 2.81 (Chang and Lin, 2001).

2.3 Orthogonal experimental design
Statistic design of experiments is a process of planning experiments. Orthogonal experimental design with orthogonal array and factor analysis is an efficient method to analyze the effect of several factors simultaneously (Dey, 1985; Wu, 1978). The factors are the parameters, which affect response variables, and a discriminative value of a factor is regarded as a level of the factor. A ‘complete factorial’ experiment would make measurements at each of all possible level combinations. However, the number of level combinations is often so large that this is impractical, and a subset of level combinations must be judiciously selected to be used, resulting in a ‘fractional factorial’ experiment. Orthogonal experimental design utilizes properties of fractional factorial experiments to efficiently determine the best combination of factor levels to use in design problems.

Orthogonal array is a fractional factorial array, which assures a balanced comparison of levels of any factor. Orthogonal array can reduce the number of level combinations for factor analysis. Each row of an orthogonal array represents the levels of factors in each combination, and each column represents a specific factor that can be changed from each combination. The term ‘main effect’ of one factor designates the effect on response variables that one can trace to a design parameter, which does not bother the estimation of the main effect of another factor. After proper tabulation of experimental results, the summarized data are analyzed using factor analysis to determine the relative-level effects of factors.

Factor analysis can evaluate the effects of individual factors on the evaluation function, rank the most effective factors, and determine the best level for each factor such that the evaluation function is optimized. Table 2 shows an illustrative example of orthogonal experimental design using a two-level orthogonal array LM(2M–1) with M rows and M – 1 columns. In this example of M = 8, there are seven factors where each corresponds to a physicochemical property and its two levels correspond to exclusion and inclusion of the feature in the proposed feature selection. Let ft denote a function value (prediction accuracy of 10-CV in this study) of the combination t. Define the main effect of factor j with level k as Sjk where j = 1, ... , M – 1 and k = 1, 2:


Formula 2

(2)
where Ft = 1 if the level of factor j of combination t is k; otherwise, Ft = 0. Since the objective function is to be maximized, the level 1 of factor j makes a better contribution to the function than level 2 of factor j does when Sj1 > Sj2. The main effect reveals the individual effect of a factor. After the better one of two levels of each factor is determined, a good combination consisting of all factors with the better levels can be easily reasoned (Ho et al., 2004b).


View this table:
[in this window]
[in a new window]

 
Table 2. An illustration example of orthogonal array L8(27) and factor analysis

 
The rank in Table 2 shows the rank of the combination t in all 128 (=27) possible combinations. In this example, the reasoned combination gets the best accuracy with rank 1. Notably, the reasoned combination is not guaranteed to be the best one in general cases. The most effective factor j has the largest main effect difference MED = |Sj1Sj2|. The 6th factor having the largest MED 36.3 is the most effective factor.

2.4 Inheritable bi-objective genetic algorithm
Selecting a minimal number of informative features while maximizing prediction accuracy is a bi-objective 0/1 combinatorial optimization problem. An efficient inheritable bi-objective genetic algorithm (IBCGA, Ho et al., 2004a) is utilized to solve this optimization problem. IBCGA consists of an intelligent genetic algorithm (Ho et al., 2004b) with an inheritable mechanism. The intelligent genetic algorithm uses a divide-and-conquer strategy and an orthogonal array crossover to efficiently solve large-scale parameter optimization problems. In this study, the intelligent genetic algorithm can efficiently explore and exploit the search space of C(n, r). IBCGA can efficiently search the space of C(n, r ± 1) by inheriting a good solution in the space of C(n, r) (Ho et al., 2004a). Therefore, IBCGA can economically obtain a complete set of high-quality solutions in a single run where r is specified in an interesting range such as [5, 45].

The proposed chromosome encoding scheme of IBCGA consists of both binary genes for feature selection and parametric genes for tuning SVM parameters, where the gene and chromosome are commonly used terms of genetic algorithm (GA), named GA-gene and GA-chromosome for discrimination in this article. The GA-chromosome consists of n = 531 binary GA-genes bi for selecting informative properties and two 4-bit GA-genes for tuning the parameters C and {gamma} of SVM. If bi = 0, the ith property is excluded from the SVM classifier; otherwise, the ith property is included. This encoding method maps the 16 values of {gamma} and C into {2–7, 2–6 ,... , 28}. Figure 1 shows the encoding scheme of GA-chromosome and process of constructing feature vectors for fitness function evaluation using a concise example.


Figure 1
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. An illustration example of fitness function evaluation from decoding a GA-chromosome.

 
The feature vector for training the SVM classifier is obtained from decoding a GA-chromosome using the following steps. Consider a given peptide sequence, e.g. lysosomal acid lipase (LAL). At first, the index vectors for all selected physicochemical properties (residue volume and molecular weight in this example) are constructed from AAindex for each amino acid. Feature vector of a peptide consists of the selected features whose values are obtained by averaging the values in their corresponding index vectors. Finally, all values of the feature vectors are normalized into [–1, 1] for applying SVM.

Fitness function is the only guide for IBCGA to obtain desirable solutions. To avoid from the prediction bias for some immunogenic levels, the averaged accuracies (AA) of four immunogenic levels, defined in (6), is adopted as the fitness function. The performance of selected properties associated with the parameter values of SVM is measured by 10-CV. Therefore, the fitness value of a GA-chromosome is obtained by computing the mean accuracy of 10 runs.

IBCGA with the fitness function f(X) can simultaneously obtain a set of solutions, Xr, where r = rstart, rstart + 1, ..., rend in a single run. The algorithm of IBCGA with the given values rstart and rend is described as follows:

Step 1. (Initiation) Randomly generate an initial population of Npop individuals. All the n binary GA-genes have r 1s and nr 0s where r = rstart.
Step 2. (Evaluation) Evaluate the fitness values of all individuals using f(X).
Step 3. (Selection) Use the traditional tournament selection that selects the winner from two randomly selected individuals to form a mating pool.
Step 4. (Crossover) Select Pc·Npop parents from the mating pool to perform orthogonal array crossover on the selected pairs of parents, where Pc is the crossover probability.
Step 5. (Mutation) Apply the swap mutation operator to the randomly selected Pm·Npop individuals in the new population, where Pm is the mutation probability. To prevent the best fitness value from deteriorating, mutation is not applied to the best individual.
Step 6. (Termination test) If the stopping condition for obtaining the solution Xr is satisfied, output the best individual as Xr. Otherwise, go to Step 2.
Step 7. (Inheritance) If r < rend, randomly change one bit in the binary GA-genes for each individual from 0 to 1; increase the number r by one, and go to Step 2. Otherwise, stop the algorithm.

2.5 Evaluation of POPI
The selected m physicochemical properties and the associated parameter setting of SVM by IBCGA are used to implement the computational system POPI for prediction of peptide immunogenicity. Four measurements were used to evaluate POPI using leave-one-out cross-validation (LOOCV) on the dataset PEPMHCI, namely percentage accuracy (ACCi) and Matthew's correlation coefficient (MCCi) for the ith immunogenicity class, i = 1, ... , 4, and overall accuracy (OA) and averaged accuracies (AA) for all classes:


Formula 3

(3)


Formula 4

(4)


Formula 5

(5)


Formula 6

(6)
where TPi, TNi, FPi and FNi are the number of true positive, true negative, false positive and false negative, respectively. N (=428) is the total number of sequences and h (=4) is the number of immunogenicity classes.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 
3.1 Mining informative physicochemical properties
IBCGA is performed to mine informative physicochemical properties using the whole dataset PEPMHCI. In this study, the parameters of IBCGA are set as Npop = 50, Pc = 0.8, Pm = 0.05, rstart = 5 and rend = 45. For each feature set with size r, IBCGA selected a small set of physicochemical properties and parameter values of SVM. Figure 2 shows a potentially good result in terms of averaged accuracy (AA), and the number of used features obtained from a single run of IBCGA using 10-CV. The result reveals that the best number of selected features is m = 23, where the SVM classifier with C = 2 and {gamma} = 2 has the best-averaged accuracy AA = 63.67% and overall accuracy OA = 66.12%. The SDs of AA and OA among the 10 cross-validation results are 10.87 and 9.73%, respectively.


Figure 2
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Averaged accuracies (AAs) of 10-CV for IBCGA, rank-based methods (RankD and RankI) and the alignment-based method (ALIGN).

 
To further evaluate the feature selection of IBCGA, a traditional rank-based method for evaluating performance of a single feature is also implemented for comparison. The feature selection of the rank-based method is performed using the following steps. (1) For each physicochemical property, the prediction accuracy AA of 10-CV using SVM and the single feature was computed. (2) All physicochemical properties were ranked according to their AA accuracies. (3) The r properties with the highest ranks and SVM were used to predict peptide immunogenicity. Figure 2 shows the AA accuracies of various feature sets with size r, where r = 5, ... , 45.

The rank-based method suffers from the incapability of finding appropriate values of C and {gamma} to train SVM classifiers. In order to achieve high performance, two parameter settings of SVM were tested. The first rank-based method named RankD using the default values of SVM parameters that C = 1 and {gamma} = 1/r. The best performance of RankD is AA = 36.08% with 21 features. The second rank-based method named RankI using the same values of C = 2 and {gamma} = 2 obtained from IBCGA. The best performance of RankI is AA = 48.87% with 18 features. Figure 2 shows the performance of RankI is better than that of RankD, revealing that the parameter setting of SVM parameters derived from IBCGA is effective. Furthermore, the performance of feature selection of IBCGA is much better than that of the rank-based method. This result is well recognized that the feature selection by additionally considering the correlated effects among physicochemical properties can advance prediction performance. Table 3 lists the AAindex identities of the 23 physicochemical properties selected by IBCGA.


View this table:
[in this window]
[in a new window]

 
Table 3. The AAindex identities of the 23 physicochemical properties selected by IBCGA, which are ranked according to their effectiveness of prediction

 
3.2 Analyzing individual effects of properties
Estimating the individual effects of selected properties is important for immunologists to understand peptide immunogenicity comprehensively. Orthogonal experimental design used in IBCGA is capable of estimating individual effects of factors according to the value of MED. The property with the largest value of MED is the most effective property. Figure 3 shows the value of MED for each selected property. The property of AAindex identity GEIM800103 is the most effective property with MED = 33.29, which corresponds to ‘Alpha-helix indices for beta-proteins’ (Geisow and Roberts, 1980). The least effective property is MIYS850101 with MED = 0.80, which corresponds to ‘Effective partition energy’ (Miyazawa and Jernigan, 1985).


Figure 3
View larger version (40K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Individual effects of 23 selected properties sorted by MED.

 
Since all the properties were selected at the same time based on the prediction performance, the feature set obtained by IBCGA would be not the same always for each run of IBCGA due to the reasons: (1) IBCGA is a non-deterministic algorithm; (2) the selected kernel function and parameter setting of SVM would slightly affect the prediction performance and (3) the feature selection is a machine-learning approach and its result depends on the distribution of samples in the dataset. A larger training dataset would make the selected feature set more stable.

In the computer experiment of mining informative features, there are 72 independent runs performed by IBCGA. The largest, mean and smallest numbers m of selected features are 45, 29.10 and 8, respectively. The highest, mean and lowest AA accuracies in the training phase are 63.78, 61.11 and 58.56%, respectively. The statistic result reveals that a small set of effective properties is more stable in each run of IBCGA. For example, the three properties QIAN880112, MITS020101 and KARP850103 with ranks 8, 15 and 16 shown in Figure 3 have the highest ranks 1, 6 and 6, respectively, according to the selection frequency in the 72 runs.

Table 3 also shows the ranks of the selected properties based on the prediction accuracy of RankI. The best one of selected properties is NADH010106 in terms of the rank by RankI, which has the accuracy of AA = 32.98% and rank 47. On the other hand, the most effective property GEIM800103 has the rank 257 by RankI. Table 3 reveals that the ranks by RankI for the 23 selected properties are uniformly distributed. This scenario indicates that a set of properties should be considered simultaneously rather than single property at a time because of strong correlation among physicochemical properties.

3.3 Prediction system POPI
The prediction system POPI is implemented by adopting the 23 selected informative properties (shown in Table 3) and the established SVM-based classifier in the training phase. To evaluate the ability of POPI in predicting novel peptides, the LOOCV performance is applied on the whole dataset PEPMHCI.

Table 4 shows the performance of POPI in terms of ACC and MCC for the four immunogenicity classes, and the prediction accuracies of OA and AA. The ACC accuracies of the four classes None, Little, Moderate and High are 83.33, 50.60, 55.00 and 59.41%, respectively. The mean of MCC performance is 0.51.


View this table:
[in this window]
[in a new window]

 
Table 4. Performance comparisons of ALIGN, PSI-BLAST and POPI using LOOCV on the whole dataset PEPMHCI

 
The test performance of POPI (OA = 64.72 and AA = 62.09%) is slightly worse than the training performance (OA = 66.12 and AA = 63.67%). This result indicates that the overfitting problem is not obviously occurred in selecting informative features.

3.4 Alignment-based prediction
Sequence alignment may be an efficient approach to predicting peptide immunogenicity because similar sequences may have similar peptide immunogenicity. In order to compare the alignment-based prediction methods with POPI, two methods including global sequence alignment tool ALIGN (Myers and Miller, 1988) and advanced sequence comparison method PSI-BLAST that is capable of detecting remote homologs (Altschul et al., 1997) were applied to search for similar sequences. For each tested peptide, ALIGN and PSI-BLAST using three iterations were applied separately to search for its homologs.

For comparison, LOOCV was used to evaluate their prediction performances on the same dataset. The immunogenicity class with the highest similarity score was assigned to the test peptide. If there are multiple peptides with the same score, voting strategy is applied. Otherwise, if two or more immunogenicity classes have equal votes, the candidate immunogenicity classes will be ranked by sample size in the dataset and the immunogenicity class with highest rank was assigned to the test peptide.

Table 4 shows the results of ALIGN (OA = 54.91 and AA = 52.64%) and PSI-BLAST (OA = 53.23 and AA = 52.35%). Notably, the accuracy of PSI-BLAST shown in Table 4 is measured by considering only the peptides whose homologs can be obtained. When considering the 118 of 428 peptides with no homolog found, the accuracy of PSI-BLAST would be decreased. The lines shown in Figure 2 represents the performance (AA = 49.23 and OA = 51.17%) of ALIGN using 10-CV for comparison. The results reveal that POPI performs well compared with the alignment-based methods ALIGN and PSI-BLAST, when the size of the reference dataset is not sufficiently large. Notably, the performance of all methods using 10-CV is not significantly different from that using LOOCV.

3.4 Affinity-driven prediction
In the past, affinity was considered as an important index to predict peptide immunogenicity. To evaluate the affinity-driven prediction method, an additional dataset was established by extracting MHC class I binding peptides with known activity levels in both fields of ‘BINDING’ and ‘IMMUNOGENICITY’ from the MHCPEP database. However, there are four levels in the field of ‘IMMUNOGENICITY’, but the field of ‘BINDING’ has only three levels without the level ‘none’. To fairly evaluate the prediction performance of the affinity-driven prediction, the immunogenic class None was combined with the class Little. The dataset contains 160 peptides belonging to three classes.

To evaluate the affinity-driven prediction method, a prediction system named AFFIPRE to predict peptide immunogenicity was implemented using the following criterion. If the immunogenic level and the affinity level of a peptide are identical, this test is regarded as a successful prediction. Otherwise, this prediction is fail. The four measurements were used to evaluate AFFIPRE, which are the same with those for IBCGA. Table 5 shows the results of AFFIPRE (OA = 39.38% and AA = 40.09%) and POPI (OA = 60.63% and AA = 50.50%). The poor performance of AFFIPRE reveals that the affinity only cannot be directly used to predict peptide immunogenicity and this result is consistent with previous studies that the affinity of peptide-MHC molecules is not the main factor for predicting peptide immunogenicity (Feltkamp et al., 1994; Ochoa-Garay et al., 1997).


View this table:
[in this window]
[in a new window]

 
Table 5. Performance comparisons between AFFIPRE and POPI

 

    4 DISCUSSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 
The effectiveness of vaccination depends on peptide immunogenicity in designing peptide-based vaccines. Accurate prediction of peptide immunogenicity will decrease many experimental efforts. This study investigates the prediction problem of peptide immunogenicity and proposes an efficient prediction system POPI to predict immunogenicity of peptides with variable lengths. POPI is an SVM-based classifier with a set of informative features selected by the proposed IBCGA.

In this study, a dataset PEPMHCI of peptides associated with human MHC class I molecules extracted from MHCPEP was established. Considering the correlated effects among physicochemical properties and the cooperation with the SVM classifier, both feature selection and parameter tuning are simultaneously optimized using IBCGA. A feature set consisting of 23 physicochemical properties was selected to implement the prediction system POPI.

To our knowledge, POPI is the first computational system for prediction of peptide immunogenicity based on physicochemical properties. To evaluate POPI comprehensively, the feature selection method was compared with a rank-based selection method and the selected properties were analyzed using the factor analysis of orthogonal experimental design. Simulation results show that IBCGA can select a small set of informative properties considering the correlated effects, compared with the rank-based method.

In order to further evaluate POPI, three prediction methods were tested for comparison, namely the alignment-based methods ALIGN and PSI-BLAST, and the affinity-driven prediction method AFFIPRE. Because the reference dataset is not sufficiently large, ALIGN and PSI-BLAST cannot work well. This poor performance of AFFIPRE shows that affinity is not suitable to predict peptide immunogenicity directly. This result is consistent with previous studies that the peptide immunogenicity does not strongly correlate with its affinity for the MHC molecule (Feltkamp et al., 1994; Ochoa-Garay et al., 1997).

To cope with the small size of the training dataset in mining informative physicochemical properties, the proposed method can provide each selected property with the effectiveness according to its main effect difference in discriminating immunogenic levels and the robustness in terms of selection frequency. The valuable information is helpful in determining a best set of features to implement an accurate prediction system as well as to further understand immune responses from the informative physicochemical properties. The future work is to collect more immunogenicity data by combining biological knowledge and related sources, such as Immune Epitope Database and Analysis Resource (IEDB, Peters et al., 2005), to advance the prediction performance of IBCGA.

In fact, the feature selection method of IBCGA has been shown effective in solving large-scale binary combinatorial optimization problems (Ho et al., 2004a, b, 2006). On the other hand, the SVM-based learning methods are shown effective for protein sequence-based predictions. As a result, IBCGA with SVM can be easily used to design an SVM-based classifier for solving sequence-based prediction problems by mining informative features of physicochemical properties from an experimental dataset.


    ACKNOWLEDGEMENT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 
The authors would like to thank the National Science Council of Taiwan for financially supporting this research under the contract numbers NSC 95-2627-B-009-002 and NSC 95-2221-E-009-116.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Limsoon Wong

Received on October 28, 2006; revised on February 14, 2007; accepted on February 14, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 

    Altschul SF, et al. Gapped. BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., ( (1997) ) 25, : 3389–3402.[Abstract/Free Full Text].

    Bhasin M, Raghava GP. Analysis and prediction of affinity of TAP binding peptides using cascade SVM. Protein Sci., ( (2004) ) 13, : 596–607.[Abstract/Free Full Text].

    Bhasin M, Raghava GP. Pcleavage: an SVM based method for prediction of constitutive proteasome and immunoproteasome cleavage sites in antigenic sequences. Nucleic Acids Res., ( (2005) ) 33, : W202–W207.[CrossRef][Medline].

    Blythe MJ, Flower DR. Benchmarking B cell epitope prediction: underperformance of existing methods. Protein Sci., ( (2005) ) 14, : 246–248.[Abstract/Free Full Text].

    Brusic V, et al. MHCPEP, a database of MHC-binding peptides: update 1997. Nucleic Acids Res., ( (1998) ) 26, : 368–371.[Abstract/Free Full Text].

    Cao Y, et al. Prediction of protein structural class with rough sets. BMC Bioinformatics, ( (2006) ) 7, : 20.[CrossRef][Medline].

    Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ( (2001) ) Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm..

    Deavin AJ, et al. Statistical comparison of established T-cell epitope predictors against a large database of human and murine antigens. Mol. Immunol., ( (1996) ) 33, : 145–155.[CrossRef][ISI][Medline].

    Dey A. Orthogonal Fractional Factorial Designs., ( (1985) ) New York: Wiley..

    Dönnes P, Elofsson A. Prediction of MHC class I binding peptides, using SVMHC. BMC Bioinformatics, ( (2002) ) 3, : 25.[CrossRef][Medline].

    Dönnes P, Kohlbacher O. Integrated modeling of the major events in the MHC class I antigen processing pathway. Protein Sci., ( (2005) ) 14, : 2132–2140.[Abstract/Free Full Text].

    Feltkamp MC, et al. Efficient MHC class I-peptide binding is required but does not ensure MHC class I-restricted immunogenicity. Mol. Immunol., ( (1994) ) 31, : 1391–1401.[CrossRef][ISI][Medline].

    Geisow MJ, Roberts RDB. Amino acid preferences for secondary structure vary with protein class. Int. J. Biol. Macromol., ( (1980) ) 2, : 387–389.[CrossRef][ISI].

    Hämmerling GJ, et al. Antigen processing and presentation – towards the millennium. Immunol. Rev., ( (1999) ) 172, : 5–9.[CrossRef][ISI][Medline].

    Ho S.-Y, et al. Inheritable genetic algorithm for bi-objective 0/1 combinatorial optimization problems and its applications. IEEE Trans. Syst. Man Cybern. B Cybern., ( (2004a) ) 34, : 609–620.[CrossRef][ISI][Medline].

    Ho S.-Y, et al. Intelligent evolutionary algorithms for large parameter optimization problems. IEEE Trans. Evol. Comput., ( (2004b) ) 8, : 522–541.[CrossRef].

    Ho S.-Y, et al. Interpretable gene expression classifier with an accurate and compact fuzzy rule base for microarray data analysis. Biosystems, ( (2006) ) 85, : 165–176.[CrossRef][ISI][Medline].

    Idicula-Thomas S, et al. A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli. Bioinformatics, ( (2006) ) 22, : 278–284.[Abstract/Free Full Text].

    Kanduc D. Peptimmunology: immunogenic peptides and sequence redundancy. Curr. Drug Discov. Technol., ( (2005) ) 2, : 239–244.[CrossRef][Medline].

    Kawashima S, Kanehisa M. AAindex: amino acid index database. Nucleic Acids Res., ( (2000) ) 28, : 374.[Abstract/Free Full Text].

    Kesmir C, et al. Prediction of proteasome cleavage motifs by neural networks. Protein Eng., ( (2002) ) 15, : 287–296.[Abstract/Free Full Text].

    Larsen MV, et al. An integrative approach to CTL epitope prediction: a combined algorithm integrating MHC class I binding, TAP transport efficiency, and proteasomal cleavage predictions. Eur. J. Immunol., ( (2005) ) 35, : 2295–2303.[CrossRef][ISI][Medline].

    Liu W, et al. Quantitative prediction of mouse class I MHC peptide binding affinity using support vector machine regression (SVR) models. BMC Bioinformatics, ( (2006) ) 7, : 182.[CrossRef][Medline].

    Miyazawa S, Jernigan RL. Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation. Macromolecules, ( (1985) ) 18, : 534–552.[CrossRef][ISI].

    Myers EW, Miller W. Optimal alignments in linear space. Comput. Appl. Biosci., ( (1988) ) 4, : 11–17.[Abstract/Free Full Text].

    Nanni L, Lumini A. An ensemble of K-local hyperplanes for predicting protein-protein interactions. Bioinformatics, ( (2006) ) 22, : 1207–1210.[Abstract/Free Full Text].

    Nielsen M, et al. Improved prediction of MHC class I and class II epitopes using a novel Gibbs sampling approach. Bioinformatics, ( (2004) ) 20, : 1388–1397.[Abstract/Free Full Text].

    Ochoa-Garay J, et al. The ability of peptides to induce cytotoxic T cells in vitro does not strongly correlate with their affinity for the H-2Ld molecule: implications for vaccine design and immunotherapy. Mol. Immunol., ( (1997) ) 34, : 273–281.[CrossRef][ISI][Medline].

    Peters B, et al. Identifying MHC class I epitopes by predicting the TAP transport efficiency of epitope precursors. J. Immunol., ( (2003) ) 171, : 1741–1749.[Abstract/Free Full Text].

    Peters B, et al. The immune epitope database and analysis resource: from vision to blueprint. PLoS Biol., ( (2005) ) 3, : e91.[CrossRef][Medline].

    Sarda D, et al. pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties. BMC Bioinformatics, ( (2005) ) 6, : 152.[CrossRef][Medline].

    Van Regenmortel MH. Antigenicity and immunogenicity of synthetic peptides. Biologicals, ( (2001) ) 29, : 209–213.[CrossRef][ISI][Medline].

    Wu Q. On the optimality of orthogonal experimental design. Acta Math. Appl. Sin., ( (1978) ) 1, : 283–299..


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
L. Jacob and J.-P. Vert
Efficient peptide-MHC-I binding prediction for alleles with few known binders
Bioinformatics, February 1, 2008; 24(3): 358 - 366.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. Kawashima, P. Pokarowski, M. Pokarowska, A. Kolinski, T. Katayama, and M. Kanehisa
AAindex: amino acid index database, progress report 2008
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D202 - D205.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/8/942    most recent
btm061v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Tung, C.-W.
Right arrow Articles by Ho, S.-Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Tung, C.-W.
Right arrow Articles by Ho, S.-Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?