Skip Navigation


Bioinformatics Advance Access originally published online on April 13, 2006
Bioinformatics 2006 22(13):1648-1655; doi:10.1093/bioinformatics/btl141
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/13/1648    most recent
btl141v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zhu, S.
Right arrow Articles by Mamitsuka, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhu, S.
Right arrow Articles by Mamitsuka, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Improving MHC binding peptide prediction by incorporating binding data of auxiliary MHC molecules

Shanfeng Zhu 1, Keiko Udaka 2, John Sidney 3, Alessandro Sette 3, Kiyoko F. Aoki-Kinoshita 1 and Hiroshi Mamitsuka 1,*

1 Bioinformatics Center, Institute for Chemical Research, Kyoto University Gokasho, Uji 611–0011, Japan
2 Department of Immunology, Kochi Medical School Nankoku, Kochi 783–8505, Japan
3 La Jolla Institute for Allergy and Immunology 10335 Science Center Drive, La Jolla, CA 92121, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

Motivation: Various computational methods have been proposed to tackle the problem of predicting the peptide binding ability for a specific MHC molecule. These methods are based on known binding peptide sequences. However, current available peptide databases do not have very abundant amounts of examples and are highly redundant. Existing studies show that MHC molecules can be classified into supertypes in terms of peptide-binding specificities. Therefore, we first give a method for reducing the redundancy in a given dataset based on information entropy, then present a novel approach for prediction by learning a predictive model from a dataset of binders for not only the molecule of interest but also for other MHC molecules.

Results: We experimented on the HLA-A family with the binding nonamers of A1 supertype (HLA-A*0101, A*2601, A*2902, A*3002), A2 supertype (A*0201, A*0202, A*0203, A*0206, A*6802), A3 supertype (A*0301, A*1101, A*3101, A*3301, A*6801) and A24 supertype (A*2301 and A*2402), whose data were collected from six publicly available peptide databases and two private sources. The results show that our approach significantly improves the prediction accuracy of peptides that bind a specific HLA molecule when we combine binding data of HLA molecules in the same supertype. Our approach can thus be used to help find new binders for MHC molecules.

Contact: mami{at}kuicr.kyoto-u.ac.jp

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Major histocompatibility complex (MHC) molecules bind short peptides from antigens and present them on the surface of a cell for recognition by T Cell Receptors (TCR) [For general information on MHC see Janeway et al. (2001)]. The presented peptide and MHC complexes induce the naïve T Cells to proliferate and differentiate into armed effector T cells that help to remove the antigens. MHC molecules show high diversity in their selectivity of peptides, making it difficult for pathogens to escape immune response. Each different MHC molecule can bind a set of different peptides. As antigen recognition by MHC molecules is the prerequisite of cellular immune response, it is of great immunological importance to have the ability to accurately predict those peptides that bind to specific MHC molecules. The experimental identification of peptide binding affinity to MHC molecules requires a binding assay of each peptide, which is a time consuming and costly process. Therefore, a number of alternative research efforts have been carried out in an attempt to discover the laws of binding peptide sequence patterns. Past predictive approaches can be divided into two main groups: MHC molecule structure based approaches and binding peptide sequence based approaches. In the former case, the crystal structure of the MHC molecule is required, which may not be possible to obtain for many MHC molecules. In the latter approach, the sequences of peptides are studied in order to ascertain binding patterns. After the discovery of main anchor residues by pooling sequences (Falk et al., 1991), secondary anchors (Ruppert et al., 1993) and peptide sequence motifs (Rammense et al., 1995), position specific quantitative matrix methods have been proposed to predict the binding affinity of a given peptide. Quantitative matrices such as BIMAS, SYFPEITHI and RANKPEP are constructed by analyzing the amino acid frequency in the binding peptides during pool sequencing (Rammensee et al., 1999), side chain scanning (Hammer et al., 1994; Parker et al., 1994; Gulukota et al., 1997), by revealing positional amino acid preferences with the use of combinatorial peptide libraries (Udaka et al., 2000), and sequence alignment of binding peptides (Reche et al., 2002). Furthermore, machine learning based approaches, such as artificial neural network (ANN) (Gulukota et al., 1997; Brusic et al., 1998a), hidden markov model (HMM) (Mamitsuka, 1998; Udaka et al., 2002), classification and regression tree (CART) (Segal et al., 2001) and support vector machine (SVM) (Dönnes, and Elofsson 2002; Riedesel et al., 2004), have been introduced. Several studies comparing the performance of quantitative matrix and machine learning based methods found that machine learning based methods need more training data than matrix based methods to achieve good performance. (Yu et al., 2002; Peters et al., 2003).

To develop an effective computer model in a bioinformatics approach, we need to understand the characteristics of the biological data at hand (Brusic et al., 1998c, 1999). There are several concerns regarding the MHC binder databases currently available. First, the number of peptides for each MHC molecule is very limited, due to high experimental costs. In addition, although it may be easy to find a new binder that is similar to an existing one, it is difficult to find a unique and completely new binder. For example, seven or eight amino acid positions out of the nine in peptides' sequences may be the same, such as ‘ALAKAAYAV’, an HLA-A*0201 binder in MHCPEP (Brusic et al., 1998b). Correspondingly, 10 peptides with the pattern ‘ALAKAAXXV’, where X is an amino acid, can easily be found in MHCPEP as well. In short, any currently available binder database is highly redundant. We thus need to reduce the redundancy of the current database, assuming that the true data space of peptide binders is more general.

Our work attempts to overcome these data issues. (1) We give a new method for reducing the redundancy of an MHC binding peptide database. This method is based on entropy, and by reducing the redundancy we can obtain a dataset representing a general data space of MHC binders which is more representative than any existing database. (2) We propose a novel computational method for predicting MHC-binding peptides by learning the predictive model from both the binding data of the MHC molecule of interest, as well as other MHC molecules. Studies show that MHC molecules can be classified into a relatively few number of supertypes (superfamilies) in terms of binding specificities by different criteria, such as motifs (supermotifs) of binding peptides (Sette and Sidney, 1999), amino acid sequence similarities (Cano et al., 1998; McKenzie et al., 1999), functional pockets in the binding groove (Chelvanayagam, 1996; Zhang et al., 1998), structural similarities (Doytchinova et al., 2004) and binding specificity matrices (Lund et al., 2004). Different MHC alleles in the same supertype have highly similar structure in the main binding peptide pocket and bind largely overlapping sets of peptides, which is also recognized in chimpanzees (Bertoni et al., 1998). Cross-reactive peptides are frequently observed in the process of cancers and infectious diseases (Bertoni et al., 1997; Doolan et al., 1997). Specifically, Sette and Sidney (1999) divided HLA class I molecules into nine supertypes, A1, A2, A3, A24, B7, B27, B44, B58 and B62. Brusic et al. found that HLA class I binding data of multiple alleles in the same supertype could accurately predict binding peptides for alleles that have no experimental data available (Brusic et al., 2002; Srinivasan et al., 2004). Sturniolo et al. (1999) made use of pocket profiles to build virtual matrices for predicting promiscuous HLA-DR ligands. In contrast to these studies, we combine the binding data of the MHC allele of interest with the binding data of another MHC allele, regardless of supertype, to improve prediction accuracy. Through this study, the effect of combining the binding data of two different alleles, in the same or different supertypes, can be examined.

We examine this novel idea for nonameric peptide binding prediction to 16 HLA-A molecules in four different supertypes with respect to several studies (Sette and Sidney, 1999; Lund et al., 2004): A1 (HLA-A*0101, A*2601, A*2902, A*3002), A2 (A*0201, A*0202, A*0203, A*0206, A*6802), A3 (A*0301, A*1101, A*3101, A*3301, A*6801) and A24 (A*2301 and A*2402). The results show that our approach significantly improves the prediction accuracy of peptides that bind a specific HLA molecule when we combine binding data of MHC molecules in the same supertype.


    2 MATERIALS AND METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
2.1 Source data
In this study, HLA-A binding nonamers were collected from six public databases, MHCPEP (Brusic et al., 1998b), SYFPEITHI (Rammensee et al., 1999), FIMM (Schönbach et al., 2002), MHCBN (Bhasin et al., 2003), AntiJen (Blythe et al., 2002) and Ligand (Sathiamurthy et al., 2003) in March 2005, as well as from two private data sources (A. Sette, unpublished data; K. Udaka, unpublished data). Because of varying experimental conditions on binding assays, the real-valued binding affinity measurements produced by these different research groups are incompatible. Moreover, these values are often unavailable in these databases. Therefore, we make binary predictions on peptide binding ability to HLA-A molecules. After deleting peptides that have undetermined amino acids in their sequences and removing redundant peptides, our dataset consists of altogether 16 alleles in different supertypes. As shown in Table 1, 12 of these that have no less than 95 distinct binding nonamers used in the combination experiment. Considering the lack of experimentally verified nonamers that do not bind MHC molecules, we note that it was estimated that <1% of any nonameric peptide would bind a particular MHC molecule (Udaka et al., 1995) and that randomly generated putative non-binding peptides have been used in other studies (Dönnes and Elofsson, 2002). Therefore, we randomly generated putative non-binding nonamers from proteins in the human genome from the KEGG database (Kanehisa et al., 2004), ensuring that they are distinct from the known binding peptides of the MHC molecules of interest.


View this table:
[in this window]
[in a new window]
 
Table 1 The number of binding nonamers in each database and total number of distinct nonamers in all databases for each allele

 
2.2 Predictive model
We utilize the popular inductive learning algorithm C4.5, which generates a decision tree classifier for prediction (Quinlan, 1993). The original technique of decision tree was established in the 1970s (Friedman, 1977; Quinlan, 1979), and it has been developed and matured in 1990s. C4.5 is one of the most popular and basic decision tree learning methods. The learned results are comprehensive and easily understandable, which is important to allow further verification by biologists. We construct the prediction model using C4.5 release 8 (downloadable from http://www.rulequest.com/Personal/). The generation of a decision tree is a recursive data partitioning, and C4.5 maximizes information content to split data, i.e. it maximizes information entropy. We denote the instances in the binding class as positive instances, and those in the non-binding class as negative instances. In each MHC molecule's binding dataset, we set the ratio of positive to negative instances to 1:1 to obtain a balanced training dataset, shown to achieve better classifier performance than natural class distributions (Weiss and Provost, 2003). We note that the main purpose of this research is to determine whether or not the predictive accuracy of peptide binding to the HLA molecule of interest could be improved by incorporating the binding data of other HLA molecules, as opposed to improving the performance of existing computational prediction models. That is to say, other computational methods such as HMM may easily be used to build the predictive model.

2.3 Evaluation
For each dataset of binding nonamers to an MHC molecule, we conduct 24 five-fold cross validation experiments. Prediction accuracy is the percentage of correctly identified instances out of all instances in the test set. The average prediction accuracy on the test sets over all 24 rounds is used to evaluate performance. That is, we build the predictive model 120 times to obtain an average prediction accuracy to reduce any bias in the random partitioning. In this way, each dataset S has a corresponding C4.5 model predictive accuracy that is calculated by 24 five-fold cross validations. The paired sample two-tailed t-test is used to compare the performance of two predictive models for the same test. If the t-value is larger than a certain value, say 3.373 when comparing two sets of 120 prediction accuracy values, then the performance of one model is statistically significant over the other at confidence level 99.9%.

2.4 Dataset combination
To describe the dataset combination procedure in detail, we first define two terms: base dataset and auxiliary dataset. The base dataset is the peptide binding dataset of the MHC molecule of interest. The auxiliary dataset is the peptide binding dataset of another MHC molecule, which will be added into the base dataset for improving the prediction accuracy of the MHC molecule of interest. The detailed procedure we used to combine the base and auxiliary datasets is as follows:

Input. A peptide binding dataset Sa for MHC molecule A (base dataset), a peptide binding dataset Sb for MHC molecule B (auxiliary dataset), and C4.5 program to build the predictive model (decision tree).

Output. The prediction accuracy before and after combination, and their corresponding t-value.

  1. Run 24 five-fold cross validation experiments on original data set Sa. That is, divide Sa into a training set Straining and a test set Stesting.
  2. After training the predictive model M on Straining using C4.5 with default settings, obtain the prediction accuracy Ainitial on Stesting with M.
  3. Keeping Stesting unchanged, add all instances of Sb to Straining except for the instances that already exist in Sa, calling this the new training dataset Formula. Then train a new predictive model M' using C4.5 with default settings. Finally, predict the binding ability of peptides in testing set Stesting with M' to obtain a new accuracy Acombine.
  4. Evaluate all 120 pairs of Ainitial and Acombine using the statistical t-test.

Note that it is important to maintain consistency between the base and auxiliary datasets. That is, since we verify accuracy improvement by combining these two datasets, the prediction accuracy of each dataset should be made the same. This prediction accuracy is controlled by reducing the redundancy of each dataset using a new technique which we describe in the next section.

2.5 Redundancy reduction
In general, redundancy reduction techniques enable predictive models to avoid overfitting and to reproduce well on unseen data. Different redundancy reduction techniques are already used in various studies on MHC peptide binding prediction (Yu et al., 2002; Dönnes and Elofsson, 2002; Buus et al., 2003; Nielsen et al., 2004), but they tended to be rather ad hoc and primitive. Yu et al. (2002) simply removed all peptides from the training set that differed by only a single amino acid from the test peptides, Dönnes and Elofsson (2002) ensured that no two peptides shared more than four amino acids in the binding dataset, and Buus et al. (2003) discarded peptides whose pairwise alignment scores exceeded a given threshold. Nielsen et al. (2004) performed homology reduction to make sure no peptide in test set shared sequence identity >90% with the peptides in training set. On the other hand, we use the measurement of entropy from information theory to reduce redundancy in the binding peptides. Given a set S of binding peptides (without non-binding peptides) for a specific MHC molecule, we derive a 20 row x 9 column matrix C containing the count of each distinct amino acid occurring in a specific position in S. Denoting each element nij in matrix C as the number of amino acid i occurring in position j among all peptides, and N as the total number of peptides in the set S, the entropy of dataset S is:

Formula

In our procedure, we repeat the process of removing the peptide that maximizes this entropy function. That is, we repeatedly select a set of N – 1 peptides from the set of N peptides and measure the information content to find the set that maximizes this measure. The maximization of information content by C4.5 to recursively partition the data is thus consistent with our approach of reducing the redundancy of a peptide dataset. Since the prediction accuracies need to be kept consistent during our experiments, for each HLA molecule's binding peptide dataset, we produce peptide binding datasets whose predictive accuracies from the 24 cross-validation runs are ~80%.

Given a binding peptide dataset S of size N, the pseudocode for creating peptide binding dataset Set(A) at predictive accuracy level A (say 80%) from S is shown in Figure 1. The specific percentage of peptides that are removed by redundancy reduction depends on the size of the initial dataset, the redundancy in the initial dataset, and the target prediction accuracy. In Table 2, the number of instances in each dataset of each HLA molecule at 80% accuracy levels is listed.


Figure 1
View larger version (21K):
[in this window]
[in a new window]
 
Fig. 1 The pseudocode for creating a peptide binding datset at C4.5 predictive accuracy level A from a dataset S.

 

View this table:
[in this window]
[in a new window]
 
Table 2 The number of all (binding and non-binding) nonamers in the datasets for each HLA molecule at 80% accuracy level

 
2.6 Experimental procedure
Our experimental procedure consists the following:

(1) Preliminary experiments on controlled datasets. We carried out our experiments on two types of preliminary datasets: one combining homogeneous datasets (binding data from the same MHC molecule) and one incorporating a randomly generated dataset.

  • Combining homogeneous datasets. As shown in Figure 2, the base dataset Sa is constructed by randomly selecting 200 nonameric peptides from the dataset S of all available nonamers that bind to the HLA-A*0201 molecule (positive instances) and by randomly generating 200 nonameric peptides from the human genome database (negative instances). Then an auxiliary dataset Sh is created similarly from S and the human genome database, ensuring that no peptide existing in Sh is found in Sa. Sh is then added to Sa, and two prediction accuracies Ainitial and Acombine are obtained. To reduce any bias that may exist in the particular dataset selected, we generate Sa and Sh randomly 50 times to calculate a set of Ainitial and Acombine values, which are then analyzed using the paired sample t-test.
  • Combining random datasets. The base dataset is constructed in the same way as the base dataset in the previous section, but the secondary dataset is generated differently. To create an auxiliary random dataset Sr, both the 200 positive and 200 negative instances, none of which occur in Sa, are randomly generated from the human genome database. Furthermore, to investigate the effect of the size of combined random peptide binding datasets, eight random datasets Sr1,Sr2, ... , Sr8 of varying sizes are generated: 200, 400, 800, 1200, 1600, 2000, 3000 and 4000. Each of these random datasets are then individually combined with the base dataset to calculate the prediction accuracy. These procedures are repeated 50 times to reduce bias.


Figure 2
View larger version (11K):
[in this window]
[in a new window]
 
Fig. 2 The creation of homogeneous datasets in our experiment.

 
(2) Main experiments combining peptide binding datasets of different HLA molecules. As the dataset of our focus, the peptide binding datasets of different HLA molecules at the same accuracy level, resulting from our redundancy reduction technique, are combined in pairs to analyze the effect of combination. All these datasets were examined under the same experimental procedure using our predictive model. We are especially interested in the HLA molecule pairs whose initial predictive accuracies improve after combination.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
3.1 Preliminary experiments

  • Combining homogeneous datasets As illustrated in Table 3, the prediction accuracy is improved significantly after incorporating Sh into Sa. The results of this experiment show that when two homogeneous datasets are combined together to build our predictive model, the predictive accuracy is improved significantly.
  • Combining random datasets As shown in Table 4, incorporating a random dataset into the base dataset reduces the accuracy of prediction. As the size of the random datasets increase, the prediction accuracy of the combined datasets decrease correspondingly. The prediction accuracy of the combined peptide binding dataset decreases monotonically as the size of the added random dataset increases. The initial predication accuracy is 80.6.
From these preliminary experiments combining homogeneous and random datasets, we can see that adding a similar peptide binding dataset into the base dataset improves the predictive accuracy of our model, while adding random peptide binding datasets decreases performance.


View this table:
[in this window]
[in a new window]
 
Table 3 The comparison of Ainitial (predictive accuracy before combination) and Acombine (predictive accuracy after combining a homogeneous dataset) over 50 experiments. Formula is the mean of Formula, and Formula is the mean of Acombine).

 

View this table:
[in this window]
[in a new window]
 
Table 4 The prediction accuracy of the combined peptide binding dataset decreases monotonically as the size of the added random dataset increases.

 
3.2 Combining peptide binding datasets of different HLA molecules
We combined the datasets of two different HLA molecules at the same accuracy level. The result of the combination at accuracy level 80% is shown in Table 5, which can be viewed as a prediction accuracy matrix Aij. In this matrix, cell aii in the principal diagonal of the matrix represents the prediction accuracy of the dataset for the corresponding molecule before adding any auxiliary binding data. The other cells aij represent the prediction accuracy of the HLA molecule at row i after incorporating the auxiliary dataset of the HLA molecule at column j. Each cell (except along the principal diagonal) contains not only the prediction accuracy after combination, but also the t-value in the paired sample t-test comparing prediction accuracy before and after combination over the 24 five-fold cross validation runs. If there exists a statistically significant difference (99.9% or above) in prediction accuracy before and after combination, the t-value is printed in bold. An improvement in the prediction accuracy at a statistically significant level (99.9% or above) is indicated by a bolded prediction accuracy value after combination. Our experimental results show that combining binding data of different types of molecules to improve prediction accuracy works in various combinations. Examining these cases, we are especially interested in HLA molecule pairs A and B such that the addition of the binding data of A to B improves the prediction accuracy of B, and vice versa.


View this table:
[in this window]
[in a new window]
 
Table 5 The prediction accuracy of the combined dataset from different HLA molecules at the same accuracy level of 80% (corresponding t-values are given in parentheses)

 
From these experimental results, we find that the improvement in prediction accuracy mainly comes from the combination of two alleles in the same supertype who have similar peptide binding specificities. Out of 40 statistically significantly improved combinations from all 132 possible combinations, 34 (85%) belong to the combination of two alleles in the same supertype. On the other hand, out of all 40 combinations of two alleles in the same supertype, 38 (95%) improve the original prediction accuracy, and 34 (85%) are statistically significant at the 99.9% level. Thus we next focus on the combinations of two alleles in the same supertype. In our experiment, there are only two supertypes that have more than two alleles, the A3 supertype (A*0301, A*1101, A*3101, A*3301 and A*6801) and the A2 supertype (A*0201, A*0202, A*0203, A*0206 and A*6802).

3.2.1 A*0301, A*1101, A*3101, A*3301 and A*6801
Among the 20 possible combinations of alleles from A*0301, A*1101, A*3101, A*3301 and A*6801 at the 80% level, all improve the original prediction accuracies, and 19 (95%) are statistically significant at the 99.9% level. This indicates that the combination of peptide binding datasets between HLA molecules in this group of HLA-A*0301, A*1101, A*3101, A*3301 and A*6801 always improves the original prediction accuracy.

3.2.2 A*0201, A*0202, A*0203, A*0206 and A*6802
Among the 20 possible combinations of alleles from A*0201, A*0202, A*0203, A*0206 and A*6802 at the 80% accuracy level, 18 (90%) improve the original prediction accuracies and 15 (75%) are statistically significant at the 99.9% level. This indicates that the combination of peptide binding datasets between HLA molecules in this group of HLA-A*0201, A*0202, A*0203, A*0206 and A*6802 usually improves the original prediction accuracy.

On the other hand, the combination of two alleles in different supertypes can hardly improve the original prediction accuracy. Among all 92 of such possible combinations, only 6 (6.5%) improve the original prediction accuracies statistically significantly at the 99.9% level. We also find that the combination of peptide binding data of two alleles in the A2 and A3 supertypes, respectively, decreases the initial prediction accuracy significantly. One notable example is that the combination of A*6801 and A*6802 decreases the original prediction accuracy significantly. Although A*6801 and A*6802 belong to the same allotype, they have distinctly different peptide binding specificities and are classified into A3 and A2 supertype, respectively. Among the 50 possible combinations of two alleles each from the A2 and A3 supertypes, 48 (96%) decrease the initial prediction accuracies statistically significantly at the 99.9% level, implying that these alleles in two different supertypes differ greatly in peptide binding specificities. An interesting result came from the fact that the combination of A*0101 and A*0301, and A*0101 and A*1101, improve the original prediction accuracy. Since these alleles represent two different supertypes that are associated with somewhat different main anchor specificities, we believe that the improvements observed can be attributed to shared preferences at non-anchor positions. It has indeed been noted that A*0101, A*0301 and A*1101 bear a close evolutionary kinship (McKenzie et al., 1999; Lawlor et al., 1990). It has also been observed that mutation rates are faster at residues forming the main peptide binding pockets than at other sites along the peptide binding region (Sette and Hughes., 2003; Hughes and Hughes, 1995). Together, these observations suggest that the similarities between A*0101, A*0301 and A*1101 reflect their common ancestry.

3.2.3 Summary
Based on the experimental results at accuracy level 80%, we have three basic observations.

  • The combination of peptide binding data of HLA-A alleles in the same supertype, e.g. within the A2 or A3 supertypes, improves original prediction accuracy.
  • The combination of peptide binding data of HLA-A alleles in different supertypes hardly improves original prediction accuracy, and sometimes decreases statistically significantly.
  • Even though in different supertypes, the combinations of A*0101 and A*0301, A*0101 and A*1101 improved the original prediction accuracy.

We also carried out combination experiment of different HLA molecules at accuracy level 75%, and obtained similar observations (see Supplementary information I). Furthermore, to verify the generalization of our method, we examined the performance of incorporating binding data of different HLA alleles in the same supertype by another predictive SVM model. Similar observations were obtained and the experimental results are provided under Supplementary information II. In addition, we explored the sequence similarity of different peptide binding datasets during combination. The experimental results are provided in Supplementary information III. We show that the sequence similarity of binding peptide datasets of the alleles in the same supertype is significantly higher than those in different supertypes, which helps to explain the improvement in prediction accuracy after incorporating the binding data of alleles in the same supertype.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Incorporating new data for predicting MHC binding peptide has also been examined by other researchers. Yu et al. found that with more training data, the performance of a prediction system by ANN and HMM could be improved in general (Yu et al., 2002). Brusic et al. cyclically refined the predictive model to improve prediction accuracy by inclusion of new data (Brusic et al., 2001), i.e. the binding data of the MHC molecule of interest. On the other hand, to predict binding peptide to the MHC molecule without experimental data, some researchers incorporated binding data of MHC molecules in the same supertype (Brusic et al., 2002; Srinivasan et al., 2004). To verify their proposed supertypes of HLA class I alleles, Lund et al. (2004) used the peptide binding weight matrices of HLA molecules to predict the binding affinity of peptides to other HLA molecules in the same supertype. They found that the predictive value is positively correlated with experimental value. In contrast to these studies, with original binding data, we further combine the binding data of other MHC molecules that belong to any supertype, whether it be the same or different supertype. In this way, the effect of combining peptide binding data of two different MHC molecules in the same or different supertype is examined. From the experimental results, we find that combining binding data of two MHC molecules in the same supertype usually improves the prediction accuracy. Thus the key in improving prediction accuracy is to identify the group of MHC molecules with similar peptide binding specificities.

The development of vaccines that can cover a broad distribution of the human population stimulates researchers to classify HLA alleles into supertypes with similar specificities. Based on supermotifs shared by different HLA molecules, Sette and Sidney (1999) reported nine supertypes (A1, A2, A3, A24, B7, B27, B44, B58, B62) in HLA Class I molecules. For example, the HLA alleles in A3 supertype prefer A, V, I, L, M, S or T in position 2, and R, K or Y at the C-terminus. The HLA alleles in A2 supertype prefer to L, I, V, M, A, T or Q at position 2, and L, I, V, M, A or T at the C-terminal position. This evidence can explain the improvement of prediction accuracy in our experiment of combining peptide binding data of alleles in the same A2 or A3 supertype, and the decrease of prediction accuracy when combining peptide binding data of two alleles respectively in A2 and A3 supertype. The improvement of prediction accuracy in combining A*0101 with A*0301, and A*0101 with A*1101 can also be explained in this way. Even though A*0101 is phylogenetically and structurally similar to A*0301 and A*1101 (McKenzie et al., 1999; Doytchinova et al., 2004), A*0101 has different peptide binding specificities, and can not be classified into the A3 supertype. The most telling difference is in position 77, which constitutes a major peptide contact residue in the F pocket. Unlike alleles (such as A*0301 and A*1101) in the A3 supertype, A*0101 has an N, rather that the acidic residue D. This difference changes the ability of the pocket to accommodate the basic residues (R and K) preferred by A3-supertype molecules. In spite of obvious differences, A*0101 still has some common features with A*0301 and A*1101 in terms of peptide binding specificities. A*0101 prefers T, S, I, V, L and M at position 2, which is similar to the preferences of A*0301 and A*1101, and Y at the C-terminus. Since A*0301 and A*1101 also prefer Y in the C-terminus, we can see that A*0101 has some common features with A*0301 and A*1101 in terms of peptide binding specificities. This intermediate level of cross activity is only one-way, as A*0101 does not prefer R or K in the C-terminus. It can also explain the phenomenon in our experiment that when A*0101 binders are added to A*0301 (or A*1101) binders, the accuracy was improved to around 84%, which is higher than around just 82% obtained by adding A*0301 (or A*1101) to A*0101.

This work also sheds light on the study of evolution of HLA class I genes. In a review article Klein et al. (1993) indicated that the evolution of MHC molecules has occurred through accumulated mutations, and they compared the evolutionary rate of MHC class I with that of class II. Many studies found that there are many gene conversions in HLA class I molecules (Hughes et al., 1993; Parham et al., 1988). In addition, it was reported in Hughes et al. (1993) that HLA-A*0101, A*0301 and A*1101 are all close to each other in terms of evolution. More often than HLA class II genes, HLA class I genes have exploited gene conversion like recombination events in order to transplant an anchor preference en bloc. Although this appears to have been a rather effective tactic in changing the anchor amino acids, the present analysis of binding similarities between A*0101, A*0301 and A*1101 may show that HLA evolution is still a slow process carried on over a substantial part of the repertoire.

In this article, we have proposed a new approach for predicting binders to an MHC molecule by incorporating auxiliary peptide binding data from other MHC molecules. We have also presented a method for reducing redundancy in a set of binding peptides. Our experimental results show that our approach significantly improves the accuracy of predicting peptides binding an MHC molecule, especially when the base and auxiliary molecules belong to the same supertype having similar peptide binding specificities. Interesting future work should explore the effect of the combination of binding data from multiple alleles in the same supertype.


    Acknowledgments
 
The authors would like to thank Nicolas Majeux for performing preliminary experiments, and the reviewers who provided many suggestions to improve the original manuscript. This work is supported in part by Bioinformatics Education Program ‘Education and Research Organization for Genome Information Science’ and Kyoto University 21st Century COE Program ‘Knowledge Information Infrastructure for Genome Science’ with support from MEXT (Ministry of Education, Culture, Sports, Science and Technology), Japan.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Satoru Miyano

Received on June 23, 2005; revised on March 8, 2006; accepted on April 8, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

    Bertoni, R., et al. (1997) Human histocompatibility leukocyte antigen-binding supermotifs predict broadly cross-reactive cytotoxic T lymphocyte responses in patients with acute hepatitis. J. Clin. Invest, . 100, 503–513[ISI][Medline].

    Bertoni, R., et al. (1998) Human class I supertypes and CTL repertoires extend to chimpanzees. J. Immunol, . 161, 4447–4455[Abstract/Free Full Text].

    Bhasin, M., et al. (2003) MHCBN: a comprehensive database of MHC binding and non-binding peptides. Bioinformatics, 19, 665–666[Abstract/Free Full Text].

    Blythe, M.J., et al. (2002) JenPep: a database of quantitative functional peptide data for immunology. Bioinformatics, 18, 434–439[Abstract/Free Full Text].

    Brusic, V., et al. (1998a) Prediction of MHC class II-binding peptides using an evolutionary algorithm and artificial neural network. Bioinformatics, 14, 121–130[Abstract/Free Full Text].

    Brusic, V., et al. (1998b) MHCPEP, a database of MHC-binding peptides: update 1997. Nucleic Acids Res, . 26, 368–371[Abstract/Free Full Text].

    Brusic, V., Wilkins, J.S., Stanyon, C.A., Zelenikow, J. (1998c) Data learning: understanding biological data. In Merrill, G. and Pathak, D.K. (Eds.). Knowledge Sharing Across Biological and Medical Knowledge Based Systems: Papers from the 1998 AAAI Workshop, AAAI Technical Report WS–98–04 AAAI Press, pp. 12–19.

    Brusic, V., Zeleznikow, J., Sturniolo, T., Bono, E., Hammer, J. (1999) Data cleaning for computer models: a case study from immunology. Proceedings of ICONIP99, The Sixth International Conference on Neural Information Processing IEEE press, pp. , pp. 603–609.

    Brusic, V., et al. (2001) Efficient discovery of immune response targets by cyclical refinement of QSAR models of peptide binding. J Mol. Graph Model, . 19, 405–411[CrossRef][ISI][Medline].

    Brusic, V., et al. (2002) Prediction of promiscuous peptides that bind HLA class I molecules. Immunol. Cell Biol, . 80, 280–285[CrossRef][Medline].

    Buus, S., et al. (2003) Sensitive quantitative predictions of peptide-MHC binding by a ‘Query by Committee’artificial neural network approach. Tissue Antigens, 62, 378–384[CrossRef][ISI][Medline].

    Cano, P., et al. (1998) A geometric study of the amino acid sequence of class I HLA molecules. Immunogenetics, 48, 324–334[CrossRef][ISI][Medline].

    Chelvanayagam, G. (1996) A roadmap for HLA-A, HLA-B, and HLA-C peptide binding specificities. Immunogenetics, 45, 15–26[CrossRef][ISI][Medline].

    Doolan, D.L., et al. (1997) Degenerate cytotoxic T cell epitopes from P. falciparum restricted by multiple HLA-A and HLA-B supertype alleles. Immunity, 7, 97–112[CrossRef][ISI][Medline].

    Dönnes, P. and Elofsson, A. (2002) Prediction of MHC class I binding peptides, using SVMHC. BMC Bioinformatics, 3, 25–32[CrossRef][Medline].

    Doytchinova, I.A., et al. (2004) Identifying human MHC supertypes using bioinformatics methods. J. Immunol, . 172, 4314–4323[Abstract/Free Full Text].

    Falk, K., et al. (1991) Allele-specific motifs revealed by sequencing of self-peptides eluted from MHC molecules. Nature, 351, 290–296[CrossRef][Medline].

    Friedman, J.H. (1977) A recursive partitioning decision rule for non-parametric classification. IEEE Trans. Comput, . 26, 404–408.

    Gulukota, K., et al. (1997) Two complementary methods for predicting peptides binding major histocompatibility complex molecules. J. Mol. Biol, . 267, 1258–1267[CrossRef][ISI][Medline].

    Hammer, J., et al. (1994) Precise prediction of MHC class II-peptide interaction based on peptide side chain scanning. J. Exp. Med, . 180, 2353–2358[Abstract/Free Full Text].

    Hughes, A.L., et al. (1993) Contrasting roles of interallelic recombination at the HLA-A and HLA-B loci. Genetics, (133), 669–680[Abstract].

    Hughes, A.L. and Hughes, M.K. (1995) Self peptides bound by HLA class I molecules are derived from highly conserved regions of a set of evolutionarily conserved proteins. Immunogenetics, 41, 257–262[ISI][Medline].

    Janeway, C.A., Travers, P., Walport, M., Shlomchik, M. Immunobiology: The Immune System in Health and Disease, . (2001) , New York Garland Publishing.

    Kanehisa, M., et al. (2004) The KEGG resource for deciphering the genome. Nucleic Acids Res, . 32, 277–280.

    Klein, J., et al. (1993) The molecular descent of the major histocompatibility complex. Annu. Rev. Immunol, . 11, 269–295[CrossRef][ISI][Medline].

    Lawlor, D.A., et al. (1990) Evolution of class-I MHC genes and proteins: from natural selection to thymic selection. Annu. Rev. Immunol, . 8, 23–63[CrossRef][ISI][Medline].

    Lund, O., et al. (2004) Definition of supertypes for HLA molecules using clustering of specificity matrices. Immunogenetics, 55, 797–810[CrossRef][ISI][Medline].

    Mamitsuka, H. (1998) Predicting peptides that bind to MHC molecules using supervised learning of hidden markov models. Proteins, . 33, 460–474[CrossRef][ISI][Medline].

    McKenzie, L.M., et al. (1999) Taxonomic hierarchy of HLA class I allele sequences. Genes Immun, . 1, 120–129[CrossRef][ISI][Medline].

    Nielsen, M., et al. (2004) Improved prediction of MHC class I and class II epitopes using a novel Gibbs sampling approach. Bioinformatics, 20, 1388–1397[Abstract/Free Full Text].

    Parham, P., et al. (1988) Nature of polymorphism in HLA-A, -B, and -C molecules. Proc. Natl Acad. Sci. USA, . 85, 4005–4009[Abstract/Free Full Text].

    Parker, K., et al. (1994) Scheme for ranking potential HLA-A2 binding peptides based on independent binding of individual side chain scanning. J. Immunol, . 152, 163–175[Abstract].

    Peters, B., et al. (2003) Examining the independent binding assumption for binding of peptide epitopes to MHC-I molecules. Bioinformatics, 19, 1765–1772[Abstract/Free Full Text].

    Quinlan, J.R. (1979) Discovering rules by induction from large collections of examples. In Michie, D. (Ed.). Expert Systems in the Micro Electronic Age, , Edinburgh, UK Edinburgh University Press, pp. 168–201.

    Quinlan, J.R. C4.5: Programs for Machine Learning, (1993) , USA Morgan Kaufmann Publishers.

    Rammensee, H.G., et al. (1995) MHC ligands and peptide motifs: 1st listing. Immunogenetics, 41, 178–228[ISI][Medline].

    Rammensee, H.G., et al. (1999) SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics, 50, 213–219[CrossRef][ISI][Medline].

    Reche, P.A., et al. (2002) Prediction of MHC class I binding peptides using profile motifs. Hum. Immunol, . 63, 701–709[CrossRef][ISI][Medline].

    Riedesel, H., et al. (2004) Peptide binding at class I MHC scored with linear functions and support vector machines. Gen. Inform, . 15, 198–212.

    Ruppert, J., et al. (1993) Prominent role of secondary anchor residues in peptide binding to HLA-A2.1 molecules. Cell, 74, 929–937[CrossRef][ISI][Medline].

    Sathiamurthy, M., et al. (2003) Population of the HLA ligand database. Tissue Antigens, 61, 12–19[CrossRef][ISI][Medline].

    Schönbach, C., et al. (2002) FIMM, a database of functional molecular immunology update 2002. Nucleic Acids Res, . 30, 226–229[Abstract/Free Full Text].

    Segal, A., et al. (2001) Relating amino acid sequence to phenotype: analysis of peptide-binding data. Biometrics, 57, 632–642[CrossRef][ISI][Medline].

    Sette, A., et al. (1999) Nine major HLA class I supertypes account for the vast preponderance of HLA-A and -B polymorphism. Immunogenetics, 50, 201–212[CrossRef][ISI][Medline].

    Sette, A., et al. (2003) Class I molecules with similar peptide-binding specificities are the result of both common ancestry and convergent evolution. Immunogenetics, 54, 830–841[ISI][Medline].

    Srinivasan, K.N., et al. (2004) Prediction of class I T-cell eptiopes: evidence of presence of immunological hot spots inside antigens. Bioinformatics, 20, suppl.1, i297–i302[Abstract].

    Sturniolo, T., et al. (1999) Generation of tissue-specific and promiscuous HLA ligand database using DNA microarrays and virtual HLA class II matrices. Nat. Biotechnol, . 17, 555–561[CrossRef][ISI][Medline].

    Udaka, K., et al. (1995) Decrypting the structure of MHC-I restricted CTL epitopes with complex peptide libraries. J Exp. Med, . 181, 2097–2108[Abstract/Free Full Text].

    Udaka, K., et al. (2000) An automated prediction of MHC class I-binding peptides based on positional scanning with peptide libraries. Immunogenetics, 51, 816–828[CrossRef][ISI][Medline].

    Udaka, K., et al. (2002) Empirical evaluation of a dynamic experiment design method for prediction of MHC class I-binding peptides. J. Immunol, . 169, 5744–5753[Abstract/Free Full Text].

    Weiss, G.M. and Provost, F. (2003) Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intell. Res, . 19, 315–354.

    Yu, K., et al. (2002) Methods for prediction of peptide binding to MHC molecules: a comparative study. Mol. Med, . 8, 137–148[ISI][Medline].

    Zhang, C., et al. (1998) Structural principles that govern the peptide-binding motifs of class I MHC molecules. J. Mol. Biol, . 281, 929–947[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
L. Jacob and J.-P. Vert
Efficient peptide-MHC-I binding prediction for alleles with few known binders
Bioinformatics, February 1, 2008; 24(3): 358 - 366.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
C. Lundegaard, O. Lund, C. Kesmir, S. Brunak, and M. Nielsen
Modeling the adaptive immune system: predictions and simulations
Bioinformatics, December 15, 2007; 23(24): 3265 - 3275.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. C. Tong, T. W. Tan, and S. Ranganathan
In silico grouping of peptide/HLA class I complexes using structural interaction characteristics
Bioinformatics, January 15, 2007; 23(2): 177 - 183.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/13/1648    most recent
btl141v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zhu, S.
Right arrow Articles by Mamitsuka, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhu, S.
Right arrow Articles by Mamitsuka, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?