Skip Navigation

Bioinformatics 2007 23(2):e148-e155; doi:10.1093/Bioinformatics/btl324
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Hertz, T.
Right arrow Articles by Yanover, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hertz, T.
Right arrow Articles by Yanover, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Machine Learning in Computational Biology

Identifying HLA supertypes by learning distance functions

Tomer Hertz 1,2,*,{dagger} and Chen Yanover 1,*,{dagger}

1 School of Computer Science and Engineering Israel
2 Interdisciplinary Center for Neural Computation, The Hebrew University of Jerusalem Israel

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 LEARNING PEPTIDE DISTANCE...
 3 LEARNING A DISTANCE...
 4 EXPERIMENTS AND RESULTS
 5 DISCUSSION AND CONCLUDING...
 REFERENCES
 

Motivation: The development of epitope-based vaccines crucially relies on the ability to classify Human Leukocyte Antigen (HLA) molecules into sets that have similar peptide binding specificities, termed supertypes. In their seminal work, Sette and Sidney defined nine HLA class I supertypes and claimed that these provide an almost perfect coverage of the entire repertoire of HLA class I molecules.

HLA alleles are highly polymorphic and polygenic and therefore experimentally classifying each of these molecules to supertypes is at present an impossible task. Recently, a number of computational methods have been proposed for this task. These methods are based on defining protein similarity measures, derived from analysis of binding peptides or from analysis of the proteins themselves.

Results: In this paper we define both peptide derived and protein derived similarity measures, which are based on learning distance functions. The peptide derived measure is defined using a peptide–peptide distance function, which is learned using information about known binding and non-binding peptides. The protein derived similarity measure is defined using a protein–protein distance function, which is learned using information about alleles previously classified to supertypes by Sette and Sidney (1999). We compare the classification obtained by these two complimentary methods to previously suggested classification methods. In general, our results are in excellent agreement with the classifications suggested by Sette and Sidney (1999) and with those reported by Buus et al. (2004).

The main important advantage of our proposed distance-based approach is that it makes use of two different and important immunological sources of information—HLA alleles and peptides that are known to bind or not bind to these alleles. Since each of our distance measures is trained using a different source of information, their combination can provide a more confident classification of alleles to supertypes.

Contact: tomboy{at}cs.huji.ac.il; cheny{at}cs.huji.ac.il


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 LEARNING PEPTIDE DISTANCE...
 3 LEARNING A DISTANCE...
 4 EXPERIMENTS AND RESULTS
 5 DISCUSSION AND CONCLUDING...
 REFERENCES
 
The major task of recognizing foreign pathogen proteins is mediated by interactions between Major Histocompatibility Complex (MHC) molecules and short pathogen-derived peptides. When such a peptide binds to an MHC molecule, the complex is transported to the cell surface, where it can be recognized by T-cells that in turn elicit an immune response. Predicting protein–peptide binding is therefore of central importance for developing peptide-based (or epitope-based) vaccines. These vaccines contain only selected sub-sequences, or epitopes, derived from an entire protein, which are known to bind to various MHC molecules. There are several important advantages of epitope-based vaccines: they can induce more potent immune responses and they appear to be safer to use and easier to produce (Sette and Sidney, 1999; Sette et al., 2002; Sette and Fikes, 2003; Lund et al., 2004).

Designing epitope-based vaccines with high population coverage is a challenging problem for the following two main reasons. First, MHC molecules are highly selective and only bind to specific peptides—each molecule binds to ~1% of all existing peptides (Yewdell and Bennink, 1999). Binding specificity is determined by the molecular structure and the chemical properties of the MHC binding sites. Second, MHC molecules are highly polymorphic and polygenic. Currently, the IMGT/HLA database [Robinson et al. (2003), version 2.9], lists 1245 HLA class I and 744 HLA class II alleles (1046 HLA class I and 604 HLA class II proteins). Each individual carries only a few alleles (up to 6 HLA class I alleles and up to 12 HLA class II alleles) (Janeway et al., 2001). One of the ways to overcome this large degree of polymorphism is to make use of epitopes, which bind to many HLA molecules.

Luckily, it turns out that despite the high polymorphism exhibited by HLA alleles, many HLA molecules bind to sets of overlapping peptides. These alleles can be grouped into supertypes—sets of alleles that bind to similar peptides. Identifying these supertypes is therefore an important task with clear implications to the development of epitope-based vaccines. However, experimentally determining binding specificity of a single allele is a hard task which requires both rigorous experimental validation and theoretical analysis. Tackling this problem for more than 1600 proteins is currently impractical (Doytchinova et al., 2004). In addition, as may be seen in Figure 1, around 200 new HLA alleles are discovered each year. The difficulty of the task at hand and the rapid increase in the number of alleles call for the development of computational tools for HLA supertype classification.


Figure 1
View larger version (24K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Number of new alleles added to the IMGT/HLA dataset in recent years (Robinson et al., 2003).

 
Recently, a number of computational methods have been proposed for HLA supertype classification (Reche and Reinherz, 2004; Lund et al., 2004; Doytchinoya et al., 2004). These methods first define a protein-protein similarity (or equivalently a distance) measure. This similarity measure is then used to classify the proteins to supertypes. Although the supertype classification problem is defined over proteins, its underlying goal of finding peptides which bind to several proteins, requires exploration of the ‘peptide space’. This inherent duality leads to two different approaches for defining the similarity between proteins:
  1. ‘Peptide–based’ approaches define a similarity measure between proteins that is based on properties of sets of peptides that bind to these proteins. The binding peptides used can either be experimentally determined binders (Sette and Sidney, 1999) or computationally predicted binders (e.g. based on binding motifs) (Reche and Reinherz, 2004). The similarity between these sets of binding peptides can be defined by counting the overlapping peptides (Sette and Sidney, 1999; Reche and Reinherz, 2004) or by representing each set by a motif, and then defining some similarity measure over these motifs (Lund et al., 2004).
  2. ‘Protein-based’ approaches define a similarity measure that is based on the properties of the proteins themselves. It has been noted that proteins that bind to a set of overlapping peptides have binding sites that are similar to one another (Doytchinova et al., 2004). By using some canonical representation, one can define a similarity measure over these binding sites.

In this paper we present a novel approach for computationally identifying supertypes, that is based on learning distance functions. Unlike previous works, we address the supertype classification problem using the two complementary approaches described above—both peptide- and protein-based similarities. More specifically, we propose to explicitly learn distance functions of two different types:

  1. A peptide–peptide distance function using information about binding and non-binding peptides. We have recently presented a framework for protein–peptide binding prediction, based on learning a peptide–peptide distance function over an entire family of proteins (e.g. HLA class I) (Yanover and Hertz, 2005). In our current work, we show how this distance function can be naturally used to define a distance measure over proteins.
  2. A protein–protein distance function using information about alleles which have been experimentally determined to belong to the same supertype [e.g. by Sette and Sidney (1999)]. We propose to use the same distance learning algorithm to directly learn protein–protein distance functions.

Using these two complementary methods we classify a set of HLA-A and HLA-B alleles to supertypes. We then compare the results obtained using these two methods with previously proposed methods and characterize their regions of agreement/disagreement.

1.1 Related work
HLA class I supertypes were originally defined by Sette, Sidney and colleagues during the second half of the 1990s (del Guercio et al., 1995; Sidney et al., 1996a, b; Sette and Sidney, 1999). In these seminal works, supertype classification was essentially based on overlapping sets of peptides, known to bind to a subset of HLA alleles. In addition, the classification took into account properties of the binding pockets of these alleles. All in all, nine HLA supertypes were identified (Sette and Sidney, 1999): A1, A2, A3 and A24 for HLA-A alleles; B7, B27, B44, B58 and B62 for HLA-B alleles.

Recently, several computational methods for defining supertypes have been proposed. Both Lund et al., (2004) and Reche and Reinherz (2004) define supertypes, based on similarity between sets of binding peptides: Lund et al. (2004) construct hidden markov models (HMMs) for HLA class I molecules using a Gibbs sampling procedure. They then define a similarity measure between these sequence motifs and use this similarity to cluster alleles into supertypes. Reche and Reinherz (2004) rank a set of 1000 random peptides, using a position specific scoring matrix for each protein, and then consider the top 2% scoring peptides as predicted binders. They define a similarity between alleles that is based on counting the number of peptides that are predicted to bind to both alleles. As in Lund et al., (2004), they then cluster the alleles using the neighbor clustering algorithm from the Phylogeny Inference Package (PHYLIP) (Felsenstein, 1993).

A complimentary approach is taken by Doytchinova et al. (2004), who classify supertypes by examining the binding sites of various HLA-A, HLA-B and HLA-C alleles. Each allele is modeled based on a reference X-ray structure and its binding site residues are used to define a set of characteristic properties. They then use both a hierarchical clustering algorithm and principal component analysis (PCA) to cluster these alleles into supertypes.

It is important to note that despite the fact that the supertype problem is of central importance, to date there is no clear ground truth classification of alleles into supertypes. In all the works described above, and in our current work, the results are compared with the supertypes defined by Sette and Sidney (1999). Their classification only includes a small subset of currently known HLA class I alleles, and therefore most alleles are currently unlabeled.


    2 LEARNING PEPTIDE DISTANCE FUNCTIONS FOR SUPERTYPE CLASSIFICATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 LEARNING PEPTIDE DISTANCE...
 3 LEARNING A DISTANCE...
 4 EXPERIMENTS AND RESULTS
 5 DISCUSSION AND CONCLUDING...
 REFERENCES
 
For sake of completeness, we first present our novel framework for protein–peptide binding prediction based on learning peptide–peptide distance functions. We then show how these learned distance functions can be naturally used to define a protein–protein similarity measure for supertype classification.

2.1 Learning peptide–peptide distance functions
Previously proposed learning approaches for protein–peptide binding prediction, address the binding prediction problem using traditional margin based binary classifiers: for each protein a classifier is trained to distinguish binding peptides from non-binding peptides (Donnes and Elofsson, 2002; Buus et al., 2003; Reche et al., 2004) [for a review see Flower (2003)]. Recently, we proposed PepDist: a novel approach for predicting binding affinity based on learning peptide–peptide distance functions1 (Yanover and Hertz, 2005; Hertz and Yanover, 2006). Our approach is based on two important observations:

OBSERVATION 1.
Peptides that bind to the same protein are similar to one another, and different from non-binding peptides.

OBSERVATION 2.
Peptides binding to different proteins within the same ‘family’ resemble each other.

Observation 1 implicitly underlies most, if not all, computational prediction methods. A direct implication of this observation is that the distance between a query peptide and a set of known binders can be used to predict the peptide's binding affinity: a peptide that is close to the set of binders would be classified as a binder and a peptide far from this set would be classified as a non-binder. We therefore suggested to predict protein–peptide binding affinity by learning a distance function over pairs of peptides. Moreover, based on Observation 2, we suggested to learn a single peptide–peptide distance function over an entire family of proteins (e.g. MHC class I) (Yanover and Hertz, 2005; Hertz and Yanover, 2006). A similar approach was also taken by Heckerman et al. (2006). This distance function can be used to compute the affinity of a novel peptide to any of the proteins in the given family. Note that the mere definition of supertypes is based on Observation 2. The learned peptide–peptide distances can therefore also be used to identify supertypes by clustering together proteins that bind to similar peptides.

Recently, there has been a growing interest in the problem of learning distance functions in the machine learning community. Most algorithms that learn distance functions make use of equivalence constraints (Hertz et al., 2004a, b; Bar-Hilel et al., 2003; Xing et al., 2002; Bilenko et al., 2004; Vert and Yamanishi, 2005). Equivalence constraints are relations between pairs of data points, which indicate whether the points in the pair belong to the same category or not. We term a constraint positive when the points are known to be from the same class, and negative in the opposite case. In this setting, the goal of the algorithm is to learn a distance function that complies with the equivalence constraints provided as input. Specifically, in our previous work (Yanover and Hertz, 2005; Hertz and Yanover, 2006) and in this work, we use the DistBoost algorithm which is a semi-supervised distance learning algorithm (Hertz et al., 2004a, b). DistBoost learns a distance function using a well-known machine learning technique, called Boosting (Schapire and Singer, 1999). For details regarding the algorithm's description see Hertz et al. (2004a), Yanover and Hertz (2005).

We formalize the peptide–peptide distance learning problem as follows: each protein is denoted by some class label. Each pair of peptides, which are known to bind to a specific protein (i.e. belong to the same class), defines a positive constraint, while each pair of peptides in which one binds to the protein and the other does not—defines a negative constraint. Therefore, for each protein, our training data consist of a list of binding and non-binding peptides, and a set of equivalence constraints that they induce. We collect these sets of peptides and equivalence constraints for several proteins within a protein family into a single dataset. We then use this dataset to learn a peptide–peptide distance function (see Figure 2b). Using this distance function, we can predict the binding affinity of a novel peptide to a specific protein, by measuring its average distance to all of the peptides which are known to bind to that protein (see Figure 3b). We have tested our approach on binding prediction of MHC class I and MHC class II datasets and in all cases our method provided excellent results which outperformed most state-of-the-art computational prediction methods (Yanover and Hertz, 2005; Hertz and Yanover, 2006).


Figure 2
View larger version (53K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Peptide–peptide distance matrices of HLA binding peptides, collected from the MHCPEP dataset (Brusic et al., 1998). The data consist of peptides known to bind to nine different HLA class I proteins (see labels on the y-axis). Peptides that bind to each of the proteins were grouped together and labeled accordingly. In both matrices the value in position (i, j) represents the distance between Peptidei and Peptidej (the darker the color is, the smaller the distance). A ‘good’ peptide–peptide distance matrix should therefore be block diagonal. (a) Naive peptide–peptide distance matrix (Euclidean distance in R45). (b) The peptide–peptide distance matrix learned using the DistBoost algorithm. DistBoost was trained on binding peptides from all of the proteins simultaneously.

 


Figure 3
View larger version (50K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 The peptide–peptide distance function (a) can be used to provide both protein-specific binding prediction (b) and to identify supertypes (c). (b) A protein–peptide affinity matrix where the value in position (i,j) is the predicted binding affinity of Peptidei to Proteinj. Peptides that bind to each of the nine proteins were grouped together. The first three proteins belong to the A2 supertype, the next three proteins to the A3 supertype and the last three to the B27 supertype. As may be seen the affinity values of peptides that are known to bind to a specific protein are higher than their affinity to other proteins. Also note that the binding affinities of peptides to different proteins within the same supertype are very similar. (c) The protein–protein distance matrix defined using peptide distances [Equation (1)]. Proteins are ordered as described above. As can be seen, the distances between proteins within each of the three supertypes are smaller than the distances between proteins from different supertypes.

 
2.2 Using peptide distance functions to define supertypes
As noted above, the definition of supertypes is based on identifying HLA molecules that bind to sets of overlapping peptides. This definition implies that the distances between binding peptides can be naturally used to define a distance measure over pairs of proteins. This protein–protein distance function can then be used to classify proteins to supertypes. A ‘good’ distance function would assign small distances to proteins which bind to sets of overlapping peptides, and large distances to proteins which bind to different (or non-overlapping) sets of peptides. One intuitive way to formalize this notion is to use the average distance between peptides that bind to two different proteins as a measure of their similarity. More formally, let us denote by Dpeptides(Peptidei, Peptidej) the distance between Peptidei and Peptidej. We define the distance between Proteinm and Proteinn, Dproteins(Proteinm, Proteinn) to be

Formula 1(1)
where Bn and Bm denote the sets of peptides known to bind to Proteinn and Proteinm, respectively, and Nnm = |Bn|·|Bm|.

Figure 3c presents an illustrative example of the protein–protein distance matrix between nine different HLA class I alleles using the peptide–peptide distance matrix in Figure 3a (see also previous section). As can be clearly seen, the distances between proteins from the same supertype are smaller than the distances between proteins from different supertypes. In order to classify alleles into supertypes we can now ‘feed’ this protein distance matrix into any generic hierarchical clustering algorithm (e.g. average-linkage) and identify clusters of alleles as supertypes. We used this peptide-based similarity measure to classify a set of HLA-A and HLA-B proteins to supertypes and present the results obtained in Section 4.


    3 LEARNING A DISTANCE FUNCTION OVER PROTEIN BINDING SITES FOR DEFINING SUPERTYPES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 LEARNING PEPTIDE DISTANCE...
 3 LEARNING A DISTANCE...
 4 EXPERIMENTS AND RESULTS
 5 DISCUSSION AND CONCLUDING...
 REFERENCES
 
In the previous section we defined supertypes based on peptide–peptide distance functions, learned using information about binding and non-binding peptides for a family of proteins. An alternative, complementary approach, is to directly define a similarity measure between the proteins themselves. As noted by Doytchinova et al. (2004), the binding sites of two proteins that bind to similar peptides, share common characteristics. Such similarities were also analyzed in the original works of Sette and Sidney [see e.g. Sette and Sidney (1999)], who used this information to suggest tentative assignment of alleles to various HLA class I supertypes.

Our distance learning algorithm can also be used to learn a distance function over the binding sites of the various HLA alleles. In order to learn such a distance function we need to define a representation of the binding sites for each allele and to show how to make use of experimentally validated supertype classifications to define equivalence constraints. Following Doytchinova et al. (2004), we represented the binding sites of HLA-A alleles using a set of 35 amino acids and HLA-B binding sites using a set of 37 amino acids. We extract equivalence constraints as follows: each pair of proteins which are known to belong to the same supertype [based on the classification of Sette and Sidney (1999)], forms a positive constraint and each pair of proteins which are known to belong to different supertypes forms a negative constraint. Note that since only 64 alleles were classified by Sette and Sidney (1999), we only have information regarding a small subset of all known alleles (currently 874), and therefore the distance learning scenario is semi-supervised.


    4 EXPERIMENTS AND RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 LEARNING PEPTIDE DISTANCE...
 3 LEARNING A DISTANCE...
 4 EXPERIMENTS AND RESULTS
 5 DISCUSSION AND CONCLUDING...
 REFERENCES
 
We now present the results obtained by our peptide- and protein-based approaches, both using the same distance learning framework, on classification of HLA class I alleles to supertypes. We begin with a short description of our experimental setup including the datasets we used, data representation and algorithmic details.

4.1 Experimental setup
Datasets. Sequences of nine amino acid long (9mers) peptides, that are known to bind to HLA class I proteins listed in Lund et al. (2004), were collected from the MHCPEP (Brusic et al., 1998) and SYFPEITHI (Rammensee et al., 1999) datasets. Peptides, that contain undetermined residues (denoted by the letter code X), were excluded. We then grouped all 9mers, that bind to HLA class I molecules (both HLA-A and HLA-B), into a single dataset, called HLA1peptides. The HLA1peptides dataset includes 4273 peptides that bind to 112 HLA class I alleles (42 HLA-A alleles and 70 HLA-B alleles).

Sequences of HLA alleles were acquired from the IMGT/HLA Sequence Database [(Robinson et al. (2003), version 2.9]. In this version, 372 HLA-A and 661 HLA-B alleles have been named. We obtained a set of unique proteins, considering only the first allele (denoted by *xxxx01) out of each group of alleles with an identical sequence of amino acids (Doytchinova et al., 2004). We then defined a dataset of HLA-A alleles, called HLA-Aproteins, and a dataset of HLA-B alleles, called HLA-Bproteins. The former dataset consists of 301 HLA-A proteins and the latter—573 HLA-B proteins. Following Doytchinova et al. (2004), the HLA-A binding site was represented using 35 residues and the HLA-B binding site using 37 residues. These binding site definitions are based on X-ray structures of reference proteins. HLA alleles were then aligned within each locus, using the initial X-ray structure as a template.

To allow a fair comparison between all methods, we used the list of alleles presented by Lund et al. (2004) and compared the classifications reported by all methods on this subset only. We therefore present a comparison of all methods on a subset of 92 HLA class I alleles. In all comparisons with Doytchinova et al. (2004) we used the results reported using a hierarchical clustering algorithm.

Data representation. DistBoost, the distance learning algorithm used in this paper, requires that the data be represented in some continuous vector feature space. As in our previous work (Hertz and Yanover, 2006) we used the representation suggested by Venkatarajan and Braun (2001). Using Venkatarajan and Braun's feature vectors, we represent each nine amino acid long peptide as a point in Formula 1, by simply concatenating its amino acid feature vectors. Similarly, we encode each sequence of residues defining HLA-A and HLA-B allele binding site as a 175 and 185 dimensional vector, respectively. The vectors representing the binding sites were further processed using PCA to obtain vectors in Formula 1.

Extracting equivalence constraints. The distance learning algorithm we used is trained using unlabeled data and additional equivalence constraints. Extracting equivalence constraints for our peptide-based approach is rather straightforward: each pair of peptides that are known to bind to the same protein form a positive equivalence constraint. It should be noted that no explicit information regarding the classification of alleles to supertypes is provided. Extracting equivalence constraints for our proposed protein-based approach was done as described in Section 3: each pair of proteins which were grouped into the same supertype by Sette and Sidney (1999), formed a positive equivalence constraint and each pair of proteins which are known to belong to different supertypes forms a negative constraint. It is important to note that a rather small portion of the binding site datasets was tagged using these constraints—23 out of 301 in the HLA-Aproteins dataset and 55 out of 573 in the HLA-Bproteins dataset. Out of these, 22 HLA-A and 42 HLA-B alleles appear in the set of 92 alleles on which we provide a detailed comparison with previously suggested methods.

Algorithmic setup. The DistBoost algorithm was run for 100 iterations on the peptide dataset and 30 iterations on the binding site datasets. In order to cluster the alleles into supertypes we used two standard clustering methods: (1) The average-linkage algorithm and (2) The Neighbor program from the PHYLIP package (Felsenstein, 1993) [as in Lund et al., (2004), Reche and Reinherz (2004)]. Each cluster was automatically assigned a label as follows: We begin by labelling each point using the classification in Lund et al. (2004) [of which the labels of Sette and Sidney (1999) are almost in full agreement]. We then labeled each cluster using the most frequent label within the cluster. The number of clusters chosen was equivalent to the number of supertypes described in Lund et al. (2004). In order to verify that our labeling scheme was sensible, we also drew dendrograms of our peptide- and protein-based methods and visually inspected them after coloring each node with our predicted classifications. Dendrograms were drawn using the TreeView program (Page, 1996).

4.2 Supertype classification results
Table 1 presents a comparison of the classification of alleles to supertypes obtained by our two proposed methods and the classifications reported by Sette and Sidney (1999), Lund et al. (2004), Doytchinova et al. (2004). Each supertype was colored using a different color.


View this table:
[in this window]
[in a new window]

 
Table 1 Comparative supertype classification results. In general, the results obtained by our proposed peptide-based (Pept. Dist.) and protein-based (BS Dist.) approaches are in agreement with the classifications suggested by Sette and Sidney (1999), Lund et al. (2004), Doytchinova et al. (2004)

 
The percentage of agreement between the various methods are summarized in Table 2. In these calculations we ignore entries in which one of the two methods did not provide a classification (entries marked with '?' or with ‘—’)2. As can be seen, our methods are in high agreement with all other compared methods. The agreement between our two methods is 75%. We can also see that our protein-based prediction is in 94% agreement with the classifications of Sette and Sidney (1999). This is not very surprising, owing to the fact that when training our algorithm on the binding site datasets, we used the classifications provided by Sette and Sidney as constraints provided to our distance learning algorithm. It is interesting to note that our peptide-based method which was not provided with any constraints regarding the proteins, still obtains an 82% agreement with the results of Sette and Sidney.


View this table:
[in this window]
[in a new window]

 
Table 2 Comparative supertype classification results

 
In order to further visualize the results obtained by our peptide-based approach, Figure 4 presents a colored dendrogram of the classifications obtained by our method. Each cluster is associated with a specific supertype. Interestingly, we can see that most of the alleles on which our method disagrees with the classifications of other methods are near cluster boundaries (see e.g. B4001 which our peptide-based approach classified as belonging to the B7 supertype, and is classified by most other methods to the adjacent B44 supertype). Additionally, many of these involve serologically defined specificities (e.g. A2, B57 etc.), as opposed to genetically defined specificities. Serologically defined species, are usually based on earlier experiments, and their motifs are not always definitive. For example, the A28 allele consists of two different specificities as a results of a crossover event, and was therefore later divided into A68 and A69 (J. Sidney, personal communication). Since in our current paper we wanted to provide a clear comparison with previously suggested methods, we evaluated our method on the exact same 92 alleles that were used in Lund et al. (2004). In future work, it may be beneficial to exclude the serological alleles from supertype classification studies.


Figure 4
View larger version (81K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 A dendrogram plot showing the classification obtained using our peptide-based approach. Each supertype is marked with a different color and therefore results are best seen in color.

 
A partial visualization of the results obtained by our protein-based method is shown in Figure 5. Since the datasets used to train the binding site distance functions contained a large number of alleles, it is very hard to visualize the clustering of the entire dataset. Two distinct supertypes are shown: A2 and A24. To the best of our knowledge, many of the alleles in these clusters have not been previously classified. Examples are the A0261, A0262, A0263, A6827 and A6828 alleles which our method classifies to the A2 supertype and the A2444 and A2312 alleles which are predicted to belong to the A24 supertype.


Figure 5
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5 (b) Part of the dendrogram plot showing the classification of 301 HLA-A alleles obtained by our protein-based approach. (a) A blowup of the A2 cluster in (b). (c) A blowup of the A24 cluster in (b).

 

    5 DISCUSSION AND CONCLUDING REMARKS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 LEARNING PEPTIDE DISTANCE...
 3 LEARNING A DISTANCE...
 4 EXPERIMENTS AND RESULTS
 5 DISCUSSION AND CONCLUDING...
 REFERENCES
 
In this paper, we presented a novel framework for classification of HLA alleles to supertypes which is based on learning distance functions. We showed how the same algorithmic framework can be utilized for learning both peptide and protein distance functions and hence provide peptide- and protein-based approaches for supertype classification. Our novel method can make use of the two most important sources of immunological information which are continuously collected and published: the HLA alleles themselves and the peptides that are known to bind (and also not-bind) to these alleles.

Since most known alleles are not clearly classified to supertypes, it is quite difficult to provide a quantitative measure of performance. Comparing our results with previously suggested methods, showed that our binding site prediction method was in excellent agreement with most other methods, and our peptide-based approach was in good agreement with most other methods (including the binding-site method). Our protein-based approach also provided supertype classification predictions for new, previously unclassified alleles. Despite the fact that our peptide-based approach provided only an 82% agreement with the results of Sette and Sidney, it is important to note that it did not make use of any information regarding the classification of alleles to supertypes. This result clearly conforms with the original observations that led to the definition of supertypes, as alleles that bind to overlapping sets of peptides. Our binding site results show, that it is also feasible to classify alleles to supertypes by directly measuring the similarity between their binding sites.

The performance of our binding site approach suggests that it can be practically used as a tool for guiding experimental binding assays for newly discovered alleles, or alleles for which no binding peptides are currently known. Furthermore, using both protein- and peptide-based classification methods proposes a combined supertype classification and binding prediction scheme, described in Algorithm 1.


Algorithm 1 Characterizing a newly discovered allele

    Input: A newly discovered allele
        1. Use the protein-based approach to classify the allele to one of the currently identified supertypes.
        2. Use this classification to guide experiments which seek to identify peptides that bind to this novel protein.
        3. These binding peptides can in turn be used to retrain a peptide-based classifier and to obtain an additional classification of the new allele to a supertype.

Since the protein-based and peptide-based approaches rely on different sources of information, we should expect more confident classifications when their predictions agree with one another.

In our future work, we hope to incorporate additional data for training our peptide-based method. Of special interest is the incorporation of experimentally determined non-binders, which were currently not used. In our previous work (Hertz and Yanover, 2006), we have shown that learning a peptide distance function using additional information about non-binders, provides better prediction results. We therefore hope that using this additional source of information will further improve our classification results.


    Acknowledgments
 
The authors thank John Sidney for many useful comments and suggestions. C.Y. is supported by Yeshaya Horowitz Association through the Center for Complexity Science.

Conflict of Interest: none declared.


    FOOTNOTES
 
{dagger}The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Back

1A distance function is a function that assigns some non-negative value for each pair of points (or peptides in our case). Back

2When comparing our results to those suggested by Sette and Sidney, we do not consider alleles labeled as A26 and B39 as disagreeing with labels A1 and B27, respectively. The A26 and B39 supertypes are two novel supertypes that were recently suggested by Lund et al. (2004). Back


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 LEARNING PEPTIDE DISTANCE...
 3 LEARNING A DISTANCE...
 4 EXPERIMENTS AND RESULTS
 5 DISCUSSION AND CONCLUDING...
 REFERENCES
 

    Bar-Hilel, A., et al. (2003) Learning distance functions using equivalence relations. International Conference on Machine Learning.

    Bilenko, M., et al. (2004) Integrating constraints and metric learning in semisupervised clustering. International conference on Machine LearningBanff Canada.

    Brusic, V., et al. (1998) MHCPEP, a database of MHC-binding peptides: update 1997. Nucliec Acids Res, . 26, 368–371.

    Buus, S., et al. (2003) Sensitive quantitative predictions of peptide-MHC binding by a ‘query by committee’ artificial neural network approach. Tissue Antigens, 62, 378–384[CrossRef][Web of Science][Medline].

    del Guercio, M., et al. (1995) Binding of a peptide antigen to multiple HLA alleles allows definition of anA2-like supertype. J. Immunol, . 154, 685–693[Abstract].

    Donnes, P. and Elofsson, A. (2002) Prediction of MHC class I binding. BMC Bioinformatics, 3, .

    Doytchinova, I.A., et al. (2004) Identifiying human MHC supertypes using bioinformatic methods. J. Immunol, . 172, 4314–4323[Abstract/Free Full Text].

    Felsenstein, J. (1993) PHYLIP (phylogeny inference package) version 3.5c. Distributed by the author.

    Flower, D.R. (2003) Towards in silico prediction of immunogenic epitopes. Trends immunol, . 24, .

    Heckerman, D., et al. (2006) Leveraging information across HLA alleles/supertypes improves epitope prediction. RECOMB, 296–308.

    Hertz, T. and Yanover, C. (2006) Pepdist: a new framework for protein-peptide binding prediction based on learning peptide distance functions. BMC Bioinformatics, 7, .

    Hertz, T., et al. (2004a) Boosting margin based distance functions for clustering. International conference on Machine Learning.

    Hertz, T., et al. (2004b) Learning distance functions for image retrieval. Proceedings of Computer vision and Pattern RecognitionWashington DC.

    Janeway, C., et al. Immunobiology, (2001) 5th edn , NewYork and London Garland Publishing.

    Lund, O., et al. (2004) Definition of supertypes for HLA molecules using clustering of specificity matrices. Immunogenetics, 55, 797–810[CrossRef][Web of Science][Medline].

    Page, R.D.M. (1996) Treeview: an application to display phylogenetic trees on personal computers. Comp. Appl. Biosci, . 12, 357–358.

    Rammensee, H.G., et al. (1999) SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics, 50, 213–219[CrossRef][Web of Science][Medline].

    Reche, P.A. and Reinherz, E.L. (2004) Definition of MHC supertypes through clustering of MHC peptide binding repertoires. International Conference on Artificial Immune Systems 2004 Vol. 3239, , pp. 189–196.

    Reche, P.A., et al. (2004) Enhancement to the RANKPEP resource for the prediction of peptide binding to MHC molecules using profiles. Immunogenetics, 26, 405–419.

    Robinson, J., et al. (2003) IMGT/HLA and IMGT/MHC: sequence databases for the study of the major histocompatibility complex. Nucliec. Acids Res, . 31, 311–314.

    Schapire, R.E. and Singer, Y. (1999) Improved boosting using confidence-rated predictions. Mach. Learn, . 37, 297–336[CrossRef].

    Sette, A. and Fikes, J. (2003) Epitope-based vaccines: an update on epitope identification, vaccine design and delivery. Curr. Opin. Immunol, . 15, 461–470[CrossRef][Web of Science][Medline].

    Sette, A. and Sidney, J. (1999) Nine major HLA class I supertypes account for the vast preponderance of HLA-A and -B polymorphism. Immunogenetics, 50, 201–212[CrossRef][Web of Science][Medline].

    Sette, A., et al. (2002) Optimizing vaccine design for cellular processing, MHC binding and TCR recognition. Tissue Antigens, 59, 443–443[CrossRef][Web of Science][Medline].

    Sidney, J., et al. (1996a) Definition of an HLA-A3-like supermotif demonstrates the overlapping peptide-binding repertoires of common HLA molecules. Hum. Immunol, . 45, 79–93[CrossRef][Web of Science][Medline].

    Sidney, J., et al. (1996b) Specificity and degeneracy in peptide binding to HLA-B7-like class I molecules. J. Immunol, . 157, 3480–3490[Abstract].

    Venkatarajan, M.S. and Braun, W. (2001) New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical-chemical properties. J. Mol. Model, . 7, 445–453[CrossRef].

    Vert, J.P. and Yamanishi, Y. (2005) Supervised graph inference. NIPS 17, , Cambridge, MA MIT Press, pp. 1433–1440.

    Xing, E.P., et al. (2002) Distance metric learning with application to clustering with side-information. Proceeding of Neural Information Processing Systems The MIT Press Vol. 15, .

    Yanover, C. and Hertz, T. (2005) Predicting protein-peptide binding affinity by learning peptide-peptide distance functions. Proceedings of Research in Computational Molecular Biology 2005.

    Yewdell, J.W. and Bennink, J.R. (1999) Immunodominance in major histocompatibility complex class I-restricted T-lymphocyte responses. Ann. Rev. Immunol, . 17, 51–88[CrossRef][Web of Science][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J. Immunol.Home page
A. A. Chentoufi, X. Zhang, K. Lamberth, G. Dasgupta, I. Bettahi, A. Nguyen, M. Wu, X. Zhu, A. Mohebbi, S. Buus, et al.
HLA-A*0201-Restricted CD8+ Cytotoxic T Lymphocyte Epitopes Identified from Herpes Simplex Virus Glycoprotein D
J. Immunol., January 1, 2008; 180(1): 426 - 437.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
C. Lundegaard, O. Lund, C. Kesmir, S. Brunak, and M. Nielsen
Modeling the adaptive immune system: predictions and simulations
Bioinformatics, December 15, 2007; 23(24): 3265 - 3275.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Hertz, T.
Right arrow Articles by Yanover, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hertz, T.
Right arrow Articles by Yanover, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?