Skip Navigation


Bioinformatics Advance Access originally published online on December 13, 2005
Bioinformatics 2006 22(4):453-459; doi:10.1093/bioinformatics/bti826
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/4/453    most recent
bti826v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Duprat, E.
Right arrow Articles by Gascuel, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Duprat, E.
Right arrow Articles by Gascuel, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

A simple method to predict protein-binding from aligned sequences—application to MHC superfamily and ß2-microglobulin

Elodie Duprat 1, Marie-Paule Lefranc 1,2 and Olivier Gascuel 3,*

1Laboratoire d'ImmunoGénétique Moléculaire IGH (UPR CNRS 1142), 141 rue de la Cardonille, 34396 Montpellier Cedex 5, France
2Institut Universitaire de France 103 Boulevard Saint-Michel, 75005 Paris, France
3Projet Méthodes et Algorithmes pour la Bioinformatique LIRMM (UMR CNRS-UM2 5506), 161 rue Ada, 34392 Montpellier Cedex 5, France

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DATA
 3 SIMPLE-BAYES CLASSIFIER
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 

Motivation: The MHC superfamily (MhcSF) consists of immune system MHC class I (MHC-I) proteins, along with proteins with a MHC-I-like structure that are involved in a large variety of biological processes. ß2-Microglobulin (B2M) non-covalent binding to MHC-I proteins is required for their surface expression and function, whereas MHC-I-like proteins interact, or not, with B2M. This study was designed to predict B2M binding (or non-binding) of newly identified MhcSF proteins, in order to decipher their function, understand the molecular recognition mechanisms and identify deleterious mutations. IMGT standardization of MhcSF protein domains provides a unique numbering of the multiple alignment positions, and conditions to develop such predictive tool.

Method: We combine a simple-Bayes classifier with IMGT unique numbering. Our method involves two steps: (1) selection of discriminant binary features, which associate an alignment position with an amino acid group; and (2) learning of the classifier by estimating the frequencies of selected features, conditionally to the B2M binding property.

Results: Our dataset contains aligned sequences of 806 allelic forms of 47 MhcSF proteins, corresponding to 9 receptor types and 4 mammalian species. Eighteen discriminant features are selected, belonging to B2M contact sites, or stabilizing the molecular structure required for this contact. Three leave-one-out procedures are used to assess classifier performance, which corresponds to B2M binding prediction for: (1) new proteins, (2) species not represented in the dataset and (3) new receptor types. The prediction accuracy is high, i.e. 98, 94 and 70%, respectively. Application of our classifier to lower vertebrate MHC-I proteins indicates that these proteins bind to B2M and should then be expressed on the cellular surface by a process similar to that of mammalian MHC-I proteins. These results demonstrate the usefulness and accuracy of our (simple) approach, which should apply to other function or interaction prediction problems.

Availability: Data and MhcSF multiple alignments are available on the IMGT website (http://imgt.cines.fr).

Contact: gascuel{at}lirmm.fr, duprat{at}ligm.igh.cnrs.fr, lefranc{at}ligm.igh.cnrs.fr

Supplementary information: Supplementary material is downloadable at http://imgt.igh.cnrs.fr/MhcSF-B2M.html.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DATA
 3 SIMPLE-BAYES CLASSIFIER
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 
Major histocompatibility complex (MHC) proteins play a key role in the immune system, by displaying self and non-self peptides for recognition by T cell receptors. MHC class I (MHC-I) proteins have a transmembrane heavy chain (I-ALPHA) non-covalently linked to ß2-microglobulin (B2M). The interaction between I-ALPHA and B2M is required for peptide display, stabilization of the molecular structure and cell surface expression of the complex (D'Urso et al., 1991; Hill et al., 2003).

The MHC superfamily (MhcSF) (Lefranc et al., 2005a) includes MHC proteins, as well as proteins with a MHC-I-like structure which are involved in a large variety of biological processes. Thirty-four mammalian MHC-I-like proteins have currently been identified, and the 3D structure is available for 12 of them (Kaas and Lefranc, 2004). Among these 34 proteins, only 17 are constitutively bound to B2M, according to the experimental data. This study is designed to predict B2M binding (or non-binding) of newly identified MHC-I or MHC-I-like protein sequences. Such prediction should be useful for deciphering the function of these new sequences, determining their mechanism of molecular recognition, detecting mutations leading to defects in their cell surface expression, or clarifying a number of biological questions, as illustrated below with lower vertebrate MHC.

Description rules for MhcSF protein domains are defined in IMGT, the international ImMunoGeneTics information system® (Lefranc et al., 2005b), and are based on the IMGT-ONTOLOGY concepts (Giudicelli and Lefranc, 1999). The two N-terminal extracellular domains of the heavy chain of MHC-I (Fig. 1) and MHC-I-like proteins are G-DOMAINs and G-LIKE-DOMAINs, respectively. These domains are strikingly similar according to their 3D structure, with each being composed by one sheet of four antiparallel beta strands and one long helical region (Kaas and Lefranc, 2005). This high 3D structure similarity is noticeable as G-DOMAINs and G-LIKE-DOMAINs sequences have low homology (~30% identity). The third (and last) extracellular heavy chain domain is a C-LIKE-DOMAIN (Duprat et al., 2004; Lefranc et al., 2005c). This C-LIKE-DOMAIN is always present in MHC-I, but absent in some MHC-I-like proteins. The C-LIKE-DOMAIN of a MHC-I heavy chain was experimentally deleted, but the protein remained structurally unchanged and the B2M and the peptide were still bound conventionally (Collins et al., 1995). The presence of the C-LIKE-DOMAIN thus does not seem to be a valuable criterion for discrimination between B2M bound and unbound MhcSF proteins, and our results are based solely on an analysis of the two G- and G-LIKE-DOMAINs.


Figure 1
View larger version (29K):
[in this window]
[in a new window]
 
Fig. 1 MHC-I protein representation on the target cell surface (A) and 3D structure (B). The heavy chain consists of, from the N-terminal to the C-terminal end, the G-ALPHA1 [D1], G-ALPHA2 [D2] and C-LIKE extracellular domains, and (absent in the 3D structure) the connecting (CO), transmembrane (TM) and intracytoplasmic (CY) regions. B2M has a unique extracellular domain, non-covalently bound to the heavy chain. (modified from Lefranc et al., 2005a).

 
Our prediction method combines IMGT multiple alignment for G-DOMAINs and G-LIKE-DOMAINs (Lefranc et al., 2005a), along with experimental knowledge on the B2M bound/unbound properties of these proteins. We use a supervised classification approach (Duda et al., 2001), where classes are a priori known (here bound/unbound) in the learning set, and the goal is to predict the class of new unknown instances. In this context, the simple-Bayes classifier (Good, 1965) is easy to implement, accurate for small datasets (as is the case here) and its results are easily interpretable. Moreover, it was successfully applied for the prediction of class-specific ligands using functional features (Bandyopadhyay et al., 2002; Cao et al., 2003). Our classifier is based on binary features, consisting of a multiple alignment position and an amino acid group, and are selected from the dataset for their ability to discriminate between the two sequence classes. Three leave-one-out experiments are used to assess classifier performance—these experiments consider B2M binding prediction for new proteins, species not represented in the dataset or new receptor types.

We next give further details on MhcSF protein sequences, their alignment and the main aspects of our supervised classification problem. We then describe the method for selecting discriminant features, the classifier learning procedure, and experiments to assess its accuracy. The results are analysed in the light of structural interpretation, site-directed mutagenesis literature and B2M binding prediction for lower vertebrate MHC-I proteins.


    2 DATA
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DATA
 3 SIMPLE-BAYES CLASSIFIER
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 
2.1 MhcSF proteins
In this study, MhcSF consists of 806 protein sequences corresponding to allelic and homologous forms of 47 MHC-I and MHC-I-like proteins from four mammalian species: Homo sapiens, Mus musculus, Rattus norvegicus and Bos taurus (see Supplementary Data 1 for details). Sequences described as ‘allelic’ in this study refer to sequences which, for a given protein from a given species, differ by at least one amino acid (see Supplementary Data 1 for allele homogeneity). The high allele number partly compensates for the small amount of proteins. MhcSF proteins each includes two G-DOMAINs or G-LIKE-DOMAINs and are grouped here into nine functional receptor types:

  • MHC-I proteins (13 items, 767 alleles, B2M bound, C-LIKE-DOMAIN) are highly polymorphic and display a huge diversity of self and non-self peptides to T cell receptors.
  • AZGP1 proteins (3 items, B2M unbound, C-LIKE) regulate fat degradation in adipocytes (Sanchez et al., 1999).
  • CD1 proteins (7 items, B2M bound, C-LIKE) display phospholipid antigens to T cells and participate in immune defence against microbian pathogens (Zeng et al., 1997); these proteins differ from other MhcSF proteins by the high hydrophobicity of their antigen binding sites.
  • EPCR proteins (3 items, B2M unbound, not C-LIKE) interact with activated C protein and are involved in the blood coagulation pathway (Simmonds and Lane, 1999).
  • FCGRT proteins (4 items, B2M bound, C-LIKE) transport maternal immunoglobulins through placenta and govern neonatal immunity (West and Bjorkman, 2000).
  • HFE proteins (3 items, B2M bound, C-LIKE) interact with transferrin receptor and consequently take part in iron homeostasis by regulating iron transport through cellular membranes (Feder et al., 1998).
  • MIC proteins (2 items, B2M unbound, C-LIKE) are induced by stress and involved in tumor cell detection (Holmes et al., 2002).
  • MR1 (3 proteins, B2M bound, C-LIKE) function is currently unknown (Miley et al., 2003).
  • RAE proteins (9 items, 14 alleles, B2M unbound, no C-LIKE) are inducible by retinoic acid and stimulate cytokine/chemokine production and cytotoxic activity of NK cells (Li et al., 2002).

2.2 IMGT multiple alignment
The IMGT unique numbering for G-DOMAIN and G-LIKE-DOMAIN (Lefranc et al., 2005a) is built by successive alignments of sequences and 3D structures of MHC-I, MHC-II and MHC-I-like proteins. MhcSF sequences that belong to the same receptor type are close (60-90% of identity), whereas MhcSF sequences from different receptor types are quite different (15-40%). All G-DOMAINs and G-LIKE-DOMAINs are then aligned together using the following strategy: (1) we perform structural alignment of nine—one per receptor type—3D structures; (2) remaining sequences are aligned within each receptor class against the previously structurally aligned protein. Finally, IMGT numbering is obtained by attributing a number to each position of the resulting multiple alignment.

Newly identified MhcSF proteins (e.g. amphibian and teleost MHC-I proteins, whose prediction results are presented hereafter) are first described in terms of domains. IMGT numbering of their G-DOMAINs or G-LIKE-DOMAINs is then obtained by sequence alignment with the numbered sequence with highest similarity in the learning set, or by structural alignment when sequence similarity is insufficient.

MUSCLE (Edgar, 2004) and COMPARER (Sali and Blundell, 1990) are used for sequence and 3D structure multiple alignments; Fasta2 (Pearson and Lipman, 1988) is used for pairwise sequence alignments. Sequence and structure consistency of the resulting alignment are validated with NorMD (Thompson et al., 2001) and ProFit (http://bioinf.org.uk/software/profit), respectively.

2.3 Phylogeny
Evolutionary relationships within the MhcSF are established by phylogenetic analysis (Guindon and Gascuel, 2003) of 47 MHC-I and MHC-I-like protein sequences from the IMGT multiple alignment (see Supplementary Data 2 for details). The resulting phylogeny indicates that specialization occurred before speciation. Indeed, each receptor type corresponds to a clade containing all available sequences for that receptor from the species at hand. This seems to indicate that the various functions of MhcSF proteins appeared before the common ancestor of the studied mammalian species. This is in line with the small sequence similarity of G-DOMAINs and G-LIKE-DOMAINs (see above), and is further supported by the high-bootstrap value obtained for every receptor clade.

The second insight gained through this phylogeny is directly related to our prediction problem. Indeed, the two sequence classes (bound versus unbound to B2M) constitute several clades unrelated to the phylogeny, instead of two monophyletic clades. For example, the nearest neighbour of the EPCR clade is CD1, while CD1 binds B2M and EPCR does not. This indicates that nearest neighbour analysis would be inaccurate for predicting B2M binding of any MHC-I-like protein belonging to a new receptor type. However, classification seems to be easier when sequences of the same receptor type as the sequence to be predicted are already known, as all sequences from the same receptor type have the same behaviour regarding B2M [unless they correspond to a pathogenic mutant, as described in Barbosa et al. (1987) and Santos-Aguado et al. (1987), and explained in the Results section].


    3 SIMPLE-BAYES CLASSIFIER
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DATA
 3 SIMPLE-BAYES CLASSIFIER
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 
The simple-Bayes classifier estimates the probability of classes (B2M bound/unbound) for a new sequence s of MhcSF, given its description with a feature set. Two steps are required to infer this classifier from the learning set: (1) selection of discriminant binary features; and (2) learning of the classifier by estimating the frequencies of selected features, conditionally to the classes. Within this process, protein polymorphism is taken into account by weighting alleles: for a protein with e alleles, the weight of each of them is set equally at 1/e. In the following, each step is first detailed for the standard case and then for polymorphic proteins.

3.1 Feature selection
This step aims at selecting features regarding their ability to discriminate between Cß and C¬ß classes (i.e. bound and unbound to B2M, respectively). Each (binary) feature consists of an alignment position i and an amino acid group g, and denotes the presence/absence of an amino acid from g at i in the studied sequence. Amino acids are grouped based on statistical analysis of immunoglobulin sequences (Pommié et al., 2004) and using standard physicochemical criteria (Wu and Brutlag, 1995): {IVLFCMAW} {DNEQKR} {GTSYPH} {AGILPV} {CDPNT} {MILKR} {EVQH} {AILV} {GAS} {FWY} {ILV} {RHK} {DE} {NQ} {ST} {CM} {AG} (see also Supplementary Data 3). The 20 amino acids are also considered as ‘groups’, leading to a total of 37 possible groups. For a given group g, ¬g represents the amino acids excluded from g plus the gap; e.g. ¬g = {DNEQKRGTSYPH—} when g = {IVLFCMAW}. The g and ¬g groups are dealt with in a symmetrical way and tested simultaneously for eachposition.

The discrimination capacity of each group is evaluated at every position of the alignment. Occurrences of the amino acids from g and ¬g at position i of the sequences from classes Cß and C¬ß are counted in the contingency table:

Formula 1(1)

In case of polymorphic proteins, this contingency table is computed according to the allele weights. For example, the contributions for a and c in (1) are set at 2/10 and 8/10, respectively, for a protein from Cß represented by two alleles having an amino acid from g at i and eight alleles where site i belongs to ¬g. The discrimination capacity of any (i,g) pair is estimated using the {chi}2-measure that is applied to contingency table CT (1):

Formula 2(2)

The highest the {chi}2-value, the highest is the difference between both contingency table diagonals, which are represented by ad and bc terms. For a given position i, the amino acid group g with the highest discrimination capacity is selected—if several groups have same {chi}2-value, the one with the smallest size is chosen. Resulting (i,g) pairs are ordered according to their {chi}2-value, starting from the best ones. The f first pairs define the selected features, where f is a parameter that is tuned with data (see Section 3.4). The feature set D = (d1, d2, ..., dk, ..., df) consists of the f-most discriminant features dk, with each combining a multiple alignment position ik and an amino acid group gk.

3.2 Simple-bayes classifier and learning procedure
The probability that a new MhcSF sequence s belongs to class CX given its description with feature set D, is provided by the Bayes formula:

Formula 3(3)
where X isin {ß, –ß} and D(s) = (d1 (s), d2 (s),...,dk (s),...,df (s)).

The class with highest probability is then predicted. Note that both P(CX) P(D(s) | CX) terms can simply be compared to perform this prediction, and that computing P(D(s)) is useless. Moreover, the simple-Bayes classifier is based on the assumption that features are independent conditionally to the classes. This is a simplifying assumption which nevertheless proved reliable for many real datasets, even with strongly correlated features—this property is explained with theoretical arguments in (Domingos and Pazzani, 1996). The probability that a given sequence s belongs to class CX is then obtained using:

Formula 4(4)
The probabilities P(Cß) and P(C–ß) are a priori estimated by the proportions of proteins in the dataset which bind or not B2M, respectively. The probabilities P(dk(s)|CX) are estimated during classifier learning by the frequencies of features dk (presence or absence of gk at position ik) within class CX. These frequencies are corrected by Lidstone's (1920) factor, in order to overcome the problem arising from null frequencies. Indeed, in case of a feature dk for which all the sequences s' of the class CX are such as dk(s) != dk(s'), the probability (4) of CX knowing D(s) is null, irrespective of the contribution of other features. The use of non-corrected frequencies is thus likely to lead to predominance of a single feature for the classification of a new sequence s. Corrected frequencies are defined by:

Formula 5(5)
where N(dk(s)|CX) is the number of sequences s' of CX with dk(s) = dk(s'). We chose {lambda} = 1/|CX| according to preliminary analyses conducted to adjust {lambda}, and according to (Kohavi et al., 1997). Since our features are binary, the factor of {lambda} in denominator is equal to 2, so a feature is true or false with a total probability of 1. Estimation and correction of the frequencies for polymorphic proteins are treated in a way similar to the contingency table calculation, by taking into account the sum of the weights of the sequences s' of CX such as dk(s) = dk(s').

3.3 Classifier performance and number of features
In order to evaluate the performance of a classifier, the dataset at hand is usually divided into a learning sample and a test sample. The feature selection and classifier learning stages (detailed in Sections 3.1 and 3.2) are carried out using the learning sample, and the classifier built in this way is then applied to sequences of the test sample to predict their membership class. Classifier performance is evaluated by the number of test sequences whose predicted class is equal to the real class. In case of proteins expressed in various allelic forms, a simple approach involves classifying all allelic sequences of the test sample and then balancing successes and errors by the inverse of the allele number, as we saw in the learning step. In our dataset, preliminary studies showed that the predicted class is identical for all alleles of the same protein. In order to reduce computing time, we thus consider each protein encoded by a given gene as a profile p made up of one or more allelic sequence. Amino acids of gk and ¬gk can be observed jointly at position ik of a profile p. We then estimate P(CX|D(p)) by replacing, in (4), P(dk(s)|CX) terms by the average within all alleles of the conditional probabilities corresponding to each two cases (ik(s) = gk and ik(s) = ¬gk).

Current data on MhcSF is related to a limited number of proteins and cannot be divided into a learning sample and a test sample of sufficient size. We then use the leave-one-out procedure (Hand, 1986) to define these samples. When there are n observations, the guiding principle is to learn on n – 1 observations, to test the remaining observation, and to iterate this process n times. The performance is evaluated by the average of the n test results. Here we apply this procedure in three different ways to evaluate the performance of the classifier when the prediction relates to a new protein, a species not represented in the dataset, or a new receptor type. Sequences of each 47 proteins, each 4 species and each 9 receptor types constitute the test sample repeatedly. For example with species leave-one-out, we predict human sequences with a classifier built using rat, mouse and cow sequences, then mouse sequences using a classifier built from human, rat and cow, etc., finally averaging the test results to obtain classifier accuracy in predicting sequences from a species not represented in the database.

In order to adjust the number f of features to be used by the classifier, we iteratively build a classifier for each value of f ranging from 1 to 40. For f = 1, the single feature taken into account is thus the first of the list, i.e. that which presents best discrimination accuracy regarding the {chi}2-measure (2). By increasing the number of features, an increase in performance is expected (evaluated by leave-one-out, as described above), until reaching a plateau corresponding to the optimal size f.


    4 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DATA
 3 SIMPLE-BAYES CLASSIFIER
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 
4.1 Classifier performance and number of features
The correct classification rates are shown in Figure 2 for all values of f = 1, 2, ..., 40 and for the three leave-one-out procedures. The lowest performance is obtained when all proteins of the same receptor type constitute the test sample, regardless of the number of features. This result was expected as this leave-one-out procedure corresponds to the classification of test sequences having a percentage of identity with the learning sequences <40%. The best performance, regardless of the leave-one-out procedure, is obtained by a classifier made up of 18 features. Note that this number is inevitably approximate, due to the small amount of available proteins. Such a classifier correctly classifies 70% of the sequences (33 proteins among 47) belonging to a receptor type not represented within the learning sample. Random prediction is about 50% when predictions are well balanced, i.e. when they satisfy class priors as is the case here. Accuracy of 70% is therefore highly significant from a statistical point of view. Moreover, 7 missclassified proteins (among 14) belong to CD1 whose G-LIKE-DOMAINs bind phospholipids (instead of peptides, see data section) and are much more hydrophobic than those of other MhcSF proteins; missclassification of these proteins was thus expected. Finally, the two other leave-one-out procedures show very high accuracy of 94 and 98%, for test sequences belonging to new species and new proteins, respectively. These results compare favourably with those of the simple score-based approach, which involves outputting the bound/unbound status of the protein that is closest (using FASTA with Blosum62) from the protein to be predicted. Using our three leave-one-out procedures, we found accuracies of 48, 89 and 100%, for new receptor type, species and protein, respectively. These results confirm our phylogenetic analysis (see above) indicating that new receptor prediction can hardly be done using protein neighbourhood, while the two other prediction tasks are relatively easy.


Figure 2
View larger version (16K):
[in this window]
[in a new window]
 
Fig. 2 Classifier accuracy as a function of the feature number and the leave-one-out procedure. Best accuracy is obtained with 18 features.

 
Final (i.e. without re-sampling) learning of our Bayes classifier is thus carried out for the 18 most discriminant features according to the {chi}2-measure (2). These features are displayed in Table 1. Note that the same feature set is obtained using mutual information, which is another standard association measure (Shannon, 1948). We also evaluated the performance of two classifiers being built with 9 and 5 features located within and outside of the potential zone of B2M contact, respectively (this zone is defined hereafter). The number of features of each of these classifiers was adjusted as above described, starting from the initial multiple alignment but restricting it to the potential contact sites or excluding these sites, respectively. Each of these two classifiers proves to be as accurate as the classifier built with the 18 feature set (which includes the nine and five features of restricted classifiers). This experiment highlights a certain statistical redundancy of our 18 features. However, we shall see in the next section that all of our 18 selected features can be biologically and/or structurally interpreted and are thus useful for understanding MhcSF/B2M interaction.


View this table:
[in this window]
[in a new window]
 
Table 1 The 18 selected features

 
4.2 Structural analysis of selected features
In order to identify potential sites of B2M contact on MHC-I and MHC-I-like heavy chains, we carried out an exhaustive contact analysis for the 165 known 3D structures of complexes between a MhcSF protein and B2M (see Supplementary Data 4 for details). Based on this analysis, selected features can be classified in four types, depending on whether they correspond or not to potential sites of B2M contact, and whether they are favourable or not for B2M binding. The latter distinction (favourable/unfavourable) results from an analysis of the diagonals of contingency Table 1 and identifies class-conserved features. For a given feature (ik, gk), a contingency Table 1 with dominant ad diagonal (>bc) indicates that an amino acid of group gk at position ik of a protein tends to be favourable for its interaction with B2M. In the same way, dominant bc diagonal indicates that an amino acid of gk at ik is unfavourable for B2M interaction.

Structural interpretation of selected features must then be carried out independently for each feature type. Nine features are favourable for the interaction with B2M, and are analysed using the 3D structure of Rattus norvegicus FCGRT. Indeed, this protein (with known structure) possesses an amino acid belonging to the conserved group of class Cß, for each of the nine positions involved. In the same way, nine features are unfavorable for the interaction with B2M and are analysed using the 3D structure of Mus musculus RAE1B (unbound to B2M and representative of C–ß for the nine positions). Figure 3 displays the structural context of the selected features for these two 3D structures, while 3D coordinate files and PyMOL scripts for dynamic visualization are available in the Supplementary Data 5. Among the nine features which seem favourable for the interaction with B2M, four correspond to a position located in the potential zone of B2M contact. The same holds for five features among the nine that are unfavourable for the interaction with B2M.


Figure 3
View larger version (17K):
[in this window]
[in a new window]
 
Fig. 3 Structural context of selected features for (A) Rattus norvegicus FCGRT and (B) Mus musculus RAE1B proteins. Each MHC-I-like heavy chain consists of the [D1] (in the back) and [D2] (in the front) extracellular domains. B2M is complexed with FCGRT, but virtually placed for RAE1B (see Supplementary Data 5). The C-LIKE extracellular domain of FCGRT heavy chain is not shown. Each feature is labelled with the domain, position and amino acid observed in the 3D structure; the corresponding side chains are represented by spheres. Features located in the potential B2M contact zone are shown in dark grey and the others are in light grey. Coordinate files: (A) 3fru and (B) 1jfm.

 
Overall, features that are favourable for the interaction with B2M and located in the potential zone of contact seem to correspond to a side chain orientation or a physicochemical property favourable for direct contact with B2M, such as a large and aromatic residue F, W or Y at position [D1] 27 (W for Rattus norvegicus FCGRT). The features favourable for the interaction with B2M and located outside of the potential zone of contact seem to maintain a structure suitable for B2M contact; e.g. residues [D1] 51, [D2] 83 and 85 could ensure closure of the groove (by bringing the two helices closer) at one end. On the contrary, the unfavourable features located in the potential zone of contact seem to prevent direct contact by steric hindrance, such as residues N and K at position [D1] 8 and 25 of Mus musculus RAE1B, respectively. Destabilization of the conformation favourable for the interaction by residues such as E, V, Q or H at position [D2] 39 should be analysed in detail.

Definition of the features in terms of position and amino acid group thus facilitates determination of the physicochemical properties whose detection at a given position seems to be favourable or not for direct contact (for those located in the potential zone of B2M contact), or for stabilizing or not the molecular structure (for those located outside of this zone). The determination of these 4 types of features on heavy chains of MHC-I and MHC-I-like proteins should thus be valuable for future site-directed mutagenesis experiments.

4.3 Site-directed mutagenesis
Polymorphism analysis and site-directed mutagenesis on MHC-I genes described in the literature relate mainly to the interaction affinity of MHC-I proteins with proteins required for peptide presentation (Paquet and Williams, 2002). Among them, the two site-directed mutagenesis on asparagine N (to aspartate D and glutamine Q) at position [D1] 86 of HLA-A gene are the only ones described as preventing the interaction between a MHC-I protein and B2M (Barbosa et al., 1987; Santos-Aguado et al., 1987). This partly supports the findings of our study as we found (Table 1) that position [D1] 86 associated with amino acid group NQ (amide) is favourable for the interaction with B2M. Our classifier highlights the importance of this position for B2M binding, but partly fails to identify the exact amino acid required as it suggests that mutation N>D could be deleterious, while overlooking that N>Q is also deleterious. In fact, all sequences in our dataset corresponding to B2M-bound proteins possess an N at [D1] 86, except those of CD1 which possess a Q at this position. Moreover, Q is totally absent at [D1] 86 in sequences corresponding to B2M unbound proteins. This explains selection (by our classifier) of the NQ group as being the most discriminant one at this position. However, as said earlier, CD1 proteins are atypical and much more hydrophobic than other MhcSF proteins. Thus, careful analysis of our dataset and of selected features also suggests that N>Q could be deleterious. We must keep in mind, however, that our dataset is limited and that the 18 features selected by our classifier only give statistical trends, which can be interpreted at the structural level, but should be validated by site-directed mutagenesis.

4.4 Prediction for lower vertebrate MHC-I sequences
We also classified 8 MHC-I proteins of lower vertebrates: Salmon trutta (Satr-UBA, Q9GJJ8 in UniProt/Swiss-Prot), Ambystoma mexicanum (P79458), Oncorhynchus kisutch (Onki-UA, Q9GJB4) and Oncorhynchus mykiss (Onmy-UAA, -UBA, -UCA, -UDA and -UEA; Shiina et al., 2005). Although sequences of amphibian and teleost MHC genes are known and their evolutionary origin well studied (Sammut et al., 1999; Hansen et al., 1999), few experimental data relate to cellular expression and interaction or not of their MHC-I protein with B2M (Antao et al., 1999). We thus analysed these 8 proteins by aligning them with IMGT multiple alignment (see above), numbering their positions, and applying our Bayes classifier. The prediction obtained in this way is the same for the 8 MHC-I proteins, which could hardly be due to chance (~5%, given class priors), and indicates that those proteins very likely bind to B2M. This strongly suggests that they should be expressed on the cellular surface by the same process as that of mammalian MHC-I proteins.


    5 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DATA
 3 SIMPLE-BAYES CLASSIFIER
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 
This paper addresses the problem of predicting the interaction between MhcSF proteins and B2M, by only using sequences and multiple alignment. This problem is difficult, due to low sequence similarity of MhcSF proteins, and constitutes a good feasibility test of function and interaction prediction solely based on sequence information. Our method combines a simple-Bayes classifier with high-quality multiple alignment and unique numbering, as provided by the IMGT information system. Our results show that this method is accurate, even when the sequence to be predicted has low similarity with sequences in the learning set. Moreover, the results of our method are interpretable as it identifies sites associated with physicochemical properties that are well conserved within one class and avoided in the other. In our interaction problem, the conserved sites of both bound/unbound classes belong to the potential contact zone, but also stabilize, or not, the structure required for this contact. Finally, we show that the predictions of our method are confirmed by site-directed mutagenesis, and we illustrate its usefulness by analysing lower vertebrate MHC-I proteins which appear to be similar to mammalian MHC-I proteins. This simple method should thus be readily applicable to numerous other problems, when functions or interactions are to be predicted, and when a learning set of classified and aligned sequences is available. A direction for further research would be to combine our supervised classification approach with other methods, based on unsupervised classification and site conservation (Lichtarge et al., 1996; del Sol Mesa et al., 2003), using simple models of protein interaction (Gomez et al., 2003), or combining both functional and structural attributes of interacting protein sequence pairs (Huang et al., 2004).


    Acknowledgments
 
This work was supported by CNRS, MENESR (doctoral grant to E.D.), Université Montpellier II Plan Pluri-Formation, ACI-IMPBIO, GIS AGENAE and BIOSTIC-LR.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Anna Tramontano

Received on August 3, 2005; revised on December 7, 2005; accepted on December 7, 2005

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DATA
 3 SIMPLE-BAYES CLASSIFIER
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 

    Antao, A.B., et al. (1999) MHC class I genes of the channel catfish: sequence analysis and expression. Immunogenetics, 49, 303–311[CrossRef][Medline].

    Bandyopadhyay, R., Tan, X.X., Matthews, K.S., Subramanian, D. (2002) Predicting protein-ligand interactions from primary structure. Technical Report TR02-398, , Houston, TX Rice University.

    Barbosa, J.A., et al. (1987) Site-directed mutagenesis of class I HLA genes. Role of glycosylation in surface expression and functional recognition. J. Exp. Med, . 166, 1329–1350[Abstract/Free Full Text].

    Cao, J., et al. (2003) A naive Bayes model to predict coupling between seven transmembrane domain receptors and G-proteins. Bioinformatics, 19, 234–240[Abstract/Free Full Text].

    Collins, E.J., et al. (1995) The three-dimensional structure of a class I major histocompatibility complex molecule missing the alpha 3 domain of the heavy chain. Proc. Natl Acad. Sci. USA, 92, 1218–1221[Abstract/Free Full Text].

    Domingos, P. and Pazzani, M. (1996) Beyond independence: conditions for the optimality of the simple Bayesian classifier. Proceedings of the Thirteenth International Conferences on Machine Learning (ICML)Bari, Italy , San Mateo, CA Morgan Kauffman, pp. 105–112.

    Duda, R.O., Hart, P.E., Stork, D.G. Pattern Classification, (2001) 2nd edition , New York Wiley.

    Duprat, E., et al. (2004) IMGT standardization for alleles and mutations of the V-LIKE-DOMAINs and C-LIKE-DOMAINs of the immunoglobulin superfamily. Recent Res. Dev. Hum. Genet, . 2, 111–136.

    D'Urso, C.M., et al. (1991) Lack of HLA class I antigen expression by cultured melanoma cells FO-1 due to a defect in B2m gene expression. J. Clin. Invest, . 87, 284–292.

    Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, . 32, 1792–1797[Abstract/Free Full Text].

    Feder, J.N., et al. (1998) The hemochromatosis gene product complexes with the transferrin receptor and lowers its affinity for ligand binding. Proc. Natl Acad. Sci. USA, 95, 1472–1477[Abstract/Free Full Text].

    Giudicelli, V. and Lefranc, M.-P. (1999) Ontology for immunogenetics: the IMGT-ONTOLOGY. Bioinformatics, 15, 1047–1054[Abstract/Free Full Text].

    Gomez, S.M., Noble, W.S., Rzhetsky, A. (2003) Learning to predict protein–protein interactions from protein sequences. Bioinformatics, 19, 1875–1881[Abstract/Free Full Text].

    Good, I.J. (1965) The estimation of probabilities: an essay on modern Bayesian methods. Research Monograph 30, , Cambridge, MA MIT Press.

    Guindon, S. and Gascuel, O. (2003) A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol, . 52, 696–704[CrossRef][ISI][Medline].

    Hand, D.J. (1986) Recent advances in error rate estimation. Pattern Recogn. Lett, . 4, 335–346[CrossRef].

    Hansen, J.D., et al. (1999) Expression, linkage, and polymorphism of MHC-related genes in rainbow trout, Oncorhynchus mykiss. J. Immunol, . 163, 774–786[Abstract/Free Full Text].

    Hill, D.M., et al. (2003) A dominant negative mutant B2-microglobulin blocks the extracellular folding of a major histocompatibility complex class I heavy chain. J. Biol. Chem, . 278, 5630–5638[Abstract/Free Full Text].

    Holmes, M.A., et al. (2002) Structural studies of allelic diversity of the MHC class I homolog MIC-B, a stress-inducible ligand for the activating immunoreceptor NKG2D. J. Immunol, . 169, 1395–1400[Abstract/Free Full Text].

    Huang, Y., et al. (2004) Predicting protein–protein interactions by a supervised learning classifier. Comput. Biol. Chem, . 28, 291–301[CrossRef].

    Kaas, Q., et al. (2004) IMGT/3Dstructure-DB and IMGT/StructuralQuery, a database and a tool for immunoglobulin, T cell receptor and MHC structural data. Nucleic Acids Res, . 32, D208–D210[Abstract/Free Full Text].

    Kaas, Q. and Lefranc, M.-P. (2005) T cell receptor/peptide/MHC molecular characterization and standardized pMHC contact sites in IMGT/3Dstructure-DB. In Silico Biology, 5, (4), pp. 0046 (advance access).

    Kohavi, R., et al. (1997) Improving Simple Bayes. Proceedings of the Ninth European Conference on Machine Learning (ECML) , Heidelberg Springer Verlag, pp. , pp. 78–87.

    Lefranc, M.-P., et al. (2005a) IMGT unique numbering for MHC groove G-DOMAIN and MHC superfamily (MhcSF) G-LIKE-DOMAIN. Dev. Comp. Immunol, . 29, 917–938[Medline].

    Lefranc, M.-P., et al. (2005b) IMGT, the international ImMunoGeneTics information system®. Nucleic Acids Res, . 33, D593–D597[Abstract/Free Full Text].

    Lefranc, M.-P., et al. (2005c) IMGT unique numbering for immunoglobulin and T cell receptor constant domains and Ig superfamily C-like domains. Dev. Comp. Immunol, . 29, 185–203[CrossRef][ISI][Medline].

    Li, P., et al. (2002) Crystal structures of RAE-1beta and its complex with the activating immunoreceptor NKG2D. Immunity, 16, 77–86[CrossRef][Medline].

    Lichtarge, O., et al. (1996) An evolutionary trace method defines binding surfaces common to protein families. J. Mol. Biol, . 257, 342–358[CrossRef][ISI][Medline].

    Lidstone, G. (1920) Note on the general case of the Bayes-Laplace formula for inductive or a posteriori probabilities. Trans. Fac. Act, . 8, 182–192.

    Miley, M.J., et al. (2003) Biochemical features of the MHC-related protein 1 consistent with an immunological function. J. Immunol, . 170, 6090–6098[Abstract/Free Full Text].

    Paquet, M.-E. and Williams, D.B. (2002) Mutant MHC class I molecules define interactions between components of the peptide-loading complex. Int. Immunol, . 14, 347–358[Abstract/Free Full Text].

    Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448[Abstract/Free Full Text].

    Pommié, C., et al. (2004) IMGT standardized criteria for statistical analysis of immunoglobulin V-REGION amino acid properties. J. Mol. Recogn, . 17, 17–32[CrossRef][Medline].

    Sali, A. and Blundell, T.L. (1990) Definition of general topological equivalence in protein structures. A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. J. Mol. Biol, . 212, 403–428[CrossRef][ISI][Medline].

    Sammut, B., et al. (1999) Axolotl MHC architecture and polymorphism. Eur. J. Immunol, . 29, 2897–2907[Medline].

    Sanchez, L.M., et al. (1999) Crystal structure of human ZAG, a fat-depleting factor related to MHC molecules. Science, 283, 1914–1919[Abstract/Free Full Text].

    Santos-Aguado, J., et al. (1987) Amino acid sequences in the alpha 1 domain and not glycosylation are important in HLA-A2/beta2-microglobulin association and cell surface expression. Mol. Cell. Biol, . 7, 982–990[Abstract/Free Full Text].

    Shannon, C.E. (1948) A mathematical theory of communication. Bell Syst. Tech. J, . 27, 379–423.

    Shiina, T., et al. (2005) Interchromosomal duplication of major histocompatibility complex class I regions in rainbow trout (Oncorhynchus mykiss), a species with a presumably recent tetraploid ancestry. Immunogenetics, 56, 878–893[CrossRef][ISI][Medline].

    Simmonds, R.E. and Lane, D.A. (1999) Structural and functional implications of the intron/exon organization of the human endothelial cell protein C/activated protein C receptor (EPCR) gene: comparison with the structure of CD1/major histocompatibility complex alpha1 and alpha2 domains. Blood, 94, 632–641[Abstract/Free Full Text].

    del Sol Mesa, A., et al. (2003) Automatic methods for predicting functionally important residues. J. Mol. Biol, . 326, 1289–1302[CrossRef][ISI][Medline].

    Thompson, J.D., et al. (2001) Towards a reliable objective function for multiple sequence alignments. J. Mol. Biol, . 314, 937–951[CrossRef][ISI][Medline].

    West, A.P., Jr and Bjorkman, P.J. (2000) Crystal structure and immunoglobulin G binding properties of the human major histocompatibility complex-related Fc receptor. Biochemistry, 39, 9698–9708[CrossRef][Medline].

    Wu, T.D. and Brutlag, D.L. (1995) Identification of protein motifs using conserved amino acids properties and partitioning techniques. Proc. Int. Conf. Intell. Syst. Mol. Biol, . 19, 402–410.

    Zeng, Z.-H., et al. (1997) Crystal structure of mouse CD1: An MHC-like fold with a large hydrophobic binding groove. Science, 277, 339–345[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
M.-P. Lefranc, V. Giudicelli, L. Regnier, and P. Duroux
IMGT, a system and an ontology that bridge biological and computational spheres in bioinformatics
Brief Bioinform, July 1, 2008; 9(4): 263 - 275.
[Abstract] [Full Text] [PDF]


Home page
Brief Funct Genomic ProteomicHome page
Q. Kaas, F. Ehrenmann, and M.-P. Lefranc
IG, TR and IgSF, MHC and MhcSF: what do we learn from the IMGT Colliers de Perles?
Brief Funct Genomic Proteomic, January 21, 2008; (2008) elm032v1.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/4/453    most recent
bti826v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Duprat, E.
Right arrow Articles by Gascuel, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Duprat, E.
Right arrow Articles by Gascuel, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?