Bioinformatics Advance Access originally published online on November 17, 2007
Bioinformatics 2008 24(1):18-25; doi:10.1093/bioinformatics/btm537
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Multi-RELIEF: a method to recognize specificity determining residues from multiple sequence alignments using a Machine-Learning approach for feature weighting
1Division of Medical Chemistry, LACDR, Leiden University, P.O. Box 9502, 2300 RA, Leiden and 2Department of Computer Science, IBIVU, Vrije Universiteit, De Boelelaan 1081A, 1081 HV, Amsterdam, The Netherlands
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Identification of residues that account for protein function specificity is crucial, not only for understanding the nature of functional specificity, but also for protein engineering experiments aimed at switching the specificity of an enzyme, regulator or transporter. Available algorithms generally use multiple sequence alignments to identify residue positions conserved within subfamilies but divergent in between. However, many biological examples show a much subtler picture than simple intra-group conservation versus inter-group divergence.
Results: We present multi-RELIEF, a novel approach for identifying specificity residues that is based on RELIEF, a state-of-the-art Machine-Learning technique for feature weighting. It estimates the expected local functional specificity of residues from an alignment divided in multiple classes. Optionally, 3D structure information is exploited by increasing the weight of residues that have high-weight neighbors. Using ROC curves over a large body of experimental reference data, we show that (a) multi-RELIEF identifies specificity residues for the seven test sets used, (b) incorporating structural information improves prediction for specificity of interaction with small molecules and (c) comparison of multi-RELIEF with four other state-of-the-art algorithms indicates its robustness and best overall performance.
Availability: A web-server implementation of multi-RELIEF is available at www.ibi.vu.nl/programs/multirelief. Matlab source code of the algorithm and data sets are available on request for academic use.
Contact: elena{at}few.vu.nl
Supplementary information: Supplemenmtary data are available at Bioinformatics online
| 1 INTRODUCTION |
|---|
|
|
|---|
Many homologous protein families have a common biological function but different specificity towards substrates, ligands, effectors, proteins and other interacting molecules. All these interactions require a certain specificity. Identifying crucial residues for this specificity is a prerequisite for understanding the nature of functional specificity, for planning experiments on functional analysis or protein redesign, and for guiding point mutations aimed at switching the specificity of an enzyme, regulator or transporter.
In order to detect specificity residues, advanced computational techniques are needed, because of a great variety of functional specificities observed in nature and the vast amount of protein sequence data. A number of algorithms have been proposed in recent years for detecting specificity residues from a multiple sequence alignment (MSA) (Bickel et al., 2002; Carro et al., 2006; Del Sol Mesa et al., 2003; Feenstra et al., 2007; Gu, 2006; Hannenhalli and Russell, 2000; Kalinina et al., 2004; Mihalek et al., 2004; Ye et al., 2006). Most algorithms employ information-entropy-related scoring functions (Shenkin et al., 1991) to rank residue positions according to the association with the subfamilies (Whisstock and Lesk, 2003). While many algorithms require a predefined subdivision of the MSA into classes, some induce a grouping on the fly.
The SDPpred method (Kalinina et al., 2004) uses mutual information to identify residue positions in which amino acid distributions correlate with the subfamily grouping (Mirny and Gelfand, 2002).
The Two-entropies analysis algorithm (TEA) (Ye et al., 2006) creates a 2D plot of residue conservation in terms of Shannon entropy at both superfamily and subfamily level. Functional sites such as conserved or specificity residues can be distinguished easily from other residues.
The TreeDet approach introduced by Del Sol Mesa et al. (2003) contains three algorithms for detecting so-called tree-determinant residues from an unpartitioned MSA. The Level Entropy (S) method first uses relative entropy to search for an optimal grouping of the alignment and then considers positions conserved within classes but different among classes as the tree-determinants. The Sequence Space Automatization (SS) method applies principle component analysis to the alignment and computes an optimal number of clusters and the residues that correspond to them. Finally, the Mutational Behavior (MB) method looks for residues whose mutational behavior resembles the phylogeny of the alignment.
The Sequence Harmony (SH) method (Feenstra et al., 2007; Pirovano et al., 2006) scores compositional overlap between two user-specified groups. The algorithm does not exploit the notion of subfamily conservation but focusses on compositional differences between the subfamilies.
In this article, we introduce multi-RELIEF, a new algorithm for identifying specificity residues from a given MSA and predefined multiple classes using local conservation properties. The approach is based on a state-of-the-art Machine-Learning technique for feature weighting, called RELIEF, which exploits the notion of locality for estimating relevance of attributes in discriminating samples from two classes (Kononenko, 1994). In the biological context considered here, locality corresponds to sequence space (Landgraf et al., 2001).
Multi-RELIEF estimates the expected local specificity of residues, by comparing each sequence with the most similar sequence in the same class and with the most similar in opposite classes. The nearest neighbor sequences are selected based on global identity. A residue is considered relevant if it has high local specificity with respect to at least one pair of classes.
While other algorithms consider residue positions independently, multi-RELIEF considers global sequence similarity while scoring each residue. Furthermore, the method can cope with sub-family classifications derived from phylogeny, which generally are heterogeneous. Miss-classification, a general error that can arise from, e.g. misannotation, will result in a close match between some opposing classes. Multi-RELIEF is able to recover the innate specificity of a class, whenever one of the other classes can be contrasted to it. This alleviates the problem of downweighting the relevance of residue positions, e.g. in cases where a single class is polluted with a misplaced sequence.
Multi-RELIEF can optionally include 3D structural information, if available. It does this by employing a new heuristic based on the assumption that a specificity residue does not evolve in isolation, but within a functional cluster in the protein structure. This means that a residue would be more likely to be a specificity residue if its neighboring residues are also specific.
To test our novel approach thoroughly, seven experimentally determined benchmark sets were considered, taken from five widely studied protein families: G protein-coupled receptors (GPCRs), the LacI family of bacterial transcription factor, the Ras-superfamily of small GTP-ases, the MIP-family of integral membrane transporters and the Smad family of transcription factors. The performance of multi-RELIEF was compared with TEA and SDPpred (both acting on multiple classes), TreeDet/MB (no class division required) and SH (acting on two classes). Using ROC curves we show that (a) multi-RELIEF identifies specificity residues, (b) incorporating structural information improves prediction for specificity of interaction with small molecules and (c) comparison of multi-RELIEF with other algorithms indicates its robustness and best overall performance.
| 2 METHODS |
|---|
|
|
|---|
2.1 RELIEF
Many interesting feature weighting algorithms based on different approaches have been introduced in Machine Learning (Guyon and Elisseeff, 2003). One particular class uses a multivariate filter prior to the construction of a model (the classifier) to quantify the relevance of features as to their ability to jointly discriminate between classes. RELIEF is considered one of the most successful filter multivariate feature weighting algorithms (Guyon and Elisseeff, 2003), due to its simplicity and effectiveness (Kononenko, 1994). We recently applied RELIEF for selecting specificity residues (subtype specific functional sites) from protein sequences of the Smad receptor binding family (Marchiori et al., 2006).
Given samples from two classes, RELIEF iteratively assigns weights to features based on how well they separate samples from their nearest neighbor (nnb) within the same class relative to that within the opposite class (Marchiori et al., 2006). To do this, RELIEF employs a feature weight vector. At each iteration, one sequence seq is selected. The weights are updated by adding the difference between seq and its nnb from the opposite class, miss(seq), and subtracting the difference between seq and its nnb from the same class, hit(seq). We define nnb for a sequence seq to class l as nnb(seq) = argmin {d(seq, x), x
Xl} where d denotes the Hamming distance between strings [e.g. d(ALM, VLM) = 1]. The difference between two sequences seq1 – seq2 is a vector representing matches (0) and mismatches (1) between residues (e.g. ALM – VLM = 100). This procedure is iterated over all sequences of the dataset. The computational complexity of RELIEF is
.
A residue position (site) will obtain best weight if it has maximal local specificity over all triplets of a sequence, its nearest neighbor in the same, and that in the opposite class, i.e. local in sequence space. Thus if a residue position is conserved within each class but divergent between classes, then its RELIEF weight will be high. Completely conserved positions and overall divergent positions will get zero weight, while positions that are divergent within subfamilies but conserved between subfamilies will get negative weight.
2.2 Multi-RELIEF
RELIEF is a two-class feature weighting algorithm. However, large protein families with a variety of specificities require algorithms acting on multiple classes. Extensions of RELIEF to handle multiple classes have been proposed (Kononenko, 1994; Robnik-Sikonja and Kononenko, 2003; Sun and Li, 2006). For instance, Kononenko (1994) introduced RELIEF-F where the weight vector is updated by the sum of miss(seq) weighted by the estimated a priori probabilities of the classes. Here, we present a new ensemble approach based on random sub-sampling of pairs of classes. The multi-RELIEF algorithm is illustrated below in pseudo-code.
Multi-RELIEF
%input: X1.,Xm (m classes of aligned proteins)
%parameters: nr_iter, nr_sample
%output: multi_W (weights assigned to positions)
nr_positions = total number of positions;
weights = zero vector of size nr_positions;
for i = 1: nr_iter
select randomly two classes
X = select randomly nr_sample sequences
from each selected class
W_i = apply RELIEF to X end;
for s = 1: nr_positions
multi_W(s) = (average across positive W_i(s)'s);
end;
return multi_W
In multi-RELIEF, multiple runs (nr_ iter) of RELIEF are performed. At each run i, first two classes are randomly selected. Next, nr_ sample sequences from each class are randomly selected. Finally, RELIEF is applied to the resulting two classes, yielding an output vector Wi. When the multiple runs are completed, the weight multi_W(s) of a position s is computed by averaging the positive weights assigned to that position by the nr_iter runs of RELIEF. That is, using N + = |{Wi (s) > 0
i}| and N –= |{Wi (s) < 0
i}|,
|
|
Random sampling of pairs of classes is mainly employed for efficiency reasons, while random sub-sampling of sequences is applied for handling unbalanced classes as well as for gaining efficiency. The computational complexity of multi-RELIEF is
while that of RELIEF-F is
Algorithms that do not consider the context (univariate scoring algorithms), such as TEA and SH, are generally more efficient with complexity O(nr_seq · nr_positions).
Table 1 illustrates the application of multi-RELIEF to a toy dataset. Note that positions b and c both get maximum weight. This is expected for position c, because it fully discriminates each pair of classes. Instead, position b only discriminates a subset of classes, e.g. C1/C3, while it does not separate the remaining pairs of classes, i.e. C1/C2 and C3/C4. So only residue positions that, at least partly, discriminate between pairs of classes have a positive weight assigned by multi-RELIEF. This property of the algorithm is desirable, e.g. in cases where the number of subfamilies is larger than the number of amino acids, such as the GPCR benchmark (see below) that consists of 77 classes.
|
2.3 multi-RELIEF + 3D contacts
As an additional step in multi-RELIEF, 3D structural information can be exploited. We use a simple heuristic based upon the notion that functional specificity generally does not evolve for a single residue but typically involves a cluster of residues in the protein structure. For each position s, we adjust the corresponding multi-RELIEF weight by adding the average weight of its 3D neighbors. Thus, the score of a residue will be boosted if its neighbors have a high average score. 3D neighbors are residues that share surface with a given residue as calculated by the web server at http://ligin.weizmann.ac.il/cma (Sobolev et al., 1999). From the list returned, residue pairs with a sequence distance of two or less are removed.
2.4 Comparison to other algorithms
To assess the performance of multi-RELIEF versus other methods, we included the following four recent state-of-the-art algorithms:
- TEA: Ye et al. (2006)
- SDPpred: http://bioinf.fbb.msu.ru/SDPpred/index.jsp
- TreeDet/MB: http://www.pdg.cnb.uam.es/Servers/treedet/
- SH: http://www/ibi.un.nl/programs/seqharmwww/
|
|
SDPpred was applied with 10000 shuffles for each column, and a maximum allowed percentage (70%) of gaps in a group in each column; these are the highest possible settings allowed through the web interface of SDPpred.
TreeDet/MB was applied with the following setting, in order to obtain a ranking of the residues: advanced run for MB method, cutoff set to 10 – 12 and percentage of High Scoring Residues set to 100%. We could not run TreeDet on the GPCR dataset because its web server accepts a maximum of 200 sequences. For this reason, we compiled a GPCR-190 reduced set (see below), to which TreeDet was applied.
SH has no adjustable parameters, except for the cutoff value that is irrelevant for the generation of the ROC curves used. Note that for a fair comparison between the methods, the tie-braking by sequential groups (Rank) and entropy was excluded from the SH method. A similar mechanism could be added to the other methods in a post-processing step. SH was not applied to the GPCR and LacI datasets since these consist of more than two classes.
Multi-RELIEF was run using parameters nr_iter = 1000 and nr_samples = 10. These values were chosen based on the number of classes and their sizes, albeit no parameter tuning was applied. In general, a high value of nr_iter and a reasonably small value of nr_samples are recommended. Ties were broken by sorting residue positions with equal score in increasing sequence position.
2.5 Benchmark studies
The performance of a method may depend on the type of protein family and functional specificity properties considered. We therefore carried out a benchmark involving seven different protein families with various associated functional specificity properties (Table 2).
|
G Protein-Coupled Receptors (GPCRs) are integral cell membrane proteins involved in signal transduction. Their mediatory role makes them important drug targets (Gether et al., 2002; Pierce et al., 2002). We extracted the MSA of class A GPCRs in the transmembrane region from the latest version of the GPCRDB [Horn et al., 2003, June 2006 release (10.0), www.gpcr.org/7tm], yielding an MSA of 2065 protein sequences with an average identity of 26% over all sequence pairs in the alignment. The MSA was classified into 77 subfamilies according to the recognition of endogenous ligands. An additional reduced MSA was derived by applying a redundancy limit of 65% identity, and subsequently discarding all subfamilies that had only one sequence remaining. This yielded an MSA of 190 protein sequences divided over 39 GPCR families, which was named GPCR-190. Residues are deemed to be ligand binding whenever their mutation affects ligand binding in aminergic receptors, as listed in Table 2.
The LacI family is one of the largest families of bacterial transcription factors. This family was analyzed by Mirny and Gelfand (2002) using a technique based on mutual information. We used a multiple sequence alignment of 54 LacI protein sequences (Mirny and Gelfand, 2002) classified into 15 families. Suckow et al. (1996) mutated positions 2–329 of Lac repressor into 12 or 13 of the 20 natural occurring amino acids. These 4000 well-defined mutants yielded a functional classification for each position. We took the residues in group IX (DNA binding) and XI (IPTG binding) as the specificity residues. Some of these are actually conserved in the alignment and thus cannot contribute to specificity. These were subsequently excluded from the selection. The resulting 28 specificity determining residues are listed in Table 2.
The Ras superfamily of small GTP-ases is implicated in the regulation of growth, survival, differentiation and other processes in haematopoietic cells (Reuther and Der, 2000). It comprises six families, of which experimental evidence for functional sites was available from the literature for the Rab 5 versus Rab 6 subfamilies, and the Ras versus Ral families, as defined in Pirovano et al. (2006). The 28 and 12 true positives, respectively, are listed in Table 2. The MSAs of 4 Rab5 and 6 Rab6, and of 20 Ras and 69 Ral protein sequences described in Pirovano et al. (2006) were used.
The Major Intrinsic Protein (MIP) family of Integral Membrane Transporters is mainly involved in facilitating the transport of both water and small neutral solutes through the cellular membrane in all domains of life. There are about six MIP subfamilies, the two major are the aquaporins (AQPs) and the glycerol-uptake facilitators (GLPs) (Zardoya and Villalba, 2001). The MSA of 12 AQP and 48 GLP protein sequences described in Pirovano et al. (2006) was used. Residues with at least one atom closer than 7 Å to the bound glycerol molecules in the GLP pore channel in the crystal structure 1FX8 (Fu et al., 2000), excluding those that were conserved in the training set of sequences, as defined in Pirovano et al. (2006). This yields a set of 37 sites, which are listed in Table 2.
The Smad family of TGF β-associated transcription factors plays a crucial role in the transforming growth factor-β signaling pathway and is critical for determining the specificity between alternative pathways (Feng and Derynck, 2005; Massague et al., 2005). The family can be subdivided into two major classes: AR-Smads that are mainly induced by TGFβ-type receptors, and BR-Smads that are mainly induced by the BMP-type receptors. The MSA of 8 AR-Smad and 12 BR-Smad non-redundant sequences of the Smad-MH2 domain described in Pirovano et al. (2006) was used. The 29 specificity determining residues as defined in Pirovano et al. (2006) are listed in Table 2.
2.6 Evaluation of the algorithms' performance
The Receiver-operator characteristic (ROC) curve is used for testing the performance of an algorithm for separating true and false positives (Provost and Kohavi, 1998; Swets, 1988). Here known functional specificity residues are considered true positives. The remaining residues are considered true negatives. We use the scoring (weight) values as threshold for generating the ROC curve. For each weight value v, the set of residues with weight higher than or equal to v is considered: the true positive percentage is reported on the y axis (sensitivity, or coverage), and the false positive percentage (1–specificity, or error) on the x axis. The ROC curve thus describes the goodness of a method in giving higher ranking to the given functionally important residues.
3 RESULTS
From the ROC curves in Figure 1A and D, the results of our multi-RELIEF method appear superior to the other methods over all the datasets. The addition of 3D contacts yields a clear improvement for the GPCRs and LacI family, as is shown in the weights plots in figure 1B and E. This is more evident for the GPCRs, for reasons explained in the next section.
|
The Smad, Ras/Ral and Rab5/Rab6 datasets contain two classes, which are rather balanced. In this case nearly all algorithms achieve high performance, but some variations are still observable (see Fig. 2, Supplementary Material). In general, the distributions of true positives with respect to the computed scoring weights are similar for Smad, GPCR and Ras/Ral, and for Rab5/Rab6 to somewhat lesser extent. The true positives occur in the upper part of the curve, i.e. they satisfy the multi-RELIEF condition of being locally specific.
For the LacI and MIP datasets the situation is slightly different. Here, the majority of true positives also occurs on the upper part of the curve, but some are retained at the central or lower parts of the curve. Clearly, some of the LacI true positives do not conform to the model of local specificity exploited by multi-RELIEF. Upon detailed examination, these sites turn out to be largely conserved, a-like positions, as discussed further below.
The overall performance of the methods can be captured by the area under the curve (AUC) in the ROC plots, as listed in Table 3. Here we observe that in five of the datasets, multi-RELIEF or multi-RELIEF + 3D contacts is the best-scoring algorithm. In two others, they are not far below the best. A notable exception is the GPCR-190 reduced set, for reasons that are explained below. Importantly, the other methods are top-scoring only in at most one single dataset. The average scores over all datasets, in the last column of Table 3, also shows that multi-RELIEF and multi-RELIEF + 3D contacts are the top-scoring methods, with a consistent but modest lead for multi-RELIEF + 3D contacts.
|
| 4 DISCUSSION |
|---|
|
|
|---|
4.1 Evolving Specificity Residues
It is well accepted in sequence analysis that conserved residues are likely to be functionally important. Indeed, many early approaches select functional sites by simply picking the most conserved positions in a given MSA. Since vast amounts of sequence data have become available, sequence comparison between paralogous and orthologous proteins is performed routinely in order to identify specificity residues that account for differences between functional subgroups. Most state-of-the-art approaches for functional specificity detection require an MSA with predefined functional classes. They then forward MSA positions conserved within each group but different between groups as functionally specific. However, different degrees of specificity may be relevant. For example, position c in the Table 1 toy example provides a perfect explanation of such subdivision in classes. Although position b is insufficient for differentiating all four classes, it does provide some information about the difference between C1, C2 and C3, C4.
The specificity residues considered in this study include c-like positions, that are fully class-specific, but also partially class-specific b-like positions are present, especially in the GPCR case study. The following evolutionary scenario can explain this observation. After proteins learn how to fold correctly in order to perform their main function, they can start evolving new functional sites in order to interact with other components such as small molecules, DNA, RNA or another protein. Such a process can be conducted in a stepwise fashion, first by establishing general interaction anchor points (conserved, a-like positions), next by evolving to more selective recognition sites (specific, b- and c-like). For example, if proteins want to interact with DNA, they first evolve some positively charged residues in a certain region of the protein just to attract the negatively charged phosphoric acid group(s) of DNA. They then can evolve b-like positions to selectively bind to a specific category of DNA and finally, they can obtain c-like positions to achieve specific recognition of a particular DNA fragment.
Functional specificity sites can therefore contain different types of specificity positions. The proportions of a-, b- and c-like positions (see Table 1) may vary within different protein families. In our benchmark studies, for the GPCRs, Smad and LacI datasets, we defined all residues at the specificity interaction interface according to the experimental evidence and excluded a-like. Such definition is straightforward but results in c- and b-like positions being taken as true positives. If a family contains a high percentage of c-like positions, methods focussing on intra-group conservation will all perform well, while a more varying performance is likely with larger proportions of b-like positions.
4.2 Using 3D contacts
Although multi-RELIEF attains similar or better performance than its counterparts considered here, we have demonstrated that specificity detection can be further enhanced by taking 3D contact information into account (Fig. 1A and D). In this scenario, the score of a residue position will be boosted if its neighboring residues score high, introducing a bias towards spatially clustered residues. Depending on the ligand being a small molecule or a larger protein or DNA structure, employing contact information may affect predictions differently. If the ligand is a small moiety, the specificity residues form a small, compact cavity, such that application of 3D contacts improves prediction. On the other hand, interaction interfaces to large protein or DNA ligands will generally be larger and more planar, often leading to relatively few isolated interface residues providing specificity recognition. This renders 3D contacts potentially less beneficial for datasets associated with proteins interacting with larger ligands.
4.3 Benchmark performance
4.3.1 GPCRs
On the GPCRs dataset, all algorithms except SDPpred perform well. The GPCR ligand binding site is illustrated by retinal, the endogenous ligand of bovine rhodopsin in Figure 1C. Multi-Relief outperforms the other methods substantially, and the use of structure information (multi-Relief + 3D contacts) further improves its performance. There are two factors that contribute to these observations. First, there are 77 subfamilies in the GPCRs dataset that cannot be uniquely differentiated by a single position using the 20 natural amino acids. Thus, in the absence of absolute c-like positions, b-like positions are the best alternative. This gives multi-Relief an obvious advantage in identifying b-like positions. Second, the class A GPCRs evolved to recognize small molecules so that the specificity site is relatively compact and concentrated in a small region of the protein compared to other protein families that recognize DNA, RNA or protein (Fig. 1C). This also explains the relatively large performance increase, compared to the other datasets, of multi-Relief when 3D contacts are used for boosting results for the GPCR dataset.
For the reduced GPCR-190 set, average AUC of all methods is similar to that of the full GPCR set, see Table 3 (also Fig. 2, Supplementary Material). Intriguingly, only multi-RELIEF performs similarly over both datasets, while all other methods perform differently. Multi-RELIEF + 3D contacts give a much smaller improvement over multi-RELIEF than in the full GPCR set, but more strikingly, the performance of TEA and SDPpred are higher. An explanation can be found in the 65% redundancy threshold applied. This retains diversity within a subfamily, i.e. the most divergent members, but multi-RELIEF relies on differences between nearest neighbors, which could be entirely different in the reduced set. Even the 3D information apparently cannot overcome this. TEA and SDPpred, on the other hand, put more emphasis on entropy to measure the overall differences between the subfamily, which may be more pronounced in the reduced set.
4.3.2 LacI
Results on the LacI dataset highlight the difference between specificity-related binding to small molecules compared to binding DNA. LacI transcription factors bind to particular DNA fragments to prevent transcription of downstream genes. After recognition of ligands specific for each subfamily, they change conformation so that RNA polymerase is no longer blocked from binding to DNA. This leads to high expression of the encoded proteins. As illustrated in Figure 1E, multi-RELIEF generally assigns higher weights to residues that recognize the small molecule than those binding to DNA. Moreover, application of the 3D contacts option boosts the weights of these residues.
Figure 1F shows the structure of the transcription factor of LacI. Among the small molecule binding sites, the position of R196 has low weight, even after being boosted by means of the 3D contacts step. When looking at its residue composition (data not shown), we can see that R196 is a b-like position, since amino acid R occurs in 47 out of a total of 54 protein sequences.
For this dataset, the 3D contacts information does not notably improve the detection of the DNA-binding residues, but, importantly, the prediction quality also does not suffer from the 3D contacts. The limited added value may be due to the fact that the DNA binding site is much bigger and more extended than the binding site for small molecules. Thus, interaction between protein and DNA may include several relatively isolated and spatially separated locations. For example, residue S16 interacts with DNA and indeed is assigned a high weight since it contributes to specific recognition. However, neighboring residues do not interact with the DNA and have low weight, so the score of S16 becomes worse after application of the 3D contacts step.
In addition, we identified a specific region of the protein where two residues, S21 and A27 are close to each other and have high weights before and after application of 3D contacts. Although these two residues were not characterized as DNA binding by Suckow et al. (1996), they are located within 5 Å distance to the DNA.
4.3.3 Ras
The two datasets from the Ras family are based upon mutation experiments, three regions of about 10 residues each for Rab5 versus Rab6, and 12 point mutations for Ras versus Ral and Rab. They show best performance for multi-RELIEF and worst for TEA, while other methods perform very similar and only slightly below multi-RELIEF, see Table 3 (and Fig. 2, Supplementary Material for more details). Overall performance of all methods is lower for Rab5/Rab6 than for Ras/Ral (Table 3).
Although specificity in the Ras superfamily is related to recognition of various small-molecule and protein targets, multi-RELIEF is well able to recognize these sites. However, due to the presence of multiple interacting sites (see Fig. 2, Supplemantary Material), addition of 3D contacts information does not lead to a gain in detection of specificity residues.
4.3.4 MIP
The MIP dataset is based on a structural definition of functional residues: those close to the ligand in the crystal structure. Overall performance of all methods is relatively low (see Table 3). Multi-RELIEF + 3D contacts and TEA together are the best-scoring methods. Importantly, multi-RELIEF and multi-RELIEF + 3D contacts show the steepest initial slope in the ROC curves (Fig. 2, Supplementary Material), which is relevant for experimental planning if only top-scoring sites are to be examined.
4.3.5 Smad
The Smad dataset is a special benchmark because the true positive residues have been verified directly by site-directed mutagenesis experiments. It is different from the other datasets in that it contains two classes. The known functional specificity sites are a mix of b/c-like positions, i.e. specific and conserved in each class, and d-like positions, that are specific but not conserved within the classes.
The performance of all methods on the Smads is remarkably good, compared to the other datasets, see Table 3 and Figure 2, Supplementary Material (note the difference in scale of the FPR axis). This is likely due to the comprehensive experimental coverage of true functional Smad sites, reducing the proportion of false negatives and increasing overall performance by all methods. The 3D contact step results in a slightly decreased performance of multi-Relief. This may be due to the fact that three different functional interactions are involved, each involving distinct interaction interfaces on the Smad protein surface.
| 5 CONCLUSION |
|---|
|
|
|---|
In this study, we proposed a novel multi-Relief algorithm for identifying specificity-related functional sites. We provided an option for boosting prediction quality using structural information, if available, for specificity of interaction with small molecules. We tested the performance of multi-RELIEF and other recent algorithms on seven different experimental benchmark cases. The results demonstrate robustness and best overall performance of multi-RELIEF over a wide variety of biological cases.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank our student Bernhard Kammleitner for implementing the multi-RELIEF web server. We also thank Leonid A. Mirny and Mikhail S. Gelfand for making their LacI dataset available for this study. This work was financially supported by the NWO-Bioinformatics Breakthrough Project (050-71-047), The Netherlands.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Burkhard Rost
Received on July 28, 2007; revised on October 18, 2007; accepted on October 18, 2007
| REFERENCES |
|---|
|
|
|---|
Bickel P, et al. Finding important sites in protein sequences. Proc. Natl Acad. Sci. USA (2002) 99:14764–14771.
Carro A, et al. Treedet: a web server to explore sequence space. Nucleic Acids Res (2006) 35:99.
DelSol Mesa A, et al. Automatic methods for predicting functionally important residues. J. Mol. Biol (2003) 326:1289–1302.[CrossRef][Web of Science][Medline]
Feenstra K, et al. Sequence harmony: detecting functional specificity from alignments. Nucleic Acids Res (2007) 35:W495–W498.
Feng X, Derynck R. Specificity and versatility in TGF-beta signaling through Smads. Annu. Rev. Cell Dev. Biol (2005) 21:659–693.[CrossRef][Web of Science][Medline]
Fu D, et al. Structure of a glycerol-conducting channel and the basis for its selectivity. Science (2000) 290:481–486.
Gether U, et al. Structural basis for activation of g-protein-coupled receptors. Pharmacol. Toxicol (2002) 91:304–312.[CrossRef][Web of Science][Medline]
Gu X. A simple statistical method for estimating type-ii (cluster-specific) functional divergence of protein sequence. Mol. Biol. Evol (2006) 23:1937–1945.
Guyon I, Elisseeff A. An introduction to variable and feature selection. J. Mach. Learn. Res (2003) 3:1157–1182.[CrossRef]
Hannenhalli S, Russell R. Analysis and prediction of functional sub-types from protein sequence alignments. J. Mol. Biol (2000) 303:61–76.[CrossRef][Web of Science][Medline]
Horn F, et al. Gpcrdb information system for g protein-coupled receptors. Nucleic Acids Res (2003) 31:294–297.
Kalinina O, et al. SDPpred: a tool for prediction of amino acid residues that determine differences in functional specificity of homologous proteins. Nucleic Acids Res (2004) 32:W424–W428.
Kononenko I. Estimating attributes: analysis and extensions of relief. In Bergadano, F. and De Raedt, L. (eds). European Conference on Machine Learning (1994) volume LNCS 784. Springer-Verlag New York, Secaucus, NJ, USA, pp. 171–182. http://portal.acm.org/citation.cfm?id=188427.
Landgraf R, et al. Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteinsr. J. Mol. Biol (2001) 307:1487–1502.[CrossRef][Web of Science][Medline]
Marchiori E, et al. A feature selection algorithm for detecting subtype specific functional sites from protein sequences for smad receptor binding. In Proceedings of the Fifth International Conference on Machine Learning and Applications (ICMLA'06) (2006) pp. 168–173. http://doi.ieeecomputersociety.org/10.1109/ICMLA.2006.7.
Massague J, et al. Smad transcription factors. Genes Dev (2005) 19:2783–2810.
Mihalek I, et al. A family of evolution-entropy hybrid methods for ranking protein residues by importance. J. Mol. Biol (2004) 336:1265–1282.[CrossRef][Web of Science][Medline]
Mirny L, Gelfand M. Using orthologous and paralogous proteins to identify specificity-determining residues in bacterial transcription factors. J. Mol. Biol (2002) 321:7–20.[CrossRef][Web of Science][Medline]
Pierce K, et al. Seven-transmembrane receptors. Nat. Rev. Mol. Cell Biol (2002) 3:639–650.[CrossRef][Web of Science][Medline]
Pirovano W, et al. Sequence comparison by sequence harmony identifies subtype specific functional sites. Nucleic Acids Res (2006) 34:6540–6548.
Provost F, Kohavi R. Guest editors introduction: on applied research in machine learning. Mach. Learn (1998) 30:127–132.[CrossRef]
Reuther G, Der C. The Ras branch of small GTPases: Ras family members don't fall far from the tree. Curr. Opin. Cell Biol (2000) 12:157–165.[CrossRef][Web of Science][Medline]
Robnik-Sikonja M, Kononenko I. Theoretical and empirical analysis of relieff and rrelieff. Mach. Learn (2003) 53:23–69.[CrossRef]
Shenkin P, et al. Information-theoretical entropy as a measure of sequence variability. Proteins (1991) 11:297–313.[CrossRef][Web of Science][Medline]
Sobolev V, et al. Automated analysis of interatomic contacts in proteins. Bioinformatics (1999) 15:327–332.
Suckow J, et al. Genetic studies of the lac repressor. xv: 4000 single amino acid substitutions and analysis of the resulting phenotypes on the basis of the protein structure. J. Mol. Biol (1996) 261:509–523.[CrossRef][Web of Science][Medline]
Sun Y, Li J. Iterative relief for feature weighting. In Proceedings of the 23rd International Conference on Machine Learning (2006) ACM Press, USA, pp. 913–920. http://www.icml2006.org/icml2006/technical/accepted.html.
Swets J. Measuring the accuracy of diagnostic systems. Science (1988) 240:1285–1293.
Whisstock J, Lesk A. Prediction of protein function from protein sequence and structure. Q. Rev. Biophys (2003) 36:307–340.[CrossRef][Web of Science][Medline]
Ye K, et al. A two-entropies analysis to identify functional positions in the transmembrane region of class a g protein-coupled receptors. Proteins (2006) 63:1018–1030.[CrossRef][Web of Science][Medline]
Zardoya R, Villalba S. A phylogenetic framework for the aquaporin family in eukaryotes. J. Mol. Evol (2001) 52:391–404.[Web of Science][Medline]
This article has been cited by other articles:
![]() |
J. E. Donald and E. I. Shakhnovich SDR: a database of predicted specificity-determining residues in proteins Nucleic Acids Res., January 1, 2009; 37(suppl_1): D191 - D194. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

