Bioinformatics Advance Access originally published online on January 18, 2007
Bioinformatics 2007 23(5):597-604; doi:10.1093/bioinformatics/btl660
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Protein–protein interaction site prediction based on conditional random fields
Bioinformatics Research Group, ITNLP Lab, Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: We are motivated by the fast-growing number of protein structures in the Protein Data Bank with necessary information for prediction of protein–protein interaction sites to develop methods for identification of residues participating in protein–protein interactions. We would like to compare conditional random fields (CRFs)-based method with conventional classification-based methods that omit the relation between two labels of neighboring residues to show the advantages of CRFs-based method in predicting protein–protein interaction sites.
Results: The prediction of protein–protein interaction sites is solved as a sequential labeling problem by applying CRFs with features including protein sequence profile and residue accessible surface area. The CRFs-based method can achieve a comparable performance with state-of-the-art methods, when 1276 nonredundant hetero-complex protein chains are used as training and test set. Experimental result shows that CRFs-based method is a powerful and robust protein–protein interaction site prediction method and can be used to guide biologists to make specific experiments on proteins.
Availability: http://www.insun.hit.edu.cn/~mhli/site_CRFs/index.html
Contact: mhli{at}insun.hit.edu.cn
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Biological functions and processes are performed through the interactions among proteins, RNA or DNA. It is of great significance for protein mimetic engineering, elucidation of molecular pathways and drug design to understand characteristics of protein interfaces (Lichtarge et al., 2002; Sowa et al., 2001; Zhou, 2004). Protein–protein interaction is an important factor for determining protein function (Letovsky and Kasif, 2003; Nabieva et al., 2005). Furthermore, identification of interface residues can help the construction of a structural model for a protein complex (Dominguez et al., 2003).
The availability of more and more protein structures in the Protein Data Bank (PDB) (Berman et al., 2000) makes prediction of protein–protein interaction sites possible. Machine learning methods, such as neural networks (ANN) (Chen and Zhou, 2005; Fariselli et al., 2002; Zhou and Shan, 2001) and support vector machines (SVM) (Bradford and Westhead, 2005; Chung et al., 2006; Koike and Takagi, 2004; Res et al., 2005) have been successfully applied in this field. These studies consider sequential, structural or evolutionary features such as amino acid residue composition (Chen and Zhou, 2005; Chung et al., 2006; Koike and Takagi, 2004; Res et al., 2005; Zhou and Shan, 2001), spatial neighboring residues (Wanga et al., 2006; Zhou and Shan, 2001), accessible surface area (Koike and Takagi, 2004), structural conservation score (Chung et al., 2006) and residue evolutionary information (Res et al., 2005; Wanga et al., 2006). Most of these methods focus on prediction of protein–protein interaction sites on surface of proteins with known structures (Koike and Takagi, 2004; Zhou and Shan, 2001). However, only protein local sequential information is used in study of Ofran and Rosta (2003). Res et al. (2005) use protein sequential and evolutionary information to predict proteins interaction sites without structural information. Recently, Liang et al. (2006) present an empirical score function, which is a linear combination of energy score, interface propensity and residue conservation score for prediction of protein binding sites.
These traditional methods take protein–protein interaction prediction as a classification task and separately study each residue, so one interface residue is identified at a time. One drawback of these methods is the relation between two labels (interface or noninterface) of neighboring residues is not taken into consideration. However, as a matter of fact, sequentially or spatially neighboring residues should have similar characters in forming interface. Chung et al. (2006) noticed this relation and used the clustering as a post-processing strategy to remove the isolated interface residues predicted by SVMs, and include the noninterface residues surrounded by several predicted interface residues.
In order to acquire the inter-relation information between neighboring residues, prediction of protein interaction sites was formalized as a sequence labeling task in our study. Sequence labeling tasks are very common tasks in natural language processing such as part-of-speech tagging (Lafferty et al., 2001; Ratnaparkhi, 1996), named-entity recognition (Chinchor, 1998) and information extraction (Freitag and McCallum, 2000). Recently, conditional random fields (CRFs) (Lafferty et al., 2001; Sutton and McCallum, 2006) are successfully applied to solve sequence labeling problems, and are also proved their effectiveness in solving problems in bioinformatics such as protein secondary structure prediction and protein fold recognition (Liu et al., 2004, 2005). The advantage of CRFs is that it can integrate both rich state features and transition features between label states. Furthermore, CRFs have advantages over traditional graphical models such as hidden Markov models (HMMs) (Rabiner, 1989) and maximum entropy Markov models (MEMMs) (Mccallum et al., 2000). It is one of the outstanding methods used for labeling sequence data. In this study, given a protein sequence with structural information, each residue needs to be labeled as an interface residue or noninterface residue.
CRFs are efficient methods for labeling sequence data, and different from the classification methods such as SVMs and maximum entropy method (ME) (Rosenfeld, 1996). In this article, we compared the performance of CRFs in predicting protein interaction site with state-of-the-art methods, such as SVMs and ANN. CRFs can be used to label residues of the whole protein sequence, but only the residues on surface were chosen to compare with other methods. Basic features including sequence profile and accessible surface area of spatially neighboring residues were used for comparison of CRFs with other methods for performance. Experimental result shows that CRFs-based method is comparable with the conventional classification methods on 1276 nonredundant chains of hetero complexes selected from the PDB.
| 2 MATERIAL AND METHODS |
|---|
|
|
|---|
2.1 Data set
All x-ray diffraction protein structures which have multiple chains and resolution of <3.5 Å were extracted from the PDB (July, 2005) (Berman et al., 2000). Protein chains shorter than 40 residues were removed. For each structure, we selected chain pairs with >20 interfacial residues on each chain. A residue is considered to be an interface residue if the distance between any of its heavy atom and any heavy atom of its interacting chains is <5 Å (Chen and Zhou, 2005; Koike and Takagi, 2004; Zhou and Shan, 2001). For PDB structure with more than two chains, each chain was selected for at most one time. For protein chain that interacts with multiple partners, only one partner with the most interfacial residues was selected as its partner. Finally, a total of 15 264 chain pairs were selected.
In order to get nonredundant protein chains of hetero complexes, we adopted the method of Chung et al. (2006). All these selected chains were compared using BLAST (McGinnis and Madden, 2004). Two chains were assigned with the same cluster if (i) over 90% of their sequences were aligned and (ii) the sequence identity was
30%. All above chains were clustered in this way. One representative chain of each cluster was selected. Hetero complexes with longer chains were selected in this study. Two interacting protein chains were defined as homo complex if >90% of them were aligned and the sequence identity over the aligned region was >95% (Chen and Zhou, 2005). Thus 1276 chains (312 858 residues) were selected as nonredundant protein chains of hetero complexes.
The surface residues were defined using the criterion of at least 15% solvent accessible surface area exposure to solvent (Chung et al., 2006; Rost and Sander, 1994). The solvent accessible surface area (ASA) of each residue was calculated using the DSSP program (Kabsch and Sander, 1983). A total of 200 482 residues (
64.1%) were collected as surface residues from all these chains. Since a protein chain within a complex with more than one chain may form more than one interface. Within these interfaces, there is generally a main large interface while residues in other minor interfaces can be treated as interface or noninterface residues, or even excluded from data set. In our experiment, we consider all these three cases and generated three types of data set (Types I, II and III). Their statistical information is tabulated in Table 1.
|
Surface residue sequence segments were collected. The surface residue sequence segment is sequential continuous residue segment which are all surface residues. Each residue within the segment was labeled as interface or noninterface residue. These segments were used to train and test CRFs.
The fact that there are more noninterface residues than interface residues in the training set leads to higher precision and lower recall for many classifiers such as SVMs and ANN (Chen and Zhou, 2005; Chung et al., 2006; Koike and Takagi, 2004). These researchers used trimmed data set, the ratio of positive and negative examples are set to about 1:1. To evaluate the robustness and performance of different methods, we conduct experiments on both complete and trimmed data sets of all above three data types. The left dashed-line rounded rectangle in Figure 1 illustrates the process of data preparation.
|
2.2 Conditional random fields used for labeling sequence data
In order to predict protein–protein interaction sites, we address this problem as a sequence labeling task. Protein surface residues were extracted and the surface residue segments were treated as sequence data. Residues on surface segments were labeled as interface or noninterface residues using CRFs.
Conditional random fields (CRFs) were proposed by Lafferty et al. for labeling sequence data (Lafferty et al., 2001). Given a sequence of observations X = (x1, x2, ..., xn), we want to get the most probable label sequence Y = (y1, y2, ..., yn), i.e. Y* = arg maxY P(Y|X). CRFs are undirected graphical models (as opposed to directed graphical models such as HMMs) and the conditional probability P(Y | X) is computed directly. Figure 2 shows the structures of CRFs, HMMs and ME. Both CRFs and HMMs suit to label sequence, differing from the probability solution formulation. HMMs obtain the target label sequence Y by maximizing the joint probability of X and Y (Rabiner, 1989), but HMM cannot use long distance features, which limits the broad application of this method. CRFs are exponential or log-linear models that can use any kind of features. By the fundamental theorem of random field (Lafferty et al., 2001), the joint distribution over label sequence Y given X can be given by the following conditional probability:
|
| (1) |
j and µj correspond with feature tj and sj, respectively, and they are learned via maximizing the conditional likelihood of the training data. Z(X) is a normalization factor. More details about CRFs can be referred from Lafferty (2001).
|
2.3 Prediction of protein–protein interaction sites based on CRFs
Here, sequence segments on protein surface are labeled by CRFs. The label set for residues is L = {I, N}, where I represents the interface residue and N represents the noninterface residue. Given a segment X = (x1, x2, ..., xn), the most possible label sequence Y = (y1, y2, ..., yn) (yi
L) is obtained using CRFs.
2.4 Definition of features
The features for CRFs include transition and state features. We define several types of state features based on common features most used by other researches. Two kinds of state features, spatially neighboring residues profile and accessible surface area are taken as basic features for CRFs. Residue conservation is taken as an extended feature to test its effectiveness in CRFs.
2.4.1 Transition feature
Transition feature is defined for each label pair (y and y'
L) as follows:
|
| (2) |
2.4.2 Profile feature of spatially neighboring residues
Spatially neighboring residues profile feature was taken from multiple sequence alignment obtained from three iterations of PSI-BLAST searching against NCBI nonredundant database (NR, April 2006 release) under conditions E-value = 0.001 and H = 0.001 (Altschul et al., 1997). For each labeled residue, its profile features were taken from profiles of 15 nearest spatially neighbor residues (including the labeled residue). The profile value x was scaled to the [0, 1] range by using the following function (Kim and Park, 2003):
|
| (3) |
The spatially neighboring residue profile feature is defined for each label-amino pair (y
L and aa
amino acid alphabet) as:
|
| (4) |
2.4.3 ASA feature
Accessible surface area (ASA) feature represents the relative accessible surface area (scaled by the nominal maximum area of each residue). For convenience, we use ASA to represent the relative accessible surface area of residues.
|
| (5) |
2.4.4 Residue conservation feature
Residue conservation feature represents the degree of evolutionary conservation at each residue position and was obtained from the conservation score in the ConSurf-HSSP database (Glaser et al., 2005). This score is based on the relative entropy and correlates with the functional importance of position. According to the conservation score, the residues were classified into nine categories of conservation (from grade 1 to grade 9). Residue conservation feature is expressed by the conservation grade divided by 10:
|
| (6) |
2.4.5 Summary of state feature set
The right dashed-line rounded rectangle in Figure 1 illustrates the process of feature extraction. Table 2 gives the feature type and corresponding dimensions.
|
2.5 Implementation of conditional random fields
FlexCRFs is a conditional random field toolkit for segmenting and labeling sequence data (Phan and Nguyen, 2005). The current version of FlexCRFs cannot be used to deal with continuous real value features, so we modified it to solve this problem. In this study, we adopted the first-order Markov CRFs. The parameter init_lambda_val was set to 0.05 and other parameters were set by default. Figure 1 illustrates the whole implementation of our protein interaction labeling system based on CRFs.
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
3.1 Cross-validation and scoring
The performance of each method is measured using 3-fold cross-validation. The whole data set (hetero-complex chains) was randomly divided into three subsets with equal number of chains. Each method was trained and tested three times with three different training and test sets. For each time, two subsets were used as training data and the remaining subset was used as test data.
All methods are measured according to the evaluation of residue labeling (or classification) based on the following quantities:
- TP is the number of true positives which are residues correctly classified as interface residues;
- TN is the number of true negatives which are residues correctly classified as noninterface residues;
- FP is the number of false positives which are noninterface residues incorrectly classified as interface residues;
- FN is the number of false negatives which are interface residues incorrectly classified as noninterface residues.
Then we used the following measures to evaluate the labeling (and classification) performance:
|
| (7) |
|
| (8) |
|
| (9) |
|
| (10) |
|
| (11) |
Precision, recall F1 are all used to measure the performance for labeling or classifying interface residues, while accuracy is to measure the performance for labeling or classifying the whole test data set. Correlation coefficient (CC) is to measure the correlation between predictions and actual test data.
3.2 Performance of CRFs versus other classification methods
Support vector machines (SVMs), neural network (ANN) and maximum entropy model (ME) are selected to compare with our method. All of them are discriminative classification methods. SVMs and ANN are state-of-the-art methods for predicting protein–protein interaction sites (Chen and Zhou, 2005; Chung et al., 2006; Fariselli et al., 2002; Koike and Takagi, 2004; Res et al., 2005; Zhou and Shan, 2001) and CRFs are extension of ME (Lafferty et al., 2001; Sutton and McCallum, 2006). LIBSVM (Chang and Lin, 2001) was used as the SVM implementation with radial basis function as kernel. The values of
and regularization parameter C were set to be 0.1 and 10, respectively. Neural Network Toolbox in Matlab was used as ANN implementation and a feed-forward, back-propagation neural networks was used (Chen and Zhou, 2005). The neural network contained an input layer with 21 x 15 nodes, a hidden layer with 20 nodes and an output layer with two nodes. ME implementation of Zhang was used and can be downloaded freely from http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html.
First, we tested these methods on basic feature set: profile of spatially neighboring residues and ASA feature. We tested them on six data sets, and the evaluation results are tabulated in Table 3. Among three complete data sets, CRFs perform best according to F1-measure, which shows that CRFs can obtain better trade-off between precision and recall automatically. Other methods suffer from the unbalanced training data greatly and they get higher precision and lower recall on complete data sets, which agrees with result of Chen and Zhou (2005). CRFs-based method is more robust with respect to different ratio between positive and negative examples of training set.
|
Among three trimed data sets, the performance of CRFs is next to the best performance obtained by SVMs method according to F1-measure and CC. Removing some noninterfacial residues from training set (in trimed data set) reduces the performance of CRFs, since these removed residues still contain useful information for predicting interaction sites. We will discuss this phenomenon in the following section.
Both CRFs and ME are exponential models based on maximum entropy principle. From the result, we can notice that the CRFs outperform ME greatly in most data sets, which shows that CRFs method are more suitable for labeling protein interaction sites than ME method. The performance of ANN is worst according to our experiment.
3.3 The effect of different ratio of positive and negative examples for CRFs and SVMs
We generated a series of training sets by randomly removing different number of negative examples from the original Complete Type I data set. The evaluation result of F1-measure and CC changing with the ratio of positive and negative examples is shown in Figure 3. We can see that the performance of CRFs is stable when the ratio of Pos/Neg is between 0.3 and 0.7 and the CRFs achieve the best performance when Pos/Neg is
0.4. It means that CRFs can obtain the best performance when only very few negative examples are removed. When the ratio of Pos/Neg is >0.7, the (CC) performance of CRFs will decline. SVMs can not obtain any interaction sites when the Pos/Neg ratio is <0.4. So the effect of the Pos/Neg ratio for SVMs is more serious than it is for CRFs. This experiment has been done only on Type I data set, while results on other two data sets may be different.
|
3.4 Some predicted examples by CRFs and SVMs
We give some examples that are predicted by SVMs and CRFs trained on trimed Type I data set. The first example is the SC SMC1HD:SCC1-C complex (Haering et al., 2004). The Kleisin is the conserved region of Rad21/Rec8 like protein which has 22 residues located on the interface with its partner according to the above definition of interaction residue (Fig. 4b). The CRFs predict 27 residues to be interface which covers 20 interfacial residues (recall: 91%, precision: 74%) (Fig. 4a). The SVMs predict 21 residues to be interface which covers 13 interfacial residues (recall: 59%, precision: 62%) (Fig. 4c). We can see that most of the false positives from SVMs locate on outside of the actual interface, i.e. the green cycle in Figure 4c. CRFs can successfully distinguish interface and noninterface residues for this protein.
|
The second example is complex of the ribosomal subunit 30S, a complex of 20 polypeptide chains with a 1522 nucleotide long 16S RNA (Carter et al., 2000). The S6 chain is in our data set and the interface between S6 and S18 was studied by us. The prediction results are shown in Figure 5. The interface residues of S6 (binding with S18) centralize in its hollow (Fig. 5b). This interface region is accurately identified by CRFs covering
86% of the actual binding site with a precision of 73% (Fig. 5a). The prediction result by SVMs covers only 68% of the actual binding sites with a precision of 56%, including a error region far away from the binding site i.e. residues within the green circle of Figure 5c.
|
The last example given by us is complex of sreptococcal pyrogenic enterotoxin C (SpeC) with a human T cell receptor beta chain (Sundberg et al., 2002). There are 17 residues located on the interface (Fig. 6b). CRFs can label the majority these residues with coverage of 65% (Fig. 6a), while SVMs only correctly label 4 interface residues with coverage of only 23.5% (Fig. 6c). Clearly, it is difficult to characterize the interfacial feature by SVMs.
|
3.5 Test CRFs on extended feature
We add residue conservation features to CRFs method that is also trained on Type I data set. These features are obtained from conservation score in the ConSurf-HSSP database (Glaser et al., 2005), which are different from that of Chung (2006) and Res (2005). Experimental result is tabulated in Table 4, from which we can see that the value of CC of CRFs-2 on two data types all descend. According to our experimental result, better performance can not be obtained by adding these features to CRFs.
|
| 4 CONCLUSION AND FUTURE WORK |
|---|
|
|
|---|
Protein–protein interaction sites prediction is tackled as a sequence labeling problem using conditional random fields that is different from conventional classification based methods. Features used for conditional random fields include sequence profile and residue accessible surface area of spatially neighboring residues. Comparative experiments of CRFs-based method and other classification-based methods including SVMs, ANN and ME on 1276 nonredundant chains of hetero complexes show that CRFs-based method achieves the best performance on complete data sets. On the trimmed data sets, the performance of CRFs is comparable with state-of-the-art methods, such as ANN and SVMs. CRFs method is more robust than conventional classification methods when using data sets with different ratio of positive and negative examples. Our study indicates the feasibility of using CRFs to predict protein–protein interaction sites and guides specific experiments for biologists.
In our experiment, the residue conservation feature did not contribute to the performance of CRFs. It shows that simply adding this feature to CRFs is not suitable for this task. Choosing proper features is a challenging work and we will investigate more effective features in the future. Information of binding protein chains will also be considered in our future work.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The authors would like to thank Cheng-Jie Sun and Lu Li for their support during the research, and the reviewers for their valuable comments. Special thank goes to Xian-Fang Wen for his suggestion about English writing. Thanks also go to Xuan-Hieu Phan from Japan Advanced Institute of Science and Technology for providing the original version of FlexCRFs source code, Dr. Chih-Jen Lin from National Taiwan University for providing the LIBSVM tool, and Le Zhang from University of Edinburgh for providing the Maximum Entropy Modeling Toolkit. This research work is funded by National Natural Science Foundation of China (60673019).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alfonso Valencia
Received on September 11, 2006; revised on December 3, 2006; accepted on December 20, 2006
| REFERENCES |
|---|
|
|
|---|
Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., ( (1997) ) 25, : 3389–3402.
Berman HM, et al. The Protein Data Bank. Nucleic Acids Res., ( (2000) ) 28, : 235–242.
Bradford JR, Westhead DR. Improved prediction of protein–protein binding sites using a support vector machines approach. Bioinformatics, ( (2005) ) 21, : 1487–1494.
Carter AP, et al. Functional insights from the structure of the 30S ribosomal subunit and its interactions with antibiotics. Nature, ( (2000) ) 407, : 340–348.[CrossRef][Medline].
Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. ( (2001) ) Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm..
Chen H, Zhou H.-X. Prediction of interface residues in protein–protein complexes by a consensus neural network method: test against NMR data. Proteins Struct. Funct. Bioinfo., ( (2005) ) 61, : 21–35.[CrossRef].
Chinchor N. MUC-7 Named Entity Task Definition. In: Proc. of the Seventh Message Understanding Conference, ( (1998) )..
Chung J.-L, et al. Exploiting sequence and structure homologs to identify protein–protein binding sites. Proteins Struct. Funct. Bioinfo., ( (2006) ) 63, : 630–640..
Dominguez CR, et al. HADDOCK: a protein–protein docking approach based on biochemical or biophysical information. J. Am. Chem. Soc., ( (2003) ) 125, : 1731–1737.[CrossRef][ISI][Medline].
Fariselli P, et al. Prediction of protein-protein interaction sites in heterocomplexes with neural networks. Eur. J. Biochem., ( (2002) ) 269, : 1356–1361.[ISI][Medline].
Freitag D, McCallum A. Information extraction with HMM structures learned by stochastic optimization. ( (2000) ) Proc. of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence. 584–589..
Glaser F, et al. The ConSurf-HSSP database: the mapping of evolutionary conservation among homologs onto PDB structures. Proteins Struct. Funct. Bioinfo., ( (2005) ) 58, : 610–617.[CrossRef].
Haering CH, et al. Structure and stability of cohesin's Smc1-kleisin interaction. Mol. Cell, ( (2004) ) 15, : 951–964.[CrossRef][ISI][Medline].
Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen bonded and geometrical features. Biopolymers, ( (1983) ) 22, : 235–242..
Kim H, Park H. Protein secondary structure prediction based on an improved support vector machines approach. Protein Eng. Des. Sel., ( (2003) ) 16, : 553–560.
Koike A, Takagi T. Prediction of protein–protein interaction sites using support vector machines. Protein Eng. Des. and Sel., ( (2004) ) 17, : 165–173.[CrossRef].
Lafferty J, et al. Conditional random fields: probabilistic models for segmenting and labeling sequence data. ( (2001) ) 18th International Conference on Machine Learning (ICML). 282–289..
Letovsky S, Kasif S. Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics, ( (2003) ) 19, : i197–i204.[Abstract].
Liang S, et al. Protein binding site prediction using an empirical scoring function. Nucleic Acids Res., ( (2006) ) 34, : 3698–3707.
Lichtarge O, et al. Evolutionary traces of functional surfaces along G protein signaling pathway. Methods Enzymol., ( (2002) ) 344, : 536–556.[ISI][Medline].
Liu Y, et al. Comparison of probabilistic combination methods for protein secondary structure prediction. Bioinformatics, ( (2004) ) 20, : 3099–3107.
Liu Y, et al. Segmentation conditional random fields (SCRFs): a new approach for protein fold recognition. ( (2005) ) ACM International Conference on Research in Computational Molecular Biology. 408–422. (RECOMB05)..
Mccallum A, et al. Maximum entropy Markov models for information extraction and segmentation. ( (2000) ) Proc. of the Seventeenth International Conference on Machine Learning. 591–598..
McGinnis S, Madden TL. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res., ( (2004) ) 32, : W20–W25.
Nabieva E, et al. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics, ( (2005) ) 21, : i302–i310.[Abstract].
Ofrana Y, Rosta B. Predicted protein–protein interaction sites from local sequence information. FEBS Letters, ( (2003) ) 544, : 236–239.[CrossRef][ISI][Medline].
Phan X-H, Nguyen L-M. FlexCRFs: flexible conditional random field toolkit. ( (2005) ) http://www.jaist.ac.jp/~hieuxuan/flexcrfs/flexcrfs.html..
Rabiner LR. A tutorial on hidden markov models and selected applications in speech recognition. Proc. of the IEEE, ( (1989) ) 77, : 257–286.[CrossRef].
Ratnaparkhi A. A maximum entropy model for part-of-speech tagging. ( (1996) ) Proc. of the Conference on Empirical Methods in Natural Language Processing..
Res I, et al. An evolution based classifier for prediction of protein interfaces without using protein structures. Bioinformatics, ( (2005) ) 21, : 2496–2501.
Rosenfeld R. A maximum entropy approach to adaptive statistical language modeling. Computer, Speech and Language, ( (1996) ) 10, : 187–228.[CrossRef][ISI].
Rost B, Sander C. Conservation and prediction of solvent accessibility in protein families. Proteins Struct. Funct. Gen., ( (1994) ) 20, : 216–226.[CrossRef].
Sowa ME, et al. Prediction and confirmation of a site critical for effector regulation of RGS domain activity. Nat Struct Biol, ( (2001) ) 8, : 234–237.[CrossRef][ISI][Medline].
Sutton C, McCallum A. An introduction to conditional random fields for relational learning. In: Introduction to Statistical Relational Learning, —Getoor L, Taskar B, eds. ( (2006) ) MIT Press, Cambridge, Massachusetts, USA..
Sundberg EJ, et al. Structures of two streptococcal superantigens bound to TCR beta chains reveal diversity in the architecture of T cell signaling complexes. Structure, ( (2002) ) 10, : 687–699.[Medline].
Wanga B, et al. Predicting protein interaction sites from residue spatial sequence profile and evolution rate. FEBS Letters, ( (2006) ) 580, : 380–384.[CrossRef][ISI][Medline].
Zhou H.-X. Improving the understanding of human genetic diseases through predictions of protein structures and protein–protein interaction sites. Curr. Med. Chem., ( (2004) ) 11, : 539–549.[CrossRef][ISI][Medline].
Zhou H.-X, Shan Y. Prediction of protein interaction sites from sequence profile and residue neighbor list. Proteins Struct. Funct. and Gen., ( (2001) ) 44, : 336–343.[CrossRef].
This article has been cited by other articles:
![]() |
H.-X. Zhou and S. Qin Interaction-site prediction for protein complexes: a critical assessment Bioinformatics, September 1, 2007; 23(17): 2203 - 2209. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





