Bioinformatics Advance Access originally published online on September 16, 2004
Bioinformatics 2005 21(3):314-324; doi:10.1093/bioinformatics/bti019
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 21 issue 3 © Oxford University Press 2005; all rights reserved.
Discovery of stable and significant binding motif pairs from PDB complexes and protein interaction datasets
1 Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore 119613
2 School of Computing, National University of Singapore Singapore 119260
*To whom correspondence should be addressed.
| Abstract |
|---|
Motivation: Discovery of binding sites is important in the study of proteinprotein interactions. In this paper, we introduce stable and significant motif pairs to model protein-binding sites. The stability is the patterns resistance to some transformation. The significance is the unexpected frequency of occurrence of the pattern in a sequence dataset comprising known interacting protein pairs. Discovery of stable motif pairs is an iterative process, undergoing a chain of changing but converging patterns. Determining the starting point for such a chain is an interesting problem. We use a protein complex dataset extracted from the Protein Data Bank to help in identifying those starting points, so that the computational complexity of the problem is much released.
Results: We found 913 stable motif pairs, of which 765 are significant. We evaluated these motif pairs using comprehensive comparison results against random patterns. Wet-experimentally discovered motifs reported in the literature were also used to confirm the effectiveness of our method.
Contact haiquan{at}i2r.a-star.edu.sg
Supplementary information http://sdmc.i2r.a-star.edu.sg/BindingMotifPairs
| INTRODUCTION |
|---|
Proteinprotein interactions play an important role in many biological processes such as for intercellular communication, for signal transduction and for the regulation of gene expressions. Binding sites are crucial clues to unraveling proteinprotein interactions. The discovery of binding sites is also useful for the prediction of unknown proteinprotein interactions, for the library design of phage display (Smith, 1985) and for the drug design as targets in proteomics.
The discovery of binding sites can be categorized into two different approaches. One is focused on the single sides of the binding sites, whereas the other emphasizes on the cooperation of both sides. We are interested in the second approach and call it the binding-pair approach. Both experimental and computational methods can deduct binding pairs. Experimental methods include those for analyzing protein complexes (Josephson et al., 2001) phage display (Smith, 1985; Rodi et al., 2001; Sidhu et al., 2003) and mutagenesis (Botstein and Shortle, 1985; Clemmons, 2001). These methods usually lead to relatively high accuracy, but they are time-consuming and cost-expensive. As complementary methods, the computational methods are fast and economical to narrow down the search space.
Current computational methods for discovering binding pairs are mainly concentrated on domaindomain interactions. Sprinzak and Margalit, (2001) first termed domaindomain interactions as correlated sequence-signatures. Deng et al. (2002) applied a maximum-likelihood method to statistically estimate domaindomain interactions. Ng et al. (2003) used an integrative score system to deduct domaindomain interactions. All these methods stand at the domain level and study only predefined domains. Note that domains themselves may not be binding sites. For example, they can be folding determinants instead. In addition, most domains are lengthy segments of residues, where only a part of them are contained in the binding sites. Therefore, how to pinpoint those specific regions that are really involved in binding behavior became our research interest.
We studied a simple type of binding pairs where each side of a binding site consists of a short sequence of continuous residues and where the two sides approach spatially with each other closely. We call these as short sequences motifs and the binding sites binding motif pairs. The notion of binding motif pairs was first proposed in our previous work (Li et al., 2004), where we also presented a combination approach to the efficient discovery of binding motif pairs. The discovery consists of two main steps. In the first step, we used an efficient computational algorithm to discover so-called maximal contact segment pairs from a protein complex structural dataset. In the second step, we used these segment pairs as templates to subgroup an interacting protein sequence dataset and to generate starting motif pairs. Then, we applied an iterative refinement to these starting motif pairs to find interesting motif pairs based on a stopping criterion called emerging significance, which is defined as the occurrence ratio of the binding motif pairs between a positive interaction dataset and a negative dataset. Note that the definition of emerging significance is based on a negative dataset. However, there are currently no large experimentally validated negative interaction dataset. Hence, the effectiveness of this method would be greatly influenced by the putative negative interaction dataset with a lot of false negatives. To avoid this weakness, in this paper, we introduce some new ideas to discover binding motif pairs from protein interaction datasets.
More specifically, we require our binding motif pairs to be stable. The notion of stable motif pairs is rooted in the fact that many biological phenomenon exist in stable status, and this stable status might be evolved from their past unstable status. Therefore, from a certain starting point to the stable point, mathematically it is a chain of changing but converging patterns. Second, we require our binding motif pairs to be significant. We propose statistical measurements to evaluate the significance not only for single motifs but more importantly for their co-occurrence as pairs. By significance, we mean that their observations or supports should be much higher than their random expectations. Combining these two ideas, our binding motif pairs are mathematically stable and statistically significant.
The discovery of all stable and significant motif pairs is a challenging problem as the number of candidates is huge. In this paper, we narrow down our search space by looking only for a special subset of stable and significant motif pairs. With the help of the same algorithm as the one proposed in our previous work (Li et al., 2004), the starting points of this subset of stable motif pairs are also derived from a protein complex dataset that is known to contain the most biologically reliable data about protein binding sites. In this manner, the binding motif pairs aimed to discover possess high confidence because of the biological support from the complex data. This is strongly confirmed by our comprehensive comparison experiments with random patterns and by wet-experimentally discovered binding motifs reported in the literature. Figure 1 relates stable motif pairs and significant motif pairs, and locates where the motif pairs that we are most interested in.
|
In the next section, we describe two datasets and define basic notations. Then we explain stable motif pairs, significant motif pairs and starting motif pairs in a formal manner with the help of the three sections in this paper. Finally, we report our discoveries and evaluations.
| DATA AND BASIC NOTATIONS |
|---|
We use two interaction datasets in this paper: a sequence dataset of interacting protein pairs collected by von Mering et al. (2002) and a protein complex dataset derived from the Protein Data Bank PDB (http://www.rcsb.org/pdb/). The sequence dataset consists of 78 390 non-redundant interactions, containing almost all the latest interacting protein pairs in yeast genome produced by various experimental and high-confident computational methods. The protein complex dataset was generated from the PDB on June 9, 2003, containing 1533 entries that have at least two chains, by using online search tools in the PDB-REPRDB (http://mbs.cbrc.jp/pdbreprdb-cgi//reprdb_query.pl). In this complex dataset, the maximum pairwise sequence identity between any two complexes is 30% and each complex has a structure resolution of 2.0 or higher. In this study, the complex dataset is first used to generate starting points for stable motif pairs, then the interacting sequence dataset is used to transform those starting motif pairs so as to output a set of stable and significant binding motif pairs.
The following are basic notations that are frequently used in this paper.
|
More formally, a protein P is denoted by a 1 a 2 ... a l , where a i
and l > 0. A motif M is denoted by
1
2...
k , where
i
and k > 0. For example, M = {E, K, N}{P}
{D, E}. (Traditionally, it is also written as M = [EKN]Px[DE].) A protein P contains a motif M, denoted M
P, if there exists a continuous segment in P of length k, denoted a
a (
+1) ... a (
+k1), such that a j
j ,
j
(
+ k 1). PrtnDB is a set of m proteins and is denoted {P i ,i = 1, ..., m}. The sequence dataset
of n interacting protein pairs is denoted using
, where
and
have interactions.
| STABLE MOTIF PAIRS |
|---|
In this section, we introduce the new concept of stable motif pairs. This notion is in light of evolution principles: a stable motif pair is evolved from its neighboring motif pairs and it should maintain such a status for a long time. We emulate such an evolution using a function f defined in accordance with a widely accepted concept called consensus discovery. By this function, only strong residue signals are conserved in heritage but weak ones are filtered out in the transformation of motif pairs.
Given a motif pair, our algorithm for the consensus discovery is to find a consensus pattern from the cluster of this motif pair.
DEFINITION 1.
[Cluster of a motif pair] Let MPr be a motif pair and
be a sequence dataset of interacting protein pairs. The cluster of MPr in
, denoted Cluster(MPr,
), is a subset of
such that for every PPr in this subset, PPr contains MPr. That is, Cluster(MPr,
) = {PPri
| MPr
PPri }, or denoted Cluster(MPr) simply when
is understood.
where, a protein pair PPr = P 1, P 2} contains MPr = M L ,M R if (ML
P 1
MR
P 2)
(ML
P 2 ^ MR
P 1). This is also denoted by MPr
PPr.
To find the consensus pattern MPr' from the cluster of a given motif pair MPr = M L , {M R }, we use the following method.
- Split the cluster vertically into two semi-clusters: Cluster L and Cluster R
-
,
.
-
- Align all the occurrences in Cluster L or in Cluster R according to the motif M L or M R , respectively.
- Find a consensus motif M' L or M' R respectively from the two alignments by extracting all those residues in each column of an alignment whose occurrence rate is larger than a threshold (20% in this paper).
- Combine M' L and M' R into a motif pair MPr' = {M' L ,M' R }, then it is the transformed motif pair of MPr.
Table 1 gives an example, showing the cluster of a motif pair {AGGG[IY], [FV]G[EK][AE][ENS][IL]A} in a sequence dataset
used in von Mering et al. (2002). Observe that this cluster consists of seven interacting protein pairs, which is indeed a non-empty subset of
. Columns 2 and 3 of Table 1 list the alignment for the two semi-clusters of the motif pair {AGGG[IY], [FV]G[EK][AE][ENS][IL]A}. The consensus motifs (M' L or M' R }) are derived separately from these two semi-clusters. The consensus pattern (the motif pair {M' L , M' R }) is listed in the second last row of the table.
|
Alternatively, we can use other algorithms such as EMOTIF (Nevill-Manning et al., 1998) to discover consensus patterns.
From the consensus discovery, we can see that MPr' is a transformation of MPr. We use function f to describe this transformation process. Therefore, finding the consensus pattern MPr' from the cluster of a given motif pair MPr can be denoted by f(MPr) = MPr'.
DEFINITION 2.
[Stable motif pairs] A motif pair MPr is stable if
![]() |
Mathematically, this definition follows the Brouwers Fixed Point Theorem (Mohamed and William, 2001). That is, a thing X, after a transformation f, is still the same thing X, denoted f(X) = X. Some basic biological phenomenon can be interpreted as fixed points. For example, the DNA of a cell can be split into two cells with the same DNA after self-replicating where X is the DNA and f is the laws of Physics and Chemistry applied to the DNA. As another example, some C 2 H 2 zinc-finger genes can be translated into the same type of protein after frameshifts (Meng et al., 2004). Here, X is the protein type and the f is the frameshifting.
Starting from any motif pair MPr, it is possible to find a stable motif pair. Suppose f(MPr) = MPr (1). Then we apply f to MPr (1) and get MPr (2). Iteratively, it is possible to get f(MPr (i)) = MPr (i). If MPr (i) is non-empty, then it is called a stable motif pair. The whole process is called refinement. Table 2 shows an example of such refinement from a starting motif pair. We can prove that the refinement from any motif pair converges to either a stable motif pair or an empty pattern.
|
| SIGNIFICANT MOTIF PAIRS AND THEIR EFFICIENT COMPUTATION |
|---|
As discussed below, not all stable motif pairs are statistically significant. Therefore, we introduce significant motif pairs in this section to capture more information for binding motif pairs. A significant motif pair requires that the two motifs in the pair must be significant as well. The significance is statistically evaluated against randomness. We begin with definitions for the absolute support and statistical score of single motifs and their efficient computation. Then, we explain significant motif pairs and give efficient methods to compute their significance indices.
DEFINITION 3.
[Support for a motif] The absolute support of a motif M in PrtnDB is the number of proteins in PrtnDB that contain M, denoted by
(M, PrtnDB) = |{P i
PrtnDB|M
P i |, or simply denoted by
(M).
The Z-score measurement is widely used to evaluate the significance of single motifs (Atteson, 1998). The Z-score of a motif M is defined as
![]() |
where exp(M, PrtnDB) is the expectation support for M in PrtnDB,
(M, PrtnDB) is the SD for the random occurrence (support) of M in PrtnDB. With Z-scores, we can distinguish significant motifs from random ones. If the occurrence of a motif is far away from its random expectation, this motif is considered to be statistically significant.
With the help of the software package provided by Nicodeme et al. (2002), the expectation and deviation for a motif M =
1
2 ...
k with respect to PrtnDB can be calculated as follows:
![]() |
where m is the number of proteins in PrtnDB(see Appendix A for details). Nicodeme et al. (2002) also showed that for most motifs,
![]() |
From formulae (2) and (3), we can see that after one pass of pre-computation for the number of residues in PrtnDB and the number of proteins m, the expectation and SD of any motif can be calculated in linear time with respect to the number of positions in the motif, i.e. in O(k) time.
Next, we introduce the concept of significant motif pairs. Let us first define the support and contributive support of motif pairs with respect to a sequence dataset
of interacting protein pairs.
DEFINITION 4.
(Support a motif pair) The absolute support of a motif pair MPr = {ML,MR } in
is defined as the number of interacting protein pairs in
that contain MPr, denoted by
(MPr,
) = |{PPr i
| MPr
PPr i }| = |Cluster(MPr,
)|.
Since not all motif pairs contained in an interacting protein pair can play a role in the interaction, we define contributive support of motif pairs to reflect the true contributors for the interaction.
DEFINITION 5.
(Contributive support for a motif pair) The contributive support of a motif pair MPr in
is the number of protein pairs in
whose interaction is partially contributed by MPr, denoted by
c (MPr,
) = |{PPr i
| MPr
PPr i , MPr contributes PPr i |, or simply denoted by
c (MPr).
Contributive support is only a theoretical concept when structure data for the protein complexes are unavailable. Later on, we will show how to estimate contributive support values based on a sequence dataset
of interacting protein pairs and a set of motif pairs.
Similar to Z-scores (Atteson, 1998), which are used to measure the significance of single motifs with regard to PrtnDB, we define P-scores to measure the significance of motif pairs. Given an MPr = {M L ,M R } and a protein interacting sequence dataset
,
![]() |
where exp(MPr,
) is the expectation support of random co-occurrences of MPr in
.
Based on the Z-scores of single motifs and P-scores of motif pairs, now we define significant motif pairs:
DEFINITION 6.
[Significant motif pairs] A motif pair MPr = {ML,MR } is significant in a protein interacting sequence dataset
and the corresponding protein set PrtnDB if zs (ML , PrtnDB)
L , zs (MR ,PrtnDB)
R and ps (MPr,
)
B , where
L
0,
R
0,
B
1 are pre-set thresholds.
This definition emphasizes that the observations should be far away from the expectation values.
Computationally calculating P-scores is not straightforward because the accurate contributive support is almost impossible to be obtained without wet-experimental examination. Therefore, we present an approximate solution. First, assume ML and MR are independent, the expectation can be calculated as follows:
![]() |
where m is the number of unique proteins in
. Therefore, the P-score can be re-written as
![]() |
Assume that an interaction contains only one binding motif pair, then the contribution of a motif pair to a protein pair is influenced by other motif pairs. Given a sufficiently large set of motif pairs SMPr , we can estimate the contributive support using the following
![]() |
It can be seen that for a motif pair, the supports of its two contained motifs are fixed values in a given protein set PrtnDB. Therefore, when handling a large motif pair set SMPr , formulae (6) and (7) will consist of a large group of equations with two types of variables: the P-scores and the contributive supports of the motif pairs.
To solve this group of equations, we explore the use of iterative programming. First, we set an identical initial value for the P-score of every motif pair. Then, we use the current P-scores to calculate the contributive support for all motif pairs by formula (7). We can thereafter get new P-scores using formula (6) for each motif pair and start a new round of calculation, until the changes of most variables are less than a threshold.
| DETERMINE STARTING POINTS TO DERIVE STABLE MOTIF PAIRS |
|---|
Since the number of possible starting motif pairs is huge, it is a computational difficult problem to find all stable motif pairs from a large dataset of interacting protein pairs. In this section, we present a heuristic method to find a subset of the stable motif pairs with the biological guidance from a protein complex dataset. Our motivation is that the protein complex data are the most reliable data about protein interactions, and its three-dimensional (3D) coordinate information is an easy platform to find binding sites. Hereby, we first compute binding sites from the complex dataset, and then use them to produce starting motif pairs to search for stable and significant motif pairs from the dataset of interacting protein sequence pairs. In this manner, we can get high confidence to the discovered stable and significant motif pairs since they are stemmed from the biologically reliable protein complex data.
A core step to determine starting motif pairs is to discover the so-called maximal contact segment pairs (Li et al., 2004) from a protein complex dataset. Let us explain a bit more about this concept. Two segments from different proteins are a contact segment pair if any residue in one segment can find at least one contact residue in the opposite segment, where the contact of two residues means that at least one of their atom pairs has an Euclidean distance less than a threshold. A contact segment pair is a maximal contact segment pair if no any other contact segment pair in the same protein pair contains both segments of this contact segment pair, capturing contact segment pairs as lengthy as possible. These definitions and the search algorithms can be found in our previous work (Li et al., 2004). To be self-contained for this paper, we also outline these in Appendix B. As an example (see more in Figure S1 and Figure S2 of the Supplementary information), the segment pair ([a 16,a 20],[d 41,d 47]) with sequence (AGSSY, VGRANMA) between chain A and chain D of the complex pdb1mbm is a maximal contact segment pair.
However, directly using maximal contact segment pairs as starting motif pairs is not a smart choice. Because these segment pairs are highly specific in corresponding species, they may not occur in yeast interacting protein dataset
. Therefore, we need to generalize these contact segment pairs. We achieve this goal by using the principle proposed by Azarya-Sprinzak et al. (1997). The principle says that even some residues in some positions are changed to other residues, their structures are still unchanged. Since the structures maintain the same, the binding behavior is highly likely to maintain as well. Basically, we use local alignment and consensus discovery to implement this generalization and to get satisfactory starting motif pairs.
Given a maximal contact segment pair SPr and a protein interaction dataset
, the generalization of SPr is as follows:
- Find a subset of
, denoted ApproxCluster(SPr) = {PPri
| Local_Alignment(SPr,PPri )
}, where
is an empirical threshold,
- Discover the consensus pattern MPr from ApproxCluster(SPr).
Thus, MPr is a generalized pattern for SPr. Then, we use MPr as a starting point to discover a stable motif pair. For the maximal segment pair (AGSSY, VGRANMA) mentioned above, we found 34 interactions for its ApproxCluster. From this cluster, we induced a consensus motif pair, {AG[DGS][GS][IVY], [FV]G[EK][AE][DENS][IL]A}, which was then used as the starting point to derive a stable motif pair {AGGG[IY],[FV]G[EK]A[ES]IA}.
Summary of the whole flow to discover a subset of stable and significant motif pairs
The whole flow of our method is summarized as follows:
Input: A sequence dataset
of interacting protein pairs, a complex dataset
Output: A set of stable and significant motif pairs SMPr
forall complex CPL in
do
forall protein pair Pa and Pb in CPL do
find the set of maximal contact segment pairs SSPr ;
end for
end for
forall contact segment pair SPr in SSPr do
generalize SPr to produce a starting motif pair MPr
end for
for all starting motif pair MPr do
refine MPr to either a stable motif pair MPr' or an emptyset.
end for
for all stable motif pair MPr' do
filter those stable motif pairs MPr' if MPr' is not significant
end for
| IMPLEMENTATIONS AND RESULTS |
|---|
In the computation of contact residues in a complex, we set the distance threshold as 5Å, that is, any residue/atom pair that have a distance <5Å is regarded to be contacted. In the computation of maximal contact segment pairs, we required that every contact segment should contain at least four residues. In the generalization from maximal contact segment pairs to starting motif pairs, we set different
thresholds for local alignment based on the segment lengths:
was set strictly for short segments but loosely for long segments. Actual
values used in this study can be referred to Figure S3 in the Supplementary information. After obtaining starting motif pairs from the complex dataset, we conducted the refinement process to find stable motif pairs from the sequence dataset of interacting protein pairs. For a motif pair MPr, to discover f(MPr)the consensus patternand subsequently f(... f(f(MPr))) until a stable state, we computed a latter cluster based on its previous cluster instead of he whole dataset. The efficiency was therefore greatly improved. This is correct because the refinement leads to more and more specific motif pairs.
After obtaining a set of stable motif pairs from the starting motif pairs and the refinement, we filtered the insignificant ones. The thresholds for the significance indices were set as follows:
L = 0,
R = 0 and
B = 1. The computation of the supports and Z-scores are straightforward according to our algorithm. However, the computation of the P-scores is an iterative process. The initial P-score for every motif pair was set as 1.0 in this work. We observe that the P-score trends of most motif pairs (>90%) in the iterative process are convergent, either monotonically increasing or monotonically decreasing. Given a set of motif pairs SMPr , the overall score difference between the j-th and the (j 1)-th iteration is calculated by an index
(j).
![]() |
If
< 0.01, we stop the iterative process. For most sets of motif pairs, the process could stop within four iterations.
Results overview
In total, we discovered 765 stable and significant motif pairs from the sequence dataset of interacting protein pairs using 1403 maximal contact segment pairs identified from the protein complex dataset. Table 3 provides these results and other related results such as the support information.
|
The P-score values of the 765 stable and significant motif pairs differ very much from one another. Figure 2 shows the distribution of these P-scores (under log2 scale). It can be seen that our algorithm can discover motif pairs with both high and low P-scores (larger than a threshold).
|
Besides P-scores, another important information is the support. The distribution of the support values (under log2 scale) of the 765 stable and significant motif pairs is depicted in Figure 3. It can be seen that our algorithm preferred to discovering motif pairs with relatively low supports. This is an advantage to our algorithm as the support of many real binding motif pairs is quite possible to be low in an incomplete dataset. The distribution of the estimated contributive support values for our discovered motif pairs exhibits almost the same shape as the one in Figure 3.
|
To evaluate the lengths of our discovered motif pairs, we used information content (Tompa, 1999) as the index. Assume each residue has equal distribution, the information content of a motif M =
1
2 ...
k can be computed by:
![]() |
For a motif pair MPr = {M L ,M R }, we define
![]() |
Therefore, the information content mostly reflects the length of a motif. The distribution of the information contents of the 765 motif pairs is shown in Figure 4. It can be seen that most of the motif pairs have an information content between 10 and 20, except for very few cases. Therefore, these motif pairs roughly have residues between 10 and 20.
|
Effectiveness comparison with random patterns
To demonstrate that our discovered stable and significant motif pairs are credible and also to illustrate that our choice of the starting motif pairs makes benefits to the discovery, we conduct a comprehensive computational comparison between our patterns and random patterns. These experiments include (1) the comparison between our 913 stable motif pairs and 10 random sets each consisting of 913 random motif pairs; and (2) the comparison between our 1222 starting motif pairs and 10 random sets each consisting of 1222 random starting motif pairs.
A random motif pair is generated by substituting every residue in our pattern with a random residue. Therefore, the random pattern has the same length as ours. The distribution of the randomly generated residues follows the same distribution of all the residues in the contact sites of our complex dataset. [In fact, it has no significant difference between this distribution and that in the whole yeast genome (Fariselli et al., 2002).]
First, we compare our 913 stable motif pairs with the 10 sets of random motif pairs of equal size to see how much percentage of them are significant. We observed that
- About two-thirds of the random motif pairs have a zero-support in the interaction dataset
, namely
(MPr random ,
) = 0. However, for every MPr of our 913 stable motif pairs,
(MPr,
)
= 0. Figure 5 shows the percentage of random patterns having non-zero support for the 10 rounds of random experiments.
- Only about one-ninth of the random motif pairs are significant. However,
84% of our 913 stable motif pairs are significant. Complete results are shown in Figure 6.
- The total support of our stable and significant motif pairs is much larger than that of significant random motif pairs, which is shown in Figure S4 in the Supplementary information.
|
|
These results indicate that our discovered stable motif pairs are much more statistically significant than random patterns. Therefore, they are most likely to be potential binding motif pairs.
Second, we substitute our 1222 starting motif pairs with random starting motif pairs to see how much percentage of stable motif pairs can be discovered, and how much percentage of stable and significant can be discovered. Such substitution is repeated 10 times. We observed that
- Our starting motif pairs can lead to 75% (913) of stable points, but those random starting points in each round lead to <33% of stable motif pairs. Complete results are shown in Figure 7.
- Our starting motif pairs can lead to
63% of stable and significant motif pairs, but <18% of those random starting points can lead to stable and significant motif pairs. Figure 8 shows complete results.
|
|
From these comparison, we can conjecture that the generalization from maximal contact segment pairs to our starting motif pairs is a useful method because it contributes much more number of stable and significant motif pairs than the random method does.
From these various random experiments, we can see that the stable and significant motif pairs that we discovered are far way from the random expectation, which benefits from the choice of starting points. Therefore, it is reasonable that they provide much information to find real binding motif pairs. This is also confirmed by our literature searching results reported in the next subsection.
Literature validation
To demonstrate the biological significance of our discovered patterns, ideally, they should be validated by wet-experimental methods. Unfortunately, there are few well-known wet-experimental methods that can determine the two sides of the binding sites simultaneously. Current available technique such as phage display (Smith, 1985) can determine only one side of the binding sites and produce proteinmotif binding pairs or proteinpeptide binding pairs. On the other hand, there is still limited data about the binding sites, mostly spanning across various individual literature, without an integrative and comprehensive database available, which makes our validation even harder.
Nevertheless, we still find some evidences to show the biological significance of our discovered patterns. First, we check the coincidence of the individual motifs in our motif pairs with the reported binding motifs determined by various wet-experimental methods. For example, using key words binding motif OR site AND mutagenesis, we extracted 202 binding motifs from the abstracts of NCBI PUBMED; 89 of them have at least three positions compatible to ours and 40% overall similarity. Of these 89 binding motif pairs, 42 motif pairs are highly similar with our discovered motifs, having at least four positions compatible and 50% overall similarity. We show in Table S1 the top five matches in comparison with mutagenesis, and in Table S2 the top 5 matches with another method named phage display in the Supplementary information.
Second, we check our discovered motif pairs with proteinmotif binding pairs determined by phage display. First, we identify the individual motifs in our population of discovered motif pairs that match closely with a binding motif/peptide in the literature. Then, for each of such matched motifs, we verify whether the motif on the other side of the corresponding motif pairs can be found in the protein known to bind the particular motif/peptide. An example is shown in Table 4. Tumbarello et al., (2002) studied the binding sites of protein paxillin and its binding proteins. The binding site of paxillin is in the form of LDxLLxxL. Our method discovered similar motifs as shown in the first column of Table 4. The other side of the corresponding motif pairs are shown in the second column of the table, which have been found to exist in the binding proteins reported in the literature (Tumbarello et al. 2002). The fully matched binding proteins or roughly matched motifs are shown in the last column of the table. More examples are detailed in Tables S3 and S4 in the Supplementary information.
|
Finally, we provide complete details for one of the 765 stable and significant motif pairs to see how it is discovered, where is its origin and what is its biological significance. This stable motif pair is
![]() |
denoted by MPr example = {M L , M R }, where M L = L[DN]LL and M 2 = [EK][LV]GDG.
Its origin is located at the so-called pdb3daa protein complex. Specifically, the motif M L = L[DN]LL is evolved from the the segment LNLL in the chain A of the pdb3daa complex. These four amino acids are indexed from 147 to 150 residues in the chain A, denoted by [a 147,a 150] with sequence LNLL.
The motif M R = [EK][LV]GDG is rooted at the segment YQFGDG in the chain B of the pdb3daa complex. These six amino acids are indexed from 24 to 29 residues in the chain B, denoted by [b 24,b 29] with sequence YQFGDG.
The segment pair, ([a 147,a 150], [b 24,b 29]) with sequence (LNLL,YQFGDG) between chain A and chain B, is a maximal contact segment pair.
This maximal contact segment pair (LNLL,YQFGDG) is then generalized to the following starting motif pair MPr start
![]() |
for the function f.
Interestingly, we found that f(MPr start) = MPr start = MPr example. That is, this starting motif pair MPr start itself is a stable motif pair.
We found that this stable motif pair MPr example is statistically significant after examining its support level and P-score against random motif pairs. The support of motif L[DN]LL is 265 in PrtnDB, the support of motif [EK][LV]GDG is 13 with respect to the same protein set PrtnDB. The support of MPr example as a pair is 58 in the protein interaction sequence dataset
. Then, we generated 1000 random motif pairs according to MPr example, where each random motif pair is generated by substituting every residue in MPr example with a random residue. So, the random motif pairs have the same length as MPr example. The distribution of the randomly generated residues follows the same distribution of all the residues in the whole yeast genome. For these 1000 random motif pairs, the average support of the random motifs corresponding to L[DN]LL is 32.91, the average support of the random motifs corresponding to [EK][LV]GDG is 4.41. The average support for those 1000 motif pairs is 1.83 in the protein interaction sequence dataset
. The P-score of MPr example as a pair is 6.15 with respect to protein interaction sequence dataset
, while the average P-score for these 1000 random motif pairs is 2.63 with respect to the same
. From these statistical numbers of MPr example and its equal-length 1000 random motif pairs, we can see that MPr example has occurrence much more than its random expectation either in single motifs or in pairs. Hence, the stable motif pair MPr example is not a random result indeed.
We also found many biological significance of the motif pair MPr example. In biology, Doray and Kornfeld, (2001) found a protein motif M DK = LLDLL, a functional variant of the LLNLD motif within the beta 1 subunit of AP-1, was biologically confirmed to bind to the terminal domain of the clathrin heavy chain. From the sequence of this terminal domain, we find that there exists a segment ELGD near the end part of this domain. Comparing these biological results and our computational results, we can see that
- M DK = LLDLL is similar to the left motif L[DN]LL of our motif pair MPr example.
- The segment ELGD matches well with our right motif [EK][LV]GDG of MPr example. The precise position of the segment ELGD is from positions 462 to 465 at the end of the globular terminal domain (from 1 to 479) of clathrin heavy chain 1 of human.
- Besides, our left motif L[DN]LL is similar to LLDLL and LLNLD both of which share the same functions.
| DISCUSSIONS AND SUMMARY |
|---|
In this paper, we model binding sites using a mathematical concept, stable patterns. Our random experiments have shown that stable motif pairs are more likely to be significant than random motif pairs. It is interesting to examine the theoretical aspect of this concept in future.
In this paper, we use P-scores to evaluate the significance of motif pairs. In fact, more complicated score schemes can be tried. For example, Ng et al. (2003) used a similar score with a slight difference only in the calculation of expectations. Deng et al. (2002) used maximum-likelihood to estimate the scores. Note that these two approaches are quite expensive in computation. Moreover, Ngs formula also has the divergence problem. Hence, both of them are not suitable to search significant motif pairs in a huge pattern space. Therefore, how to combine the strength of these score schemes is still a future research effort of us.
We validate our discovered motif pairs with those determined by experimental methods from the literature, both for individual motifs and motif pairs. We show some examples for the coincidence between them. Nevertheless, there still a lot of efforts to be made in future. We intend to collect a comprehensive database about experimentally determined binding motif pairs through text-mining methods in addition to manual check. Then, we could perform a systematic validation for our discovered patterns.
Finally, we summarize the main results achieved in this paper. We used motif pairs to model the binding sites between proteins with two intuitions: (1) the motif pairs should satisfy the stability and (2) the motif pairs should be statistically significant, for both single motifs and their co-occurrence as pairs. We presented efficient algorithms to identify meaningful starting motif pairs, and to find a convergence route for stable motif pairs, as well as to compute the significance of motif pairs. As the search for all possible stable and significant motif pairs from a sequence dataset of interacting protein pairs is a challenging problem, in this paper we turned to look for a special subset of them. The discovery of this subset of stable and significant motif pairs is guided by binding sites identified from a biologically reliable dataset of protein complexes. For this, we extract maximal contact segment pairs from the complex dataset, then generalize them to become our crucial patternsstarting motif pairs that lead to stable motif pairs by a refinement process.
Our comprehensive comparison results have shown that our discovered motif pairs are much more statistically significant than random motif pairs, a result from the choice of starting motif pairs. Some of our discovered motif pairs are also highly matched with real binding motifs reported in the literature.
| APPENDIX A |
|---|
For a motif M =
1
2 ...
k, the expectation in overlapping model is as follows:
![]() |
| APPENDIX B |
|---|
DEFINITION 7.
[Contact segment pairs] Given two proteins Pa = (a1 ,...,ai ,...,au ) and Pb = (b 1,..., bj ,...,bv ), where ai and bj are corresponding residue ids on its protein, a segment pair ([a i 1 ,a i 2 ],[b j 1 ,b j 2 ]) is a contact segment pair if
ai
[a i 1 , a i 2 ],
bj
[b j 1 , b j 2 ] such that contact(ai, bj ), and
bj
[b j 1 , b j 2 ],
ai
[a i 1 , a i 2 ] such that contact(ai, bj ). Residue ai and j is contacted if one of their atom pairs having Euclidean distance less than a threshold.
The following proposition is useful for the efficient discovery of all maximal contact segment pairs from a complex dataset.
PROPOSITION 1.
[Containing property] A segment pair ([a i 1 , a i 2 ], [b j 1 , b j 2 ]) in protein Pa and Pb is a contact segment pair iff the coverage of any of the two segments contains the other segment, i.e. Contact([a i 1 , a i 2 ], [b j 1 , b j 2 ])
(Co
([a i 1 , a i 2 ])
[b j 1 , b j 2 ])
(Co
([b j 1 , b j 2 ])
[a i 1 , a i 2 ]).
Cov is short for coverage, Cov(ai ) = {bj | contact(a i,bj ), bj
Pb }, and Cov([a i 1 ,a i 2 ]) =
a i
[a i 1 ,a i 2 ] Cov(ai ); Cov(bj ) and Cov([b j 1 , b j 2 ]) are similarly defined.
We use a top-down divide-and-conquer strategy to make use of this proposition for the discovery of all the maximal contact segment pairs. First, we start with the two entire segments, then we check whether the containing property exists between these two segments. If yes, stop. Otherwise, we split the coverage of one segment into several discontinued subsegments and form several new segment pairs. This process goes recursively until all the segment pairs fulfill the containing property. In this manner, we also guarantee that we target only on maximal contact segment pairs. The detailed proof of the proposition and the formal description of the algorithm can be found in our previous work (Li et al., 2004).
| Acknowledgments |
|---|
We are grateful to Pierre Nicodeme, Frederic Chyzak and Bruno Salvy for providing the source code and library of the powerful software package Algolib to let us compute the statistical scores of motif pairs. We are thankful to Soon-Heng Tan for spending his valuable time to help us to validate the patterns in comparison with the literature results.
Received on May 17, 2004; revised on June 8, 2004; accepted on September 2, 2004
| REFERENCES |
|---|
Atteson, K. (1998) Calculating the exact probability of language-like patterns in biomolecular sequences. Proceedings of the sixth International Conference on Intelligent Systems for Molecular Biology (ISMB), , Canada , pp. 1724.
Azarya-Sprinzak, E., Naor, D., Wolfson, H.J., Nussinov, R. (1997) Interchanges of spatially neighbouring residues in structurally conserved environments. Protein Eng., 10, 11091122
Botstein, D. and Shortle, D. (1985) Strategies and applications of in vitro mutagenesis. Science, 229, 11931201
Clemmons, D.R. (2001) Use of mutagenesis to probe IGF-binding protein structure/function relationships. Endocr. Rev., 22, 800817
Deng, M., Mehta, S., Sun, F., Chen, T. (2002) Inferring domain-domain interactions from proteinprotein interactions. Genome Res., 12, 15401548
Doray, B. and Kornfeld, S. (2001) Gamma subunit of the AP-1 adaptor complex binds clathrin: implications for cooperative binding in coated vesicle assembly. Mol. Biol. Cell, 12, 19251935
Fariselli, P., Pazos, F., Valencia, A., Casadio, R. (2002) Prediction of proteinprotein interaction sites in heterocomplexes with neural networks. Eur. J. Biochem., 269, 13561361[Web of Science][Medline].
Josephson, K., Logsdon, N.J., Walter, M.R. (2001) Crystal structure of the IL-10/IL-10R1 complex reveals a shared receptor binding site. Immunity, 15, 3546[CrossRef][Web of Science][Medline].
Li, H., Li, J., Tan, S.H., Ng, S.K. (2004) Discovery of binding motif pairs from protein complex structural data and protein interaction sequence data. Proceedings of the Ninth Pacific Symposium on Biocomputing (PSB), , Hawaii , pp. 312323.
Meng, S.W., Zhang, Z., Li, J. (2004) Twelve C2H2 zinc finger genes on human chromosome 19 can be each translated into the same type of protein after frameshifts. Bioinformatics, 20, 14
Mohamed, A.K. and William, A.K. (2001) An Introduction to Metric Spaces and Fixed Point Theory. , Sons John Wiley &.
Nevill-Manning, C.G., Wu, T.D., Brutlag, D.L. (1998) Highly specific protein sequence motifs for genome analysis. Proc. Natl Acad. Sci., USA, 95, 58655871
Ng, S.K., Zhang, Z., Tan, S.H. (2003) Integrative approach for computationally inferring proteindomain interactions. Bioinformatics, 19, 923929
Nicodeme, P., Salvy, B., Flajolet, P. (2002) Motif statistics. Theoret. Comput. Sci., 287, 593618[CrossRef].
Rodi, D.J., Agoston, G.E., Manon, R., Lapcevich, R., Green, S.J., Makowski, L. (2001) Identification of small molecule binding sites within proteins using phage display technology. Comb. Chem. High Throughput Screen., 4, 553572[Web of Science][Medline].
Sidhu, S.S., Fairbrother, W.J., Deshayes, K. (2003) Exploring proteinprotein interactions with phage display. Chembiochem., 4, 1425[CrossRef][Web of Science][Medline].
Smith, G. (1985) Filamentous fusion phage: novel expression vectors that display cloned antigens on the virion surface. Science, 228, 13151317
Sprinzak, E. and Margalit, H. (2001) Correlated sequence-signatures as markers of proteinprotein interaction. J. Mol. Biol., 311, 681692[CrossRef][Web of Science][Medline].
Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology (ISMB). Tompa, M. (1999) An exact method for finding short motifs in sequences, with application to the ribosome binding site problem. , Germany 262271.
Tumbarello, D.A., Brown, M.C., Turner, C.E. (2002) The paxillin LD motifs. FEBS Lett., 513, 114118[CrossRef][Web of Science][Medline].
von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S.G., Fields, S., Bork, P. (2002) Comparative assessment of large-scale data sets of proteinprotein interactions. Nature, 417, 399403[Medline].
This article has been cited by other articles:
![]() |
J. Guo, X. Wu, D.-Y. Zhang, and K. Lin Genome-wide inference of protein interaction sites: lessons from the yeast high-quality negative protein-protein interaction dataset Nucleic Acids Res., April 1, 2008; 36(6): 2002 - 2011. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.D.J. van Dijk, C.J.F. ter Braak, R.G. Immink, G.C. Angenent, and R.C.H.J. van Ham Predicting and understanding transcription factor interactions based on sequence level determinants of combinatorial control Bioinformatics, January 1, 2008; 24(1): 26 - 33. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Li, J. Li, and L. Wong Discovering motif pairs at interaction sites from protein sequences on a proteome-wide scale Bioinformatics, April 15, 2006; 22(8): 989 - 996. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||























