Bioinformatics Advance Access originally published online on October 31, 2007
Bioinformatics 2007 23(24):3297-3303; doi:10.1093/bioinformatics/btm524
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A computational strategy for the prediction of functional linear peptide motifs in proteins
Abteilung für Bioinformatik, Institut für Biochemie, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Short linear peptide motifs mediate protein–protein interaction, cell compartment targeting and represent the sites of post-translational modification. The identification of functional motifs by conventional sequence searches, however, is hampered by the short length of the motifs resulting in a large number of hits of which only a small portion is functional.
Results: We have developed a procedure for the identification of functional motifs, which scores pattern conservation in homologous sequences by taking explicitly into account the sequence similarity to the query sequence. For a further improvement of this method, sequence filters have been optimized to mask those sequence regions containing little or no linear motifs. The performance of this approach was verified by measuring its ability to identify 576 experimentally validated motifs among a total of 15 563 instances in a set of 415 protein sequences. Compared to a random selection procedure, the joint application of sequence filters and the novel scoring scheme resulted in a 9-fold enrichment of validated functional motifs on the first rank. In addition, only half as many hits need to be investigated to recover 75% of the functional instances in our dataset. Therefore, this motif-scoring approach should be helpful to guide experiments because it allows focusing on those short linear peptide motifs that have a high probability to be functional.
Contact: h.sticht{at}biochem.uni-erlangen.de
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Protein interactions are crucial for all cellular processes. One common way in which two proteins interact is the formation of specific contacts between globular domains present in the binding partners. Alternatively, globular domains in one protein may also recognize short stretches of approximately 3–10 residues in their binding partner. These regions often show a particular sequence pattern, or Short Linear Motif (SLiM), which contains the key residues involved in function or binding. These key residues may be connected by variable residues (denoted as x), which ensure the proper spacing of the interacting amino acids. Examples for SLiMs include the classical P-x-x-P motif for binding to SH3 domains or the N-P-x-Y motif for the interaction with PTB domains. In addition to mediating protein–protein interactions, SLiMs also represent the target sites of post-translational modifications and they mediate cell compartment targeting.
Today, approximately 200–300 motifs are known and are cataloged by several resources including the eukaryotic linear motif (ELM) database (Puntervoll et al., 2003), SCANSITE (Obenauer et al., 2003), PROSITE (Hulo et al., 2006) and Minimotif-Miner (Balla et al., 2006). SLiM-mediated binding was estimated to account for 15–40% of the interactions in the human proteome (Neduva and Russell, 2006b), suggesting that there are a significant number of novel motifs that still need to be discovered. Recently, tools like DILIMOT (Neduva and Russell, 2006a) and SLiMDisc (Davey et al., 2006) have been designed for motif discovery in proteins and corresponding approaches have also been developed for the identification of regulatory elements in DNA sequence (GuhaThakurta, 2006).
A key problem of predicting protein interactions based on linear recognition motifs is the short length of the motifs resulting in a large number of instances of which only a small portion is functional. This makes it difficult for the experimentalist to decide which motifs to select for further experimental characterization. Existing motif search tools attempt to reduce the number of false-positive hits by different strategies: The ELM server (Puntervoll et al., 2003) relies on context filters that mask those parts of the sequence space in which little or no SLiMs are expected to occur. Masked regions include globular domains, in which no motifs are expected since they would be buried and therefore not accessible for interaction. Other filters take into account that some motifs are only functional within particular cellular compartments. In addition, the localization of disordered protein stretches is considered by ELM, since these regions frequently exhibit an increased density of SLiMs.
Scoring schemes, like those available in Minimotif-Miner (Balla et al., 2006), QuasiMotiFinder (Gutman et al., 2005) and Scansite (Obenauer et al., 2003), pursue alternative strategies to assess the functional relevance of a hit: QuasiMotiFinder uses a conservation filter for scoring PROSITE motifs in order to reduce the number of false-positive hits. Minimotif-Miner either provides a frequency score, which measures the relative occurrence of SLiMs in the protein query with respect to the entire proteome, or an evolutionary conservation score that measures the conservation of a SLiM among orthologs. The scoring scheme implemented in SCANSITE relies on position-specific scoring matrices (PSSMs), which are used instead of regular expressions for motif presentation. PSSMs allow a scoring of predicted motif sites, but they can only be generated for a small subset of well-characterized motifs, for which a sufficiently large number of functional instances is known from experiment.
Both sequence filtering and scoring proved to be useful in enhanced motif detection in the past, but have to the best of our knowledge not yet been combined. The complementary nature of both approaches, however, suggests that their combination will improve detection of functional motifs among a large excess of false-positive instances. Therefore, we have developed a novel consensus strategy that relies on a combination of sequence filters and information from homologous sequences for scoring (homology scoring).
To allow an improved identification of those regions that contain no SLiMs and can therefore be excluded from the subsequent analysis, we first optimized two different classes of sequence filters. For the scoring of the remaining motifs, we have developed an approach for an automatic inclusion of sequence information from homologs, which explicitly takes into account the sequence similarity to the query sequence, and which does not require a laborious dissection of orthologs and paralogs.
Application of this combined approach to a large dataset of more than 15 000 motif instances proved to facilitate the identification of functional instances significantly. This feature should be particularly helpful to reduce subsequent experimental work by focusing on SLiMs, which have a high probability to be functional.
| 2 METHODS |
|---|
|
|
|---|
2.1 Dataset of interaction motifs
A set of 61 different types of SLiMs (Supplementary Table S1) with a total of 675 annotated instances was obtained from the ELM databank (Puntervoll et al., 2003) (Release September 2006). These 675 sites, at which the SLiMs are known to be functional, are located in 487 different proteins. The respective protein sequences were retrieved from the UniProt database (Wu et al., 2006) and subjected to a 75% homology filtering using the algorithm by Holm and Sander (Holm and Sander, 1998). The resulting 415 sequences containing 576 annotated motifs were stored in a database for further analysis.
Since the aim of our study was the efficient detection of these functional motifs among all motif instances, a pattern search using the 61 motif types in our dataset was performed in all 415 sequences. This search results in 14 987 new instances in addition to the 576 functional instances known. Each protein contains on average 1.4 annotated motifs and the portion of annotated motifs in the dataset is 3.7%. Thus, our dataset contains a significant excess of unannotated motifs that allow testing strategies for the improved identification of the functional motifs among a large excess of false-positive instances.
2.2 Sequence analysis
The membrane topology of the proteins in the dataset was predicted using the programs TMHMM2 (Krogh et al., 2001) and PHOBIUS (Käll et al., 2004) with standard settings. For the identification of protein domains, the 9318 Pfam (Sonnhammer et al., 1997) HMMs (Release 22) were stored locally and searched with the tool pfam_scan.pl using the default hmmpfam E-value threshold.
In order to assess the pattern conservation in homologous sequences, homologs were retrieved from a UniRef100 databank (version May 2006, containing 3.5 million sequences) by a BLAST (Altschul et al., 1990) search in which the maximal number of hits included was restricted to 250. Pairwise alignments between the query sequence containing the candidate motif(s) of interest and its homologs were generated using a BLOSUM50 matrix (Gap-open = –12, Gap-extend = –2) and all homologs were finally sorted for further analysis according to their LALIGN (Huang and Miller, 1991) scores. A SLiM was considered as conserved in a homolog, if the same type of motif was present at the equivalent position as in the query sequence. The scoring of the conserved patterns in the homologs was done according to the procedure described below.
2.3 Motif scoring in homologous sequences
The basic idea of the following strategy is to identify those patterns that are higher conserved between two homologs than expected from the overall sequence identity of the two sequences. For that reason, we first calculated the average probability that a pattern is present or absent in a homolog depending on its overall sequence identity to the query sequence. For each position of a motif, one can envision three different situations [see (a)–(c) below].
- The amino acid present in the query sequence is conserved in the homolog. The average probability to observe this occurrence depends on the sequence identity between the query sequence and its homolog. For a sequence identity of id% the average probability for conservation (pc) of a pattern position can be estimated by:

(1)
- The amino acid present in the query sequence is substituted by an amino acid and the pattern describing the SLiM is retained (ps,r).
The overall probability to observe a substitution of an amino acid is given by (1– id/100). There are 19 possible substitution types, and the subset of tolerated amino acids depends on the pattern itself. This is described by the motif variability v. For example, a valine matching the pattern position -[VIL]- tolerates substitution by leucine or isoleucine resulting in a motif variability of 2 (v = 2).
(2)
- The amino acid present in the query sequence is substituted by an amino acid and the pattern describing the SLiM is destroyed (ps, d).
|
| (3) |
|
| (4) |
These probabilities can be used to assess the functional relevance of patterns based on the following consideration: If the pattern [VIL]-[WFY]-[KR] is conserved between two sequences sharing 90% identity, this is no strong hint for functional relevance, since such a conservation might also occur with a high probability (p = 0.75) by chance at this level of sequence identity. In contrast, at 50% sequence identity, the probability of a motif occurrence by chance is quite low (p = 0.16) and therefore, this motif is highly likely to be conserved for functional reasons.
Analogous conclusions can also be made, if a pattern is absent in a homologous sequence: While the absence is rather expected at low levels of sequence similarity, the absence in a close homolog strongly argues against a functional relevance of the respective motif.
These considerations are taken into account by the following simple scoring scheme, which was applied to each pairwise comparison between the query sequence and its homologs
S = 1 – p, if the motif is conserved
S = 0 – p, if the motif is not conserved.
In addition to the approach outlined above, which assumes uniform frequencies of 1/20 for each amino acid, we also tested a modified approach that explicitly takes into account the individual frequencies for each amino acid in Equation (4).
The individual scores for each comparison were used to calculate an average conservation score ACSn according to the following equation
|
| (5) |
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
3.1 Sequence filters for reducing the number of motif instances
Masking sequence regions that are expected to contain little or no SLiMs represent one common approach to reduce the number of false-positive hits. In our study, we assessed the performance of two different types of sequence filters that predict membrane topology and globularity of proteins (Fig. 1).
|
The vast majority of the SLiMs known to date mediates intracellular recognition processes. Therefore transmembrane regions, extracellular loops or signal peptide sequences can be masked for further motif search. One important point, however, is the accuracy of the prediction of the respective regions and an inaccurate prediction may result in a loss of relevant motifs. In this context, we tested the accuracy of the programs TMHMM2 (Krogh et al., 2001) and PHOBIUS (Käll et al., 2004), which predict transmembrane regions (TM), extracellular loops (EX) and signal peptides (SP; PHOBIUS only). If these programs would work perfectly, none of the annotated motifs in the test set, which all mediate intracellular recognition processes, should be masked.
As shown in Table 1, a good performance is obtained for the SP- and TM-filters that remove <2% of the annotated motifs. The correct prediction of extracellular regions is generally less accurate and depending on the method used, the portion of motifs removed is 6.1% or 3.0%. These errors result from a wrong classification of the inside/outside topology of transmembrane proteins, which can occur even if the transmembrane helices are predicted at the correct sites. To improve the accuracy of the filters, the output of both algorithms was combined to a consensus prediction in which only those TM and EX regions consistently predicted by both algorithms are excluded (see EX1+2 and TM1+2 in Table 1). In particular for the extracellular regions, this procedure leads to a significant improvement and only 1.4% of the annotated SLiMs are erroneously removed. The combined application of all topology filters masks 2.8% of the functional and 13.2% of all instances, demonstrating the usefulness of this procedure in reducing the number of false-positive hits.
|
In the next step of our analysis, we tested strategies to remove globular regions from the search space. This idea is based on the consideration that functional SLiMs are expected to be located in solvent accessible, non-globular regions of proteins rather than being buried in the interior of globular protein domains. Globular protein domains are for example cataloged in the Pfam databank (Sonnhammer et al., 1997). However, a small portion of the Pfam databank also covers conserved sequence stretches that do not correspond to globular domains. This point is reflected in the relatively large number of functional instances masked (14.4%; Table 1), when all Pfam HMMs are used as a sequence filter.
In order to obtain a better discrimination between true and false-positive hits, we created a Pfam subclass (termed Pfam-d) that includes exclusively those Pfam entries that are annotated as domain in the Pfam databank. When Pfam-d is used instead of Pfam, the number of validated SLiMs lost by filtering is reduced from 14.4% to 5.6% (Table 1).
We tested, whether this portion of annotated SLiMs lost can be further reduced by considering only those Pfam domains, for which at least one representative with known 3D structure is available. This subset of entries (termed Pfam3d), however, results only in a marginal improvement and the portion of lost instances is still 5.2% (Table 1). These numbers suggest that the respective motifs are located in non-globular stretches at the N- or C-terminus of globular domains, or within loops that exhibit the pre-defined topology required for interaction. Due to these properties of the respective sequence stretches, an efficient exclusion by Pfam3d is not possible. Therefore, Pfam-d was used as sequence filter for all subsequent analyses.
A combination of Pfam-d with the topology filter masks
34% of all motif instances while only 8% of the validated SLiMs are lost (Table 1, last line). Note that the performance of both filters is not strictly additive because extracellular Pfam domains are masked by both filters. Unfortunately, due to the large overall number of 15 563 instances in the test set, there are still 10 043 instances (roughly 24 instances per protein) left after filtering. This indicates that there still remains a considerable risk of obtaining false-positive hits, which render a scoring of the hits highly desirable. In the present study, we therefore tested a scoring scheme that includes information about motif conservation in homologous sequences.
3.2 Sequence information from homologs for motif scoring
The conservation of SLiMs in homologous sequences was measured by calculating an average conservation score (ACS) as a function of the number of homologs considered. The individual contribution of each homolog to the score is determined by its sequence similarity to the query sequence (see Methods section). From these curves (Fig. 2), the maximal value of the ACS (MCS; maximum conservation score) was used to discriminate between functional and non-functional motifs within a query sequence. This procedure was applied to the complete set of 576 experimentally validated motifs among a total of 15 563 instances in a set of 415 protein sequences. This corresponds to an average number of 37.5 motifs per protein of which 1.4 are known to be functional. A plot showing the motif distribution per sequence is available as Supplementary Figure S2.
|
The ACS scores for four representative proteins are shown in Figure 2A–D and the plots for all 415 proteins analyzed are available as Supplementary Material (Fig. S1). Figure 2A shows the plot for the analysis of FEN1_HUMAN, which contains a total of 23 candidate motifs, and one of them is annotated as a binding site for the Proliferating Cell Nuclear Antigen {LIG_PCNA}. The ACS as a function of the number of homologs considered was calculated individually for each of these 23 motifs. This example shows the ideal situation that the validated motif (green line) exhibits the highest MCS (
0.7) and can therefore be readily identified. This situation, in which a functional motif obtains the highest MCS, is observed for 21.4% of the sequences in our dataset indicating that for those proteins inclusion of homology information alone is sufficient to identify a functional instance based on its MCS without requiring any additional filtering. For the remaining proteins in the test set, there are motifs that give higher MCSs than the annotated SLiMs (Fig. 2B and C). A significant portion of these high-scoring instances is located within Pfam domains (Fig. 2; blue lines) or extracellular regions (Fig. 2; gray lines), and can therefore be masked by sequence filtering. Figure 2B shows an example in which all high-scoring motifs are located within Pfam domains and therefore the annotated instance obtains the highest MCS after filtering. The combination of homology scoring and sequence filtering increases the number of proteins in the dataset, in which a validated SLiM obtains the highest MCS from 21.4% to 33.0%. Thus, in one-third of all sequences, a functional motif can unambiguously be identified because it exhibits the highest MCS in the respective protein sequence.
In order to allow an efficient discovery of the remaining functional SLiMs that do not obtain the highest MCS (e.g. the LIG_TRAF2_1 motif in Fig. 2C), we propose a simple scheme that ranks all candidate motifs within a protein sequence according to their MCS thus creating a priority list for further experimental studies. The recovery of validated motifs as a function of the number of ranks included is shown in Figure 3 for four different scenarios. The gray squares and diamonds curves represent situations, in which the motifs were ranked according to a random procedure (with and without filtering, respectively) and no information on motif conservation was used. In such a situation, the chance to recover a functional instance depends on the overall number of instances per protein and is therefore affected by sequence filters. For the first ranks, sequence filtering improves the recovery of functional instances, since the filters efficiently reduce the overall number of instances by 34% (Table 1). Sequence masking, however, also removes a small portion (
8%) of the validated SLiMs, and therefore the total recovery does not exceed 92% (Fig. 3; squares).
|
The black curves in Figure 3 correspond to a situation, in which the candidate motifs are ranked according to their MCS, either with (circles) or without (triangles) prior sequence masking. The scoring improves the overall performance, and the majority of annotated instances is now found on the first ranks. An explicit consideration of the amino acid frequencies in Equation 4 did not lead to an improved performance (Supplementary Fig. S3) and was therefore not considered in the final benchmarking (Fig. 3).
Regardless of the scoring strategy used, there are few annotated SLiMs that are still difficult to identify: This observation can at least partially be explained by motif duplication found in several other proteins in our dataset. A typical example is BCA1_HUMAN (Fig. 2D), which contains two Src SH2-domain binding motifs {LIG_SH2_SRC}. In its homologs one copy is highly conserved and even represents the highest-scoring motif in this protein, while the second copy is only poorly conserved. This suggests that one copy is sufficient for the interaction and that there is little evolutionary pressure to retain the second copy of this motif.
Apart from this small subgroup of poorly conserved instances, homology scoring significantly improves the recovery of validated SLiMs and the largest effect is observed for the first rank (Fig. 3). While a random selection procedure recovers only 3.7% of the functional instances on the first rank, sequence filtering, homology scoring, and the combination of both methods increase the portion of validated instances to 7.9%, 21.4% and 33.0%, respectively, in our dataset. In addition, the number of ranks that must be inspected to recover 75% of the validated SLiMs is only half as large compared to random selection procedure (Fig. 3). These results indicate a beneficial effect of sequence filtering, homology scoring and combinations thereof for motif identification, but the exact numbers will depend on the dataset and benchmarking protocol used. Some points, which should be considered in this context, are listed below:
- In the present work the ELM databank was used, because it contains not only the pattern itself but also verified functional motif instances, which are required for benchmarking. A further advantage of this dataset is the manual curation of ELM entries (Puntervoll et al., 2003), which should reduce the number of wrong functional annotations. Until now, however, validated instances are only available for a subset of the known SLiMs, thus limiting the size of the dataset for benchmarking motif identification approaches. In addition, there is some redundancy in the ELM dataset as evident from the fact that several validated motif instances are found in closely related proteins. In order to improve the significance of our analysis, we have removed near-neighbor redundancy at a level of 75% sequence identity from our dataset using a clustering procedure (see Methods section).
- For our dataset of verified functional motifs, there exists no complementary dataset of verified non-functional motifs thus hampering a comprehensive benchmarking (e.g. calculation of the specificity of motif identification). We have therefore limited our benchmarking to the calculation of the recovery of 576 experimentally validated SLiMs from a large background dataset of 14 987 functionally uncharacterized motif instances.
- The choice of the sequence database used for the search for homologous sequences will also affect the resulting scores. In particular, poor quality sequences from gene predictions or alignment problems with remote homologs will lower benchmark scores. We tried to reduce these problems by performing only pairwise sequence alignments with the query sequence instead of one large multiple sequence alignment.
- The performance of the method will also depend on the type of motif investigated. This is particularly important for those few types of SLiMs that frequently fall into domains (e.g. phosphorylation sites in exposed loops of globular domains). Such instances will be lost by sequence filtering and therefore the use of homology scoring alone should be the best choice for their identification.
3.3 Properties and application of the scoring scheme
The data in Figure 3 show that information about pattern conservation in homologous sequences facilitates the identification of functional SLiMs. Key features of the underlying ranking and scoring scheme include (1) the explicit consideration of the sequence similarity between the query sequence and its homologs, and (2) the calculation of a maximum conservation score that is used for motif ranking.
The first criterion is intended to ensure a proper weighting of the presence or absence of a motif in the light of the evolutionary distance between two sequences. This means, that the absence of a motif in a closely related sequence strongly argues against the functional relevance of this site, while conservation is not necessarily expected in remote homologs. Previous approaches, which use evolutionary conservation for scoring known SLiMs (Balla et al., 2006) or novel types of motifs (Neduva and Russell, 2006a; Neduva et al., 2005) mainly used information from close homologs (orthologs) that are likely to share similar functional properties. An automatic dissection of orthologs and paralogs, however, is not always straightforward and requires evolutionary knowledge that is only available for a subset of protein families. The approach presented here does not require an explicit dissection of orthologs and paralogs, but rather relies on the observation that orthologs frequently exhibit higher sequence conservation than paralogs (Johnston et al., 2007). Sorting the homologs according to the similarity with the query sequence ensures that at first the close homologs, which have a higher probability to share the functional site, are included in the ACS calculation.
The curves representing the ACSs of the functional instances in Figure 2A–D (green lines) show as common feature an initial increase of the ACS up to a maximum (termed MCS in this study) and a successive decrease when more remote homologs are included. Using the MCS as a measure of motif conservation has the advantage that this score is not affected by the total number of remote homologs included, which do no longer contain the motif and therefore allow automated retrieval of homologs from database without further selection steps. The magnitude of the MCS itself, however, is affected by the total number of close homologs sharing the motif. This does not allow the definition of one uniform threshold that is valid for all protein families. Motif ranking based on the MCS should therefore rather be used for the ranking of motif instances within the same protein family than for ranking instances between different non-homologous proteins in a genome-wide search. The latter type of application was therefore not the major goal of our method. Instead, our strategy for motif ranking was developed to guide those experimental studies that aim at a more efficient discovery of functional motifs within a given protein.
An analysis of the proteins in our dataset shows that 89.6% of them contain more than 10 motif instances. For those proteins, comprehensive experimental studies that investigate all candidate motifs by site-directed mutagenesis are very labor-intensive, and a ranking procedure that increases the portion of functional instances on the first ranks will help to reduce the work required for motif discovery because it allows first focusing on those SLiMs that have a higher probability to be functional.
Motif ranking can also be applied to proteins, for which the type of interaction is already known from experiment, but the exact binding site still needs to be determined. This situation can be illustrated for the Interleukin-4 receptor (IL4RA_HUMAN), which signals via the Jak/Stat pathway, but due to the high degeneracy of Stat binding motifs, a total of seven putative binding sites (four for Stat5, and three for Stat6) are detected by a conventional motif search. The MCS scoring approach identifies the functional motif as most likely candidate among these motifs (Supplementary Fig. S1B) and should therefore facilitate subsequent experiments.
In summary, we propose a strategy for the prediction of functional motifs, which scores pattern conservation in homologous sequences by taking explicitly into account the sequence similarity to the query sequence and can be used in conjunction with sequence filters masking those sequence regions containing little or no linear motifs. This motif-scoring approach should be helpful to guide experiments because it allows focusing on those motifs that have a higher probability to be functional.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Heike Meiselbach and Klemens Pichler for helpful comments on the manuscript. H.D. was funded by the research training grant GRK 1071 from the Deutsche Forschungsgemeinschaft (DFG).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alex Bateman
Received on August 10, 2007; revised on October 8, 2007; accepted on October 12, 2007
| REFERENCES |
|---|
|
|
|---|
Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol (1990) 215:403–410.[CrossRef][Web of Science][Medline]
Balla S, et al. Minimotif Miner: a tool for investigating protein function. Nat. Methods (2006) 3:175–177.[CrossRef][Web of Science][Medline]
Davey NE, et al. SLiMDisc: short, linear motif discovery, correcting for common evolutionary descent. Nucleic Acids Res (2006) 34:3546–3554.
GuhaThakurta D. Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Res (2006) 34:3585–3598.
Gutman R, et al. QuasiMotiFinder: protein annotation by searching for evolutionarily conserved motif-like patterns. Nucleic Acids Res (2005) 33:W255–W261.
Holm L, Sander C. Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics (1998) 14:423–429.
Huang X, Miller W. A time-efficient, linear-space local similarity algorithm. Adv. Appl. Math (1991) 12:373–381.
Hulo N, et al. The PROSITE database. Nucleic Acids Res (2006) 34:D227–D230.
Johnston CR, et al. Evaluation of whether accelerated protein evolution in chordates has occurred before, after, or simultaneously with gene duplication. Mol. Biol. Evol (2007) 24:315–323.
Käll L, et al. A combined transmembrane topology and signal peptide prediction method. J. Mol. Biol (2004) 338:1027–1036.[CrossRef][Web of Science][Medline]
Krogh A, et al. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol (2001) 305:567–580.[CrossRef][Web of Science][Medline]
Neduva V, Russell RB. DILIMOT: discovery of linear motifs in proteins. Nucleic Acids Res (2006a) 34:W350–W355.
Neduva V, Russell RB. Peptides mediating interaction networks: new leads at last. Curr. Opin. Biotechnol (2006b) 17:465–471.[CrossRef][Web of Science][Medline]
Neduva V, et al. Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biol (2005) 3:e405.[CrossRef][Medline]
Obenauer JC, et al. Scansite 2.0: Proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Res (2003) 31:3635–3641.
Puntervoll P, et al. ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res (2003) 31:3625–3630.
Sonnhammer EL, et al. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins (1997) 28:405–420.[CrossRef][Web of Science][Medline]
Wu CH, et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res (2006) 34:D187–D191.
This article has been cited by other articles:
![]() |
C. M. Gould, F. Diella, A. Via, P. Puntervoll, C. Gemund, S. Chabanis-Davidson, S. Michael, A. Sayadi, J. C. Bryne, C. Chica, et al. ELM: the status of the 2010 eukaryotic linear motif resource Nucleic Acids Res., November 17, 2009; (2009) gkp1016v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. T.-H. Chang, T.-Y. Chien, and C.-Y. Chen seeMotif: exploring and visualizing sequence motifs in 3D structures Nucleic Acids Res., July 1, 2009; 37(suppl_2): W552 - W558. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. E. Davey, D. C. Shields, and R. J. Edwards Masking residues using context-specific evolutionary conservation significantly improves short linear motif discovery Bioinformatics, February 15, 2009; 25(4): 443 - 450. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




