Bioinformatics Advance Access originally published online on November 15, 2007
Bioinformatics 2007 23(24):3276-3279; doi:10.1093/bioinformatics/btm513
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Low folding propensity and high translation efficiency distinguish in vivo substrates of GroEL from other Escherichia coli proteins
1Department of Structural Biology, Weizmann Institute, Rehovot 76100 and 2The Mina & Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan 52900, Israel
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Theoretical considerations have indicated that the amount of chaperonin GroEL in Escherichia coli cells is sufficient to fold only
2–5% of newly synthesized proteins under normal physiological conditions, thereby suggesting that only a subset of E.coli proteins fold in vivo in a GroEL-dependent manner. Recently, members of this subset were identified in two independent studies that resulted in two partially overlapping lists of GroEL-interacting proteins. The objective of the work described here was to identify sequence-based features of GroEL-interacting proteins that distinguish them from other E.coli proteins and that may account for their dependence on the chaperonin system.
Results: Our analysis shows that GroEL-interacting proteins have, on average, low folding propensities and high translation efficiencies. These two properties in combination can increase the risk of aggregation of these proteins and, thus, cause their folding to be chaperonin-dependent. Strikingly, we find that these properties are absent in proteins homologous to the E.coli GroEL-interacting proteins in Ureaplasma urealyticum, an organism that lacks a chaperonin system, thereby confirming our conclusions.
Contact: amnon.horovitz{at}weizmann.ac.il
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
The Escherichia coli chaperonin system facilitates protein folding in vivo and in vitro in an ATP-dependent manner (Horovitz and Willison, 2005; Horwich et al., 2007). It comprises GroEL and its helper-protein GroES, which are both essential proteins (Fayet et al., 1989). GroEL is able to bind and assist the folding of a wide range of proteins in vitro. For example, it was shown that
40% of the soluble proteins in E. coli can interact with GroEL in vitro (Viitanen et al., 1992) and that even randomly generated artificial polypeptides are GroEL binders (Aoki et al., 2000). Theoretical considerations have indicated, however, that the amount of GroEL in the cell is sufficient to fold only
2–5% of newly synthesized proteins under normal physiological conditions (Lorimer, 1996), thereby suggesting that only a subset of E.coli proteins fold in vivo in a GroE-dependent manner. An initial attempt to identify this set of proteins resulted in a list of about 300 proteins that are comprised preferentially of
β domains (Houry et al., 1999). More recently, two additional sets of GroEL-interacting proteins (GIP) were identified for which there is less experimental uncertainty regarding whether they represent the true in vivo substrates (Chapman et al., 2006; Kerner et al., 2005). Overall, it appears that
10% of the cytosolic proteins interact with GroEL under normal growth conditions (Ewalt et al., 1997; Houry et al., 1999). The existence of a subset(s) of E.coli proteins that require the GroE system for folding raises the following two questions. The first concerns the mechanism by which GroEL recognizes its substrates. In other words, what are the features that confer binding to GroEL that distinguish members of the set of GIP from other E.coli proteins that are non-binders? Given that the mobile loop of GroES and protein substrates share the same binding site in GroEL, it was suggested that one possible answer to this question is that members of the set of GIP contain sequence motifs similar to that of the GroES mobile loop (Chaudhuri and Gupta, 2005; Stan et al., 2005, 2006). The second question concerns identifying the properties of members of the set of GIP that render them GroE-dependent. In this article, we focus on this second question. It should be mentioned, however, that the two questions are not entirely unrelated. For example, slow folding may cause a protein to be prone to aggregation and, thus, GroE-dependent and, at the same time, facilitate binding to GroEL.
The work described here involves DNA and protein sequence analysis of the set of GIP identified by Kerner et al. (2005). This set was partitioned into three classes: (i) class I substrates that are assisted by GroE but do not require it; (ii) class II substrates that require both GroEL and GroES at 37°C but do not require GroES at 25°C and (iii) class III substrates that require the GroE-system for folding in a stringent manner (Kerner et al., 2005). This partitioning enabled us to compare GroEL binders and non-binders at a higher resolution by analyzing differences also between the three classes. Our results show that GroEL binders have, on average, relatively low folding propensities but higher translation efficiencies as compared with all other E.coli proteins (i.e. non-binders). These results were confirmed by showing that these features are also present in the set of GIP identified by Chapman et al. (2006) but absent in the proteins homologous to the set of GIP in Ureaplasma urealyticum, an organism that lacks a chaperonin system.
| 2 METHODS |
|---|
|
|
|---|
Differences between sequence-based features of the set of GIP and all other E.coli proteins were evaluated by comparing the mean value for the set of GIP to the distribution of the mean values of 1000 sets of randomly selected E.coli proteins (from a set of 3444 proteins that exclude GIP and their homologues) each with the same number of proteins as in the set of GIP. Other comparisons, such as between the separate classes of GIP or the GIP homologues in U.urealyticum and all other E.coli proteins, were carried out in a similar way. A P-value of < 0.001 was assigned in cases where the mean value of the examined set (e.g. that of GIP) is more extreme than (or equal to) the mean values of all the 1000 sets of randomly selected proteins. It is important to note that in the case of all the features examined, the values for individual members of the set of GIP were not found to be significantly separable in space from the values for members of the random sets and, therefore, only the mean values for each feature are compared. In addition, the Wilcoxon rank sum test was implemented in order to evaluate whether the sample (GIP, U.urealyticum homologues or the different classes) and the rest of the E.coli proteome have equal medians.
The full genome and proteome sequences of the K12 strain of E.coli (corresponding to accession number NC_000913 [GenBank] .1) were downloaded from the Refseq database (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K12/) (Riley et al., 2006). The full genome and proteome sequences of U.urealyticum (Glass et al., 2000) were downloaded from the PEDANT database (http://pedant.gsf.de/) (Frishman and Mewes, 1997). The sequences of GIP reported by Kerner et al. (2005) were also downloaded from the PEDANT database (http://pedant.gsf.de/links.jsp). This data set contains 252 proteins that are divided into classes I–III comprising 38, 126 and 84 proteins, respectively, and 4 proteins whose class was not determined. The protein P00810 (that belongs to class I) was excluded from our analysis as it is a β-lactamase that is not encoded by the E.coli K12 genome. The GIP set of Chapman et al. (2006) was built using the accession-number list provided in their supporting information. This data set contains 317 proteins of which 136 appear also in the set of Kerner et al. (2005). The other 181 proteins found only by Chapman et al. (2006) constitute a set that is used to confirm our findings for the set of Kerner et al. (2005). In the analysis, we use only 180 sequences (instead of 181) since protein YP_026165.1 was not found in the proteome of the K12 strain of E.coli (accession number NC_000913 [GenBank] .1). Identification of proteins in the E.coli and U.urealyticum proteomes that are homologous to proteins in the GIP set of Kerner et al. (2005) was carried out using BLAST (Altschul et al., 1990). All proteins found to have an E-value smaller than 0.1 were assigned as GIP homologues.
The tRNA adaptation index (tAI) was calculated using the codonM script and tAI.R program provided by dos Reis et al. (2004). The tAI is a measure of tRNA usage by the coding sequences. Each codon is assigned a weight based on the gene copy numbers of all the tRNA that recognize it. This measure relies on the fact that the tRNA gene copy number has a high correlation with the tRNA abundance within the cell and with the codon preferences in the genomes (Ikemura, 1981a, b). The tAI values are in the range of 0–1 where high values correspond to high translation efficiency. The tRNA copy numbers were taken from the genomic tRNA database (Lowe and Eddy, 1997) (http://lowelab.ucsc.edu/GtRNAdb/).
The FoldIndex (Prilusky et al., 2005) measure was calculated for each protein using the server (http://bip.weizmann.ac.il/fldbin/findex). This measure is based on a linear equation proposed by Uversky et al. (2000) that separates folded proteins from disordered ones according to the absolute value of their net charge and average hydrophobicity. The values of this measure are in the range of –1 to 1 where larger values correspond to a higher probability that the protein is able to fold. Here, we use FoldIndex as a measure of folding propensity.
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
The means of the folding propensity (Prilusky et al., 2005), translation efficiency (dos Reis et al., 2004) and GC content of the set of GIP (Kerner et al., 2005) were found to be significantly different from the respective means of sets of randomly selected E.coli proteins (Fig. 1 and Table 1). Similar results were obtained when classes II and III of the GIP were compared individually with the random sets. In the case of class I, however, only the mean value of the translation efficiency was found to be significantly different from the respective means of the random sets (Table 1). Importantly, similarly significant results were obtained for all three features when the full set of GIP of Chapman et al. (2006), or the subset of GIP found by Chapman et al. (2006) that is not included in the set of Kerner et al. (2005), were compared to the random sets (Table 1). All the differences were found to be significant also according to the Wilcoxon rank sum test except for the GC content of the subset of GIP found by Chapman et al. (2006) but not by Kerner et al. (2005) in which case we were not able to reject the hypothesis that its median value is equal to that of the rest of the E.coli proteome. It should be noted that other properties we examined, such as contact order (Plaxco et al., 1998) and
-helix and β-strand secondary structure content, were found to not distinguish between GIP and other E.coli proteins (data not shown).
|
|
The mean folding propensity (as measured by the FoldIndex) of the different sets of GIP was found to be lower than those of the random sets (Table 1). This difference is, as expected, less significant in the case of class I proteins that are assisted by GroEL in folding but do not require it. In contrast, the mean tAI values of all the GIP sets was found to be significantly higher than the means of the random sets. Taken together, our results indicate that in vivo substrates of GroEL are characterized by low folding propensities and high translation efficiencies. These observations can be rationalized by assuming that proteins that fold slowly but are synthesized rapidly are more aggregation prone and, as a result, are chaperonin-dependent.
A negative control for the above findings is provided by analysis of the proteins that are homologous to GIP in U.urealyticum, an organism that lacks the GroE system. The full U.urealyticum genome contains 613 proteins of which 21, 37 and 28 are homologous to members of classes I, II and III, respectively, in Kerner et al. (2005) (here only the best hits were considered to be homologous). The non-GIP set that was used for generating the random sets contains all U.urealyticum proteins except for those that are homologous (best hit or otherwise) to the set of GIP in E.coli (i.e. 502 proteins). It may be seen in Table 1, Figure 2 and the Supplementary Material that the set of U.urealyticum proteins homologous to GIP in E.coli has a mean FoldIndex value that is similar to the means of the random sets. Moreover, the mean tAI value of this set is significantly smaller than the mean values of the random sets instead of being larger as found in the case of the GIP set of E.coli. These findings support our conclusion that GIP differ, on average, from other E.coli proteins in their folding potential and translation efficiency as these differences are absent when a similar comparison is made for an organism that lacks the GroE system. The set of GIP also differs from that of other E.coli proteins in its mean GC content but a similar difference is found in U.urealyticum and this feature is, therefore, not considered further.
|
It may be seen in Figure 2 that the GIP set of E.coli has very unique properties as a group and is clearly separable from the random sets (Fig. 2a) whereas the GIP homologues in U.urealyticum are indistinguishable from the other proteins in this organism (Fig. 2b). A possible explanation for this observation is that two independent evolutionary processes that compensate for the absence of a chaperonin system occurred in U.urealyticum. These processes led to (i) increased folding propensities as reflected in higher FoldIndex values and (ii) reduced translation rates (as reflected in lower tAI values) that may cause folding to occur in a more efficient co-translational manner. The weak (r
0.3) negative correlation in Figure 2b between the translation efficiency and FoldIndex value is consistent with the above explanation as it may reflect the subset of U.urealyticum proteins with a potential problem in folding that is circumvented by having both a high FoldIndex value and a low translation efficiency. In addition, it has been shown that there is a significant and positive correlation between the tAI values and the protein abundance levels in yeast (Man and Pilpel, 2007) and between the tAI values and the mRNA expression levels in E.coli (dos Reis et al., 2003). These correlations, if they also exist in U.urealyticum, may indicate that the significant reduction in the tAI values of the GIP homologues in this organism reflects a tendency to reduce the production of proteins that require a chaperonin system. This line of reasoning is also supported by the observation that a very high anti-correlation was found for humans between mRNA expression levels and aggregation rates (Tartaglia et al., 2007). In conclusion, we have identified sequence-based characteristics related to folding propensity and translation efficiency that significantly differentiate GIP in E.coli from the rest of the E.coli proteome. These characteristics, although reflected only in the mean values of the sets and not in each protein separately, highlight the relationship between the folding properties of a protein and its chaperonin dependence. Further analysis is required in order to reveal the mechanism by which GroEL recognize these substrates.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This work was supported by grants to A.H. from the Israel Science Foundation (67/05) and the Kimmelman Center for Macromolecular Assembly.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Thomas Lengauer
Received on July 10, 2007; revised on August 28, 2007; accepted on October 8, 2007
| REFERENCES |
|---|
|
|
|---|
Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol (1990) 215:403–410.[CrossRef][Web of Science][Medline]
Aoki K, et al. GroEL binds artificial proteins with random sequences. J. Biol. Chem (2000) 275:13755–13758.
Chapman E, et al. Global aggregation of newly translated proteins in an Escherichia coli strain deficient of the chaperonin GroEL. Proc. Natl Acad. Sci. USA (2006) 103:15800–15805.
Chaudhuri TK, Gupta P. Factors governing the substrate recognition by GroEL chaperone: a sequence correlation approach. Cell Stress Chaperones (2005) 10:24–36.[CrossRef][Web of Science][Medline]
dos Reis M, et al. Unexpected correlations between gene expression and codon usage bias from microarray data for the whole Escherichia coli K-12 genome. Nucleic Acids Res (2003) 31:6976–6985.
dos Reis M, et al. Solving the riddle of codon usage preferences: a test for translational selection. Nucleic Acids Res (2004) 32:5036–5044.
Ewalt KL, et al. In vivo observation of polypeptide flux through the bacterial chaperonin system. Cell (1997) 90:491–500.[CrossRef][Web of Science][Medline]
Fayet O, et al. The groES and groEL heat shock gene products of Escherichia coli are essential for bacterial growth at all temperatures. J. Bacteriol (1989) 171:1379–1385.
Frishman D, Mewes HW. PEDANTic genome analysis. Trends Genet (1997) 13:415–416.[CrossRef][Web of Science]
Glass JI, et al. The complete sequence of the mucosal pathogen Ureaplasma urealyticum. Nature (2000) 407:757–762.[CrossRef][Medline]
Horovitz A, Willison KR. Allosteric regulation of chaperonins. Curr. Opin. Struct. Biol (2005) 15:646–651.[CrossRef][Web of Science][Medline]
Horwich AL, et al. Two families of chaperonin: physiology and mechanism. Annu. Rev. Cell Dev. Biol (2007) 23:115–145.[CrossRef][Medline]
Houry WA, et al. Identification of in vivo substrates of the chaperonin GroEL. Nature (1999) 402:147–154.[CrossRef][Medline]
Ikemura T. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes. J. Mol. Biol (1981a) 146:1–21.[CrossRef][Web of Science][Medline]
Ikemura T. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J. Mol. Biol (1981b) 151:389–409.[CrossRef][Web of Science][Medline]
Kerner MJ, et al. Proteome-wide analysis of chaperonin-dependent protein folding in Escherichia coli. Cell (2005) 122:209–220.[CrossRef][Web of Science][Medline]
Lorimer GH. A quantitative assessment of the role of the chaperonin proteins in protein folding in vivo. FASEB J (1996) 10:5–9.[Abstract]
Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res (1997) 25:955–964.
Man O, Pilpel Y. Differential translation efficiency of orthologous genes is involved in phenotypic divergence of yeast species. Nat. Genet (2007) 39:415–421.[CrossRef][Web of Science][Medline]
Plaxco KW, et al. Contact order, transition state placement and the refolding rates of single domain proteins. J. Mol. Biol (1998) 277:985–994.[CrossRef][Web of Science][Medline]
Prilusky J, et al. FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics (2005) 21:3435–3438.
Riley M, et al. Escherichia coli K-12: a cooperatively developed annotation snapshot-2005. Nucleic Acids Res (2006) 34:1–9.
Stan G, et al. Identifying natural substrates for chaperonins using a sequence-based approach. Protein Sci (2005) 14:193–201.[CrossRef][Web of Science][Medline]
Stan G, et al. Residues in substrate proteins that interact with GroEL in the capture process are buried in the native state. Proc. Natl Acad. Sci. USA (2006) 103:4433–4438.
Tartaglia GG, et al. Life on the edge: a link between gene expression levels and aggregation rates of human proteins. Trends Biochem.Sci (2007) 32:204–206.[CrossRef][Web of Science][Medline]
Uversky VN, et al. Why are natively unfolded proteins unstructured under physiologic conditions? Proteins (2000) 41:415–427.[CrossRef][Web of Science][Medline]
Viitanen PV, et al. Purified chaperonin 60 (groEL) interacts with the nonnative states of a multitude of Escherichia coli proteins. Protein Sci (1992) 1:363–369.[Web of Science][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

X
GIP) with the respective mean values of 1000 sets of similar size of randomly selected proteins that do not interact with GroEL (