Bioinformatics Advance Access originally published online on January 18, 2005
Bioinformatics 2005 21(9):1891-1900; doi:10.1093/bioinformatics/bti266
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Prediction of unfolded segments in a protein sequence based on amino acid composition
Yeast Structural Genomics, IBBMC, Bat 430, Université Paris-Sud 91405 Orsay Cedex, France
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Partially and wholly unstructured proteins have now been identified in all kingdoms of lifemore commonly in eukaryotic organisms. This intrinsic disorder is related to certain critical functions. Apart from their fundamental interest, unstructured regions in proteins may prevent crystallization. Therefore, the prediction of disordered regions is an important aspect for the understanding of protein function, but may also help to devise genetic constructs.
Results: In this paper we present a computational tool for the detection of unstructured regions in proteins based on two properties of unfolded fragments: (1) disordered regions have a biased composition and (2) they usually contain either small or no hydrophobic clusters. In order to quantify these two facts we first calculate the amino acid distributions in structured and unstructured regions. Using this distribution, we calculate for a given sequence fragment the probability to be part of either a structured or an unstructured region. For each amino acid, the distance to the nearest hydrophobic cluster is also computed. Using these three values along a protein sequence allows us to predict unstructured regions, with very simple rules. This method requires only the primary sequence, and no multiple alignment, which makes it an adequate method for orphan proteins.
Availability: http://genomics.eu.org/
Contact: Anne.Poupon{at}ibbmc.u-psud.fr
| 1 INTRODUCTION |
|---|
|
|
|---|
Unstructured regions are gaining increased attention since it now clearly appears that many functionally important protein regions are disordered (Dunker et al., 2000, 2002). These disordered segments contain information concerning protein function and folding pathways (Plaxco and Gross, 2001; Verkhivker et al., 2003). Structured proteins constituted by globular units perform their functional role through binding pockets, active sites and interaction surfaces. On the contrary, disordered segments often contain short peptide motifs (protein ligands e.g. SH3 ligands, DNA recognition or affinity/specificity modulation and targeting signals, post-translational modification sites), which are the support for protein function. Recently, the largely disordered proteins have been coined as intrinsically disordered proteins (IDPs, also named IUPs for intrinsically unstructured proteins). These proteins are either completely disordered or contain large disordered regions in their native state. More than 100 such proteins are registered, of which Tau, Prions, Bcl-2, p53, 4E-BP1 and eIF1A (Tompa, 2002; Uversky and Fink, 2002) are a few important examples. A recent survey classified their functions into four categories: molecular recognition, molecular assembly/disassembly, protein modification and entropic chains (Dunker et al., 2002). These disordered regions lack a defined three-dimensional (3D) structure in their native states, and frequently undergo disorder-to-order transitions upon binding to their partners, e.g. CREBCBP complex (Radhakrishnan et al., 1997) or upon changes in environmental or cellular conditions (Dunker et al., 2001; Dunker et al., 2002; Uversky and Fink, 2002). Their unstructured conformation would, in principle, allow these proteins to have more partners and modification sites (Tompa, 2002; Liu et al., 2002) to ensure large intermolecular interfaces even in small proteins (Gunasekaran et al., 2003). Moreover these disordered segments could play a regulating role through several relatively low-affinity linear interaction sites (Evans and Owen, 2002). Clearly, they play a key role in diseases mediated by protein misfolding and aggregation (Bates, 2003; Kaplan et al., 2003; Schweers et al., 1994).
Protein disorder can be directly studied by NMR or circular dichroism, or indirectly detected by a variety of experimental methods including stretches of missing electron density in X-ray crystallography maps, Raman spectra, hydrodynamic measurements or even limited, time-resolved proteolysis (Dunker et al., 2001; Smyth et al., 2001). Each one of these methods detects different aspects of disorder resulting in different operational definitions of protein disorder.
During the target selection process in structural genomics, intrinsic protein disorder must be considered since it often leads to difficulties in protein expression, purification and crystallization. Thus, it can be rewarding to scan the sequence for potentially flexible segments so that they can be removed in the genetic constructs used for expression. This is particularly true for large proteins that are usually composed of relatively small, stable structural domains. Here we present a computational tool that helps discern ordered globular domains from disordered regions (called linkers).
Most available computational methods for detecting linkers rely on multiple sequence alignments [MaxHom (Przybylski and Rost, 2002)] with proteins of known 3D structure, protein domains [PRODOM (Servant et al., 2002), Pfam (Bateman et al., 2004), SMART (Letunic et al., 2004) or CDD (Marchler-Bauer and Bryant, 2004)], functional motifs [PROSITE (Hulo et al., 2004)] and signal peptide cleavage sites [SignalP (Dyrløv Bendtsen et al., 2004)]. Among the many methods for sequence searching, the BLAST (McGinnis and Madden, 2004) and the HMMER (Wistrand and Sonnhammer, 2004) suite of programs are by far the most frequently used. These methods cannot be used to detect domains within proteins that have no detectable homologs. Several attempts of predictive methods from the only sequence focusing on the ordered/disordered regions have been developed. One of the first prediction tools used divided sequences into regions of low and high complexity (Wootton, 1994). Low-complexity regions are compositionally biased regions, which often form separate domains within multidomain proteins. They are rarely defined in protein 3D structures (Saqi and Sternberg, 1997) and often coincide with linker sequences. Programs to define low-complexity regions, such as SEG (Wootton, 1994) and CAST (Promponas et al., 2000), are often used for this purpose. The regions detected by these programs consist of long stretches of repeated residues, particularly proline, glutamine, serine or threonine. Although many such regions are structurally disordered, the correlation is far from perfect as regions of low-sequence complexity are not always disordered (and vice versa, Romero et al., 2001). Moreover, disordered regions are not necessarily repetitive.
An alternative approach for the study of disordered/unstructured regions is based on the so-called concept of hot loops, i.e. coils with high-temperature factors [DisEMBL (Linding et al., 2003a)], and some prediction tools have been developed using complexity as one of the descriptors (Romero et al., 1997). Some prediction tools use specific experimental NMR or X-ray crystallographic data. These tools proved to be very powerful for the prediction of secondary structure [NORSp (Liu and Rost, 2003), PONDR (Li et al., 1999)] solvent accessibility, as well as location and topology of transmembrane helices [PHDsuite (Rost and Liu, 2003), TMPred (Hofmann and Stoffel, 1993)], or the prediction of the globularity [GLOBE (Rost and Liu, 2003) or GlobPlot (Linding et al., 2003b), foldIndex (Uversky et al., 2000)].
More recently, methods using neural networks (Obradovic et al., 2003) or position specific score matrices (Jones and Ward, 2003; Ward et al., 2004) have been proposed, and were evaluated for the first time in the CASP5 experiment (Melamud and Moult, 2003). These methods give far better results than the preceding ones, but still rely on very limited data on unfolded segments. Finally, a study recently published (Weathers et al., 2004) shows that a reduced amino acid alphabet is a good predictor of the disorder in protein sequences, even while using only four amino acid classes.
Hydrophobic cluster analysis (HCA) is a low-identity sequence alignment method (Callebaut et al., 1997) that makes use of a two-dimensional (2D) helical representation of protein sequences; it clearly illustrates that hydrophobic clusters are statistically centered on secondary structures. This procedure allows an overall appreciation of many structural features at a glance, including secondary structure segment and linkers. The HCA method requires a lot of visual inspection and experience making it non-automatic and time consuming. Moreover, although the ability of HCA to detect linkers was empirically noticed, it was never proven statistically (Callebaut et al., 1999). Nevertheless, manual use of HCA did lead to the ideas used in our present automatic method.
In this paper, we present an automatic easy-to-use procedure allowing the prediction of highly flexible unstructured linker regions from a primary sequence. The reasons why HCA is able to detect unstructured regions are: (1) the unstructured regions contain either no or very small hydrophobic clusters, and (2) the frequencies of occurrence of the 20 amino acids are different in the structured and the unstructured regions. We statistically quantified these two properties, and derived a prediction method.
We constructed a reference set of structured fragments using the protein database bank (PDB), and also a reference set of mostly unstructured fragments by combining PDB with Swiss-Prot. This allowed us to select unstructured regions inside domains (which will be called inner fragments), and sequence fragments at the periphery of these domains. The frequencies of each amino acid in the unstructured and the structured regions have been calculated using the two reference sets. From this we calculate, for a window centered on amino acid i, two probabilities PSi and PLi, which are the probabilities of occurrence of the particular amino acid in structured and linker regions, respectively. For each amino acid, i, we compute its distance to the nearest hydrophobic cluster, and the ratio R=PLi/PSi. Use of these two values along the protein sequence together with very simple rules is shown to successfully relate stretches of short amino acids to structured or linker regions.
| 2 SYSTEMS AND METHODS |
|---|
|
|
|---|
2.1 Constitution of the reference sets
We create a linker set (L) by aligning the PDB protein sequences (version of January 2002, only residues which have coordinates are considered as a part of the PDB sequence) with their corresponding Swiss-Prot protein sequence (Fig. 1) and extracting the non-aligned fragments automatically. Here Linker refers to unstructured region, meaning that no 3D coordinates are available for these residues. The resulting segments are classified by their position relative to the PDB sequence as N-terminal, C-terminal or inner fragments. From a total of 31 496 chains constituting the entire PDB (19 699 files), 24 686 chains (corresponding to 13 155 pdb files) have been aligned against 4258 Swiss-Prot sequences.
|
We extracted from L, a subset named U10, which consists of the last ten residues of N-terminal fragments (excluding methionine 1) and the first ten residues of C-terminal fragments. We introduced this limitation to avoid, as much as possible, having structured fragments in the set. Indeed, sequence segments immediately neighboring the structured domain are very often linker, because separating structured domains. However, these linkers might be very short, and if the flanking sequence is long, it might contain another structured domain. The U10 set contains 4707 fragments, of which 1919 are C-terminal and 2538 are N-terminal, for a total number of 35 438 residues.
The reference structured set (S) is the ensemble of PDB sequences. Amino acid frequencies in structured and linker regions were computed using the two sets S and U10. The inner set was only used for evaluation.
2.2 Ratio and probabilities of occurrence
To compute the probability of occurrence of a given sequence in structured and linker regions, we consider the occurrence of each amino acid as an independent event. This is only an approximation, since the succession of amino acids is not random. However, this is sufficient for our purpose, and there is no simple model available to establish a probabilistic relationship between the type of one residue and that of its neighbors.
With this hypothesis, the probabilities of occurrence PL and PS of a given sequence in linker and structured regions, respectively, are calculated using a multinomial law:
![]() |
and
are the probabilities of occurrence of nv valines in a linker sequence and in a structured sequence, respectively. In order to evaluate, for each sequence if it is more likely to be structured or unstructured, we took the ratio of these two probabilities, R = PL/PS. Theoretically, this ratio should be >1 when the sequence is unfolded, because in this case the probability of appearance of this particular sequence in a linker region, PL, is higher than its probability of appearance in a structured region, PS. Conversely, if structured, the ratio R, for the particular sequence should be <1. We assign the probabilities and their ratio for each amino acid in the sequence by calculating these values for a fixed size window centered on the concerned residue. To evaluate these probabilities and the ratio along the biological sequences, we use a fixed size sequence window of 21 residues, the values of the probabilities of appearance of that particular window in structured or linker regions, and the ratio R of these probabilities are then assigned to the central amino acid of the window. Different widths for the sequence window have been tried (11, 15, 21 and 25 residues). The best results were obtained with 21 residues.
2.3 Cluster distance
Sequences were coded into ternary code (Callebaut et al., 1997): 1 for hydrophobic residues (VILFMYW), 2 for proline and 0 for other amino acids. A hydrophobic cluster begins and ends with, either 0000 or a proline 2; the cluster itself is constituted by a string of 1s and 0s, but no 2s, with a maximum of three consecutive 0s. For example, the sequence AGEKTISVVLQLEKEEQ corresponds to the current binary code 00000101110100000. The binary code pattern 1011101 is framed by at least four amino acids other than the seven strong hydrophobic ones and hence the identification of 1011101 as a hydrophobic cluster corresponding to the sequence ISVVLQL. We discarded clusters <2 amino acids. For amino acid in position i, we define the cluster distance as being the distance to the closest cluster; the cluster distance is set to 0.5, when i is inside a cluster.
The software is accessible at the address http://genomics.eu.org. The interface is written in PHP, the software is written in C.
| 3 IMPLEMENTATION |
|---|
|
|
|---|
3.1 Amino acid frequencies in structured and linker sets
One of the major limits of many disorder prediction methods is the absence of an extensive database of linker fragments. Hence we first create an extensive list of linkers U10. For example, the Support Vector Machine on which DISOPRED2 (Melamud and Moult, 2003) has been trained using 750 protein chains containing linkers. The total number of disordered residues in their set was 4590, whereas we have collected 35 438.
Amino acid frequencies computed on the U10 set and on the PDB set are significantly different except for three amino acids (Fig. 2): arginine (R), cysteine (C) and glycine (G) (
2 test with a confidence of 0.95 has been applied) (Fig. 3). Strong hydrophobic amino acids V, I, L, F and moderately hydrophobic amino acids M, Y, W are more frequently present in structured regions (the PDB set) than in linker regions (the U10 set). The differences are particularly marked for V, I, F, Y and W. All the other amino acids, except histidine (H), asparagine (N) and aspartic acid (D) are more common in the U10 set, especially alanine (A), serine (S) and proline (P). These frequencies clearly confirm our first hypothesis that the distribution of amino acids in structured and linker regions are different.
|
|
3.2 Probabilities of appearance in structured and linker regions
Using our approach, we can compute for any sequence the probability PL or PS to occur in a linker or structured region. For benchmarking, we computed these probabilities and ratios, R = PL/PS, on proteins with known 3D structures. This should give us a threshold for the ratio, above which we do not detect any structured sequence in the PDB. We gathered all the sequences that gave a probability ratio >30, 20, 10 or 5, for at least 10 consecutive residues (Fig. 3a). The results in Table 1 show that defining as linker segments longer than 10 residues with a ratio >10 gives very few false positives. This led us to the first rule: if 10 or more consecutive residues have probability ratios >10, they are predicted as unstructured. A ratio 10 threshold leads also to many false negatives in that many linkers are not identified. Lowering this threshold to a value of 5 gives far fewer false negatives, but increases the number of false positives (segments predicted as linker whereas they are structured).
|
3.3 Using hydrophobic clusters
Sequence analysis of both true and false positives appearing when lowering the ratio from 10 to 5, shows that the linker segments have fewer hydrophobic clusters than the structured ones. We therefore computed for each amino acid along the sequence, its distance along the sequence to the nearest hydrophobic cluster, which we call cluster distance. The linker regions are expected to have higher cluster distance values than the structured ones. We then computed, for each amino acid along the sequence, the product P = (probability) x (cluster distance). Analysis of P-values suggested a second rule: segments (longer than 10 residues) that have probability ratios between 5 and 10, and a product >10 are predicted as linker (Fig. 3b). Using this second rule led to 116 false positives (116 structured segments predicted as linker), on 31 496 protein chains tested. Reducing the product cutoff from 10 to 5 leads to 49 more false positives (Table 1).
To discover the origin of these remaining false positives, we classified them into nine categories (Table 1, Fig. 4). Categories AH represent segments that are expected to be unstable under different conditions than the ones used for their structural study. For example, category C contains proteins for which the segment predicted as linker interacts with a ligand (nucleic acid, protein or small molecule). Most of these segments are unstructured in the absence of that ligand.
|
An example of a segment predicted as linker but which can be structured under certain conditions is given in Figure 5. The structure of the nucleosome core particle has been solved in the presence of DNA [PDB 1KX5 [PDB] (Davey et al., 2002)] and in its absence [PDB 2HIO [PDB] (Arents et al., 1991)]. This particle forms an octamer, and a linker segment is predicted for the first 30 residues of each subunit. This N-terminal segment is structured in 1KX5 and interacts with DNA, whereas it is unstructured in the absence of DNA.
|
The last category in Table 1 contains the real false positives: segments predicted as unstructured that are indeed structured, probably in any condition, but it has to be noted that this category, using the two defined rules, contains only six protein chains.
3.4 Estimation of the results obtained with rules 1 and 2 on the unfolded set
Predicting linker fragments on the PDB using the two defined rules led to very few false positives. This means that the residues predicted as unstructured are structured. We also have to evaluate the level of false negativeshow many of the linker fragments are predicted as linker.
All the fragments of the linker set (L) have been submitted for the detection. We made three different analyses, by predicting fragments as unfolded using:
- first rule only, with ratio 20 threshold,
- first rule only, with ratio 10 threshold and
- first rule with ratio 10 threshold and then second rule.
|
The results obtained for linkers longer than 30 residues are very satisfying: 59.9% of the N-terminal fragments, 70.5% of the C-terminal fragments and 61.1% of the inner fragments are correctly predicted as linker. It is difficult to evaluate what the optimum values should be, since they are indeed not all unstructured.
The results obtained for inner segments are particularly important, since these fragments were not included in the U10 set used for the calculation of the probabilities. Using the two first rules allows us, for example, to detect >60% of the long inner linker fragments. However, even if the amino acid composition of these fragments is very close to that of structured regions, they do not contain hydrophobic clusters. This led us to the third rule: a segment longer than 10 residues having a product P > 30 is predicted as unstructured (Fig. 3c).
When combining the three rules, 94.4% of the linker segments of length
30, 90% of the segments between 20 and 30 residues, 79.4% of the segments between 10 and 20 and even 53.2% of the segments shorter than 10 residues are predicted as unstructured. It has to be noted in the last case that 10 residues are predicted as unstructured, whereas <10 are really unstructured.
3.5 Exploring the CASP5 benchmark
During the CASP5 experiment, for the first time, predictions of disorder were made (Melamud and Moult, 2003). Thus, we tried our prediction method on the CASP5 targets and compared it with the results obtained by the other methods (Table 3). It has to be noted that the version of the PDB used for the statistical analysis does not contain the CASP5 targets.
|
The different methods were evaluated in CASP5 using two values: sensitivity and sensibility. A sensitivity of 1 means that there are no false negatives, a sensibility of 0 means that there are no false positives. It can be seen in Table 3 that our method has both better sensitivity and better specificity. For the comparative modeling targets, we even have maximum values: sensitivity of 1 and fraction of false positives of 0. This means that all the residues predicted as unstructured are unstructured.
| 4 DISCUSSION |
|---|
|
|
|---|
The automated procedure presented here allowed us to constitute an extensive set, mostly of linker fragments. The frequencies of amino acid occurrences in structured and linker regions were deduced from this set and from the PDB. The results show significant differences for most amino acids. The analysis of the two distributions is in agreement with the observations of the many works on protein solubility and folding. Indeed, hydrophobic amino acids VILFMWY, known to constitute the inner core of regular secondary structures, are significantly more frequent in structured sequences. The amino acids with high propensities for loops (A, T, S and P), and those very exposed (Q, E and K) are found more frequently in linker than in structured regions. More surprising is the fact that the polar amino acids N and D have higher frequencies for structured fragments.
These frequencies are then used to compute, for a sequence window, the probabilities of occurrence in linker and structured segments. The ratio, R, between these two probabilities is then computed, and assigned to the amino acid central to the considered sequence window. For each amino acid, we also compute its distance to the closest hydrophobic cluster, the so-called cluster distance and the product P. From the analysis of the linker and structured sequences, we propose three prediction rules. A segment longer than 10 residues is predicted as linker if one of the following three conditions is fulfilled (Fig. 3):
- R is higher than 10,
- R is between 5 and 10, and P is higher than 10 or
- P is higher than 30.
4.1 Finding new rules
By exploring the false negatives (linker regions predicted as structured), it is often possible to see in the graphs of the ratio a distinguishable signal above the background, which does not fulfill any of our three rules. One aspect that should probably be taken into account is the general appearance of the graph of the ratio. The three examples shown in Figure 6 illustrate this fact. The first example consists of residues 155715 of protein P04585
[GenBank]
(HIV-1 reverse transcriptase), for which many structures can be found in the PDB. Most of these structures have many unstructured loops, which are shown on the x-axis by violet rectangles. Indeed, the P04585
[GenBank]
protein, which has many loosely structured stretches distributed all along its sequence, presents a very fuzzy curve. The second example is the protein P05230
[GenBank]
(human heparin-binding growth factor 1), which possesses one unstructured loop. On the probability ratio graph, one can see very clearly the peak corresponding to the unstructured loop, but it cannot be detected by any of our three rules. The third example, which is given for comparison, is a wholly structured protein (P01922
[UniProtKB/Swiss-Prot]
: the human hemoglobin HBA2, PDB 1A00
[PDB]
), for which no peak is observed above the background. These three graphs are typical of those we obtain for these three categories of proteins. Even if the curves obtained for the first two proteins appear different, it is difficult to build a rule that could differentiate between them. Predicting on such proteins will probably require use of other methods in conjugation with ours.
|
A final interesting example we want to consider is the CASP5 target T0145. This protein is completely unstructured (Melamud and Moult, 2003). Our method predicts for this protein, two long unstructured stretches (Fig. 7) in the regions 6496 and 108200. As these two regions are separated by only 11 residues, they probably constitute one single linker segment. The regions predicted as being structured are short compared with the size of the protein (163 and 200216), and can probably not fold. Thus the protein can be predicted as fully unstructured. This could be a fourth rule for our method when regions predicted as structured are short compared with adjacent unstructured ones. Considering a part of a protein where large unstructured and short structured regions are predicted, the whole part of the protein will certainly be totally unstructured. Further studies are required in order to find the ratio above which this kind of event can occur.
|
Our tool was primarily based on the analysis of how the HCA method was able to predict linkers. HCA is a very powerful method, but its use is limited because it requires expertise from the user. Because of its efficiency, the method presented here demonstrates that a lot of information can be extracted from HCA provided it is quantified and automated. Our linker prediction method, which works on from any primary sequence, has the advantage of being fully automatic and able to predict the presence of potentially linker regions that could lower protein expression, solubility or crystallization.
Our near-term perspectives are to attempt quantification of the observed differences between the graphs of globally poorly structured proteins and those having only one short unstable fragment (Fig. 6). Finally, as has been shown for structure prediction, for example, using many different methods always yields better results than using just one (Moult et al., 2003). Thus, we plan to build a metaserver integrating other linker prediction methods.
| Acknowledgments |
|---|
We thank Herman van Tilbeurgh and Michael Levitt for helpful discussions and for critical reading of the manuscript. This work was funded by the Association Française contre les Myopathies (AFM) and the Association pour la Recherche sur le Cancer (ARC).
Received on September 23, 2004; revised on December 20, 2004; accepted on January 6, 2005
| REFERENCES |
|---|
|
|
|---|
Arents, G., Burlingame, R.W., Wang, B.C., Love, W.E., Moudrianakis, E.N. (1991) The nucleosomal core histone octamer at 3.1 Å resolution: a tripartite protein assembly and a left-handed superhelix. Proc. Natl Acad. Sci. USA, 88, 1014810152
Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., et al. (2004) The Pfam protein families database. Nucleic Acids Res., 32, D138D141
Bates, G. (2003) Huntington aggregation and toxicity in Huntington's disease. Lancet, 361, 16421644 Review[CrossRef][Web of Science][Medline].
Callebaut, I., Labesse, G., Durand, P., Poupon, A., Canard, L., Chomilier, J., Henrissat, B., Mornon, J.P. (1997) Deciphering protein sequence information through hydrophobic cluster analysis (HCA): current status and perspectives. Cell Mol. Life Sci., 53, 621645 Review[CrossRef][Web of Science][Medline].
Callebaut, I., Courvalin, J.C., Mornon, J.P. (1999) The BAH (bromo-adjacent homology) domain: a link between DNA methylation, replication and transcriptional regulation. FEBS Lett., 446, 189193[CrossRef][Web of Science][Medline].
Davey, C.A., Sargent, D.F., Luger, K., Maeder, A.W., Richmond, T.J. (2002) Solvent mediated interactions in the structure of the nucleosome core particle at 1.9 Å resolution. J. Mol. Biol., 319, 10971113[CrossRef][Web of Science][Medline].
Dunker, A.K., Obradovic, Z., Romero, P., Garner, E.C., Brown, C.J. (2000) Intrinsic protein disorder in complete genomes. Genome Inform. Ser. Workshop Genome Inform., 11, 161171[Medline].
Dunker, A.K., Lawson, J.D., Brown, C.J., Williams, R.M., Romero, P., Oh, J.S., Oldfield, C.J., Campen, A.M., Ratliff, C.M., Hipps, K.W., et al. (2001) Intrinsically disordered protein. J. Mol. Graph. Model., 19, 2659 Review[CrossRef][Web of Science][Medline].
Dunker, A.K., Brown, C.J., Lawson, J.D., Iakoucheva, L.M., Obradovic, Z. (2002) Intrinsic disorder and protein function. Biochemistry, 41, 65736582[CrossRef][Medline].
Dyrløv Bendtsen, J., Nielsen, H., Von Heijne, G., Brunak, S. (2004) Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol, 340, 783795[CrossRef][Web of Science][Medline].
Evans, P.R. and Owen, D.J. (2002) Endocytosis and vesicle trafficking. Curr. Opin. Struct. Biol., 12, 814821 Review[CrossRef][Web of Science][Medline].
Gunasekaran, K., Tsai, C.J., Kumar, S., Zanuy, D., Nussinov, R. (2003) Extended disordered proteins: targeting function with less scaffold. Trends Biochem. Sci., 28, 8185[CrossRef][Web of Science][Medline].
Hofmann, K. and Stoffel, W. (1993) TMbasea database of membrane spanning proteins segments. 374, 166.
Hulo, N., Sigrist, C.J.A., Le Saux, V., Langendijk-Genevaux, P.S., Bordoli, L., Gattiker, A., De Castro, E., Bucher, P., Bairoch, A. (2004) Recent improvements to the PROSITE database. Nucleic Acids Res., 32, D134D137
Jones, D.T. and Ward, J.J. (2003) Prediction of disordered regions in proteins from position specific score matrices. Proteins, 53, 573578.
Kaplan, B., Ratner, V., Haas, E. (2003) Alpha-synuclein: its biological function and role in neurodegenerative diseases. J. Mol. Neurosci., 20, 8392 Review[CrossRef][Web of Science][Medline].
Letunic, I., Copley, R.R., Schmidt, S., Ciccarelli, F.D., Doerks, T., Schultz, J., Ponting, C.P., Bork, P. (2004) SMART 4.0: towards genomic data integration. Nucleic Acids Res., 32, D142D144
Li, X., Romero, P., Rani, M., Dunker, A.K., Obradovic, Z. (1999) Predicting protein disorder for N-, C-, and internal regions. Genome Inform., 10, 3040.
Linding, R., Jensen, L.J., Diella, F., Bork, P., Gibson, T.J., Russell, R.B. (2003a) Protein disorder prediction: implications for structural proteomics. Structure, 11, 14531459[Medline].
Linding, R., Russell, R.B., Neduva, V., Gibson, T.J. (2003b) GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res., 31, 37013708
Liu, J. and Rost, B. (2003) NORSp: predictions of long regions without regular secondary structure. Nucleic Acids Res., 31, 38333835
Liu, J., Tan, H., Rost, B. (2002) Loopy proteins appear conserved in evolution. J. Mol. Biol., 322, 5364[CrossRef][Web of Science][Medline].
Marchler-Bauer, A. and Bryant, S.H. (2004) CD-Search: protein domain annotations on the fly. Nucleic Acids Res., 32, W327W331
McGinnis, S. and Madden, T.L. (2004) BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res., 32, W20W25
Melamud, E. and Moult, J. (2003) Evaluation of disorder prediction in CASP5. Proteins, 53, 561565.
Moult, J., Fidelis, K., Zemla, A., Hubbard, T. (2003) Critical assessment of methods of protein structure prediction (CASP)-round V. Proteins, 53, 334339.
Obradovic, Z., Peng, K., Vucetic, S., Radivojac, P., Brown, C.J., Dunker, K.A. (2003) Predicting intrinsic disorder from amino acid sequence. Proteins, 53, 566572.
Plaxco, K.W. and Gross, M. (2001) Unfolded, yes, but random? Never!. Nat. Struct. Biol., 8, 659660[CrossRef][Web of Science][Medline].
Promponas, V.J., Enright, A.J., Tsoka, S., Kreil, D.P., Leroy, C., Hamodrakas, S., Sander, C., Ouzounis, C.A. (2000) CAST: an iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts. Bioinformatics, 16, 915922
Przybylski, D. and Rost, B. (2002) Alignments grow, secondary structure prediction improves. Proteins, 46, 197205[CrossRef][Web of Science][Medline].
Radhakrishnan, I., Perez-Alvarado, G.C., Parker, D., Dyson, H.J., Montiny, M.R., Wright, P.E. (1997) Solution structure of the KIX domain of CBP bound to the transactivation domain of CREB: a model for activator:coactivator interactions. Cell, 91, 741752[CrossRef][Web of Science][Medline].
Romero, P., Obradovic, A., Kissinger, K., Villafranca, J.E., Dunker, A.K. (1997) Identifying disordered regions in proteins from amino acid sequence. Int. Proc. Neural Netw., 1, 9095.
Romero, P., Obradovic, Z., Li, X., Garner, E.C., Brown, C.J., Dunker, A.K. (2001) Sequence complexity of disordered proteins. Proteins, 42, 3848[CrossRef][Web of Science][Medline].
Rost, B. and Liu, J. (2003) The PredictProtein Server. Nucleic Acids Res., 31, 33003304
Saqi, M.A. and Sternberg, M.J. (1997) Identification of sequence motifs from a set of proteins with related function. Protein Eng., 7, 165171.
Schweers, O., Schonbrunn-Hanebeck, E., Marx, A., Mandelkow, E. (1994) Structural studies of tau protein and Alzheimer paired helical filaments show no evidence for beta-structure. J. Biol. Chem., 269, 2429024297
Servant, F., Bru, C., Carrère, S., Courcelle, E., Gouzy, J., Peyruc, D., Kahn, D. (2002) ProDom: automated clustering of homologous domains. Brief. Bioinformatics, 3, 246251
Smyth, E., Syme, C.D., Blanch, E.W., Hecht, L., Vasak, M., Barron, L.D. (2001) Solution structure of native proteins with irregular folds from Raman optical activity. Biopolymers, 58, 138151[CrossRef][Web of Science][Medline].
Tompa, P. (2002) Intrinsically unstructured proteins. Trends Biochem. Sci., 27, 527533 Review[CrossRef][Web of Science][Medline].
Uversky, V.N. and Fink, A.L. (2002) The chicken-egg scenario of protein folding revisited. FEBS Lett., 515, 7983[CrossRef][Web of Science][Medline].
Uversky, V.N., Gillespie, J.R., Fink, A.L. (2000) Why are natively unfolded proteins unstructured under physiologic conditions? Proteins, 41, 415427[CrossRef][Web of Science][Medline].
Verkhivker, G.M., Bouzida, D., Gehlhaar, D.K., Rejto, P.A., Freer, S.T., Rose, P.W. (2003) Simulating disorderorder transitions in molecular recognition of unstructured proteins: where folding meets binding. Proc. Natl Acad. Sci. USA, 100, 51485153
Ward, J.J., Sodhi, J.S., McGuffin, L.J., Buxton, B.F., Jones, D.T. (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J. Mol. Biol., 337, 625645.
Weathers, E.A., Paulaitis, M.E., Woolf, T.B., Hoh, J.H. (2004) Reduced amino acid alphabet is sufficient to accurately recognize intrinsically disordered protein. FEBS Lett., 576, 348352[CrossRef][Web of Science][Medline].
Wistrand, M. and Sonnhammer, E.L. (2004) Improving profile HMM discrimination by adapting transition probabilities. J. Mol. Biol., 338, 847854[CrossRef][Web of Science][Medline].
Wootton, J.C. (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput. Chem., 18, 269285[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
S. Bournazos, S. P. Hart, L. H. Chamberlain, M. J. Glennie, and I. Dransfield Association of Fc{gamma}RIIa (CD32a) with Lipid Rafts Regulates Ligand Binding Activity J. Immunol., June 15, 2009; 182(12): 8026 - 8036. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Rotem, C. Katz, H. Benyamini, M. Lebendiker, D. Veprintsev, S. Rudiger, T. Danieli, and A. Friedler The Structure and Interactions of the Proline-rich Domain of ASPP2 J. Biol. Chem., July 4, 2008; 283(27): 18990 - 18999. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Hirose, K. Shimizu, S. Kanai, Y. Kuroda, and T. Noguchi POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions Bioinformatics, August 15, 2007; 23(16): 2046 - 2053. [Abstract] [Full Text] [PDF] |
||||
![]() |
C.-T. Su, C.-Y. Chen, and C.-M. Hsu iPDA: integrated protein disorder analyzer Nucleic Acids Res., July 13, 2007; 35(suppl_2): W465 - W472. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. V. Galzitskaya, S. O. Garbuzynskiy, and M. Yu. Lobanov FoldUnfold: web server for the prediction of disordered regions in protein chain Bioinformatics, December 1, 2006; 22(23): 2948 - 2949. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Vullo, O. Bortolami, G. Pollastri, and S. C. E. Tosatto Spritz: a server for the prediction of intrinsically disordered regions in protein sequences using kernel machines. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W164 - W168. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. R. Yang, R. Thomson, P. McNeil, and R. M. Esnouf RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins Bioinformatics, August 15, 2005; 21(16): 3369 - 3376. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











