Bioinformatics Advance Access originally published online on June 1, 2007
Bioinformatics 2007 23(16):2046-2053; doi:10.1093/bioinformatics/btm302
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions
1PharmaDesign, Inc., Tokyo 104-0032, 2Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064 and 3Department of Biotechnology and Life Science, Graduate School of Engineering, Tokyo University of Agriculture and Technology, Koganei 184-8588, Japan
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Recent experimental and theoretical studies have revealed several proteins containing sequence segments that are unfolded under physiological conditions. These segments are called disordered regions. They are actively investigated because of their possible involvement in various biological processes, such as cell signaling, transcriptional and translational regulation. Additionally, disordered regions can represent a major obstacle to high-throughput proteome analysis and often need to be removed from experimental targets. The accurate prediction of long disordered regions is thus expected to provide annotations that are useful for a wide range of applications.
Results: We developed Prediction Of Order and Disorder by machine LEarning (POODLE-L; L stands for long), the Support Vector Machines (SVMs) based method for predicting long disordered regions using 10 kinds of simple physico-chemical properties of amino acid. POODLE-L assembles the output of 10 two-level SVM predictors into a final prediction of disordered regions. The performance of POODLE-L for predicting long disordered regions, which exhibited a Matthew's correlation coefficient of 0.658, was the highest when compared with eight well-established publicly available disordered region predictors.
Availability: POODLE-L is freely available at http://mbs.cbrc.jp/poodle/poodle-l.html
Contact: hirose-shuichi{at}aist.go.jp
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Over the last two decades, many proteins lacking well-defined 3D structures under physiological conditions have been identified using various experimental techniques, such as nuclear magnetic resonance, circular dichroism and X-ray crystallography (Dunker et al., 2001). Those proteins include over 100 proteins or domains that are unfolded over their entire sequences (Uversky et al., 2000), and are called intrinsically disordered proteins or intrinsically unstructured proteins (IUPs) (Dunker et al., 2001; Tompa, 2002). In addition, even more numerous folded proteins containing unfolded regions, called disordered regions, have been identified (Romero et al. 1997a). From a structural viewpoint, disordered regions are sequence segments that are not observed on the electron density map in X-ray crystallography (Dunker and Obradovic, 2001). Moreover, theoretical analysis suggests that both IUPs and disordered regions are quite common and are found in proteins from all kinds of organisms, especially in eukaryotic genomes (Dunker et al., 2000; Oldfield et al., 2005a; Ward et al., 2004).
From a functional standpoint, though unstructured, disordered regions play fundamental roles in biological activities (Dunker and Obradovic 2001; Dunker et al., 2002a, b; Dyson and Wright 2005; Radivojac et al., 2007; Tompa, 2002; Uversky, 2002, 2003) and are associated with diseases (Fink 2005; Iakoucheva et al., 2002). A four-class classification of their functions has recently been proposed: entropic chain, protein modification, molecular assembly/disassembly and molecular recognition (Dunker et al., 2002a). A unique feature when carrying out such functions is their ability to bind multiple partners with high specificity (Dunker et al., 2002a). This unique property is assumed to originate from both the multiple conformations that they can adopt and their binding surfaces, which are typically larger than those of a folded globular protein. Such molecular recognition peculiarities render disordered regions particularly suitable for functions related to signal transduction, cell-regulation and transcription (Dunker et al., 2005; Uversky et al., 2005). Disordered regions have therefore attracted great attention because they might yield a new structure-function paradigm.
Clear sequence differences between ordered and disordered regions have been demonstrated (Romero, 1997b; Uversky et al., 2000; Wootton, 1994), thereby stimulating numerous attempts to predict disordered regions solely from amino acid sequences. A potential area of application of disordered region prediction is high-throughput functional/structural proteomics. In such projects, putative disordered regions and IUPs, which often hamper protein analysis, are predicted and removed from experimental targets, as reported in the Center for Eukaryotic Structural Genomics (CESG) target selection (Oldfield et al., 2005b). Hence, the improvements of prediction performances might also have a direct and practical impact on increasing the efficiency of proteomics projects. Additionally, a novel drug discovery strategy, termed disorder-based rational drug design, may represent another area of application for disorder region prediction (Cheng et al., 2006).
Methods for predicting disordered regions are classifiable into two groups: the first uses mainly the physico-chemical properties of the amino acids; the second uses evolutionary information. PONDR (Romero et al., 2001), GlobPlot2 (Linding et al., 2003a), DisEMBL (Linding et al., 2003b), IUPred (Dosztanyi et al., 2005), PreLINK (Coeytaux and poupon 2005) and FoldUnfold (Galzitskaya et al., 2006a) are included in the first group. For example, PONDR uses amino acid composition and sequence complexity; GlobPlot2 uses a disorder propensity index; IUPred uses pairwise energy based on a quadratic form in the amino acid composition of protein; PreLINK uses a relationship between amino acid distribution and a putative hydrophobic cluster and FoldUnfold uses the ability to form a sufficient number of contacts in the globular state. Prediction methods included in the second group mainly use profiles generated from PSI-BLAST [e.g. DISOPRED2 (Ward et al., 2004), DISPro (Cheng et al., 2005) and DisPSSMP (Su et al., 2006)] or multiple alignment [e.g. RONN (Yang et al., 2005)].
In spite of those efforts, the critical assessment of techniques for protein structure prediction (CASP) experiment, which has had a category for the prediction of disordered regions since 2002, suggested that substantial room for improvement remains (Obradovic et al., 2005). One promising line of research for improving the prediction of disordered prediction suggests that the sequence characteristics of the disordered regions might depend on their lengths (Peng et al., 2006).
In this study, we investigated whether length dependence is useful for improving the prediction efficiency of disordered regions. We developed Prediction Of Order and Disorder by machine LEarning (POODLE-L; L stands for long), which is an SVM (Support Vector Machine) predictor specifically intended to detect long disordered regions. The prediction performances were improved by extracting the amino acids features specific to long disordered regions. When compared to eight publicly available and well-established disordered region predictors using an independent evaluation dataset of 116 sequences, POODLE-L demonstrated the highest prediction performance.
| 2 METHODS |
|---|
|
|
|---|
2.1 Training and evaluation dataset
Our training dataset (TDS) was constructed as follows. Ordered regions were collected from proteins in the Protein Data Bank (PDB) (Berman et al., 2000) that include no disordered regions or only short disordered regions (shorter than 30aa). Representative sequences with pairwise sequence identity of <30% were collected using PDB-REPRDB (Noguchi and Akiyama, 2003). From that set, we selected protein structures that are monomeric single domains, as defined by SCOP (Murzin et al., 1995), which have resolutions better than 2.0 Å, and which are determined with CNS (Brunger et al., 1998), Shelxl (Sheldrick, 1997), Refmac (Murshudov et al., 1997) or X-PLOR (Brunger, 1992). Eventually, the ordered regions of TDS included 292 protein sequences (55 784 residues), and contained 93 short disordered regions (2% of the total residue number).
Disordered regions of TDS consisted of long disordered regions and IUPs. They were collected from Uversky's; (Uversky et al., 2000) and DisProt ver. 2.2 datasets (Vucetic et al., 2005). Disordered regions shorter than 40aa, as well as redundant sequences with sequence identities higher than 90%, were removed according to BLASTClust (Altschul et al., 1990). Consequently, 199 disordered regions (35 428 residues) were collected.
Two prediction performance assessment datasets were prepared. A first dataset, called ADS-1, was used for assessing the prediction performance of the individual predictors. The ADS-1 was constructed according to the same protocol as that used for constructing TDS, except that the required resolution was 2.5 Å rather than 2.0 for ordered regions, and was removed all sequences contained in TDS. We used an updated version of DisProt (ver. 3.0) for disordered regions. Disordered regions shorter than 30aa were excluded from the dataset described above. The evaluation dataset comprised 53 ordered regions (11 431 residues) derived from PDB and 63 disordered regions (8700 residues) derived from DisProt.
A second prediction performance assessment dataset (ADS-2) was used for selecting the descriptor's; combinations and was constructed from the PDB according to a protocol similar to that used for constructing TDS. Disordered regions in ADS-2 were defined as a string of 30 or more consecutive residues missing their C
atomic coordinates. The ADS-2 consisted of 15 sequences with no disordered region and 11 sequences containing one or more long (
30aa) disordered regions, representing 6688 ordered residues, and 564 disordered residues. TDS, ADS-1 and ADS-2 do not overlap with each other, and all sequences are available at http://mbs.cbrc.jp/poodle/poodle-l-datasets.html.
2.2 Two-level SVM prediction
POODLE-L assembles the results of 10 disordered region predictors into a final prediction. Each predictor consists of a two-level SVM prediction, which uses amino acid sequences as input data.
The first-level SVM predicts the probability of a 40-residue sequence segment to be disordered. The window size was chosen by comparing, for each descriptor, the classification performances of a 30-residue window with those of a 40-residue window. The sequence was expressed as a 10-D vector, with each component corresponding to a physico-chemical characteristic described in Section 2.3 and used as learning data of the first-level SVM (Fig. 1 Step 1). The disorder probabilities of all 40-residue segments in the protein sequence were calculated using the first-level SVM (Fig. 1 Step 2) and were expressed as real numbers from 0.0 to 1.0 (Fig. 1 Step 3).
|
The second-level SVM uses the output of the first-level SVM and computes the disorder probability of a single residue. The second-level SVM learning data of each residue were prepared by subdivision into 10 classes using an increment of 0.1 (0.0–0.1, 0.1–0.2, 0.2–0.3, etc.) the disorder probabilities, as calculated with the first-level SVM, of all the windows that include the examined residue (Fig. 1 Step 4). The number of members in each class was normalized to one using the maximum number found within the protein sequence. Consequently, each amino acid residue was expressed as a 10-D vector to be used in the second-level SVM. Note that internal residues, which are located more than 40 residues distant from the protein N and C termini, are included in 40 windows, whereas termini residues, which are residues located within 39 residues of the protein N and C termini, are included in a smaller number of windows (Fig. 1 Step 3). Thus, the 10-D vectors of termini residues are calculated using a smaller number of first-level SVM outputs than the internal residue 10-D vectors, and the second level SVMs for internal and termini residues were also trained separately. The second-level SVM output was expressed as a real number from 0 to 1, which is called the disorder probability (P). Residues with disorder probability higher than 0.5 were predicted to be in the disordered state (Fig. 1 Step 5).
2.3 Physico-chemical properties used as descriptors
Ten descriptors were defined as follows and were used in the-first level SVMs:
The mean hydrophobicity was defined as the average of the modified Kyte and Doolittle's; hydrophobicity index (Kyte and Doolittle, 1982) of the residues in the window, and normalized to a scale of 0 to 1.
The hydrophobic cluster value was defined as the length of a hydrophobic cluster divided by 25. A hydrophobic cluster was defined by encoding sequences with a ternary code, i.e. 1 for hydrophobic residues (VILFMYW), 2 for a proline and 0 for the other residues. A hydrophobic cluster required that it is constituted by a string of 1s and 0s with a maximum of three consecutive 0s including no 2s. It therefore begins and ends with either 0000 or a 2 (Coeytaux and Poupon, 2005).
The mean net charge was defined as the absolute value of the average net charge (defined as +1 for K and R, –1 for D and E and 0 for other residues) of all residues in the window (Uversky et al., 2000).
The charge cluster value was defined as the average charge (defined as +1 for K and R, –1 for D and E and 0 for all other residues) of 12 consecutive amino acids.
The sequence complexity was measured using Shannon's; entropy, which is given as:
|
|
where N represents the total number of amino acids in the window, and fi is the frequency of residue i (Shenkin et al., 1991).
The amino acid composition was encoded with two descriptors. They were defined as the correlation coefficients of amino acid frequencies between the amino acid frequency in the query sequence and its respective frequencies in the ordered and disordered regions. The amino acid frequency was calculated as:
|
|
|
|
and
respectively denote the sample means of x and y. The secondary structures were described using two descriptors. In the first descriptor, a window containing a residue in an alpha helix core region was given a score of 1, and 0 otherwise. A second descriptor was derived similarly, but with beta core residues. The core regions were calculated according to a protocol similar to the Chou–Fasman's; secondary structure prediction method (Chou and Fasman, 1978). In short, a residue was defined as an alpha core region when four or more consecutive residues with alpha helix propensity higher than 1.15 were found with no flanking residues with alpha helix propensity of less than 0.8. Beta sheet core regions were predicted when three or more consecutive residues with beta sheet propensity of greater than 1.20 were found with no flanking residues with beta sheet propensity smaller than 0.8.
The average number of contacts was defined as the average of the expected number of contacts in globular state of all residues in the window (Garbuzynskiy et al., 2004).
2.4 Training method
The SVM used in this study is the libSVM package (Chang and Lin, 2001) with an RBF kernel for the classifiers. The first-level and second-level SVMs were trained using the respective training dataset containing 10-D vectors as described in Physico-chemical properties used as descriptors of section 2.3. Both TDSs were reduced to 1/10th of their original sizes as follows. The TDS was subdivided into clusters, whose number was 1/200th of all data, using a k-means nearest neighbor method. We then randomly selected 20 sequences from each cluster, which yielded the 1/10 reduced dataset. The training parameters (cost and gamma parameter) used in SVM were optimized using a 5-fold cross-validation with the reduced 1/10th dataset.
2.5 Assessment of predictions
The prediction results were assessed on a residue basis: the state of each residue in the protein sequence was predicted and was compared with the experimental state. The prediction results were classified into four categories: NTP is the number of true positives, which is defined as the number of correctly predicted disordered regions. Similarly, NFP, NTN and NFT denote the numbers of false positives, which are defined respectively as: ordered residues that were incorrectly predicted as disordered; the number of true negatives, which are defined as correctly predicted ordered residues and the number of false negatives, which are defined as disordered residues incorrectly predicted as ordered.
The first assessment criterion is the receiver operating characteristic (ROC) curve. The ROC curve is obtained by plotting the false positive rate (RFP(P) = NFP(P) / (NTN + NFP)) against the true positive rate (RTP(P) = NTP(P) / (NTP + NFN)). The RFP against RTP was plotted while the disorder probability increased from 0 to 1.0 with a 0.01 increment. The larger the area under the ROC curve (SAUC), the more robust an algorithm is. An area of 1.00 means a perfect predictor, and an area of 0.50 corresponds to a random guess.
Next, the sensitivity (Ssens) and specificity (Sspec), which respectively indicate the fraction of correctly identified disordered regions and ordered regions, were used to evaluate prediction performances. In addition, Ssens and Sspec were defined as follows.
|
|
Another commonly used criterion is the Matthew's; correlation coefficient (SMCC), which is given as:
|
|
For a prediction with unequal class frequencies, S_ MCC favors the correct prediction of small classes. In our case, S_ MCC will favor the correct prediction of disordered regions over that of ordered regions.
In CASP6, a new criterion, S_ product, was defined to complement S_ MCC and emphasize the detection of disordered regions (Jin and Dunbrack, 2005). There are considerably fewer disordered residues than ordered residues. Moreover, disordered regions tend to be underpredicted. Sproduct is defined as:
|
|
Sproduct ranges from 0 to 1, where 1 indicates a perfect prediction. Sproduct rises much faster than SMCC when the number of correctly predicted disordered residues rises.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Basic two-level disorder predictor
We constructed a basic two-level SVM predictor using all 10 descriptors, and compared its prediction performances with those of an ordinary single-level SVM prediction method, which is identical to the first-level SVM of our basic predictor (Fig. 1). The average prediction performances were estimated by training and evaluating both predictors 20 times using randomly selected data from TDS (Table 1). The Ssens of the two-level prediction was 1.7% higher than that of a usual single SVM prediction. In addition, the SMCC and Sproduct were also slightly higher. Furthermore, the two-level SVM prediction algorithm enabled the prediction of disordered regions for residues in terminal regions (see Methods section for further details).
3.2 Combination of descriptors for the first-level SVM
We optimized the first-level SVM by investigating which descriptor combination would yield a better prediction result. However, because of the numerous combinations (1022 patterns), we classified the 10 descriptors into six groups, which reduced the number of combinations to 63 patterns (62 patterns and the basic predictor). The six groups were the hydrophobicity descriptors (mean hydrophobicity, hydrophobic cluster value), the charge descriptors (mean net charge, charge cluster value), the sequence complexity descriptor (sequence complexity), the amino acid composition descriptors (amino acid composition), the secondary structure descriptors (secondary structure) and the average number of contacts descriptor (average number of contacts). We evaluated the prediction performances of all 63 predictors based on their sensitivity and specificity using ADS-2 (Fig. 2). According to this criterion, 9 among the 62 predictors exhibited better performance than the basic predictor; the performance of predictor34 was the highest (Supplementary Table 1).
|
|
3.3 Building POODLE-L
To improve the prediction performance, we assembled the results of the basic predictors and that of the nine predictors with performances higher than that of the basic predictor into a consensus prediction system, POODLE-L. We calculated the average probability of each predictor using 7, 23 and 39-residue windows and attributed the average probabilities to the central residue (Fig. 3 Step 1). For each window and each residue, the two largest and smallest averaged probabilities were omitted (Fig. 3 Step 2). The final disorder probability of an amino acid was calculated as the average value over 18 probabilities (Fig. 3 Step 3). The example for the disordered region prediction for two proteins in ADS-1 is shown in Figure 4.
|
|
3.4 Comparison with other methods
We compared the prediction performance of POODLE-L was compared to that of eight publicly available disordered region predictors: DISOPRED2, VSL2 (Peng et al., 2006), VL3H (Obradovic et al., 2003), DisEMBL, RONN, IUPred, FoldIndex (Prilusky et al., 2005) and FoldUnfold using ADS-1. According to the ROC curve, POODLE-L performance was the highest among all predictors, especially in the very low false positive rate region. Furthermore, POODLE-L's; prediction performances were also the highest according to SMCC, Sproduct, which provide a simultaneous evaluation of Ssens and Sspec, and S_ AUC(Table 2). When considered individually POODLE-L's; Ssens and Sspec, though relatively high, were not the highest. However, it is also important to assess Ssens and Sspec simultaneously because a high Sspec is usually counterbalanced by a low Ssens.
|
The ability of POODLE-L for predicting short disordered regions was low, as anticipated. The assessment dataset for estimating the prediction performance consisted of 15 sequences with short disordered regions (between 5aa and 20aa) from X-ray crystallographic data, in which 3500 residues were classed as ordered and 204 short disordered residues. The Ssens, Sspec, SMCC and Sproduct of POODLE-L were 0.353, 0.881, 0.158 and 0.311, respectively, and its performance on several criteria was lower than that of the other predictors (Supplemantary Table 2).
| 4 DISCUSSION |
|---|
|
|
|---|
POODLE-L is a disordered region predictor that is especially tuned for predicting long disordered regions. A major feature of POODLE-L is the assembly of multiple individual disordered region predictors into a final integrated predictor. The outputs of 10 two-level SVM predictors are merged using a consensus algorithm. One predictor, the basic predictor, uses all 10 descriptors in the first-level prediction SVM, whereas the other predictors use combinations of descriptors that yielded the better performances. The effectiveness of a consensus prediction algorithm was reported previously for secondary structure prediction (Cuff et al., 1998; Nishikawa and Noguchi, 1991), but to our knowledge this is its first application to the prediction of disordered regions. The prediction performance of POODLE-L was, indeed, 2.6–8.0% higher than that of the individual predictors according to Sproduct (Table 2 and Supplementary Tabel 1).
Descriptors that are most useful in POODLE-L for identifying disordered regions can be evaluated from their inclusion in the predictors that produce the best results (Table 3). We find that hydrophobicity descriptors appear in all but one predictor, although the secondary structure is not used in any predictor. That difference does not necessarily mean that disorder and secondary structure propensities are unrelated, but the difference might indicate that the encoding of secondary structure descriptor can be improved. For example, we might use secondary prediction methods such as PSIPRED (McGuffin et al., 2000), which are probably more accurate than our Chou–Fassman-derived prediction scheme. As for the four other descriptor's; groups, one or several of them appear in all predictors, but they appear in various combinations. This interchangeability suggests that the descriptors encode redundant information (Table 3). Overall, though the descriptors defined in this study might not characterize disordered regions exhaustively, hydrophobicity seems likely to be a good indicator for long disordered regions.
|
POODLE-L exhibited the best prediction performance for long disordered regions when compared with eight publicly available disordered region predictors (Table 2 and Fig. 5), but it was poor at predicting short regions (Supplementary Table 2). Among other reasons, we infer that the high performance of POODLE-L in predicting long disordered regions originates from the large window, which enables the comprehension of information from amino acids that are located distant from each other. This conjecture concurs with our preliminary calculation, indicating that a 40-residue window is more effective than a 30-residue window. Shorter windows are typically used for predicting disordered regions, and IUPred and RONN, e.g. respectively use 21-residue and 19-residue windows.
|
The performance of POODLE-L for predicting disordered regions shorter than 30 residues was poor, but it was anticipated because it was trained exclusively for predicting long disordered regions. Therefore, the ability of POODLE-L to predict, exclusively and with high reliability, long disordered regions suggests that the difference in the physico-chemical properties of long and short disordered regions (Radivojac et al., 2004) can be used to improve the prediction of the disordered region, as suggested by Peng (Peng et al., 2006). In our case, short disordered regions are identified by the method introduced in Shimizu's; work (Shimizu et al., 2005).
| 5 CONCLUSION |
|---|
|
|
|---|
We described POODLE-L, which detects long disordered regions with high accuracy. POODLEL achieved the best prediction performance among several previously reported prediction methods, according to several criteria. The prediction of long disordered regions appears to stem from POODLE-L's; capability to recognize the sequential differences among long and short disordered regions and ordered regions efficiently. Nevertheless, we believe that the prediction of long disordered regions might be further improved by exploring better descriptors.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank our colleagues in PharmaDesign, Inc. and protein function team in Computational Biological Research Center (CBRC) for helpful advice and discussion. This work was funded by PharmaDesign, Inc. and Advanced Industrial Science and Technology (AIST).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Dmitrij Frishman
Received on January 31, 2007; revised on May 29, 2007; accepted on May 30, 2007
| REFERENCES |
|---|
|
|
|---|
Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol (1990) 215:403–410.[CrossRef][Web of Science][Medline]
Berman HM, et al. The Protein Data Bank. Nucleic Acids Res (2000) 28:235–242.
Brunger AT. X-PLOR, Ver. 3.1, A System for X-ray Crystallography and NMR (1992) New Haven, CT: Yale University Press.
Brunger AT, et al. Crystallography & NMR system: a new software suite for macromolecular structure determination. Acta Crystallogr. D Biol. Crystallogr (1998) 54:905–921.[CrossRef][Medline]
Chang CC, Lin CJ. Training nu-support vector classifiers: theory and algorithms. Neural Comput (2001) 13:2119–2147.[CrossRef][Web of Science][Medline]
Cheng J, et al. Accurate prediction of protein disordered regions by mining protein structure data. Data Mining Knowl. Discov (2005) 11:213–222.[CrossRef]
Cheng Y, et al. Rational drug design via intrinsically disordered protein. Trends Biotechnol (2006) 24:435–442.[CrossRef][Web of Science][Medline]
Chou PY, Fasman GD. Prediction of the secondary structure of proteins from their amino acid sequence. Adv. Enzymol. Relat. Areas Mol. Biol (1978) 47:45–148.[Medline]
Coeytaux K, Poupon A. Prediction of unfolded segments in a protein sequence based on amino acid composition. Bioinformatics (2005) 21:1891–1900.
Cuff JA, et al. JPred: a consensus secondary structure prediction server. Bioinformatics (1998) 14:892–893.
Dunker AK, Obradovic Z. The protein trinity-linking function and disorder. Nat. Biotechnol (2001) 19:805–806.[CrossRef][Web of Science][Medline]
Dunker AK, et al. Intrinsic protein disorder in complete genomes. Genome Inform. Ser. Workshop Genome Inform (2000) 11:161–171.[Medline]
Dunker AK, et al. Intrinsically disordered proteins. J. Mol. Graph. Model (2001) 19:26–59.[CrossRef][Web of Science][Medline]
Dunker AK, et al. Intrinsic disorder and protein function. Biochemistry (2002a) 41:6573–6582.[CrossRef][Medline]
Dunker AK, et al. Identification and functions of usefully disordered proteins. Adv. Protein Chem (2002b) 62:25–49.[Web of Science][Medline]
Dunker AK, et al. Flexible nets. The roles of intrinsic disorder in protein interaction networks. FEBS J (2005) 272:5129–5148.[CrossRef][Medline]
Dosztanyi Z, et al. IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics (2005) 21:3433–3434.
Dyson HJ, Wright PE. Intrinsically unstructured proteins and their functions. Nat. Rev. Mol. Cell Biol (2005) 6:197–208.[CrossRef][Web of Science][Medline]
Fink AL. Natively unfolded proteins. Curr. Opin. Struct. Biol (2005) 15:35–41.[CrossRef][Web of Science][Medline]
Galzitskaya OV, et al. FoldUnfold: web server for the prediction of disordered regions in protein chain. Bioinformatics (2006a) 22:2948–2949.
Galzitskaya OV, et al. Prediction of amyloidogenic and disordered regions in protein chains. PLoS Comput. Biol (2006b) 2:1639–1648.[Web of Science]
Garbuzynskiy SO, et al. To be folded or to be unfolded? Protein Sci (2004) 13:2871–2877.[CrossRef][Web of Science][Medline]
Iakoucheva LM, et al. Intrinsic disorder in cell-signaling and cancer-associated proteins. J. Mol. Biol (2002) 232:573–584.
Jin Y, Dunbrack RL Jr. Assessment of disorder predictions in CASP6. Proteins (2005) 61:167–175.[CrossRef][Web of Science][Medline]
Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol (1982) 157:105–132.[CrossRef][Web of Science][Medline]
Linding R, et al. GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res (2003a) 31:3701–3708.
Linding R, et al. Protein disorder prediction: implications for structural proteomics. Structure (2003b) 11:1453–1459.[Medline]
McGuffin LJ, et al. The PSIPRED protein structure prediction server. Bioinformatics (2000) 16:404–405.
Murshudov GN, et al. Refinement of macromolecular structures by the maximum-likelihood method. Acta Crystallogr. D Biol. Crystallogr (1997) 53:240–255.[CrossRef][Medline]
Murzin AG, et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol (1995) 247:536–540.[CrossRef][Web of Science][Medline]
Nishikawa K, Noguchi T. Predicting protein secondary structure based on amino acid sequence. Meth. Enzymol (1991) 202:31–44.[Web of Science][Medline]
Noguchi T, Akiyama Y. PDB-REPRDB: a database of representative protein chains from the Protein Data Bank (PDB) in 2003. Nucleic Acids Res (2003) 31:492–493.
Obradovic Z, et al. Predicting intrinsic disorder from amino acid sequence. Proteins (2003) 53:566–572.[CrossRef][Web of Science][Medline]
Obradovic Z, et al. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins (2005) 61:176–182.[Web of Science][Medline]
Oldfield CJ, et al. Comparing and combining predictors of mostly disordered proteins. Biochemistry (2005a) 44:1989–2000.[CrossRef][Medline]
Oldfield CJ, et al. Addressing the intrinsic disorder bottleneck in structural proteomics. Proteins (2005b) 59:444–453.[CrossRef][Web of Science][Medline]
Peng K, et al. Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics (2006) 17:208.
Prilusky J, et al. FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics (2005) 21:3435–3438.
Radivojac P, et al. Protein flexibility and intrinsic disorder. Protein Sci (2004) 13:71–80.[CrossRef][Web of Science][Medline]
Radivojac P, et al. Intrinsic disorder and functional proteomics. Biophys. J (2007) 92:1439–1456.[CrossRef][Web of Science][Medline]
Romero P, et al. Sequence data analysis for long disordered regions prediction in the Calcineurin family. Genome Inform. Ser. Workshop Genome Inform (1997a) 8:110–124.[Medline]
Romero P, et al. Identifying disordered regions in proteins from amino acid sequence. Int. Proc. Neur. Net (1997b) 1:90–95.
Romero P, et al. Sequence complexity of disordered protein. Proteins (2001) 42:38–48.[CrossRef][Web of Science][Medline]
Sheldrick GM. SHELX97, programs for crystal structure analysis (Release 97-2). (1997) Germany: University of Gottingen.
Shenkin PS, et al. Information-theoretical entropy as a measure of sequence variability. Proteins (1991) 11:297–313.[CrossRef][Web of Science][Medline]
Shimizu K, et al. Feature selection based on physicochemical properties of redefined N-term and C-term regions for predicting disorder. (2005) Procedings of the Institute of electrical and Elecetronics Engineers CIBCB. 262–267.
Su CT, et al. Protein disorder prediction by condensed PSSM considering propensity for order or disorder. BMC Bioinformatics (2006) 7:319.[CrossRef][Medline]
Tompa P. Intrinsically unstructured proteins. Trends Biochem. Sci (2002) 27:527–533.[CrossRef][Web of Science][Medline]
Uversky VN, et al. Why are "natively unfolded" proteins unstructured under physiologic conditions? Proteins (2000) 15:415–427.
Uversky VN. Natively unfolded proteins: a point where biology waits for physics. Protein Sci (2002) 11:739–756.[CrossRef][Web of Science][Medline]
Uversky VN. Protein folding revisited. A polypeptide chain at the folding-misfolding-nonfolding cross-roads: which way to go? Cell Mol. Life Sci (2003) 60:1852–1871.[CrossRef][Web of Science][Medline]
Uversky VN, et al. Showing your ID: intrinsic disorder as an ID for recognition, regulation and cell signaling. J. Mol. Recognit (2005) 18:343–384.[CrossRef][Web of Science][Medline]
Vucetic S, et al. DisProt: a database of protein disorder. Bioinformatics (2005) 21:137–140.
Ward JJ, et al. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J. Mol. Biol (2004) 337:635–645.[CrossRef][Web of Science][Medline]
Wootton JC. Sequence with unusual amino acid composition. Curr. Opin. Struct. Biol (1994) 4:413–421.[CrossRef][Web of Science]
Yang ZR, et al. RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics (2005) 21:3369–3376.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





