Skip Navigation


Bioinformatics Advance Access originally published online on June 1, 2007
Bioinformatics 2007 23(16):2046-2053; doi:10.1093/bioinformatics/btm302
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/16/2046    most recent
btm302v2
btm302v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Hirose, S.
Right arrow Articles by Noguchi, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hirose, S.
Right arrow Articles by Noguchi, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions

Shuichi Hirose 1,3,*, Kana Shimizu 2, Satoru Kanai 1, Yutaka Kuroda 3 and Tamotsu Noguchi 2

1PharmaDesign, Inc., Tokyo 104-0032, 2Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064 and 3Department of Biotechnology and Life Science, Graduate School of Engineering, Tokyo University of Agriculture and Technology, Koganei 184-8588, Japan

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Recent experimental and theoretical studies have revealed several proteins containing sequence segments that are unfolded under physiological conditions. These segments are called disordered regions. They are actively investigated because of their possible involvement in various biological processes, such as cell signaling, transcriptional and translational regulation. Additionally, disordered regions can represent a major obstacle to high-throughput proteome analysis and often need to be removed from experimental targets. The accurate prediction of long disordered regions is thus expected to provide annotations that are useful for a wide range of applications.

Results: We developed Prediction Of Order and Disorder by machine LEarning (POODLE-L; L stands for long), the Support Vector Machines (SVMs) based method for predicting long disordered regions using 10 kinds of simple physico-chemical properties of amino acid. POODLE-L assembles the output of 10 two-level SVM predictors into a final prediction of disordered regions. The performance of POODLE-L for predicting long disordered regions, which exhibited a Matthew's correlation coefficient of 0.658, was the highest when compared with eight well-established publicly available disordered region predictors.

Availability: POODLE-L is freely available at http://mbs.cbrc.jp/poodle/poodle-l.html

Contact: hirose-shuichi{at}aist.go.jp

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Over the last two decades, many proteins lacking well-defined 3D structures under physiological conditions have been identified using various experimental techniques, such as nuclear magnetic resonance, circular dichroism and X-ray crystallography (Dunker et al., 2001). Those proteins include over 100 proteins or domains that are unfolded over their entire sequences (Uversky et al., 2000), and are called ‘intrinsically disordered proteins’ or ‘intrinsically unstructured proteins (IUPs)’ (Dunker et al., 2001; Tompa, 2002). In addition, even more numerous folded proteins containing unfolded regions, called ‘disordered regions’, have been identified (Romero et al. 1997a). From a structural viewpoint, disordered regions are sequence segments that are not observed on the electron density map in X-ray crystallography (Dunker and Obradovic, 2001). Moreover, theoretical analysis suggests that both IUPs and disordered regions are quite common and are found in proteins from all kinds of organisms, especially in eukaryotic genomes (Dunker et al., 2000; Oldfield et al., 2005a; Ward et al., 2004).

From a functional standpoint, though unstructured, disordered regions play fundamental roles in biological activities (Dunker and Obradovic 2001; Dunker et al., 2002a, b; Dyson and Wright 2005; Radivojac et al., 2007; Tompa, 2002; Uversky, 2002, 2003) and are associated with diseases (Fink 2005; Iakoucheva et al., 2002). A four-class classification of their functions has recently been proposed: entropic chain, protein modification, molecular assembly/disassembly and molecular recognition (Dunker et al., 2002a). A unique feature when carrying out such functions is their ability to bind multiple partners with high specificity (Dunker et al., 2002a). This unique property is assumed to originate from both the multiple conformations that they can adopt and their binding surfaces, which are typically larger than those of a folded globular protein. Such molecular recognition peculiarities render disordered regions particularly suitable for functions related to signal transduction, cell-regulation and transcription (Dunker et al., 2005; Uversky et al., 2005). Disordered regions have therefore attracted great attention because they might yield a new structure-function paradigm.

Clear sequence differences between ordered and disordered regions have been demonstrated (Romero, 1997b; Uversky et al., 2000; Wootton, 1994), thereby stimulating numerous attempts to predict disordered regions solely from amino acid sequences. A potential area of application of disordered region prediction is high-throughput functional/structural proteomics. In such projects, putative disordered regions and IUPs, which often hamper protein analysis, are predicted and removed from experimental targets, as reported in the Center for Eukaryotic Structural Genomics (CESG) target selection (Oldfield et al., 2005b). Hence, the improvements of prediction performances might also have a direct and practical impact on increasing the efficiency of proteomics projects. Additionally, a novel drug discovery strategy, termed disorder-based rational drug design, may represent another area of application for disorder region prediction (Cheng et al., 2006).

Methods for predicting disordered regions are classifiable into two groups: the first uses mainly the physico-chemical properties of the amino acids; the second uses evolutionary information. PONDR (Romero et al., 2001), GlobPlot2 (Linding et al., 2003a), DisEMBL (Linding et al., 2003b), IUPred (Dosztanyi et al., 2005), PreLINK (Coeytaux and poupon 2005) and FoldUnfold (Galzitskaya et al., 2006a) are included in the first group. For example, PONDR uses amino acid composition and sequence complexity; GlobPlot2 uses a disorder propensity index; IUPred uses pairwise energy based on a quadratic form in the amino acid composition of protein; PreLINK uses a relationship between amino acid distribution and a putative hydrophobic cluster and FoldUnfold uses the ability to form a sufficient number of contacts in the globular state. Prediction methods included in the second group mainly use profiles generated from PSI-BLAST [e.g. DISOPRED2 (Ward et al., 2004), DISPro (Cheng et al., 2005) and DisPSSMP (Su et al., 2006)] or multiple alignment [e.g. RONN (Yang et al., 2005)].

In spite of those efforts, the critical assessment of techniques for protein structure prediction (CASP) experiment, which has had a category for the prediction of disordered regions since 2002, suggested that substantial room for improvement remains (Obradovic et al., 2005). One promising line of research for improving the prediction of disordered prediction suggests that the sequence characteristics of the disordered regions might depend on their lengths (Peng et al., 2006).

In this study, we investigated whether length dependence is useful for improving the prediction efficiency of disordered regions. We developed Prediction Of Order and Disorder by machine LEarning (POODLE-L; L stands for long), which is an SVM (Support Vector Machine) predictor specifically intended to detect long disordered regions. The prediction performances were improved by extracting the amino acids features specific to long disordered regions. When compared to eight publicly available and well-established disordered region predictors using an independent evaluation dataset of 116 sequences, POODLE-L demonstrated the highest prediction performance.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Training and evaluation dataset
Our training dataset (TDS) was constructed as follows. Ordered regions were collected from proteins in the Protein Data Bank (PDB) (Berman et al., 2000) that include no disordered regions or only ‘short disordered regions’ (shorter than 30aa). Representative sequences with pairwise sequence identity of <30% were collected using PDB-REPRDB (Noguchi and Akiyama, 2003). From that set, we selected protein structures that are monomeric single domains, as defined by SCOP (Murzin et al., 1995), which have resolutions better than 2.0 Å, and which are determined with CNS (Brunger et al., 1998), Shelxl (Sheldrick, 1997), Refmac (Murshudov et al., 1997) or X-PLOR (Brunger, 1992). Eventually, the ordered regions of TDS included 292 protein sequences (55 784 residues), and contained 93 ‘short disordered regions’ (2% of the total residue number).

Disordered regions of TDS consisted of long disordered regions and IUPs. They were collected from Uversky's; (Uversky et al., 2000) and DisProt ver. 2.2 datasets (Vucetic et al., 2005). Disordered regions shorter than 40aa, as well as redundant sequences with sequence identities higher than 90%, were removed according to BLASTClust (Altschul et al., 1990). Consequently, 199 disordered regions (35 428 residues) were collected.

Two prediction performance assessment datasets were prepared. A first dataset, called ADS-1, was used for assessing the prediction performance of the individual predictors. The ADS-1 was constructed according to the same protocol as that used for constructing TDS, except that the required resolution was 2.5 Å rather than 2.0 for ordered regions, and was removed all sequences contained in TDS. We used an updated version of DisProt (ver. 3.0) for disordered regions. Disordered regions shorter than 30aa were excluded from the dataset described above. The evaluation dataset comprised 53 ordered regions (11 431 residues) derived from PDB and 63 disordered regions (8700 residues) derived from DisProt.

A second prediction performance assessment dataset (ADS-2) was used for selecting the descriptor's; combinations and was constructed from the PDB according to a protocol similar to that used for constructing TDS. Disordered regions in ADS-2 were defined as a string of 30 or more consecutive residues missing their C{alpha} atomic coordinates. The ADS-2 consisted of 15 sequences with no disordered region and 11 sequences containing one or more long (≥30aa) disordered regions, representing 6688 ordered residues, and 564 disordered residues. TDS, ADS-1 and ADS-2 do not overlap with each other, and all sequences are available at http://mbs.cbrc.jp/poodle/poodle-l-datasets.html.

2.2 Two-level SVM prediction
POODLE-L assembles the results of 10 disordered region predictors into a final prediction. Each predictor consists of a two-level SVM prediction, which uses amino acid sequences as input data.

The first-level SVM predicts the probability of a 40-residue sequence segment to be disordered. The window size was chosen by comparing, for each descriptor, the classification performances of a 30-residue window with those of a 40-residue window. The sequence was expressed as a 10-D vector, with each component corresponding to a physico-chemical characteristic described in Section 2.3 and used as learning data of the first-level SVM (Fig. 1 Step 1). The disorder probabilities of all 40-residue segments in the protein sequence were calculated using the first-level SVM (Fig. 1 Step 2) and were expressed as real numbers from 0.0 to 1.0 (Fig. 1 Step 3).


Figure 1
View larger version (33K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Schematic representation of a two-level SVM disordered region predictor. The first- and second-level SVMs are shown respectively on the left and right panels. In the first level SVM, a 40-residue window is moved from the N-terminus to C-terminus protein sequence (Step 1). The matrix in the left panel center represents the value to which the sequences are converted using the descriptors, and the name of sequence. For example, seq11 denotes the 40-residue sequence window that starts at residue 11 (Step 1). The first-level SVM is executed using the matrix generated in Step 1 (Step 2). The values in squares represent the disorder probability of the 40-residue sequence segment calculated with the first-level SVM (Step 3). Here, all windows that include a proline residues encapsulated in a broken line frame. In the right panel, the matrix in the square shows the value at which the disorder probability distribution is standardized (Step 4). The second-level SVM is executed using the matrix generated in Step 4 (Step 5).

 
The second-level SVM uses the output of the first-level SVM and computes the disorder probability of a single residue. The second-level SVM learning data of each residue were prepared by subdivision into 10 classes using an increment of 0.1 (0.0–0.1, 0.1–0.2, 0.2–0.3, etc.) the disorder probabilities, as calculated with the first-level SVM, of all the windows that include the examined residue (Fig. 1 Step 4). The number of members in each class was normalized to one using the maximum number found within the protein sequence. Consequently, each amino acid residue was expressed as a 10-D vector to be used in the second-level SVM. Note that ‘internal residues’, which are located more than 40 residues distant from the protein N and C termini, are included in 40 windows, whereas ‘termini residues’, which are residues located within 39 residues of the protein N and C termini, are included in a smaller number of windows (Fig. 1 Step 3). Thus, the 10-D vectors of termini residues are calculated using a smaller number of first-level SVM outputs than the internal residue 10-D vectors, and the second level SVMs for internal and termini residues were also trained separately. The second-level SVM output was expressed as a real number from 0 to 1, which is called the disorder probability (P). Residues with disorder probability higher than 0.5 were predicted to be in the disordered state (Fig. 1 Step 5).

2.3 Physico-chemical properties used as descriptors
Ten descriptors were defined as follows and were used in the-first level SVMs:

The mean hydrophobicity was defined as the average of the modified Kyte and Doolittle's; hydrophobicity index (Kyte and Doolittle, 1982) of the residues in the window, and normalized to a scale of 0 to 1.

The hydrophobic cluster value was defined as the length of a hydrophobic cluster divided by 25. A hydrophobic cluster was defined by encoding sequences with a ternary code, i.e. 1 for hydrophobic residues (VILFMYW), 2 for a proline and 0 for the other residues. A hydrophobic cluster required that it is constituted by a string of ‘1’s and ‘0’s with a maximum of three consecutive ‘0’s including no ‘2’s. It therefore begins and ends with either ‘0000’ or a ‘2’ (Coeytaux and Poupon, 2005).

The mean net charge was defined as the absolute value of the average net charge (defined as +1 for K and R, –1 for D and E and 0 for other residues) of all residues in the window (Uversky et al., 2000).

The charge cluster value was defined as the average charge (defined as +1 for K and R, –1 for D and E and 0 for all other residues) of 12 consecutive amino acids.

The sequence complexity was measured using Shannon's; entropy, which is given as:


Formula

where N represents the total number of amino acids in the window, and fi is the frequency of residue i (Shenkin et al., 1991).

The amino acid composition was encoded with two descriptors. They were defined as the correlation coefficients of amino acid frequencies between the amino acid frequency in the query sequence and its respective frequencies in the ordered and disordered regions. The amino acid frequency was calculated as:


Formula

where ni is the occurrence count of amino acid i in the sequence j, which has a length nj. In addition, F(st)ij was calculated using SWISS-PROT release 47.0. The amino acid frequencies, which are named P(ordered)i and P(disordered)i, were calculated respectively from the ordered and disordered regions in the TDS. Similarly, P(query)i was calculated for the query sequence. Next, the Pearson's; correlation coefficient between P(ordered)i/P(disordered)i and P(query)i was calculated as:


Formula

where xi and yi respectively indicate the value of P(query)i and P(ordered)i (or P(disordered)i), Formula and Formula respectively denote the sample means of x and y.

The secondary structures were described using two descriptors. In the first descriptor, a window containing a residue in an alpha helix core region was given a score of 1, and 0 otherwise. A second descriptor was derived similarly, but with beta core residues. The core regions were calculated according to a protocol similar to the Chou–Fasman's; secondary structure prediction method (Chou and Fasman, 1978). In short, a residue was defined as an alpha core region when four or more consecutive residues with alpha helix propensity higher than 1.15 were found with no flanking residues with alpha helix propensity of less than 0.8. Beta sheet core regions were predicted when three or more consecutive residues with beta sheet propensity of greater than 1.20 were found with no flanking residues with beta sheet propensity smaller than 0.8.

The average number of contacts was defined as the average of the expected number of contacts in globular state of all residues in the window (Garbuzynskiy et al., 2004).

2.4 Training method
The SVM used in this study is the libSVM package (Chang and Lin, 2001) with an RBF kernel for the classifiers. The first-level and second-level SVMs were trained using the respective training dataset containing 10-D vectors as described in Physico-chemical properties used as descriptors of section 2.3. Both TDSs were reduced to 1/10th of their original sizes as follows. The TDS was subdivided into clusters, whose number was 1/200th of all data, using a k-means nearest neighbor method. We then randomly selected 20 sequences from each cluster, which yielded the 1/10 reduced dataset. The training parameters (cost and gamma parameter) used in SVM were optimized using a 5-fold cross-validation with the reduced 1/10th dataset.

2.5 Assessment of predictions
The prediction results were assessed on a residue basis: the state of each residue in the protein sequence was predicted and was compared with the experimental state. The prediction results were classified into four categories: NTP is the number of true positives, which is defined as the number of correctly predicted disordered regions. Similarly, NFP, NTN and NFT denote the numbers of false positives, which are defined respectively as: ordered residues that were incorrectly predicted as disordered; the number of true negatives, which are defined as correctly predicted ordered residues and the number of false negatives, which are defined as disordered residues incorrectly predicted as ordered.

The first assessment criterion is the receiver operating characteristic (ROC) curve. The ROC curve is obtained by plotting the false positive rate (RFP(P) = NFP(P) / (NTN + NFP)) against the true positive rate (RTP(P) = NTP(P) / (NTP + NFN)). The RFP against RTP was plotted while the disorder probability increased from 0 to 1.0 with a 0.01 increment. The larger the area under the ROC curve (SAUC), the more robust an algorithm is. An area of 1.00 means a perfect predictor, and an area of 0.50 corresponds to a random guess.

Next, the sensitivity (Ssens) and specificity (Sspec), which respectively indicate the fraction of correctly identified disordered regions and ordered regions, were used to evaluate prediction performances. In addition, Ssens and Sspec were defined as follows.


Formula

Another commonly used criterion is the Matthew's; correlation coefficient (SMCC), which is given as:


Formula

For a prediction with unequal class frequencies, S_ MCC favors the correct prediction of small classes. In our case, S_ MCC will favor the correct prediction of disordered regions over that of ordered regions.

In CASP6, a new criterion, S_ product, was defined to complement S_ MCC and emphasize the detection of disordered regions (Jin and Dunbrack, 2005). There are considerably fewer disordered residues than ordered residues. Moreover, disordered regions tend to be underpredicted. Sproduct is defined as:


Formula

Sproduct ranges from 0 to 1, where 1 indicates a perfect prediction. Sproduct rises much faster than SMCC when the number of correctly predicted disordered residues rises.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Basic two-level disorder predictor
We constructed a basic two-level SVM predictor using all 10 descriptors, and compared its prediction performances with those of an ordinary single-level SVM prediction method, which is identical to the first-level SVM of our basic predictor (Fig. 1). The average prediction performances were estimated by training and evaluating both predictors 20 times using randomly selected data from TDS (Table 1). The Ssens of the two-level prediction was 1.7% higher than that of a usual single SVM prediction. In addition, the SMCC and Sproduct were also slightly higher. Furthermore, the two-level SVM prediction algorithm enabled the prediction of disordered regions for residues in terminal regions (see Methods section for further details).

3.2 Combination of descriptors for the first-level SVM
We optimized the first-level SVM by investigating which descriptor combination would yield a better prediction result. However, because of the numerous combinations (1022 patterns), we classified the 10 descriptors into six groups, which reduced the number of combinations to 63 patterns (62 patterns and the basic predictor). The six groups were the hydrophobicity descriptors (mean hydrophobicity, hydrophobic cluster value), the charge descriptors (mean net charge, charge cluster value), the sequence complexity descriptor (sequence complexity), the amino acid composition descriptors (amino acid composition), the secondary structure descriptors (secondary structure) and the average number of contacts descriptor (average number of contacts). We evaluated the prediction performances of all 63 predictors based on their sensitivity and specificity using ADS-2 (Fig. 2). According to this criterion, 9 among the 62 predictors exhibited better performance than the basic predictor; the performance of predictor34 was the highest (Supplementary Table 1).


Figure 2
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Influence of the descriptor on the predictor's; sensitivity (Ssens) and specificity (Sspec). The basic predictor is shown by a square. The other predictors are shown with crosses. A performance borderline, where the sum of the sensitivity and specificity is equal to that of the basic predictor, is shown with a dotted line. Predictors upper the dotted line performed better than the basic predictor and are identified by their identity numbers. The prediction Ssens and Sspec were averaged over 20 training and evaluation iterations performed using randomly selected learning data.

 

View this table:
[in this window]
[in a new window]

 
Table 1. Prediction performance of one-level and two-level SVM prediction

 
3.3 Building POODLE-L
To improve the prediction performance, we assembled the results of the basic predictors and that of the nine predictors with performances higher than that of the basic predictor into a consensus prediction system, POODLE-L. We calculated the average probability of each predictor using 7, 23 and 39-residue windows and attributed the average probabilities to the central residue (Fig. 3 Step 1). For each window and each residue, the two largest and smallest averaged probabilities were omitted (Fig. 3 Step 2). The final disorder probability of an amino acid was calculated as the average value over 18 probabilities (Fig. 3 Step 3). The example for the disordered region prediction for two proteins in ADS-1 is shown in Figure 4.


Figure 3
View larger version (32K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Assembling scheme of the predictors (POODLE-L). pred_15 to pred_53 represent individual predictors shown in Figure 2, and pred_basic represents the basic predictor, which uses all 10 descriptors in the first-level SVM. Ten predictors predict the probabilities of disordered regions from query sequence, respectively (Step 1). The length shows the window length, and the numbers in the dotted line frame are the probabilities calculated with each individual predictor corresponding to an n-th amino acid residue, and used to calculate the POODLE-L's; prediction of disordered regions (Step 2). The mean value is computed using the numbers in the dotted line frame for all residues in a protein (Step 3).

 

Figure 4
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Two examples of POODLE-L disorder prediction. The residue number is shown on the horizontal axis. The disorder probability of each residue is represented on the vertical axis. Residues with disorder probability higher than a threshold value of 0.5 are predicted as disordered. A horizontal line at a probability of 0.5 shows the threshold. The gray regions show the disordered as defined in DisProt. (A) Suppressor of cytokine signaling 3 (DisProt code: DP00446), (B) Subtilisin E (precursor) (DisProt code: DP00394) from DisProt.

 
3.4 Comparison with other methods
We compared the prediction performance of POODLE-L was compared to that of eight publicly available disordered region predictors: DISOPRED2, VSL2 (Peng et al., 2006), VL3H (Obradovic et al., 2003), DisEMBL, RONN, IUPred, FoldIndex (Prilusky et al., 2005) and FoldUnfold using ADS-1. According to the ROC curve, POODLE-L performance was the highest among all predictors, especially in the very low false positive rate region. Furthermore, POODLE-L's; prediction performances were also the highest according to SMCC, Sproduct, which provide a simultaneous evaluation of Ssens and Sspec, and S_ AUC(Table 2). When considered individually POODLE-L's; Ssens and Sspec, though relatively high, were not the highest. However, it is also important to assess Ssens and Sspec simultaneously because a high Sspec is usually counterbalanced by a low Ssens.


View this table:
[in this window]
[in a new window]

 
Table 2. Prediction performances of POODLE-L and of eight publicly available disordered region predictors

 
The ability of POODLE-L for predicting short disordered regions was low, as anticipated. The assessment dataset for estimating the prediction performance consisted of 15 sequences with short disordered regions (between 5aa and 20aa) from X-ray crystallographic data, in which 3500 residues were classed as ordered and 204 short disordered residues. The Ssens, Sspec, SMCC and Sproduct of POODLE-L were 0.353, 0.881, 0.158 and 0.311, respectively, and its performance on several criteria was lower than that of the other predictors (Supplemantary Table 2).


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
POODLE-L is a disordered region predictor that is especially tuned for predicting long disordered regions. A major feature of POODLE-L is the assembly of multiple individual disordered region predictors into a final integrated predictor. The outputs of 10 two-level SVM predictors are merged using a consensus algorithm. One predictor, the basic predictor, uses all 10 descriptors in the first-level prediction SVM, whereas the other predictors use combinations of descriptors that yielded the better performances. The effectiveness of a consensus prediction algorithm was reported previously for secondary structure prediction (Cuff et al., 1998; Nishikawa and Noguchi, 1991), but to our knowledge this is its first application to the prediction of disordered regions. The prediction performance of POODLE-L was, indeed, 2.6–8.0% higher than that of the individual predictors according to Sproduct (Table 2 and Supplementary Tabel 1).

Descriptors that are most useful in POODLE-L for identifying disordered regions can be evaluated from their inclusion in the predictors that produce the best results (Table 3). We find that hydrophobicity descriptors appear in all but one predictor, although the secondary structure is not used in any predictor. That difference does not necessarily mean that disorder and secondary structure propensities are unrelated, but the difference might indicate that the encoding of secondary structure descriptor can be improved. For example, we might use secondary prediction methods such as PSIPRED (McGuffin et al., 2000), which are probably more accurate than our Chou–Fassman-derived prediction scheme. As for the four other descriptor's; groups, one or several of them appear in all predictors, but they appear in various combinations. This interchangeability suggests that the descriptors encode redundant information (Table 3). Overall, though the descriptors defined in this study might not characterize disordered regions exhaustively, hydrophobicity seems likely to be a good indicator for long disordered regions.


View this table:
[in this window]
[in a new window]

 
Table 3. List of descriptors used in the first-level SVM that exhibited prediction performances higher than the basic predictor

 
POODLE-L exhibited the best prediction performance for long disordered regions when compared with eight publicly available disordered region predictors (Table 2 and Fig. 5), but it was poor at predicting short regions (Supplementary Table 2). Among other reasons, we infer that the high performance of POODLE-L in predicting long disordered regions originates from the large window, which enables the comprehension of information from amino acids that are located distant from each other. This conjecture concurs with our preliminary calculation, indicating that a 40-residue window is more effective than a 30-residue window. Shorter windows are typically used for predicting disordered regions, and IUPred and RONN, e.g. respectively use 21-residue and 19-residue windows.


Figure 5
View larger version (24K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Receiver operator characteristic (ROC) curve of POODLE-L and eight publicly available predictors. The curve of the bold line shows POODLE-L; other lines respectively show six publicly available predictors (VL3H, VSL2, DISOPRED2, IUPred, RONN and DisEMBL). ROC curves could not be drawn for FoldIndex and FoldUnfold because their predictions do not provide the disorder probability of each residue. The prediction parameters and other options used for the six publicly available predictors are the same as those described in Table 2. The vertical and horizontal axes represent the FP and TP rates, as calculated in Section 2.5.

 
The performance of POODLE-L for predicting disordered regions shorter than 30 residues was poor, but it was anticipated because it was trained exclusively for predicting long disordered regions. Therefore, the ability of POODLE-L to predict, exclusively and with high reliability, long disordered regions suggests that the difference in the physico-chemical properties of long and short disordered regions (Radivojac et al., 2004) can be used to improve the prediction of the disordered region, as suggested by Peng (Peng et al., 2006). In our case, short disordered regions are identified by the method introduced in Shimizu's; work (Shimizu et al., 2005).


    5 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We described POODLE-L, which detects long disordered regions with high accuracy. POODLEL achieved the best prediction performance among several previously reported prediction methods, according to several criteria. The prediction of long disordered regions appears to stem from POODLE-L's; capability to recognize the sequential differences among long and short disordered regions and ordered regions efficiently. Nevertheless, we believe that the prediction of long disordered regions might be further improved by exploring better descriptors.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank our colleagues in PharmaDesign, Inc. and protein function team in Computational Biological Research Center (CBRC) for helpful advice and discussion. This work was funded by PharmaDesign, Inc. and Advanced Industrial Science and Technology (AIST).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Dmitrij Frishman

Received on January 31, 2007; revised on May 29, 2007; accepted on May 30, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol, ( (1990) ) 215, : 403–410.[CrossRef][ISI][Medline].

    Berman HM, et al. The Protein Data Bank. Nucleic Acids Res, ( (2000) ) 28, : 235–242.[Abstract/Free Full Text].

    Brunger AT. X-PLOR, Ver. 3.1, A System for X-ray Crystallography and NMR, ( (1992) ) New Haven, CT: Yale University Press..

    Brunger AT, et al. Crystallography & NMR system: a new software suite for macromolecular structure determination. Acta Crystallogr. D Biol. Crystallogr, ( (1998) ) 54, : 905–921.[CrossRef][Medline].

    Chang CC, Lin CJ. Training nu-support vector classifiers: theory and algorithms. Neural Comput, ( (2001) ) 13, : 2119–2147.[Abstract/Free Full Text].

    Cheng J, et al. Accurate prediction of protein disordered regions by mining protein structure data. Data Mining Knowl. Discov, ( (2005) ) 11, : 213–222.[CrossRef].

    Cheng Y, et al. Rational drug design via intrinsically disordered protein. Trends Biotechnol, ( (2006) ) 24, : 435–442.[CrossRef][ISI][Medline].

    Chou PY, Fasman GD. Prediction of the secondary structure of proteins from their amino acid sequence. Adv. Enzymol. Relat. Areas Mol. Biol, ( (1978) ) 47, : 45–148.[Medline].

    Coeytaux K, Poupon A. Prediction of unfolded segments in a protein sequence based on amino acid composition. Bioinformatics, ( (2005) ) 21, : 1891–1900.[Abstract/Free Full Text].

    Cuff JA, et al. JPred: a consensus secondary structure prediction server. Bioinformatics, ( (1998) ) 14, : 892–893.[Abstract/Free Full Text].

    Dunker AK, Obradovic Z. The protein trinity-linking function and disorder. Nat. Biotechnol, ( (2001) ) 19, : 805–806.[CrossRef][ISI][Medline].

    Dunker AK, et al. Intrinsic protein disorder in complete genomes. Genome Inform. Ser. Workshop Genome Inform, ( (2000) ) 11, : 161–171.[Medline].

    Dunker AK, et al. Intrinsically disordered proteins. J. Mol. Graph. Model, ( (2001) ) 19, : 26–59.[CrossRef][ISI][Medline].

    Dunker AK, et al. Intrinsic disorder and protein function. Biochemistry, ( (2002a) ) 41, : 6573–6582.[CrossRef][Medline].

    Dunker AK, et al. Identification and functions of usefully disordered proteins. Adv. Protein Chem, ( (2002b) ) 62, : 25–49.[ISI][Medline].

    Dunker AK, et al. Flexible nets. The roles of intrinsic disorder in protein interaction networks. FEBS J, ( (2005) ) 272, : 5129–5148.[CrossRef][Medline].

    Dosztanyi Z, et al. IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics, ( (2005) ) 21, : 3433–3434.[Abstract/Free Full Text].

    Dyson HJ, Wright PE. Intrinsically unstructured proteins and their functions. Nat. Rev. Mol. Cell Biol, ( (2005) ) 6, : 197–208.[CrossRef][ISI][Medline].

    Fink AL. Natively unfolded proteins. Curr. Opin. Struct. Biol, ( (2005) ) 15, : 35–41.[CrossRef][ISI][Medline].

    Galzitskaya OV, et al. FoldUnfold: web server for the prediction of disordered regions in protein chain. Bioinformatics, ( (2006a) ) 22, : 2948–2949.[Abstract/Free Full Text].

    Galzitskaya OV, et al. Prediction of amyloidogenic and disordered regions in protein chains. PLoS Comput. Biol, ( (2006b) ) 2, : 1639–1648.[ISI].

    Garbuzynskiy SO, et al. To be folded or to be unfolded? Protein Sci, ( (2004) ) 13, : 2871–2877.[Abstract/Free Full Text].

    Iakoucheva LM, et al. Intrinsic disorder in cell-signaling and cancer-associated proteins. J. Mol. Biol, ( (2002) ) 232, : 573–584..

    Jin Y, Dunbrack RL Jr. Assessment of disorder predictions in CASP6. Proteins, ( (2005) ) 61, : 167–175.[CrossRef][ISI][Medline].

    Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol, ( (1982) ) 157, : 105–132.[CrossRef][ISI][Medline].

    Linding R, et al. GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res, ( (2003a) ) 31, : 3701–3708.[Abstract/Free Full Text].

    Linding R, et al. Protein disorder prediction: implications for structural proteomics. Structure, ( (2003b) ) 11, : 1453–1459.[Medline].

    McGuffin LJ, et al. The PSIPRED protein structure prediction server. Bioinformatics, ( (2000) ) 16, : 404–405.[Abstract/Free Full Text].

    Murshudov GN, et al. Refinement of macromolecular structures by the maximum-likelihood method. Acta Crystallogr. D Biol. Crystallogr, ( (1997) ) 53, : 240–255.[CrossRef][Medline].

    Murzin AG, et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol, ( (1995) ) 247, : 536–540.[CrossRef][ISI][Medline].

    Nishikawa K, Noguchi T. Predicting protein secondary structure based on amino acid sequence. Meth. Enzymol, ( (1991) ) 202, : 31–44.[ISI][Medline].

    Noguchi T, Akiyama Y. PDB-REPRDB: a database of representative protein chains from the Protein Data Bank (PDB) in 2003. Nucleic Acids Res, ( (2003) ) 31, : 492–493.[Abstract/Free Full Text].

    Obradovic Z, et al. Predicting intrinsic disorder from amino acid sequence. Proteins, ( (2003) ) 53, : 566–572.[CrossRef][ISI][Medline].

    Obradovic Z, et al. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins, ( (2005) ) 61, : 176–182.[ISI][Medline].

    Oldfield CJ, et al. Comparing and combining predictors of mostly disordered proteins. Biochemistry, ( (2005a) ) 44, : 1989–2000.[CrossRef][Medline].

    Oldfield CJ, et al. Addressing the intrinsic disorder bottleneck in structural proteomics. Proteins, ( (2005b) ) 59, : 444–453.[CrossRef][ISI][Medline].

    Peng K, et al. Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics, ( (2006) ) 17, : 208..

    Prilusky J, et al. FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics, ( (2005) ) 21, : 3435–3438.[Abstract/Free Full Text].

    Radivojac P, et al. Protein flexibility and intrinsic disorder. Protein Sci, ( (2004) ) 13, : 71–80.[Abstract/Free Full Text].

    Radivojac P, et al. Intrinsic disorder and functional proteomics. Biophys. J, ( (2007) ) 92, : 1439–1456.[CrossRef][ISI][Medline].

    Romero P, et al. Sequence data analysis for long disordered regions prediction in the Calcineurin family. Genome Inform. Ser. Workshop Genome Inform, ( (1997a) ) 8, : 110–124.[Medline].

    Romero P, et al. Identifying disordered regions in proteins from amino acid sequence. Int. Proc. Neur. Net, ( (1997b) ) 1, : 90–95..

    Romero P, et al. Sequence complexity of disordered protein. Proteins, ( (2001) ) 42, : 38–48.[CrossRef][ISI][Medline].

    Sheldrick GM. SHELX97, programs for crystal structure analysis (Release 97-2). ( (1997) ) Germany: University of Gottingen..

    Shenkin PS, et al. Information-theoretical entropy as a measure of sequence variability. Proteins, ( (1991) ) 11, : 297–313.[CrossRef][ISI][Medline].

    Shimizu K, et al. Feature selection based on physicochemical properties of redefined N-term and C-term regions for predicting disorder. ( (2005) ) Procedings of the Institute of electrical and Elecetronics Engineers CIBCB. 262–267..

    Su CT, et al. Protein disorder prediction by condensed PSSM considering propensity for order or disorder. BMC Bioinformatics, ( (2006) ) 7, : 319.[CrossRef][Medline].

    Tompa P. Intrinsically unstructured proteins. Trends Biochem. Sci, ( (2002) ) 27, : 527–533.[CrossRef][ISI][Medline].

    Uversky VN, et al. Why are "natively unfolded" proteins unstructured under physiologic conditions? Proteins, ( (2000) ) 15, : 415–427..

    Uversky VN. Natively unfolded proteins: a point where biology waits for physics. Protein Sci, ( (2002) ) 11, : 739–756.[Abstract/Free Full Text].

    Uversky VN. Protein folding revisited. A polypeptide chain at the folding-misfolding-nonfolding cross-roads: which way to go? Cell Mol. Life Sci, ( (2003) ) 60, : 1852–1871.[CrossRef][ISI][Medline].

    Uversky VN, et al. Showing your ID: intrinsic disorder as an ID for recognition, regulation and cell signaling. J. Mol. Recognit, ( (2005) ) 18, : 343–384.[CrossRef][ISI][Medline].

    Vucetic S, et al. DisProt: a database of protein disorder. Bioinformatics, ( (2005) ) 21, : 137–140.[Abstract/Free Full Text].

    Ward JJ, et al. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J. Mol. Biol, ( (2004) ) 337, : 635–645.[CrossRef][ISI][Medline].

    Wootton JC. Sequence with ‘unusual’ amino acid composition. Curr. Opin. Struct. Biol, ( (1994) ) 4, : 413–421.[CrossRef][ISI].

    Yang ZR, et al. RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics, ( (2005) ) 21, : 3369–3376.[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Protein Sci.Home page
N. B. Holladay, L. N. Kinch, and N. V. Grishin
Optimization of linear disorder predictors yields tight association between crystallographic disorder and hydrophobicity
Protein Sci., October 1, 2007; 16(10): 2140 - 2152.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/16/2046    most recent
btm302v2
btm302v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Hirose, S.
Right arrow Articles by Noguchi, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hirose, S.
Right arrow Articles by Noguchi, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?