Bioinformatics Advance Access originally published online on June 28, 2007
Bioinformatics 2007 23(17):2337-2338; doi:10.1093/bioinformatics/btm330
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
POODLE-S: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix
1Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), 2-42 Aomi, Koto-ku, Tokyo 135-0064 and 2PharmaDesign, Inc., 2-19-8 Hatchobori, Chuo-ku, Tokyo 104-0032, Japan
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Protein disorder is characterized by a lack of a stable 3D structure, and is considered to be involved in a number of important protein functions such as regulatory and signalling events. We developed a web application, the POODLE-S, which predicts the disordered region from amino acid sequences by using physicochemical features and reduced amino acid set of a position-specific scoring matrix.
Availability: POODLE-S is available from http://mbs.cbrc.jp/poodle/poodle-s.html and can be used by both academic and commercial users.
Contact: poodle{at}cbrc.jp
| 1 INTRODUCTION |
|---|
|
|
|---|
Protein disorder is a widespread phenomenon, in which there is a lack of a stable 3D structure and a high degree of flexibility in the polypeptide chain. This phenomenon is considered to provide essential biological functions because dynamic conformation allows proteins to interact with multiple targets (Dunker et al., 2002). As the primary structure of the disordered regions is different from that of folded regions (Garner et al., 1998), the development of prediction methods based on amino acid sequence analysis has been encouraged (Jones and Ward, 2003; Li et al., 1999; Linding et al., 2003; Obradovic et al., 2003). We focused our attention on two facts. First, amino acid composition has different propensities in the N-term, C-term and internal regions (Shimizu et al., 2005). Second, general physicochemical properties, rather than specific amino acids, are the key factors that contribute to the development of protein disorder (Weathers et al., 2004). Then, we investigated if/how different physicochemical properties are required to characterize disorder in different regions (Shimizu et al., 2005).Our application, POODLE-S, defines a suitable length and position for the N-term and C-term regions for predicting disorder, and provides specific predictions on the basis of these regions by selecting physicochemical features, which are discriminative factors for each region.
| 2 OUTLINE OF METHODS |
|---|
|
|
|---|
We used a
2- test to define seven regions on the basis of positions from the N-terminal, so that each data item had similar amino acid composition. The POODLE-S application consists of seven predictors, which use support vector machines.1 Each predictor is prepared for each region, and selects its own features as follows.- The predictor selects specifically discriminative physicochemical features for a region from 10 different physicochemical properties (hydrophilic, hydrophobic, charged, positive, negative, aromatic, aliphatic, tiny, small and polar). Also, amino acids, which do not have any selected physicochemical properties, are selected as features.
- A position-specific scoring matrix (PSSM) of target sequences via PSI-BLAST is obtained. The PSSMs are divided into sliding windows of size m. Each window is a matrix E i j{i = 1, ... ,m, j = 1, ... , 20} (where j represents each of the 20 amino acids).
- Each feature is calculated as Fi,c =
j
c Ei,j (i = 1, ... , m, j
c means that j has the characteristic c).
| 3 PERFORMANCE |
|---|
|
|
|---|
We used the dataset2 of the latest Critical Assessment of Techniques for Protein Structure Prediction (CASP7, http://predictioncenter.org/casp7/Casp7.html) to assess how well the POODLE-S performs. First, POODLE-S was trained on high-resolution single chained X-ray crystal structural data (Shimizu et al., 2005) and the DisProt database (Vucetic et al., 2005). All the data was obtained before the CASP7 prediction season. Therefore, at the time it was trained, the POODLE-S contained no information about CASP7 targets sequences. We used sensitivity [tp/(tp+fn)], specificity [tn/(tn + fp)], selectivity[tp/(tp + fp)], and Matthews' correlation coefficient (MCC) for assessment. This coefficient balances sensitivity and specificity, and is calculated as follows.
|
|
Table 1 shows the results of the assessment of POODLE-S based on the four different scores in comparison with three other groups successfully participating in CASP7. The predictions of DISOPRED (Ward et al., 2004), ISTZORAN (Li et al., 1999; Obradovic et al., 2003) and fais were downloaded from the CASP7 website. DISOPRED is a fully automatic server group, while both ISTZORAN and fais registered as human expert groups, which can use any combination of computational and human methods. The data indicate that our method is of comparable accuracy (MCC) with the other three top groups. It is characterized by on average a lower sensitivity (SEN), which is, however, compensated by a higher specificity (SPC) and selectivity (SEL). We additionally compared the predictions of the different groups for the seven regions defined by our method (Table 2). The results of POODLE-S indicate that it performs better on regions NR2 and NR3. Region-specific feature selection appears to be an effective way of predicting protein disorder.
|
| 4 THE POODLE-S SERVER |
|---|
|
|
|---|
The web server takes a single amino acid sequences as an input. Also, users are required to input an accessible e-mail address where the result of the prediction is sent. The POODLE-S provides both text output and graphical output. The text output style is based on the CASP format. Data in this format are inserted between the MODEL and the END records. Each line consists of a residue code a two-state prediction code and a confidence score. The symbols for the two-state order/disorder prediction are O for order and D for disorder. The last column should indicate the probability of a residue being in the disordered region. This value is between 0.0 and 1.0. The graphical output, in the form of an interactive line graph, is available from a URL, which is included in the e-mail and is accessible for 2 weeks after a submission. The user can display a position from the N-terminal, an amino acid code and, a probability score by pointing the cursor on the line graph (Fig. 1).
|
|
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We would like to thank Yoichi Muraoka from Waseda University and Satoru Kanai from PharmaDesign, Inc. for helpful discussions. We also thank an anonymous reviewer for his/her helpful comments, which improved the manuscript.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Anna Tramontano
1 We use support vector machines package tool libSVM (Chang and Lin, 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm). ![]()
2 CASP7 provided 100 valid targets during the prediction season. We evaluated results using 89 targets whose structures are available from Protein Data Bank. ![]()
Received on April 4, 2007; revised on May 2, 2007; accepted on June 16, 2007
| REFERENCES |
|---|
|
|
|---|
Chang CC, Lin CJ. LIBSVM : a library for support vector machines. (2001).
Dunker AK, et al. Intrinsic disorder and protein function. Biochemistry (2002) 41:6573–6582.[CrossRef][Medline]
Garner E, et al. Predicting disordered regions from amino acid sequence: common themes despite differing structural characterization. Genome Inform. Ser. Workshop Genome Inform (1998) 9:201–213.[Medline]
Jones DT, Ward JJ. Prediction of disordered regions in proteins from position specific score matrices. Proteins (2003) 53(Suppl. 6):573–578.[CrossRef][Web of Science][Medline]
Li X, et al. Predicting protein disorder for n-, c-, and internal regions. Genome Inform Ser Workshop Genome Inform (1999) 10:30–40.[Medline]
Linding R, et al. Protein disorder prediction: implications for structural proteomics. Structure (2003) 11:1453–1459.[Medline]
Obradovic Z, et al. Predicting intrinsic disorder from amino acid sequence. Proteins (2003) 53(Suppl. 6):566–572.[CrossRef][Web of Science][Medline]
Shimizu K, et al. Feature selection based on physicochemical properties of redefined n-term region and c-term regions for predicting disorder. In: In Proceedings of 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (2005) 262–267.
Vucetic S, et al. Disprot: a database of protein disorder. Bioinformatics (2005) 21:137–140.
Ward JJ, et al. The disopred server for the prediction of protein disorder. Bioinformatics (2004) 20:2138–2139.
Weathers EA, et al. Reduced amino acid alphabet is sufficient to accurately recognize intrinsically disordered protein. FEBS Lett (2004) 576:348–352.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
L. J. McGuffin Intrinsic disorder prediction from the analysis of multiple protein fold recognition models Bioinformatics, August 15, 2008; 24(16): 1798 - 1804. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Ishida and K. Kinoshita Prediction of disordered regions in proteins based on the meta approach Bioinformatics, June 1, 2008; 24(11): 1344 - 1348. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

