Skip Navigation


Bioinformatics Advance Access originally published online on December 7, 2004
Bioinformatics 2005 21(8):1719-1720; doi:10.1093/bioinformatics/bti203
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1719    most recent
bti203v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (41)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pollastri, G.
Right arrow Articles by McLysaght, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pollastri, G.
Right arrow Articles by McLysaght, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Porter: a new, accurate server for protein secondary structure prediction

Gianluca Pollastri 1,* and Aoife McLysaght 2

1Computer Science Department, University College Dublin Belfield, Dublin 4, Ireland
2Genetics Department, Trinity College Dublin Dublin 2, Ireland

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 REFERENCES
 

Summary: Porter is a new system for protein secondary structure prediction in three classes. Porter relies on bidirectional recurrent neural networks with shortcut connections, accurate coding of input profiles obtained from multiple sequence alignments, second stage filtering by recurrent neural networks, incorporation of long range information and large-scale ensembles of predictors. Porter's accuracy, tested by rigorous 5-fold cross-validation on a large set of proteins, exceeds 79%, significantly above a copy of the state-of-the-art SSpro server, better than any system published to date.

Availability: Porter is available as a public web server at http://distill.ucd.ie/porter/

Contact: gianluca.pollastri{at}ucd.ie

Protein secondary structure (SS) prediction is an important stage for the prediction of protein structure and function. Accurate SS information has been shown to improve the sensitivity of threading methods (e.g. Jones, 1999b) and is at the core of most ab initio methods (e.g. see Bradley et al., 2003) for the prediction of protein structure. Virtually all modern methods for protein SS prediction are based on machine learning techniques Jones, 1999a; Pollastri et al., 2002), and exploit evolutionary information in the form of profiles extracted from alignments of multiple homologous sequences (MSAs). The progress of these methods over the last 10 years has been slow, but steady, and is due to numerous factors: the ever-increasing size of training sets; more sensitive methods for the detection of homologues, such as PSI-BLAST (Altschul et al., 1997); the use of ensembles of multiple predictors trained independently, sometimes tens of them (Petersen et al., 2000); more sophisticated machine learning techniques (e.g. Pollastri et al., 2002).

We have developed Porter, a new server for protein SS prediction. Porter is based on two layers of Bidirectional Recurrent Neural Networks (BRNN) and is an evolution of SSpro (Pollastri et al., 2002), one of the most accurate public servers to date (Rost and Eyrich, 2001; Lesk et al., 2001). The novel elements of Porter are accurate coding of input profiles obtained from MSA, second stage filtering by recurrent neural networks, incorporation of long-range information, large-scale ensembles of predictors and larger training sets.

Datasets. Porter is trained on the December 2003 25% pdb_select list. After processing by DSSP (Kabsch and Sander, 1983) the set contains 2171 proteins and 344 653 amino acids. We assign eight DSSP classes as follows: H, G, I -> Helix; E, B -> Strand; S, T, . -> Coil. This assignment is known to be ‘hard’ and had been adopted at CASP (Lesk et al., 2001). More lenient assignments generally lead to higher performances. Profiles obtained from MSA have been shown to improve significantly SS prediction performances (starting from Rost and Sander, 1993). In Porter, we use MSA extracted from the NR database as available on March 3, 2004, containing over 1.4 million sequences. Redundancy in the database was first reduced at a 98% threshold, leading to 1.05 million sequences finally. The alignments were generated by three runs of PSI-BLAST (Altschul et al., 1997).

Input coding. In Porter, the input at each residue is coded as a letter out of an alphabet of 25. Beside the 20 standard amino acids, B, U, X, Z and . (gap) are considered. The input presented to the networks is the frequency of each of the 24 non-gap symbols, plus the total frequency of gaps in each column of the alignment. This input coding scheme is richer than the 20-letter scheme adopted in SSpro (Pollastri et al., 2002).

Output filtering, incorporation of long-range information. We adopt a filtering network as for example in Rost and Sander (1993), but we augment the input to this network by the predictions of the first-stage network averaged over multiple contiguous windows, i.e. if {sigma}j = ({alpha}j, ßj, {gamma}j) are the outputs in position j of the first stage network corresponding to the estimated probabilities of helix, strand and coil given the inputs, the input to the second stage network in position j is the array Ij:

where kf = j+ f(2{omega} + 1), 2{omega} + 1 is the size of the window over which first-stage predictions are averaged and 2p + 1 is the number of windows considered. In Porter {omega} = 7 and p = 7, i.e. predictions at 225 contiguous residues are considered by the filtering network.

Large-scale ensembles. Five two-stage BRNN models are trained independently and ensemble averaged to build Porter. Differences among models are introduced by two factors: stochastic elements in the training protocol, such as different initial weights of the networks and different shuffling of the examples; different architecture and size of the models. In particular, we resorted to BRNN architectures with shortcuts (Baldi et al., 1999). In these, connections along the forward and backward hidden chains span more than one-residue intervals, creating shorter paths between inputs and outputs. Averaging the five models' outputs leads to improvements in the range of 1–1.5% over single models. In Petersen et al. (2000), a slight improvement in the prediction accuracy was obtained by ‘brute ensembling’ of several tens of different models trained independently. Here, we adopted a less expensive technique: a copy of each of the five models is saved at regular intervals during training. The training protocol (similar to that described by Pollastri et al., 2002) guarantees that differences during training are non-trivial. In Porter we build an ensemble of 45 such copies (9 for each of the 5 models).

Results and conclusions. We measured the performances of each incremental improvement separately, by a 5-fold cross-validation procedure. The percentages of correctly classified residues (Q3), helices and strands (Q{alpha}, Qß), and Matthews' correlation coefficients for helices and strands (C{alpha}, Cß) by all systems are shown in Table 1. Q3 differences >0.07% are statistically significant. An exact copy of SSpro, retrained on the new sets, obtains Q3 = 78.33%. An ensemble of five models with shortcuts achieves Q3 = 78.48%. When 25 input symbols are adopted, an improvement at the margin of statistical significance is observed (Q3 = 78.54%). The most sizeable gain (+0.35%) is obtained when two-layer BRNNs with long-range filtering are adopted. Large-scale ensembles lead to a further improvement, comparable to that reported by Petersen et al. (2000). The overall performance of Porter is Q3=79.01% (SOV=75.0%). Tested on the more lenient class assignment by Petersen et al. (2000), Porter surpasses 81% correct classification. Performance indices for the single classes indicate that most of Porter's gains come from more accurate prediction of strands.


View this table:
[in this window]
[in a new window]
 
Table 1 Overall Q3%, Q{alpha}%, Qß%, C{alpha}% and Cß% for SSpro 2.0 in Pollastri et al. (2002) SSpro retrained (SSproR) and incremental improvements leading to Porter

 
We also tested Porter on the EVA (Rost and Eyrich, 2001) common2 set, as available in November 2004, containing 134 proteins. To ensure a fair comparison, we retrained Porter from scratch, after having excluded from its training set all sequences with >25% similarity to any sequence in common2. On this set, Porter achieves SOV = 72.0% and Q3 = 76.8%, better by at least 1.2 and 1.9%, respectively, than all the other servers evaluated.


    Acknowledgments
 
The work of G.P. is supported by an SFI BRG 2004 and a UCD President's Award 2004.

Received on October 9, 2004; revised on November 29, 2004; accepted on December 2, 2004

    REFERENCES
 TOP
 Abstract
 REFERENCES
 

    Altschul, S., Madden, T., Schaffer, A. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402[Abstract/Free Full Text].

    Baldi, P., Brunak, S., Frasconi, P., Soda, G., Pollastri, G. (1999) Exploiting the past and the future in protein secondary structure prediction. Bioinformatics, 15, 937–946[Abstract/Free Full Text].

    Bradley, P., Chivian, D., Meiler, J., Misura, K., Rohl, C., Schief, W., Wedemeyer, W., Schueler-Furman, O., Murphy, P., Schonbrun, J., Strauss, C., Baker, D. (2003) Rosetta predictions in casp5: successes, failures, and prospects for complete automation. Proteins, 53, 457–468.

    Jones, D. (1999a) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 292, 195–202[CrossRef][ISI][Medline].

    Jones, D. (1999b) Genthreader: an efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol., 287, 797–815[CrossRef][ISI][Medline].

    Kabsch, W. and Sander, C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577–2637[CrossRef][ISI][Medline].

    Lesk, A., Lo Conte, L., Hubbard, T. (2001) Assessment of novel fold targets in CASP4: predictions of three-dimensional structures, secondary structures, function and genetics. Proteins, Suppl. 5, 98–118.

    Petersen, T., Lundegaard, C., Nielsen, M., Bohr, H., Bohr, J., Brunak, S., Gippert, G., Lund, O. (2000) Prediction of protein secondary structure at 80% accuracy. Proteins, 41, 17–20[CrossRef][ISI][Medline].

    Pollastri, G., Przybylski, D., Rost, B., Baldi, P. (2002) Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins, 47, 228–235[CrossRef][ISI][Medline].

    Rost, B. and Eyrich, V. (2001) EVA: large-scale analysis of secondary structure prediction. Proteins, Suppl. 5, 192–199.

    Rost, B. and Sander, C. (1993) Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol., 232, 584–599[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1719    most recent
bti203v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (41)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pollastri, G.
Right arrow Articles by McLysaght, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pollastri, G.
Right arrow Articles by McLysaght, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?