Bioinformatics Advance Access originally published online on November 29, 2005
Bioinformatics 2006 22(3):356-358; doi:10.1093/bioinformatics/bti797
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Paircoil2: improved prediction of coiled coils from sequence

1Mathematics Department, Computer Science and Artificial Intelligence Laboratory Cambridge, MA 02139, USA
2Department of Biology, Massachusetts Institute of Technology Cambridge, MA 02139, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: We introduce Paircoil2, a new version of the Paircoil program, which uses pairwise residue probabilities to detect coiledcoil motifs in protein sequence data. Paircoil2 achieves 98% sensitivity and 97% specificity on known coiled coils in leave-family-out cross-validation. It also shows superior performance compared with published methods in tests on proteins of known structure.
Availability: Paircoil2 is freely available as a web application and for download at http://paircoil2.csail.mit.edu
Contact: keating{at}mit.edu; bab{at}mit.edu
Supplementary information: Available at Bioinformatics online and at the Paircoil website.
| INTRODUCTION |
|---|
|
|
|---|
The alpha-helical coiled coil is a simple structural motif found at high frequency in proteins of all organisms. Many coiled coils mediate oligomerization or proteinprotein interaction, and the motif is important to the structure and function of several classes of fibrous structural proteins, motor proteins, transcription factors and membrane fusion proteins. Prediction of coiled coils in proteins can be used to identify putative oligomerization domains, to postulate functional mechanisms and to map sequence onto structure at a high level of detail. Moreover, such predictions are necessary as a first step in understanding coiledcoil interactions (Newman and Keating, 2003; Fong et al., 2004). Thus, efficient and highly accurate methods for predicting coiled coils are important for annotating the data that result from genome sequencing projects.
Sequence-based methods for predicting coiled coils, such as COILS (Lupas, et al., 1991), Paircoil (Berger et al., 1995), MultiCoil (Wolf et al., 1997) and MARCOIL (Delorenzi and Speed, 2002), have been quite successful. Since publication of the Paircoil program, the number of known coiledcoil sequences has increased dramatically. We have used these data to develop Paircoil2, an improved version of Paircoil, and find that it performs well in leave-family-out cross-validation and outperforms other common methods.
| Paircoil2 RETRAINING AND TESTING |
|---|
|
|
|---|
An initial coiledcoil database was constructed from sequences known to contain coiled coils, using information from structure and the literature. The coiledcoil regions were defined and annotated with the appropriate heptad register according to the following sources. The myosins, tropomyosins and paramyosins were annotated as described in Berger et al. (1995). Intermediate filament coiled coils were based on Strelkov et al. (2003). Viral coat proteins, laminins, fibrinogens, heat shock factors and flagellins were based on Wolf et al. (1997) and Singh et al. (1999) and the bZIPs according to Newman and Keating (2003), Vincentz et al. (2001) and Fassler et al. (2002). Dynein heavy chains were annotated according to Gee et al. (1997) and by inspection, and kinesin heavy chains after Thormählen et al. (1998), Morii et al. (1997) and inspection. In addition, a number of coiledcoil sequences that did not fit into these categories were detected and annotated by SOCKET (Walshaw and Woolfson, 2001) from the 2002 version of the Protein Data Bank (PDB) (Berman et al., 2000).
The coiledcoil database was generated from this initial database by adding homologous sequences from the NCBI NR database. PSI-BLAST (Altschul et al., 1997) was run for four iterations, using an E-value cutoff of 1010. The BLAST sequence alignments were used to define the coiledcoil regions and assign heptad registers, which were verified with Paircoil. For cases where the Paircoil-derived and alignment-derived register disagreed, assignments were made manually. The database was filtered to 90% sequence identity with CD-HIT (Li et al., 2002). Seven residues were removed from each side of a skip in the heptad register or a gap in the alignment to avoid introducing non-coiledcoil residues into the database. To further eliminate possible non-coiledcoil residues, seven residues were removed on each side of proline residues, and from the beginning and end of each coiledcoil region in all sequences. All regions with at least 28 contiguous coiledcoil residues were included in the coiledcoil database of 1371 protein chains, containing 95 517 coiledcoil residues. Coiledcoil residue pair frequencies were calculated from this database as in Berger et al. (1995), and background frequencies were derived from the NCBI NR90 database of March 2005.
The database of non-coiled coils, PDB-minus, was derived from the PDB from February 28, 2005, filtered to 40% sequence identity using CD-HIT. Proteins in the coiledcoil database or detected by SOCKET were removed. PDB-minus consists of 6397 sequences that comprises 1 486 055 residues. A list of the protein sequences making up these datasets is available as Supplementary Data.
The Paircoil2 algorithm is the same as that of Paircoil, with the incorporation of the new data. It was extended to allow window sizes of both 21 and 28 residues. The algorithm runs in linear time relative to the length of the input sequence. Confidence is reported as a P-score, which is a measure of the percentage of non-coiledcoil residues in PDB-minus that score better than a given Paircoil2 raw score. We find that the score distribution of PDB-minus is closely approximated by a Gaussian, and as such the P-score is calculated to be the area below this curve and to right of the raw score.
Paircoil2 performs extremely well in leave-family-out cross-validation on the coiledcoil database. For each cross-validation, sequences in one coiledcoil family were placed in the test set, along with half of the sequences in PDB-minus, selected randomly. The remaining coiled coils were used to train Paircoil2. The test set was then scored. Table 1 reports the sensitivity and specificity1 at the P-score where the two values are closest for each family, using a window length of 28. Although P-scores and specificity both reflect false-positive rates, P-scores are defined using all of PDB-minus and specificity is evaluated during testing. Performance on the cross-validation test is very similar using the 21-length window (Supplementary Data).
|
| COMPARISON WITH OTHER SEQUENCE-BASED COILEDCOIL PREDICTION PROGRAMS |
|---|
|
|
|---|
We compared the performance of the programs Paircoil2, Paircoil, COILS, Multicoil and MARCOIL on individual coiledcoil families in the coiledcoil dataset used for training. Previous comparative studies have suggested that MARCOIL and Paircoil both show superior performance to COILS (Berger et al., 1995; Delorenzi and Speed, 2002). All programs were run with the recommended default settings: COILS version 2.2 with a window size of 28 and the MTIDK table, using the recommended method to filter highly hydrophobic sequences, and MARCOIL with the MTK table as MARCOIL-H. We note that all the families in the Paircoil2 database are also included in the MARCOIL database, with the exception of the heat-shock factors, flagellins and fibrinogens. However, there are many families included in the new training set that were not included in training of the older programs Paircoil, Multicoil or COILS. We were unable to perform leave-family-out cross-validation for the programs other than Paircoil2. Therefore, we compare the final versions of all programs in Table 1.
Averaged over all families, Paircoil2 in cross-validation, MARCOIL and Multicoil perform very similarly on this test. The residue-level sensitivity and specificity that we report for MARCOIL are higher than previously cited, probably owing to differences in the annotated coiledcoil regions used for testing, and in the relative sizes of the datasets used by this work and by Delorenzi and Speed (2002). Paircoil2 performs better than all other programs overall. There are significant differences between families. Notably, all programs perform very well on the canonical coiled coils in the myosin, tropomyosin, intermediate-filament and kinesin families, on which all of the programs were trained. The SNARES, flagellins and viral envelope proteins are particularly hard to recognize without training on these families. This is easy to understand, as coiled coils in these families have some unusual properties: SNARES are parallel tetramers, which are rare in the training set, and the coiledcoil regions in both the flagellins and the viral envelope proteins do not form isolated coiled coils, but rather are part of higher order helical assemblies (Yonekura et al., 2003; Chan et al., 1997).
We also compared the performance of the different programs on a structure-based test set. NEW-PDB was derived from the October 2004 release of the PDB select90 (Hobohm and Sander, 1994). Proteins without quaternary structure listed on the EBI quaternary structure server (Henrick and Thornton, 1998) were removed and SOCKET was used to detect coiledcoil regions. Two coiledcoil datasets were generated (Supplementary Data): NEW-PDB21 comprises SOCKET hits at least 21 residues in length (216 sequences, 6288 residues) and NEW-PDB28 comprises hits of least 28 residues (85 sequences, 3261 residues). Sequences with >90% sequence identity to a sequence in the Paircoil2 coiledcoil training set were excluded from NEW-PDB.
Paircoil, MultiCoil, MARCOIL and COILS were run using score thresholds ranging from 0.1 to 0.999. Paircoil2 was run with cutoffs from 0.001 (most stringent) to 0.10. Coiledcoil predictions that overlapped at least seven residues with the SOCKET annotation were counted as correct for the overlapping residues, and predictions overlapping less were considered incorrect. The results of this test using NEW-PDB28 are shown in Figure 1. At all false-positive rates2 examined, Paircoil2 outperforms all other methods, with sensitivity exceeding 90% at a low false-positive rate. On these data, a P-score cutoff of 0.03 corresponds to sensitivity 0.730 and specificity 0.998. There is virtually no difference in the performance of Paircoil2 between using a 28-residue or 21-residue window. The lower sensitivity in this test compared with cross-validation is likely due to the SOCKET set containing a wide variety of short, non-canonical coiledcoil proteins, which are more difficult to predict.
|
We conclude that the size and diversity of the training set are critical for achieving good performance at recognizing coiled coils generally. Methods trained primarily on long, dimeric coiled coils, such as Paircoil and COILS, recognize similar types of structures but do not pick up the wide variety of sequences observed to adopt the coiledcoil fold in the PDB. Paircoil2, which has a large and diverse training set, is the tool of choice for this task.
| IMPLEMENTATION |
|---|
|
|
|---|
The Paircoil2 program is written in C and runs on Linux and Mac OSX. The web server is written in Perl. Documentation is available on the web server.
| Acknowledgments |
|---|
We thank Peter S. Kim for being an inspiration for this work and K. Gutwin for contributions to PDB-minus. A.K. acknowledges NSF CAREER award MCB-0347203, NIH award GM67681 and the CSBi high-performance computing technology platform. B.B. acknowledges NSF ITR award 6897150.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Present address: Center for Computational and Systems Biology, Institute of Biophysics, Chinese Academy of Science, Beijing 100101, China Associate Editor: Alfonso Valencia
1Sensitivity is defined as TP/P, where TP is the number of correctly predicted positive residues and P the total number of positive residues. Specificity is defined as TN/N, where TN is the mumber of correctly predicted negative residues and N the total number of negative residues. ![]()
2The false-negative rate is defined as 1 specificity. ![]()
Received on June 26, 2005; revised on November 7, 2005; accepted on November 20, 2005
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, . 25, 33893402
Berger, B., et al. (1995) Predicting coiled coils by use of pairwise residue correlations. Proc. Natl Acad. Sci. USA, 92, 82598263
Berman, H.M., et al. (2000) The Protein Data Bank. Nucleic Acids Res, . 28, 235242
Chan, D.C., et al. (1997) Core structure of gp41 from the HIV envelope glycoprotein. Cell, 89, 263273[CrossRef][ISI][Medline].
Delorenzi, M. and Speed, T. (2002) An HMM model for coiledcoil domains and a comparison with PSSM-based predictions. Bioinformatics, 18, 617625
Fassler, J.H., et al. (2002) B-ZIP proteins encoded by the Drosophila genome: evaluation of potential dimerization partners. Genome Res, . 12, 11901200
Fong, J.H., et al. (2004) Predicting specificity in bZIP coiledcoil protein interactions. Genome Biol, . 5, R11[CrossRef][Medline].
Gee, M.A., et al. (1997) An extended microtubule-binding structure within the dynein motor domain. Nature, 390, 636639[CrossRef][Medline].
Henrick, K. and Thornton, J.M. (1998) PQS: a protein quaternary structure file server. Trends Biochem. Sci, . 23, 358361[CrossRef][ISI][Medline].
Hobohm, U. and Sander, C. (1994) Enlarged representative set of protein structures. Protein Sci, . 3, 522524[Abstract].
Li, W., et al. (2002) Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics, 18, 7782
Lupas, A., et al. (1991) Predicting coiled coils from protein sequences. Science, 252, 11621164[CrossRef][ISI][Medline].
Morii, H., et al. (1997) Identification of kinesin neck region as a stable alpha-helical coiled coil and its thermodynamic characterization. Biochemistry, 36, 19331942[CrossRef][Medline].
Newman, J.R. and Keating, A.E. (2003) Comprehensive identification of human bZIP interactions with coiledcoil arrays. Science, 300, 20972101
Singh, M., et al. (1999) LearnCoil-VMF: computational evidence for coiledcoil-like motifs in many viral membrane-fusion proteins. J. Mol. Biol, . 290, 10311041[CrossRef][ISI][Medline].
Strelkov, S.V., et al. (2003) Molecular architecture of intermediate filaments. BioEssays, 25, 243251[CrossRef][ISI][Medline].
Thormählen, M., et al. (1998) The coiledcoil helix in the neck of kinesin. J. Struct. Biol, . 122, 3041[CrossRef][ISI][Medline].
Vincentz, M., et al. (2001) Phylogenetic relationships between Arabidopsis and sugarcane bZIP transcriptional regulatory factors. Genet. Mol. Biol, . 24, 5560.
Walshaw, J. and Woolfson, D.N. (2001) Socket: a program for identifying and analysing coiledcoil motifs within protein structures. J. Mol. Biol, . 307, 14271450[CrossRef][ISI][Medline].
Wolf, E., et al. (1997) MultiCoil: a program for predicting two- and three-stranded coiled coils. Protein Sci, . 6, 11791189[Abstract].
Yonekura, K., et al. (2003) Complete atomic model of the bacterial flagellar filament by electron cryomicroscopy. Nature, 424, 643650[CrossRef][Medline].
This article has been cited by other articles:
![]() |
O. D. Testa, E. Moutevelis, and D. N. Woolfson CC+: a relational database of coiled-coil structures Nucleic Acids Res., October 8, 2008; (2008) gkn675v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. G. Wise and P. D. Vogel Subunit b-Dimer of the Escherichia coli ATP Synthase Can Form Left-Handed Coiled-Coils Biophys. J., June 15, 2008; 94(12): 5040 - 5052. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Hornung, O. A. Volkov, T. M. A. Zaida, S. Delannoy, J. G. Wise, and P. D. Vogel Structure of the Cytosolic Part of the Subunit b-Dimer of Escherichia coli F0F1-ATP Synthase Biophys. J., June 15, 2008; 94(12): 5053 - 5064. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. S. Yousef, H. Kamikubo, M. Kataoka, R. Kato, and S. Wakatsuki Miranda cargo-binding domain forms an elongated coiled-coil homodimer in solution: Implications for asymmetric cell division in Drosophila Protein Sci., May 1, 2008; 17(5): 908 - 917. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. S. Wishart, D. Arndt, M. Berjanskii, A. C. Guo, Y. Shi, S. Shrivastava, J. Zhou, Y. Zhou, and G. Lin PPT-DB: the protein property prediction and testing database Nucleic Acids Res., January 11, 2008; 36(suppl_1): D222 - D229. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Bu, K. D. Carroll, D. Palmeri, and D. M. Lukac Kaposi's Sarcoma-Associated Herpesvirus/Human Herpesvirus 8 ORF50/Rta Lytic Switch Protein Functions as a Tetramer J. Virol., June 1, 2007; 81(11): 5788 - 5806. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Hong, L.-C. Wang, X. Gao, Y.-L. Kuo, B. Liu, R. Merling, H.-J. Kung, H.-M. Shih, and C.-Z. Giam Heptad Repeats Regulate Protein Phosphatase 2A Recruitment to I-{kappa}B Kinase {gamma}/NF-{kappa}B Essential Modulator and Are Targeted by Human T-lymphotropic Virus Type 1 Tax J. Biol. Chem., April 20, 2007; 282(16): 12119 - 12126. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Adio, M. Bloemink, M. Hartel, S. Leier, M. A. Geeves, and G. Woehlke Kinetic and Mechanistic Basis of the Nonprocessive Kinesin-3 Motor NcKin3 J. Biol. Chem., December 8, 2006; 281(49): 37782 - 37793. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Holligan, X. Zhang, N. Jiang, E. J. Pritham, and S. R. Wessler The Transposable Element Landscape of the Model Legume Lotus japonicus Genetics, December 1, 2006; 174(4): 2215 - 2228. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y.-K. Yu, E. M. Gertz, R. Agarwala, A. A. Schaffer, and S. F. Altschul Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches Nucleic Acids Res., November 6, 2006; 34(20): 5966 - 5973. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






