Skip Navigation


Bioinformatics Advance Access originally published online on November 29, 2005
Bioinformatics 2006 22(3):356-358; doi:10.1093/bioinformatics/bti797
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/3/356    most recent
bti797v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (16)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by McDonnell, A. V.
Right arrow Articles by Berger, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by McDonnell, A. V.
Right arrow Articles by Berger, B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Paircoil2: improved prediction of coiled coils from sequence

A. V. McDonnell 1, T. Jiang 2,{dagger}, A. E. Keating 2,* and B. Berger 1,*

1Mathematics Department, Computer Science and Artificial Intelligence Laboratory Cambridge, MA 02139, USA
2Department of Biology, Massachusetts Institute of Technology Cambridge, MA 02139, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 Paircoil2 RETRAINING AND TESTING
 COMPARISON WITH OTHER SEQUENCE...
 IMPLEMENTATION
 REFERENCES
 

Summary: We introduce Paircoil2, a new version of the Paircoil program, which uses pairwise residue probabilities to detect coiled–coil motifs in protein sequence data. Paircoil2 achieves 98% sensitivity and 97% specificity on known coiled coils in leave-family-out cross-validation. It also shows superior performance compared with published methods in tests on proteins of known structure.

Availability: Paircoil2 is freely available as a web application and for download at http://paircoil2.csail.mit.edu

Contact: keating{at}mit.edu; bab{at}mit.edu

Supplementary information: Available at Bioinformatics online and at the Paircoil website.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 Paircoil2 RETRAINING AND TESTING
 COMPARISON WITH OTHER SEQUENCE...
 IMPLEMENTATION
 REFERENCES
 
The alpha-helical coiled coil is a simple structural motif found at high frequency in proteins of all organisms. Many coiled coils mediate oligomerization or protein–protein interaction, and the motif is important to the structure and function of several classes of fibrous structural proteins, motor proteins, transcription factors and membrane fusion proteins. Prediction of coiled coils in proteins can be used to identify putative oligomerization domains, to postulate functional mechanisms and to map sequence onto structure at a high level of detail. Moreover, such predictions are necessary as a first step in understanding coiled–coil interactions (Newman and Keating, 2003; Fong et al., 2004). Thus, efficient and highly accurate methods for predicting coiled coils are important for annotating the data that result from genome sequencing projects.

Sequence-based methods for predicting coiled coils, such as COILS (Lupas, et al., 1991), Paircoil (Berger et al., 1995), MultiCoil (Wolf et al., 1997) and MARCOIL (Delorenzi and Speed, 2002), have been quite successful. Since publication of the Paircoil program, the number of known coiled–coil sequences has increased dramatically. We have used these data to develop Paircoil2, an improved version of Paircoil, and find that it performs well in leave-family-out cross-validation and outperforms other common methods.


    Paircoil2 RETRAINING AND TESTING
 TOP
 ABSTRACT
 INTRODUCTION
 Paircoil2 RETRAINING AND TESTING
 COMPARISON WITH OTHER SEQUENCE...
 IMPLEMENTATION
 REFERENCES
 
An initial coiled–coil database was constructed from sequences known to contain coiled coils, using information from structure and the literature. The coiled–coil regions were defined and annotated with the appropriate heptad register according to the following sources. The myosins, tropomyosins and paramyosins were annotated as described in Berger et al. (1995). Intermediate filament coiled coils were based on Strelkov et al. (2003). Viral coat proteins, laminins, fibrinogens, heat shock factors and flagellins were based on Wolf et al. (1997) and Singh et al. (1999) and the bZIPs according to Newman and Keating (2003), Vincentz et al. (2001) and Fassler et al. (2002). Dynein heavy chains were annotated according to Gee et al. (1997) and by inspection, and kinesin heavy chains after Thormählen et al. (1998), Morii et al. (1997) and inspection. In addition, a number of coiled–coil sequences that did not fit into these categories were detected and annotated by SOCKET (Walshaw and Woolfson, 2001) from the 2002 version of the Protein Data Bank (PDB) (Berman et al., 2000).

The coiled–coil database was generated from this initial database by adding homologous sequences from the NCBI NR database. PSI-BLAST (Altschul et al., 1997) was run for four iterations, using an E-value cutoff of 10–10. The BLAST sequence alignments were used to define the coiled–coil regions and assign heptad registers, which were verified with Paircoil. For cases where the Paircoil-derived and alignment-derived register disagreed, assignments were made manually. The database was filtered to 90% sequence identity with CD-HIT (Li et al., 2002). Seven residues were removed from each side of a skip in the heptad register or a gap in the alignment to avoid introducing non-coiled–coil residues into the database. To further eliminate possible non-coiled–coil residues, seven residues were removed on each side of proline residues, and from the beginning and end of each coiled–coil region in all sequences. All regions with at least 28 contiguous coiled–coil residues were included in the coiled–coil database of 1371 protein chains, containing 95 517 coiled–coil residues. Coiled–coil residue pair frequencies were calculated from this database as in Berger et al. (1995), and background frequencies were derived from the NCBI NR90 database of March 2005.

The database of non-coiled coils, PDB-minus, was derived from the PDB from February 28, 2005, filtered to 40% sequence identity using CD-HIT. Proteins in the coiled–coil database or detected by SOCKET were removed. PDB-minus consists of 6397 sequences that comprises 1 486 055 residues. A list of the protein sequences making up these datasets is available as Supplementary Data.

The Paircoil2 algorithm is the same as that of Paircoil, with the incorporation of the new data. It was extended to allow window sizes of both 21 and 28 residues. The algorithm runs in linear time relative to the length of the input sequence. Confidence is reported as a P-score, which is a measure of the percentage of non-coiled–coil residues in PDB-minus that score better than a given Paircoil2 raw score. We find that the score distribution of PDB-minus is closely approximated by a Gaussian, and as such the P-score is calculated to be the area below this curve and to right of the raw score.

Paircoil2 performs extremely well in leave-family-out cross-validation on the coiled–coil database. For each cross-validation, sequences in one coiled–coil family were placed in the test set, along with half of the sequences in PDB-minus, selected randomly. The remaining coiled coils were used to train Paircoil2. The test set was then scored. Table 1 reports the sensitivity and specificity1 at the P-score where the two values are closest for each family, using a window length of 28. Although P-scores and specificity both reflect false-positive rates, P-scores are defined using all of PDB-minus and specificity is evaluated during testing. Performance on the cross-validation test is very similar using the 21-length window (Supplementary Data).


View this table:
[in this window]
[in a new window]
 
Table 1 By-family sensitivity and specificity for various coiled–coil predictors

 

    COMPARISON WITH OTHER SEQUENCE-BASED COILED–COIL PREDICTION PROGRAMS
 TOP
 ABSTRACT
 INTRODUCTION
 Paircoil2 RETRAINING AND TESTING
 COMPARISON WITH OTHER SEQUENCE...
 IMPLEMENTATION
 REFERENCES
 
We compared the performance of the programs Paircoil2, Paircoil, COILS, Multicoil and MARCOIL on individual coiled–coil families in the coiled–coil dataset used for training. Previous comparative studies have suggested that MARCOIL and Paircoil both show superior performance to COILS (Berger et al., 1995; Delorenzi and Speed, 2002). All programs were run with the recommended default settings: COILS version 2.2 with a window size of 28 and the MTIDK table, using the recommended method to filter highly hydrophobic sequences, and MARCOIL with the MTK table as MARCOIL-H. We note that all the families in the Paircoil2 database are also included in the MARCOIL database, with the exception of the heat-shock factors, flagellins and fibrinogens. However, there are many families included in the new training set that were not included in training of the older programs Paircoil, Multicoil or COILS. We were unable to perform leave-family-out cross-validation for the programs other than Paircoil2. Therefore, we compare the final versions of all programs in Table 1.

Averaged over all families, Paircoil2 in cross-validation, MARCOIL and Multicoil perform very similarly on this test. The residue-level sensitivity and specificity that we report for MARCOIL are higher than previously cited, probably owing to differences in the annotated coiled–coil regions used for testing, and in the relative sizes of the datasets used by this work and by Delorenzi and Speed (2002). Paircoil2 performs better than all other programs overall. There are significant differences between families. Notably, all programs perform very well on the canonical coiled coils in the myosin, tropomyosin, intermediate-filament and kinesin families, on which all of the programs were trained. The SNARES, flagellins and viral envelope proteins are particularly hard to recognize without training on these families. This is easy to understand, as coiled coils in these families have some unusual properties: SNARES are parallel tetramers, which are rare in the training set, and the coiled–coil regions in both the flagellins and the viral envelope proteins do not form isolated coiled coils, but rather are part of higher order helical assemblies (Yonekura et al., 2003; Chan et al., 1997).

We also compared the performance of the different programs on a structure-based test set. NEW-PDB was derived from the October 2004 release of the PDB select90 (Hobohm and Sander, 1994). Proteins without quaternary structure listed on the EBI quaternary structure server (Henrick and Thornton, 1998) were removed and SOCKET was used to detect coiled–coil regions. Two coiled–coil datasets were generated (Supplementary Data): NEW-PDB21 comprises SOCKET hits at least 21 residues in length (216 sequences, 6288 residues) and NEW-PDB28 comprises hits of least 28 residues (85 sequences, 3261 residues). Sequences with >90% sequence identity to a sequence in the Paircoil2 coiled–coil training set were excluded from NEW-PDB.

Paircoil, MultiCoil, MARCOIL and COILS were run using score thresholds ranging from 0.1 to 0.999. Paircoil2 was run with cutoffs from 0.001 (most stringent) to 0.10. Coiled–coil predictions that overlapped at least seven residues with the SOCKET annotation were counted as correct for the overlapping residues, and predictions overlapping less were considered incorrect. The results of this test using NEW-PDB28 are shown in Figure 1. At all false-positive rates2 examined, Paircoil2 outperforms all other methods, with sensitivity exceeding 90% at a low false-positive rate. On these data, a P-score cutoff of 0.03 corresponds to sensitivity 0.730 and specificity 0.998. There is virtually no difference in the performance of Paircoil2 between using a 28-residue or 21-residue window. The lower sensitivity in this test compared with cross-validation is likely due to the SOCKET set containing a wide variety of short, non-canonical coiled–coil proteins, which are more difficult to predict.


Figure 1
View larger version (11K):
[in this window]
[in a new window]
 
Fig. 1 Sensitivity versus false positive rate for various coiled–coil predictors on the NEW-PDB28 dataset. Sensitivity is defined as the number of correctly predicted positive residues divided by all positive residues, and the false-positive rate is the number of false-positive predictions over all negative residues in PDB-minus. Some curves are shorter than others because the programs have different sensitivities at the lowest possible score cutoffs.

 
We conclude that the size and diversity of the training set are critical for achieving good performance at recognizing coiled coils generally. Methods trained primarily on long, dimeric coiled coils, such as Paircoil and COILS, recognize similar types of structures but do not pick up the wide variety of sequences observed to adopt the coiled–coil fold in the PDB. Paircoil2, which has a large and diverse training set, is the tool of choice for this task.


    IMPLEMENTATION
 TOP
 ABSTRACT
 INTRODUCTION
 Paircoil2 RETRAINING AND TESTING
 COMPARISON WITH OTHER SEQUENCE...
 IMPLEMENTATION
 REFERENCES
 
The Paircoil2 program is written in C and runs on Linux and Mac OSX. The web server is written in Perl. Documentation is available on the web server.


    Acknowledgments
 
We thank Peter S. Kim for being an inspiration for this work and K. Gutwin for contributions to PDB-minus. A.K. acknowledges NSF CAREER award MCB-0347203, NIH award GM67681 and the CSBi high-performance computing technology platform. B.B. acknowledges NSF ITR award 6897150.

Conflict of Interest: none declared.


    FOOTNOTES
 
{dagger}Present address: Center for Computational and Systems Biology, Institute of Biophysics, Chinese Academy of Science, Beijing 100101, China Back

Associate Editor: Alfonso Valencia

1Sensitivity is defined as TP/P, where TP is the number of correctly predicted positive residues and P the total number of positive residues. Specificity is defined as TN/N, where TN is the mumber of correctly predicted negative residues and N the total number of negative residues. Back

2The false-negative rate is defined as 1 – specificity. Back

Received on June 26, 2005; revised on November 7, 2005; accepted on November 20, 2005

    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 Paircoil2 RETRAINING AND TESTING
 COMPARISON WITH OTHER SEQUENCE...
 IMPLEMENTATION
 REFERENCES
 

    Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, . 25, 3389–3402[Abstract/Free Full Text].

    Berger, B., et al. (1995) Predicting coiled coils by use of pairwise residue correlations. Proc. Natl Acad. Sci. USA, 92, 8259–8263[Abstract/Free Full Text].

    Berman, H.M., et al. (2000) The Protein Data Bank. Nucleic Acids Res, . 28, 235–242[Abstract/Free Full Text].

    Chan, D.C., et al. (1997) Core structure of gp41 from the HIV envelope glycoprotein. Cell, 89, 263–273[CrossRef][ISI][Medline].

    Delorenzi, M. and Speed, T. (2002) An HMM model for coiled–coil domains and a comparison with PSSM-based predictions. Bioinformatics, 18, 617–625[Abstract/Free Full Text].

    Fassler, J.H., et al. (2002) B-ZIP proteins encoded by the Drosophila genome: evaluation of potential dimerization partners. Genome Res, . 12, 1190–1200[Abstract/Free Full Text].

    Fong, J.H., et al. (2004) Predicting specificity in bZIP coiled–coil protein interactions. Genome Biol, . 5, R11[CrossRef][Medline].

    Gee, M.A., et al. (1997) An extended microtubule-binding structure within the dynein motor domain. Nature, 390, 636–639[CrossRef][Medline].

    Henrick, K. and Thornton, J.M. (1998) PQS: a protein quaternary structure file server. Trends Biochem. Sci, . 23, 358–361[CrossRef][ISI][Medline].

    Hobohm, U. and Sander, C. (1994) Enlarged representative set of protein structures. Protein Sci, . 3, 522–524[Abstract].

    Li, W., et al. (2002) Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics, 18, 77–82[Abstract/Free Full Text].

    Lupas, A., et al. (1991) Predicting coiled coils from protein sequences. Science, 252, 1162–1164[CrossRef][ISI][Medline].

    Morii, H., et al. (1997) Identification of kinesin neck region as a stable alpha-helical coiled coil and its thermodynamic characterization. Biochemistry, 36, 1933–1942[CrossRef][Medline].

    Newman, J.R. and Keating, A.E. (2003) Comprehensive identification of human bZIP interactions with coiled–coil arrays. Science, 300, 2097–2101[Abstract/Free Full Text].

    Singh, M., et al. (1999) LearnCoil-VMF: computational evidence for coiled–coil-like motifs in many viral membrane-fusion proteins. J. Mol. Biol, . 290, 1031–1041[CrossRef][ISI][Medline].

    Strelkov, S.V., et al. (2003) Molecular architecture of intermediate filaments. BioEssays, 25, 243–251[CrossRef][ISI][Medline].

    Thormählen, M., et al. (1998) The coiled–coil helix in the neck of kinesin. J. Struct. Biol, . 122, 30–41[CrossRef][ISI][Medline].

    Vincentz, M., et al. (2001) Phylogenetic relationships between Arabidopsis and sugarcane bZIP transcriptional regulatory factors. Genet. Mol. Biol, . 24, 55–60.

    Walshaw, J. and Woolfson, D.N. (2001) Socket: a program for identifying and analysing coiled–coil motifs within protein structures. J. Mol. Biol, . 307, 1427–1450[CrossRef][ISI][Medline].

    Wolf, E., et al. (1997) MultiCoil: a program for predicting two- and three-stranded coiled coils. Protein Sci, . 6, 1179–1189[Abstract].

    Yonekura, K., et al. (2003) Complete atomic model of the bacterial flagellar filament by electron cryomicroscopy. Nature, 424, 643–650[CrossRef][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
O. D. Testa, E. Moutevelis, and D. N. Woolfson
CC+: a relational database of coiled-coil structures
Nucleic Acids Res., October 8, 2008; (2008) gkn675v1.
[Abstract] [Full Text] [PDF]


Home page
Biophys. JHome page
J. G. Wise and P. D. Vogel
Subunit b-Dimer of the Escherichia coli ATP Synthase Can Form Left-Handed Coiled-Coils
Biophys. J., June 15, 2008; 94(12): 5040 - 5052.
[Abstract] [Full Text] [PDF]


Home page
Biophys. JHome page
T. Hornung, O. A. Volkov, T. M. A. Zaida, S. Delannoy, J. G. Wise, and P. D. Vogel
Structure of the Cytosolic Part of the Subunit b-Dimer of Escherichia coli F0F1-ATP Synthase
Biophys. J., June 15, 2008; 94(12): 5053 - 5064.
[Abstract] [Full Text] [PDF]


Home page
Protein Sci.Home page
M. S. Yousef, H. Kamikubo, M. Kataoka, R. Kato, and S. Wakatsuki
Miranda cargo-binding domain forms an elongated coiled-coil homodimer in solution: Implications for asymmetric cell division in Drosophila
Protein Sci., May 1, 2008; 17(5): 908 - 917.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. S. Wishart, D. Arndt, M. Berjanskii, A. C. Guo, Y. Shi, S. Shrivastava, J. Zhou, Y. Zhou, and G. Lin
PPT-DB: the protein property prediction and testing database
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D222 - D229.
[Abstract] [Full Text] [PDF]


Home page
J. Virol.Home page
W. Bu, K. D. Carroll, D. Palmeri, and D. M. Lukac
Kaposi's Sarcoma-Associated Herpesvirus/Human Herpesvirus 8 ORF50/Rta Lytic Switch Protein Functions as a Tetramer
J. Virol., June 1, 2007; 81(11): 5788 - 5806.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
S. Hong, L.-C. Wang, X. Gao, Y.-L. Kuo, B. Liu, R. Merling, H.-J. Kung, H.-M. Shih, and C.-Z. Giam
Heptad Repeats Regulate Protein Phosphatase 2A Recruitment to I-{kappa}B Kinase {gamma}/NF-{kappa}B Essential Modulator and Are Targeted by Human T-lymphotropic Virus Type 1 Tax
J. Biol. Chem., April 20, 2007; 282(16): 12119 - 12126.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
S. Adio, M. Bloemink, M. Hartel, S. Leier, M. A. Geeves, and G. Woehlke
Kinetic and Mechanistic Basis of the Nonprocessive Kinesin-3 Motor NcKin3
J. Biol. Chem., December 8, 2006; 281(49): 37782 - 37793.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
D. Holligan, X. Zhang, N. Jiang, E. J. Pritham, and S. R. Wessler
The Transposable Element Landscape of the Model Legume Lotus japonicus
Genetics, December 1, 2006; 174(4): 2215 - 2228.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
Y.-K. Yu, E. M. Gertz, R. Agarwala, A. A. Schaffer, and S. F. Altschul
Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches
Nucleic Acids Res., November 6, 2006; 34(20): 5966 - 5973.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/3/356    most recent
bti797v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (16)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by McDonnell, A. V.
Right arrow Articles by Berger, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by McDonnell, A. V.
Right arrow Articles by Berger, B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?