Skip Navigation


Bioinformatics Advance Access originally published online on March 3, 2005
Bioinformatics 2005 21(10):2539-2540; doi:10.1093/bioinformatics/bti360
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/10/2539    most recent
bti360v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Donald, J. E.
Right arrow Articles by Mirny, L. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Donald, J. E.
Right arrow Articles by Mirny, L. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

CoC: a database of universally conserved residues in protein folds

Jason E. Donald 1, Isaac A. Hubner 1, Veronica M. Rotemberg 1, Eugene I. Shakhnovich 1 and Leonid A. Mirny 2,*

1Department of Chemistry and Chemical Biology, Harvard University 12 Oxford Street, Cambridge, MA 02138, USA
2Harvard-MIT Division of Health Sciences and Technology, Massachusetts Institute of Technology 77 Massachusetts Avenue, 16–343, Cambridge, MA 02139, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 THE SERVER
 CONCLUSIONS
 REFERENCES
 

Summary: The conservatism of conservatism (CoC) database presents statistically analyzed information about the conservation of residue positions in folds across protein families.

Availability: On the web at http://kulibin.mit.edu/coc/

Contact: leonid{at}mit.edu

Supplementary information: The website details the method and contains an FAQ and documentation, http://kulibin.mit.edu/coc/


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 THE SERVER
 CONCLUSIONS
 REFERENCES
 
The conservatism of conservatism (CoC) database presents the conservation of residue positions in folds across protein families. Residues with high CoC are universally conserved in every family of homologous proteins that acquire a particular fold. Such residues can be different in non-homologous proteins (analogs) that exhibit the same fold. We calculate and present the statistical significance of such conservation and outline residues that are more conserved than expected given the residue's solvent accessibility. Such high CoC residues have been shown to be crucial for the kinetics and/or thermodynamics of protein folding, are involved in folding nucleation, and are often identified in positions of functional importance such as ‘super-sites’ (Mirny and Shakhnovich, 1999; Mirny and Shakhnovich, 2001a, b).

The database contains 3081 proteins, which cover all known protein structures in the Protein data bank (Berman et al., 2000) (PDB), through representation by all members of the Families of structurally similar proteins (Holm and Sander, 1996) (FSSP). Convenient access is provided by a search function that accepts queries in the form of a PDB ID, Swiss-Prot name or FASTA amino acid sequence. A search produces a list of exact and close matches, which link to the information page of every protein. Results present residues of high CoC, their sequence conservation, calculated using HSSP alignment (Sander and Schneider, 1991) and the corresponding p-values. A user can obtain this information in text format, view graph of Z-score and p-value, and render a PDB interactive image that highlights residues of user specified CoC and p-value cutoffs. A color-coded multiple alignment with marked CoC positions is also available to aid in visualizing the results. A compressed archive of all data files may be downloaded for further analysis.

The CoC database may be used to identify amino acids that are important for protein function and folding. It presents data sufficient for performing a stand-alone study or suggests key functional residues and/or structural residues to be studied through experiment. In addition, CoC complements the analysis of experiment and simulation, and encourages understanding of protein structure in an evolutionary context.


    THE SERVER
 TOP
 Abstract
 INTRODUCTION
 THE SERVER
 CONCLUSIONS
 REFERENCES
 
Queries and results
From the search page, queries are made by PDB ID, Swiss-Prot name or FASTA amino acid sequence. A compressed directory containing the data for all proteins may also be downloaded. Results are listed as ‘exact matches’, which include all chains corresponding to a given PDB that match an FSSP file exactly, and also as ‘related structures,’ which include PDBs identified through structural alignment. Clicking on any PDB ID in the search results will take a user to the results page of the required protein. The results page consists of three sections as described below.

The first section gives the protein name, including domain and chain, along with a Raster3D (Merritt and Bacon, 1997) image of the protein structure. The number of sequences and structures used in the calculation is listed along with links to the tabulated data and a structural alignment of multiple representative sequences. The table of raw data lists the PDB file residue number, residue type, solvent accessibility, sequence entropy [S(l)], Z-score and p-value for both 6-amino acid and 20-amino acid types. The multiple alignments link leads to a separate page where sequences of representative proteins from each analogous family are structurally aligned with the query sequence. FASTA query sequences are also aligned by BLAST and displayed with the structural alignments. Residues are colored by residue type (hydrophobic, polar, etc.) The background of individual positions is shaded by the sequence entropy obtained in the individual families. Such representation makes the concept of CoC clear: positions that are conserved in each family (appear as dark vertical stripes in the alignment) correspond to CoC residues, which are marked on the top of the alignment. Each sequence is labeled with its corresponding PDB ID, which links to the main page of the protein being studied. This offers an alternative presentation of the data and an intuitive way to picture both the conservation of a position across homologous structures and the identity of conserved positions.

The middle section of the results page contains tables displaying the number of residues identified at various CoC/p-value cutoff pairs for the data calculated using both 6 classes of amino acids (hydrophobic, polar, acidic, basic, aromatic and special) and 20-amino acid types. When a cutoff pair is selected, the user is brought to a separate page with an interactive cartoon image of the protein, created using jmol (Murray-Rust et al., 2004) with selected CoC residues in spacefill representation. This page also links to the structural alignment display, where selected CoC residues are labelled by a star on the top of the alignment. The third section of the results page presents plots of CoC and corresponding p-value. Gaps in the plots correspond to significant gaps (>50% unaligned) in the structural alignments. Because sufficient statistics are lacking, meaningful CoC values cannot be calculated for these positions. Nevertheless, we show the conservation of these positions in the sequence alignment.

Implication for protein function, stability and kinetics
There has been detailed discussion of interpreting CoC in the context of function, stability and kinetics (1). Also, in a previous study of conservation in the protein folding nucleus (3), it was shown that residues in the nucleus are significantly more conserved than the rest of the protein. These, and other implications, have also been reviewed in the larger context of protein folding theory (Mirny and Shakhnovich, 2001b). One finds that when a position is highly conserved, and the conservation can be described by solvent accessibility, the CoC is most probably attributable to thermodynamic importance of the position. When a highly conserved position corresponds to a disulfide bond, the reason is probably thermodynamic. When most proteins of a given fold have active/binding site in the same location on the structure, such a site is called a ‘super-site’ (e.g. Rossman fold 3CHY and TIM barrel 2EBN). Residues of the super-site are conserved in proteins of the fold and, therefore, exhibit high CoC. When high CoC cannot be explained by any of the above rationale, the cause of conservation is most probably kinetic. These high CoC residues are responsible for fast folding to the native structure and correspond to the ‘folding nucleus’ (Fersht, 2000). Assuming that topology determines the folding mechanism (and nucleus) of a protein, one expects the nuclear residues to exhibit high CoC. The hypothesis of kinetic importance may also be cross-referenced with protein engineering experimental data where available (Lopez-Hernandez and Serrano, 1996).


    CONCLUSIONS
 TOP
 Abstract
 INTRODUCTION
 THE SERVER
 CONCLUSIONS
 REFERENCES
 
The CoC database describes not only the evolutionary conservation of positions in protein folds, but also serves as a source of predictions as to which residues may play an important role in protein folding, stability and function. These data cover all known protein folds and may be used both as testable predictions for experiment or bioinformatics studies, as well as an aid in the interpretation of experimental results. CoC is distinct from the simple analysis of conservation available elsewhere. It accounts for the probability that a position is conserved owing to solvent accessibility and measures the conservation of a position regardless of its identity across structural homologs. We believe that CoC brings a unique and useful perspective to the analysis of evolutionary data in proteins.


    Acknowledgments
 
We would like to thank Grigory Kolesov for the assistance in the development of the multiple alignment display and jmol/HTML integration, and Ivan Adzhubey for the invaluable computer support. J.E.D. is supported by NSF graduate research fellowship and I.A.H. is supported by an HHMI predoctoral fellowship. L.M. is an Alfred P. Sloan research fellow.

Received on December 12, 2004; revised on February 15, 2005; accepted on February 25, 2005

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 THE SERVER
 CONCLUSIONS
 REFERENCES
 

    Berman, H.M., et al. (2000) The protein data bank. Nucleic Acids Res., 28, 235–242[Abstract/Free Full Text].

    Fersht, A.R. (2000) Transition-state structure as a unifying basis in protein-folding mechanisms: contact order, chain topology, stability, and the extended nucleus mechanism. Proc. Natl Acad. Sci. USA, 97, 1525–1529[Abstract/Free Full Text].

    Holm, L. and Sander, C. (1996) Mapping the protein universe. Science, 273, 595–603[Abstract/Free Full Text].

    Lopez-Hernandez, E. and Serrano, L. (1996) Structure of the transition state for folding of the 129 aa protein CheY resembles that of a smaller protein, CI-2. Fold. Des., 1, 43–55[CrossRef][Web of Science][Medline].

    Merritt, E.A. and Bacon, D.J. (1997) Raster3D: photorealistic molecular graphics. Meth. Enzymol., 277, 505–524[Web of Science][Medline].

    Mirny, L.A. and Shakhnovich, E.I. (1999) Universally conserved positions in protein folds; reading evolutionary, folding kinetics and function. J. Mol. Biol., 291, 177–196[CrossRef][Web of Science][Medline].

    Mirny, L. and Shakhnovich, E. (2001a) Protein folding theory: from lattice to all-atom models. Annu. Rev. Biophys. Biomol. Struct., 30, 361–396[CrossRef][Web of Science][Medline].

    Mirny, L. and Shakhnovich, E. (2001b) Evolutionary conservation of the folding nucleus. J. Mol. Biol., 308, 123–129[CrossRef][Web of Science][Medline].

    Murray-Rust, P., et al. (2004) Chemical markup, XML, and the World Wide Web 5. Applications of chemical metadata in RSS aggregators. J. Chem. Inf. Comput. Sci., 44, 462–469[CrossRef][Web of Science][Medline].

    Sander, C. and Schneider, R. (1991) Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins, 9, 56–68[CrossRef][Web of Science][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/10/2539    most recent
bti360v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Donald, J. E.
Right arrow Articles by Mirny, L. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Donald, J. E.
Right arrow Articles by Mirny, L. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?