Skip Navigation


Bioinformatics Advance Access originally published online on December 1, 2007
Bioinformatics 2008 24(2):290-292; doi:10.1093/bioinformatics/btm584
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
24/2/290    most recent
btm584v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Google Scholar
Right arrow Articles by Yip, K. Y.
Right arrow Articles by Gerstein, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Yip, K. Y.
Right arrow Articles by Gerstein, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

An integrated system for studying residue coevolution in proteins

Kevin Y. Yip 1,{dagger}, Prianka Patel 2,{dagger}, Philip M. Kim 2, Donald M. Engelman 2, Drew McDermott 1 and Mark Gerstein 1,2,3,*

1Department of Computer Science, Yale University, 51 Prospect Street, New Haven, CT 06511, USA, 2Department of Molecular Biophysics and Biochemistry, Yale University, 266 Whitney Avenue, New Haven, CT 06520, USA and 3Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SCORING FUNCTIONS
 3 PREPROCESSING OPTIONS
 4 SCORES ANALYSIS
 5 EXAMPLE
 6 DISCUSSION
 REFERENCES
 

Residue coevolution has recently emerged as an important concept, especially in the context of protein structures. While a multitude of different functions for quantifying it have been proposed, not much is known about their relative strengths and weaknesses. Also, subtle algorithmic details have discouraged implementing and comparing them. We addressed this issue by developing an integrated online system that enables comparative analyses with a comprehensive set of commonly used scoring functions, including Statistical Coupling Analysis (SCA), Explicit Likelihood of Subset Variation (ELSC), mutual information and correlation-based methods. A set of data preprocessing options are provided for improving the sensitivity and specificity of coevolution signal detection, including sequence weighting, residue grouping and the filtering of sequences, sites and site pairs. A total of more than 100 scoring variations are available. The system also provides facilities for studying the relationship between coevolution scores and inter-residue distances from a crystal structure if provided, which may help in understanding protein structures.

Availability: The system is available at http://coevolution.gersteinlab.org. The source code and JavaDoc API can also be downloaded from the web site.

Contact: mark.gerstein{at}yale.edu

Supplementary information: Additional materials can be found at http://coevolution.gersteinlab.org/coevolution/supp.jsp


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SCORING FUNCTIONS
 3 PREPROCESSING OPTIONS
 4 SCORES ANALYSIS
 5 EXAMPLE
 6 DISCUSSION
 REFERENCES
 
Coevolution (covariation/correlated mutation) is the change of a biological object triggered by the change of a related object. For example, the coding genes of some interacting proteins are preserved or eliminated together in new species (Pellegrini et al., 1999), or have similar phylogenetic trees (Goh et al., 2000). At the amino acid level, some residues under physical or functional constraints exhibit correlated mutations (Gloor et al., 2005; Socolich et al., 2005; Suel et al., 2003). Coevolving residues in a protein are detected in a two-step process: (1) the multiple sequence alignment (MSA) of the protein and its homologs is constructed or obtained; (2) a coevolution score is calculated for each pair of sites in the MSA. There are two main difficulties in this process. First, a large number of scoring functions have been proposed in the literature (Halperin et al., 2006). It can be difficult to choose from them, as they exhibit subtle yet significant differences, and it is likely that different applications would require different functions. Second, coevolution analyses could be confounded by uneven sequence representations, insufficient evolutionary divergence and the presence of gaps in the MSA. A successful coevolution study has to take all these details into account.

To address this need, we have developed an integrated system that provides a simple interface for preprocessing data, computing coevolution scores and analyzing the results. It offers a great variety of scoring variations (over 100) for studying different types of proteins and testing different hypotheses. The workflow of the system is shown in Figure 1. More details on the scoring functions, preprocessing options and result analysis are provided below.


Figure 1
View larger version (39K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. The workflow of the system (a larger version can be found at the supplementary web site).

 

    2 SCORING FUNCTIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SCORING FUNCTIONS
 3 PREPROCESSING OPTIONS
 4 SCORES ANALYSIS
 5 EXAMPLE
 6 DISCUSSION
 REFERENCES
 
2.1 Correlation-based functions
For a pair of sites i and j in an MSA, the correlation score (Gobel et al., 1994; Halperin et al., 2006) is computed as follows:


Formula

where sikl is the score for substituting the i-th residue of sequence k by that of sequence l, Formula and {sigma}i are the mean and SD of substitution scores at site i, N is the number of sequences in the MSA and wkl is the weight for the sequence pair k, l. If the two sites are coevolving in that radical substitutions at the first site are accompanied by radical substitutions at the second site, the correlation will be high. Our system provides the classical McLachlan matrix (McLachlan, 1971) that scores substitutions based on the physiochemical properties of the residues, as well as matrices based on residue volume, pI and hydropathy index, for studying the properties individually. Two variations are provided for each of them: the ‘absolute value version’ considers only the magnitude, while the ‘raw version’ also considers the direction of change, for detecting compensatory mutations. The correlation can be computed from raw values (Pearson correlation) or from value ranks (Spearman correlation, Pazos et al., 1997). Several schemes are provided for the weights wkl, preventing false coevolution signals due to uneven sequence representation or site conservation.

2.2 Perturbation-based functions
The idea of perturbation-based functions is to perform a ‘perturbation’ at a first site, and observe its effect on a second site. The Statistical Coupling Analysis (SCA) method (Lockless and Ranganathan, 1999) defines a statistical energy term for a site, and computes the energy change at a second site when the first site is perturbed by retaining only the sequences with a certain residue.1 The Explicit Likelihood of Subset Variation (ELSC) method (Dekker et al., 2004) is based on the same idea, but has the energy computations replaced by probabilities according to hypergeometric distributions. The mutual information (MI) method (Gloor et al., 2005) can be viewed as a generalized perturbation method that considers the subsetting of all twenty kinds of residues, and combines them by a weighted average according to their frequencies. To deal with finite sample size effects and phylogenetic influence, the normalization options in Martin et al. (2005) are also provided.

2.3 Independence tests
The chi-square test (c.f. the OMES method, Larson et al., 2000) and the quartets method (Galitsky, 2003) both identify site pairs that are unlikely to be independent. The former computes the P-value under the null hypothesis of independent sites. The latter counts the number of quartets in the 2D histogram of residue frequencies that deviate considerably from the expectation.


    3 PREPROCESSING OPTIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SCORING FUNCTIONS
 3 PREPROCESSING OPTIONS
 4 SCORES ANALYSIS
 5 EXAMPLE
 6 DISCUSSION
 REFERENCES
 
To improve the sensitivity and specificity of the functions, options are provided for preprocessing sequences, sites and site pairs.

3.1 Sequence filtering and weighting
Sequences that contain too many gapped positions or are too similar to others in the MSA (that might cause sites to appear coevolving) can be removed by specifying the gap and similarity thresholds respectively. A minimum number of sequences can also be specified to avoid small sample size effects.

A sequence weighting scheme based on the topology of the phylogenetic tree (Gerstein et al., 1994) and one based on Markov random walk are provided. Both schemes down-weigh sequences that are very similar to others in the MSA.

3.2 Site filtering
After sequence filtering, sites that contain too many gaps or are too conserved can be discarded. The former is likely non-informative, while the latter may artificially inflate some coevolution scores.

3.3 Site pair filtering
Sites that are close in the primary sequence may produce trivial coevolution signals that hide other more unexpected coevolution events. Such site pairs can be filtered by specifying the minimum sequence separation. It has also been observed that insertions/deletions of multiple residues may create artificial coevolution signals (Patel et al., unpublished data). An option is provided for filtering site pairs that participate in the same gaps in too many sequences.

3.4 Other options
Grouping similar residues into a smaller alphabet may increase the sensitivity (Pollock et al., 1999). Our system provides two residue groupings proposed in the literature (Elcock and McCammon, 2001; Guharoy and Chakrabarti, 2005). It has also been observed that gaps might give important coevolution signals (Patel et al., unpublished data). An option is provided for treating gaps as noise or as the 21st residue when computing coevolution scores.


    4 SCORES ANALYSIS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SCORING FUNCTIONS
 3 PREPROCESSING OPTIONS
 4 SCORES ANALYSIS
 5 EXAMPLE
 6 DISCUSSION
 REFERENCES
 
In some proteins coevolving residues tend to be close to each other in the 3D structure (Dekker et al., 2004; Gloor et al., 2005). This suggests that the instability created by the mutation of a residue may be (partially) compensated for by a corresponding mutation of a close residue. Coevolution signals may thus convey some information about the protein structure. For instance it is interesting to study how well the coevolution scores predict the residue contact map (Halperin et al., 2006). Our system provides functions for plotting and analyzing the coevolution scores against inter-residue distances and standard machine-learning techniques (e.g. ROC curve) for evaluating the effectiveness of the various coevolution functions in predicting interacting residues. A shuffling scheme for evaluating the significance of the scores is also provided in the program package for running locally.


    5 EXAMPLE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SCORING FUNCTIONS
 3 PREPROCESSING OPTIONS
 4 SCORES ANALYSIS
 5 EXAMPLE
 6 DISCUSSION
 REFERENCES
 
We provide a worked example of our system in operation on the web site that illustrates coevolution in the transmembrane protein bacteriorhodopsin due to physically constrained residues not adjacent in the primary sequence. The example can be easily loaded by clicking the corresponding link on the main page. Running the example will compute the coevolution scores between site pairs separated by at least 3 residues. The scatterplot for coevolution scores against inter-residue distances generated using a known PDB structure (Fig. 1) shows that residue pairs receiving high scores do tend to be closer in the crystal structure.

Due to the intensive computation involved in the score calculations, currently only one scoring function is allowed to be used each time. Anyone interested in performing large-scale comparisons can download the Java programs from the web site and run locally on most platforms (Windows, Macintosh, Linux, UNIX, etc.). Detailed installation instructions are provided on the web site.


    6 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SCORING FUNCTIONS
 3 PREPROCESSING OPTIONS
 4 SCORES ANALYSIS
 5 EXAMPLE
 6 DISCUSSION
 REFERENCES
 
Although the scatterplot in Figure 1, and other studies in the literature, have suggested some relationships between coevolution and physical constraints, to what extent could coevolution scores help understand physical structures remains unclear. We hope the current application could serve as a neutral tool for further exploration in this area.

The current system focuses on functions that do not assume any mutation models. Other functions, such as the likelihood method in Pollock et al. (1999) and the Bayesian mutational mapping method (Dimmic et al., 2005) may be added in a later version.

Coevolution signals have been used in recent studies to predict sequence regions involved in protein–protein interactions with different levels of success (Halperin et al., 2006; Pazos and Valencia, 2002). We plan on extending the system to include inter-protein residue coevolution in the next phase of development.


    ACKNOWLEDGMENT
 
We would like to thank Rama Ranganathan and William Russ for helpful discussions and the anonymous reviewers for their valuable comments.

Conflict of Interest: none declared.


    FOOTNOTES
 
{dagger}The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First authors. Back

Associate Editor: Olga Troyanskaya

1 Our implementation provides an asymmetric SCA score matrix, as well as extra summarizing statistics. Details can be found at the supplementary web site. Back

Received on July 5, 2007; revised on October 4, 2007; accepted on November 20, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SCORING FUNCTIONS
 3 PREPROCESSING OPTIONS
 4 SCORES ANALYSIS
 5 EXAMPLE
 6 DISCUSSION
 REFERENCES
 

    Dekker JP, et al. A perturbation-based method for calculating explicit likelihood of evolutionary co-variance in multiple sequence alignments. Bioinformatics (2004) 20:1565–1572.[Abstract/Free Full Text]

    Dimmic MW, et al. Detecting coevolving amino acid sites using Bayesian mutational mapping. Bioinformatics (2005) 21(Suppl. 1):i126–i135.[Abstract]

    Elcock AH, McCammon JA. Identification of protein oligomerization states by analysis of interface conservation. Proc. Natl Acad. Sci. USA (2001) 98:2990–2994.[Abstract/Free Full Text]

    Galitsky B. Revealing the set of mutually correlated positions for the protein families of immunoglobulin fold. In Silico Biol. (2003) 3:0022.

    Gerstein M, et al. Volume changes in protein evolution. J. Mol. Biol. (1994) 236:1067–1078.[CrossRef][Web of Science][Medline]

    Gloor GB, et al. Mutual information in protein multiple sequence alignments reveals two classes of coevolving positions. Biochemistry (2005) 44:7156–7165.[CrossRef][Web of Science][Medline]

    Gobel U, et al. Correlated mutations and residue contacts in proteins. Proteins: Struct. Funct.Genet. (1994) 18:309–317.[CrossRef]

    Goh CS, et al. Co-evolution of proteins with their interaction partners. J. Mol. Biol. (2000) 299:283–293.[CrossRef][Web of Science][Medline]

    Guharoy M, Chakrabarti P. Conservation and relative importance of residues across protein-protein interfaces. Proc. Natl. Acad. Sci. USA (2005) 102:15447–15452.[Abstract/Free Full Text]

    Halperin I, et al. Correlated mutations: advances and limitations. A study on fusion proteins and on the Cohesin-Dockerin families. Proteins Struct. Funct. Bioinfo. (2006) 63:832–845.[CrossRef]

    Larson SM, et al. Analysis of covariation in an SH3 domain sequence alignment: applications in tertiary contact prediction and the design of compensating hydrophobic core substitutions. J. Mol. Biol. (2000) 303:433–446.[CrossRef][Web of Science][Medline]

    Lockless SW, Ranganathan R. Evolutionarily conserved pathways of energetic connectivity in protein families. Science (1999) 286:295–299.[Abstract/Free Full Text]

    McLachlan AD. Tests for comparing related amino-acid sequences cytochrome c and cytochrome c551. J. Mol. Biol. (1971) 61:409–424.[CrossRef][Web of Science][Medline]

    Martin LC, et al. Using information theory to search for co-evolving residues in proteins. Bioinformatics (2005) 21:4116–4124.[Abstract/Free Full Text]

    Pazos F, et al. Correlated mutations contain information about protein-protein interaction. J. Mol. Biol. (1997) 271:511–523.[CrossRef][Web of Science][Medline]

    Pazos F, Valencia A. In silico two-hybrid system for the selection of physically interacting protein pairs. Proteins: Struct. Funct. Genet. (2002) 47:219–227.[CrossRef][Web of Science][Medline]

    Pellegrini M, et al. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl Acad. Sci. USA (1999) 96:4285–4288.[Abstract/Free Full Text]

    Pollock DD, et al. Coevolving protein residues: maximum likelihood identification and relationship to structure. J. Mol. Biol. (1999) 287:187–198.[CrossRef][Web of Science][Medline]

    Socolich M, et al. Evolutionary information for specifying a protein fold. Nature (2005) 437:512–518.[CrossRef][Medline]

    Suel GM, et al. Evolutionarily conserved networks of residues mediate allosteric communication in proteins. Nat. Struct. Biol. (2003) 10:59–69.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
24/2/290    most recent
btm584v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Google Scholar
Right arrow Articles by Yip, K. Y.
Right arrow Articles by Gerstein, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Yip, K. Y.
Right arrow Articles by Gerstein, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?