Bioinformatics Advance Access originally published online on February 26, 2008
Bioinformatics 2008 24(7):908-915; doi:10.1093/bioinformatics/btn057
Tracing evolutionary pressure
1Division of Medicinal Chemistry, Leiden/Amsterdam Center for Drug Research, Leiden University, Einsteinweg 55, 2333CC, Leiden and 2CMBI, Radboud University Nijmegen, Toernooiveld 1, 6525 ED, Nijmegen, The Netherlands
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Recent advances in sequencing techniques have yielded enormous amounts of protein sequence data from various species. This large dataset allows sequence comparison between paralogous and orthologous proteins to identify motifs or functional positions that account for the differences of functional subgroups (specificity positions). Algorithms such as SDPpred and the two-entropies analysis (TEA) have been developed to detect such specificity positions from a multiple sequence alignment (MSA) grouped into classes according to certain biological functions. Other algorithms such as TreeDet compute a classification and then predict specificity positions associated with it. However, there are still many unresolved questions: Was the optimal subdivision of a protein family achieved? Do the definitions at different levels of the phylogenetic tree affect the prediction of specificity positions? Can the whole phylogenetic tree be used instead of only one level in it to predict specificity positions?
Results: Here we present a novel method, TEA-O (Two-entropies analysis—Objective), to trace the evolutionary pressure from the root to the branches of the phylogenetic tree. At each level of the tree, a TEA plot is produced to capture the signal of the evolutionary pressure. A consensus TEA-O plot is composed from the whole series of plots to provide a condensed representation. Positions related to functions that evolved early (conserved) or later (specificity) are close to the lower-left or upper-left corner of the TEA-O plot, respectively. This novel approach allows an unbiased, user-independent, analysis of residue relevance in a protein family.
We compared our TEA-O method with various algorithms using both synthetic and real protein sequences. The results show that our method is robust, sensitive to subtle differences in evolutionary pressure during evolution and comprehensive because all positions in the MSA are presented in the consensus plot.
Availability: All computer programs and datasets used in this work are available at http://nava.liacs.nl/kye/TEA-O/ for academic use
Contact: k.ye{at}lacdr.leidenuniv.nl
| 1 INTRODUCTION |
|---|
|
|
|---|
A crucial aspect in protein sequence analysis is the identification of functional sites such as ligand-binding sites, active sites, protein–protein interaction sites, signal sequences and post-translational modification sites. Traditionally, the residues that are conserved in all members of a protein family are assembled as motifs and correlated to the main function of that protein family. In this way the ATP-binding motif (Walker et al., 1982), the zinc-finger motif (Klug and Rhodes, 1987) and the leucine-zipper motif (Landschulz et al., 1988) were discovered early on. Currently, these functional sites have been collected in databases such as PROSITE (Hulo et al., 2006), Pfam (Bateman et al., 2004), BLOCKS (Henikoff et al., 2000) and PRINTS (Attwood et al., 2003). Recent advances in sequencing techniques have yielded enormous amounts of sequence data from very many species. This prompted a further development of techniques to identify the residues that account for functional differences among subfamilies (del Sol Mesa et al., 2003; Gloor et al., 2005; Kalinina et al., 2004; Kuipers et al., 1997; Mirny and Gelfand, 2002; Oliveira et al., 2002; Pirovano et al., 2006; Ye et al., 2006). In this study, we call positions that are conserved in all members of a protein family conserved positions and those that are conserved within subfamilies but divergent among them specificity positions.
The evolutionary trace (ET) method recursively searches for the most conserved positions along the phylogenetic tree, from the root to each individual branch. It ranks globally conserved positions high. Positions that show diversity even within a group of highly homologous proteins are placed at the end of the ranking. Recently, the authors have applied ET to zinc binding domains (Lichtarge et al., 1997), steroid receptors (Raviscioni et al., 2006) and G protein-coupled receptors (Madabushi et al., 2004).
Mirny and Gelfand (2002) used mutual information to identify positions conserved within orthologs but different among paralogs. The orthologs and paralogs were defined through elaborate sequence comparison. Kalinina et al. incorporated some features of the method proposed by Mirny and Gelfand, and presented their joint method as an SDPpred web server (http://bioinf.fbb.msu.ru/SDPpred) (Kalinina et al., 2004; Mirny and Gelfand, 2002).
While some methods such as SDPpred require a definition of subfamilies in the MSA as input in order to predict specificity positions (Gu, 2006; Kalinina et al., 2004; Mirny and Gelfand, 2002; Pirovano et al., 2006; Ye et al., 2006), many others compute a classification and then search for specificity positions that are associated with the detected classification (del Sol Mesa et al., 2003; Donald and Shakhnovich, 2005a, b; Hannenhalli and Russell, 2000). For instance, in the paper by del Sol Mesa et al. (del Sol Mesa et al., 2003), TreeDet, which contains three algorithms (http://www.pdg.cnb.uam.es/Servers/treedet/), The Level Entropy Method (S-method), Mutational Behavior Method (MB-method) and SequenceSpace Automatization Method (SS-method), was developed to reveal Tree-determinant residues. The S-method uses relative entropy to search for an optimal splitting of the input MSA and then considers positions conserved within classes but different among classes as the tree-determinants. The SS-method applies principal component analysis of the multiple sequence alignment to compute an optimal number of clusters and then searches for the positions conserved in, but different among, clusters. The MB-method looks for positions in the MSA whose mutational behavior resembles the phylogeny of that MSA. Different as those methods are in searching for globally conserved or specificity positions, they all aim to derive one scoring function to rank positions. The combination of two scoring functions, however, may yield a greater resolution. Oliveira et al. (2003) demonstrated that a scatter plot of variability versus Shannon entropy provided better residue classification than with either variability or Shannon entropy alone.
Building on this principle in our previous study, we showed that the combination of overall Shannon entropy and the average value of Shannon entropy within subfamilies instead of overall variability lead to a better classification of specificity positions (e.g. the ligand binding site of G protein-coupled receptors) than an entropy-variability plot (Ye et al., 2006). Globally conserved positions, specificity positions and others are scattered in the lower left, upper left and upper right corners of the plot, respectively. Powerful as our two-entropies analysis (TEA) appeared to be, it also demanded a subjective definition of subfamilies. The combination of the necessity to have a classification and a need for the use of as many sequences as possible makes the TEA method expensive in terms of human intervention. One solution may be to detect subfamilies for the user as TreeDet does. Since each method has its own mechanism to score for optimal, they may classify proteins differently or define subfamilies at different levels of the phylogenetic tree. Hence, a difference in subdivision will be brought to the subsequent prediction of specificity positions as demonstrated in this study.
If we consider a subdivision as a single level of the phylogenetic tree, can we then use the entire phylogenetic tree to capture the signal of evolutionary pressure preserved at all levels of the tree? In this study, we developed the so-called TEA-O (Two-Entropies Analysis—Objective) method to automatically trace the evolutionary pressure along the entire phylogenetic tree to differentiate functional positions including conserved and specificity ones. We compare our approach with the TEA method described in our previous study, and with Evolutionary Trace, SDPpred and TreeDet on two-protein families, LacI bacterial transcription factors and G protein-coupled receptors.
| 2 MATERIALS AND METHODS |
|---|
|
|
|---|
2.1 TEA-O
In our previous study, we proposed the TEA to differentiate between functional positions in a given MSA grouped into multiple subfamilies. Given a definition of subfamilies, we measured the conservation in terms of information entropy for each position at both the entire superfamily and subfamily level. Then we used these two entropy values to produce a plot. In this study, we propose a radically different approach to trace the evolutionary pressure from the root to the branches of a phylogenetic tree. To distinguish our new approach from the old one, TEA, we call it the TEA-O method.
Figure 1 depicts the work flow of TEA-O. The input for the TEA-O method is a single MSA without a definition of subfamilies. We used p-distance and UPGMA algorithm implemented in MEGA3.1 (Kumar et al., 2004) to construct a phylogenetic tree. As shown in Figure 1, the vertical bar indicates a certain evolutionary distance (often expressed as a percentage sequence similarity) that could be used as a cutoff for calling a set of sequences a subfamily. For example, in the fifth way to define subfamilies, five subfamilies would result: (1), (2), (3,4), (5,6), (7,8). Intuitively, one would like to combine sequence (1) and (2) in one subfamily, but it is equally defendable to move the vertical bar a short distance to the right to split the subfamily (7, 8). Clearly the definition of subfamilies using the method illustrated in Figure 1 is arbitrary and large differences in the locations of residues in the whole series of two-entropies plots are observed if different subfamily definitions are used. In this study, we show that the arbitrary nature of the subfamily definition can be overcome by repeating the two-entropies plot generation process for each of the N possible subfamily definitions (N being the number of sequences in the MSA), and using the average of all resulting two-entropies plots.
|
In addition, we modified the equation of Shannon entropy in order to consider gaps in the alignment as well as the potential unbalance in the sizes of subfamilies. A seemingly easy way to treat gaps in the calculation of information entropy is to consider a gap as the 21st residue as is done in the ET method. However, this will lead to the obvious error that the more gaps in one position the more conserved the position is. For example, given an alignment with 100 protein sequences, ET defines a position with 1Y and 99 - as more conserved, thus more important, than another one with 98Y and 2 -. Thus we treated each gap as a different residue in calculating entropy values. For example, the first gap in the position is the 21st residue; the second the 22nd and so on. In this way, positions with a high percentage of gaps have high entropy values.
Since the number of proteins in one group determines the maximum information entropy value, we also adjusted the entropy value according to the size of each group while calculating entropy values as suggested by Valdar (2002).
For a given definition of subfamilies, the average entropy value among all subfamilies at position i is given by
|
|
Finally in the consensus plot, the x coordinate of each position is the average of all TEA plots while the y coordinate remains the same.
The weighting factor scales the entropy to range [0,1] so that scoring functions may be easily derived to rank conserved and specificity positions from the consensus TEA-O plot. The x-coordinates of all positions in the TEA-O plot are then standardized to [0,1]. Since globally conserved positions are close to the lower-left corner of the plot, the scoring function for them is defined as the distance to (0,0).
|
|
On the other hand, the specificity positions are close to the upper-left corner of the plot so that the scoring function for them is defined as the distance to (0,1).
|
|
Ex and Ey are the coordinates of a given position in the TEA-O plot.
2.2 Datasets for validation
2.2.1 Synthetic dataset
In order to demonstrate TEA-O's features and compare it with other methods, we first simulated an MSA according to our simple protein evolution model described below. Please be informed that this model does not aim to include all aspect of protein evolution but to shed light only on how different definitions of subfamilies affect the prediction of specificity positions. For example, we did not introduce any substitution matrices to the model since all methods in this study use either variability- or entropy-related scores.
We simulated four generations of protein evolution by modifying the evolutionary pressure between generations. As a start we randomly generated a protein with 200 residues. Then an evolutionary pressure (EP, float number from 0 to 1) was randomly assigned to each of the 200 residues. If a position is assigned an EP close to 1, the probability of this residue mutating to another one will be small, and this position will be conserved during evolution; if the EP is close to 0, the chance of mutation will be high, and this position will be divergent. We marked those positions of the protein with an evolutionary pressure bigger than 0.85 as the globally conserved ones and called them EP1. Subsequently we multiplied the ancestor protein five times to get the second generation of proteins and mutated each position in every protein according to the assigned EP as follows. A random number between 0 and 1 was generated, where after it was compared with its EP. If the generated number was larger than EP, one of the 20 amino-acid subtypes was randomly chosen to replace the current residue. In this case, we obtained five proteins in the second generation.
Proteins obtain new functional sites during evolution. If certain advantages are associated with these new functional sites, the evolutionary pressures will be tightened on the positions involved. Hence we modified the EP for some positions after we obtained each generation in order to simulate such gain of function during evolution. We randomly chose 20 positions (10% of the sequence length) with EP smaller than 0.5 and randomly modified their EP to a value between 0.85 and 1.0. We called these 20 positions EP2. After that we multiplied each member of the second generation five times and mutated them according to the updated evolutionary pressure. This yielded the third generation with 25 proteins.
We repeated such modification of EP, multiplication and mutation to get the fourth generation of proteins as well as a list of EP3 positions. In this way we obtained 1*5*5*5 = 125 proteins as the fourth generation of proteins.
2.2.2 Real protein sequences
LacI bacterial transcription factors: The LacI family is one of the largest families of bacterial transcription factors. Dr Mirny (MIT, USA) kindly provided us with the MSA of this family and the definition of subfamilies. The multiple sequence alignment of 55 LacI proteins was classified into 15 subfamilies (Mirny and Gelfand, 2002). Extensive experimental (Glasfeld et al., 1999; Lehming et al., 1990; Sartorius et al., 1991; Suckow et al., 1996) and structure information (Bell and Lewis, 2001; Schumacher et al., 1997) allowed us to evaluate our prediction. Suckow et al. replaced position 2–329 of Lac repressor with 12 or 13 of the 20 naturally occurring amino acids. These 4000 well-defined mutants yielded a functional classification for each position. The non-conserved residues in group XI (IPTG contacts) were defined as the specificity positions. The specificity positions based on experimental evidence were L73, N125, P127, A145, S191, S193, W220, Q248, T276 and F293. We also defined specificity positions based on structural information. The non-conserved residues within 4 Å distance to the ligand, ONPF (ortho-nitrophenyl-beta-D-fucopyranoside), in its crystal structure (PDB: 1JWL
[PDB]
), were defined as specificity positions; these were L73, S191, V192 and W220.
G protein-coupled receptors (GPCRs): GPCRs are integral cell membrane proteins involved in signal transduction. Such a mediatory role makes them important drug targets. We extracted the MSA of class A GPCRs from the latest version of the GPCRDB (June 2006 release (10.0), http://www.gpcr.org/7tm/) (Horn et al., 2003). This yielded an MSA of 2065 protein sequences with an average sequence identity of 25.8%. The MSA was classified into 77 subfamilies according to the recognition of endogenous ligands since SDPpred and TEA require such information in predicting specificity positions.
The specificity positions based on structural information were defined as residues within 4 Å distance of the endogenous ligand retinal but not being a member of the well-known conserved aromatic cluster in helix 6 (Javitch et al., 1998). They are E113, A117, T118, G121, E122, M207, A292 and K296 of bovine rhodopsin. We also defined specificity positions based on experimental evidence reviewed by Shi and Javitch (Shi and Javitch, 2002). Positions were considered to be part of the specificity recognition site if they were located in a TM region and implicated in ligand binding in aminergic receptors. The specificity positions based on experimental evidence are T94, T97, E113, G114, A 117, T118, G121, C167, L172, F203, V204, M207, F208, H211, A272, A292, F293 and K296 according to their numbering in bovine rhodopsin.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Specificity positions in the synthetic dataset
Our new TEA-O method does not require the user to define subfamilies within a protein family while many methods including TEA in our previous study demand such a definition. Thus, we first examined how different definitions of subfamilies affect the identification of specificity positions using the synthetic dataset. When we divided the MSA at the third generation level, we obtained 25 subfamilies, each containing five proteins. As shown in Figure 2A, when we used such a definition, our previous method, TEA identified EP1 (conserved), EP2 and EP3 (specificity). The separation of EP3 from other positions is much better than of EP2. However, if we defined subfamilies at the second generation level (five subfamilies, each of them contains 25 proteins), separation of EP2 remained the same but EP3 mixed with other less determining positions (Fig. 2B).
|
When we used the new TEA-O method to analyze the same MSA without a definition of subfamilies, we obtained a series of TEA plots when we traced the evolutionary pressure from the root to the leaves of the phylogenetic tree. The positions under different evolutionary pressures (EP1, EP2, EP3 and others) migrate differently as shown in the animation in the supporting material. When we averaged the entire series of plots, we obtained a TEA-O plot (Fig. 2C), which is similar to the plot generated by the TEA using the MSA grouped at the third generation (Fig. 2A). Thus the new TEA-O method maintains the same separation ability as the previous TEA method even when there is less information in the input. From Figure 2A and B we learned that definitions of subfamilies at different levels of the phylogenetic tree dramatically change the performance of the TEA method. We next examined SDPpred using the same MSA with the subfamily definition at the third or the second generation. We mapped the predictions of SDPpred on our TEA-O plot (Fig. 2C) as shown in Figure 2D and E. When the definitions of subfamilies are at the second or the third generation, SPDpred identified part of EP2 or EP3, respectively. This clearly demonstrates that grouping of protein sequences has a profound impact on the prediction of specificity positions for algorithms that require such a definition of subfamilies.
To compare the prediction of TEA-O on the synthetic dataset with TreeDet which detects subfamilies by itself, we directly mapped the prediction of the latter method on the TEA-O plot. TreeDet contains three different algorithms: the MB-method, the SS and the S-method. The S-method failed to identify any specificity positions. As shown in Figure 2F, the MB-method and SS-method behaved very differently since the MB-method failed to identify any of EP3 but found more EP2 than the SS-method. Both the MB and SS-methods mislabeled a few globally conserved positions (EP1) as specificity positions.
3.2 Prediction of specificity positions of real protein sequences
The performance of TEA-O was compared with the SDPpred, TEA, TreeDet and ET methods in predicting specificity positions of both LacI bacterial transcription factors and G protein-coupled receptors using either experimental or structural information (Figure 4 in Supplementary Materials). TEA-O performs comparably to or even better than SDPpred and TEA and, importantly, requires no classification information from the user. The poor performance of TreeDet may be due to misclassification. If we take the manual classification of 15 subfamilies in the LacI protein family by Mirny and Gelfand for comparison, TreeDet detected only six of these and some proteins from the same subfamily were identified in the wrong subfamilies (see supporting material for more information on the differences between manual and automatic classifications in the LacI family).
We could not apply TreeDet on the GPCRs dataset because the number of proteins in this family (2065) is too large for TreeDet (maximally 200). ET did not identify specificity positions of GPCRs and LacI with either of the two implementations of this method. The first implementation of ET, ET report maker, accepts a single protein but not an MSA as input. The second implementation, ET viewer, does accept an MSA but it ranks positions according to their importance with respect to conservation. In the list, the conserved positions are ranked high while it is not stated which part of the list is related to the specificity positions. Madabushi et al. used ET to compute the most conserved residues in all GPCRs as well as in rhodopsins (a particular subfamily of GPCRs). Then they defined the rhodopsin-specific positions by subtracting conserved residues in all GPCRs from those conserved in rhodopsin (Madabushi et al., 2004). As shown in Figure 4 (Supplementary Materials) the performance of ET on the GPCRs dataset is comparable to SDPpred, TEA and TEA-O, when based on structure information but performs the worst when based on experimental evidence (see Materials and Methods for definitions).
3.3 Robustness of TEA-O
The TEA-O method in this study neither requires a classification from the user nor detects one. It collects signals about evolutionary pressure within and among subfamilies with the guide of a phylogenetic tree. To examine the influence of different phylogenetic trees on the results of the TEA-O method, we used three distance measures and three phylogeny reconstruction algorithms to produce nine phylogenetic trees. The three algorithms, Neighbor-Joining (NJ), Minimum Evolution (ME) and UPGMA, and the three distance measures, No of difference (Num), p-distance (P) and Poisson correction (PC), are implemented in MEGA3.1 (Kumar et al., 2004).
Because the y-coordinates (total entropy) of the positions in the consensus plots are tree independent and are always the same, we only need to examine the difference between each two sets of x-coordinates (the consensus average entropy among all subfamilies) calculated from different trees. We used the Pearson correlation coefficient to measure the similarity. As shown in Supplementary Tables I and II, the x-coordinate sets based on different trees are highly correlated, which indicates that the resulting consensus plots are very similar. Thus, we believe that although our TEA-O method needs a phylogenetic tree to capture evolutionary pressure, it does not matter which tree is used as long as homologous proteins are closer to each other than divergent ones.
| 4 DISCUSSION |
|---|
|
|
|---|
4.1 Comparison with other methods
It is generally believed that the conserved positions in an MSA are functionally important. ET is one of the first methods that systematically explored the correlation between residue conservation and functional importance. More importantly, the authors of the method proposed that specificity positions are conserved within subfamilies but divergent among them. However, both implementations of ET mainly focus on conserved positions and no automatic mechanism has been provided to identify specificity positions. For example, one may use ET viewer to find conserved positions in the entire MSA as well as in one particular subfamily. Then the user has to manually subtract the positions conserved in the entire MSA from those conserved in the subfamily. This yields a group of residues conserved in the subfamily only, but without ranking.
We believe that specificity positions in the same protein family should be largely overlaid while each subfamily may slightly modify its specificity site to accommodate the difference of the modulator. The way ET viewer is implemented to identify specificity positions in one particular subfamily overestimates the difference among the specificity sites of various subfamilies. Thus in practice, ET ignores the conservation signal from the peer subfamilies to eliminate false positives while tracing the specificity positions in one subfamily. Many conserved positions identified by ET in one subfamily may not be functionally important but just the consequence of a small population of proteins or a short evolution history since the subfamily members began to evolve.
Although TreeDet does not require a definition of subfamilies by the user, it tries to find an optimal definition of subfamilies early in the calculation. As we demonstrated in Figure 2, any prediction based on only one level of the phylogenetic tree will introduce bias to positions associated with that level of the tree. In addition, misclassification will jeopardize the prediction (as shown in Figs 4A and B in Supplementary Material).
The new TEA-O is better than TEA (Fig. 4, Supplementary Materials) mainly because its prediction is not biased to a particular classification. When we produce a classification either manually or computationally, we are speculating the optimal cutoff to recognize certain proteins with similar biological functions. In the case of GPCRs, we defined subfamilies based on the recognition of endogenous ligands according to the web-publication (http://www.iuphar-db.org/GPCR/) of the International Union of Basic and Clinical Pharmacology (IUPHAR). Even in this international standard about GPCRs classification, the definition of subfamilies is quite arbitrary. For example, dopamine receptors and adrenergic receptors are considered as two subfamilies by IUPHAR. The adrenergic receptors are further classified as alpha-adrenergic and beta-adrenergic receptors. The compound noradrenaline prefers alpha-adrenergic receptors, whereas adrenaline prefers beta-adrenergic receptors. As shown in Figure 3, the only difference between the two compounds is a methyl group on the terminal amino group. In addition, the compound dopamine lacks the beta-hydroxy group of noradrenaline and receptors that recognize dopamine are defined as one family, dopamine receptors. If we look at these three molecules, it is indeed arbitrary to group alpha- and beta-adrenergic receptors as one family and dopamine receptor as another. One might even define all of them as one family if we consider the differences among these three compounds negligible. As we demonstrated in Figure 2A–2D, any classification will introduce a bias to the positions associated strongly with it. The classification of GPCRs we used for the TEA method is just one of the possible classifications. It has been also shown in the LacI dataset that when classification is sufficiently optimal, TEA-O performs as well as TEA. This indicates that by averaging the entire series of TEA pots, the specificity signal in TEA-O plot is as strong as in the TEA plot when optimal classification is given.
|
4.2 Further development of TEA-O
In this study, we used information entropy to measure conservation for a given position in the alignment. Information entropy as a measure is certainly more informative than variability since it considers not only the number of amino-acid types but also their relative frequencies. However, information entropy treats amino acids as 20 independent symbols and ignores the fact that amino acids can be quite similar. For two positions in an alignment, e.g. one with 50% D and 50% E and the other with 50% D and 50% W, information entropy as a measure yields the same score, although the former combination is more conserved as both D and E are negatively charged. Ideally, a conservation measure should also account for the mutation type.
We introduced a weighing factor scaling the entropy values in between 0 and 1. Such a weighing factor is only related to the number of sequences in a given alignment, and does not account for sequence similarity between various alignments. Suppose we have two subfamilies, A and B, for which the sequences in A are very similar to each other while those in B are very different. If the same substitution pattern is observed in both A and B, a conservation measure should yield different scores in the two subfamilies for such a position. Today, most specificity prediction algorithms still use relatively simple conservation measures such as variability and information entropy. In a future study we aim to use a more sophisticated conservation measure such as Rate4Site (Landau et al., 2005; Pupko et al., 2002) instead of the information entropy we used in the present study.
| 5 CONCLUSION |
|---|
|
|
|---|
In this study we present a novel method coined TEA-O to relate residues to their potential functions from an MSA. Compared to TEA, TEA-O requires only an MSA to predict both conserved and specificity positions. This novel approach allows an unbiased, user-independent, analysis of residue relevance in a protein family.
TEA-O traces the evolutionary pressure from the root to the branches of the phylogenetic tree. At each level of the tree, a two-entropies plot is produced to capture the signal of the evolutionary pressure. The whole series of plots comprehensively reproduce the evolutionary pressure on each position in every period of evolutionary history. A consensus plot is composed from the whole series of plots to provide a condensed representation. Positions related to the functions that evolved early (conserved) are close to the lower left corner of the consensus plot while those related to functions evolved later (specificity) are located close to the upper left corner of the plot. Scoring functions for ranking conserved and specificity positions are provided.
We compared our TEA-O method with various algorithms using both synthetic and real protein sequences. The results show that our method is robust, sensitive to subtle differences in evolutionary pressure during evolution and comprehensive since all positions in the MSA are presented in the consensus plot.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Dr Leonid A. Mirny for making the LacI dataset available for this study. We thank Margot W. Beukers and Qilan Li for critical review and comments. This work was financially supported by an NWO-Bioinformatics Breakthrough Grant (050-71-041), The Netherlands.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Dmitrij Frishman
Received on September 20, 2007; revised on February 6, 2008; accepted on February 9, 2008
| REFERENCES |
|---|
|
|
|---|
Attwood TK, et al. PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res (2003) 31:400–402.
Bateman A, et al. The Pfam protein families database. Nucleic Acids Res (2004) 32:D138–D141.
Bell CE, Lewis M. Crystallographic analysis of Lac repressor bound to natural operator O1. J. Mol. Biol (2001) 312:921–926.[CrossRef][Web of Science][Medline]
del Sol Mesa A, et al. Automatic methods for predicting functionally important residues. J. Mol. Biol (2003) 326:1289–1302.[CrossRef][Web of Science][Medline]
Donald JE, Shakhnovich EI. Determining functional specificity from protein sequences. Bioinformatics (2005a) 21:2629–2635.
Donald JE, Shakhnovich EI. Predicting specificity-determining residues in two large eukaryotic transcription factor families. Nucleic Acids Res (2005b) 33:4455–4465.
Glasfeld A, et al. The role of lysine 55 in determining the specificity of the purine repressor for its operators through minor groove interactions. J. Mol. Biol (1999) 291:347–361.[CrossRef][Web of Science][Medline]
Gloor GB, et al. Mutual information in protein multiple sequence alignments reveals two classes of coevolving positions. Biochemistry (2005) 44:7156–7165.[CrossRef][Web of Science][Medline]
Gu X. A simple statistical method for estimating type-II (cluster-specific) functional divergence of protein sequences. Mol. Biol. Evol (2006) 23:1937–1945.
Hannenhalli SS, Russell RB. Analysis and prediction of functional sub-types from protein sequence alignments. J. Mol. Biol (2000) 303:61–76.[CrossRef][Web of Science][Medline]
Henikoff JG, et al. Increased coverage of protein families with the blocks database servers. Nucleic Acids Res (2000) 28:228–230.
Horn F, et al. GPCRDB information system for G protein-coupled receptors. Nucleic Acids Res (2003) 31:294–297.
Hulo N, et al. The PROSITE database. Nucleic Acids Res (2006) 34:D227–D230.
Javitch JA, et al. A cluster of aromatic residues in the sixth membrane-spanning segment of the dopamine D2 receptor is accessible in the binding-site crevice. Biochemistry (1998) 37:998–1006.[CrossRef][Web of Science][Medline]
Kalinina OV, et al. SDPpred: a tool for prediction of amino acid residues that determine differences in functional specificity of homologous proteins. Nucleic Acids Res (2004) 32:W424–W428.
Klug A, Rhodes D. Zinc fingers: a novel protein fold for nucleic acid recognition. Cold Spring Harb. Symp. Quant. Biol (1987) 52:473–482.
Kuipers W, et al. Identification of class-determining residues in G protein-coupled receptors by sequence analysis. Receptors Channels (1997) 5:159–174.[Web of Science][Medline]
Kumar S, et al. MEGA3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment. Brief Bioinform (2004) 5:150–163.
Landau M, et al. ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures. Nucleic Acids Res (2005) 33:W299–W302.
Landschulz WH, et al. The leucine zipper: a hypothetical structure common to a new class of DNA binding proteins. Science (1988) 240:1759–1764.
Lehming N, et al. Mutant lac repressors with new specificities hint at rules for protein—DNA recognition. EMBO J (1990) 9:615–621.[Web of Science][Medline]
Lichtarge O, et al. Identification of functional surfaces of the zinc binding domains of intracellular receptors. J. Mol. Biol (1997) 274:325–337.[CrossRef][Web of Science][Medline]
Madabushi S, et al. Evolutionary trace of G protein-coupled receptors reveals clusters of residues that determine global and class-specific functions. J. Biol. Chem (2004) 279:8126–8132.
Mirny LA, Gelfand MS. Using orthologous and paralogous proteins to identify specificity-determining residues in bacterial transcription factors. J. Mol. Biol (2002) 321:7–20.[CrossRef][Web of Science][Medline]
Oliveira L, et al. Correlated mutation analyses on very large sequence families. Chembiochem (2002) 3:1010–1017.[CrossRef][Web of Science][Medline]
Oliveira L, et al. Identification of functionally conserved residues with the use of entropy-variability plots. Proteins (2003) 52:544–552.[CrossRef][Web of Science][Medline]
Pirovano W, et al. Sequence comparison by sequence harmony identifies subtype-specific functional sites. Nucleic Acids Res (2006) 34:6540–6548.
Pupko T, et al. Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics (2002) 18(Suppl. 1):S71–S77.[Abstract]
Raviscioni M, et al. Evolutionary identification of a subtype specific functional site in the ligand binding domain of steroid receptors. Proteins (2006) 64:1046–1057.[CrossRef][Web of Science][Medline]
Sartorius J, et al. The roles of residues 5 and 9 of the recognition helix of Lac repressor in lac operator binding. J. Mol. Biol (1991) 218:313–321.[CrossRef][Web of Science][Medline]
Schumacher MA, et al. The X-ray structure of the PurR-guanine-purF operator complex reveals the contributions of complementary electrostatic surfaces and a water-mediated hydrogen bond to corepressor specificity and binding affinity. J. Biol. Chem (1997) 272:22648–22653.
Shi L, Javitch JA. The binding site of aminergic G protein-coupled receptors: the transmembrane segments and second extracellular loop. Annu. Rev. Pharmacol. Toxicol (2002) 42:437–467.[CrossRef][Web of Science][Medline]
Suckow J, et al. Genetic studies of the Lac repressor. XV: 4000 single amino acid substitutions and analysis of the resulting phenotypes on the basis of the protein structure. J. Mol. Biol (1996) 261:509–523.[CrossRef][Web of Science][Medline]
Valdar WS. Scoring residue conservation. Proteins (2002) 48:227–241.[CrossRef][Web of Science][Medline]
Walker JE, et al. Distantly related sequences in the alpha- and beta-subunits of ATP synthase, myosin, kinases and other ATP-requiring enzymes and a common nucleotide binding fold. EMBO J (1982) 1:945–951.[Web of Science][Medline]
Ye K, et al. A two-entropies analysis to identify functional positions in the transmembrane region of class A G protein-coupled receptors. Proteins (2006) 63:1018–1030.[CrossRef][Web of Science][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


