Bioinformatics Advance Access originally published online on June 9, 2005
Bioinformatics 2005 21(16):3409-3415; doi:10.1093/bioinformatics/bti532
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Refined phylogenetic profiles method for predicting proteinprotein interactions


1School of Life Sciences & Technology, Shanghai Jiaotong University Shanghai 200240, China
2Department of Biology, Hunan Normal University Changsha 410081, China
3Bioinformation Center, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences Shanghai 200031, China
4The Chinese National Center for Biotechnology Development Beijing 100081, China
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: The increasing availability of complete genome sequences provides excellent opportunity for the further development of tools for functional studies in proteomics. Several experimental approaches and in silico algorithms have been developed to cluster proteins into networks of biological significance that may provide new biological insights, especially into understanding the functions of many uncharacterized proteins. Among these methods, the phylogenetic profiles method has been widely used to predict proteinprotein interactions. It involves the selection of reference organisms and identification of homologous proteins. Up to now, no published report has systematically studied the effects of the reference genome selection and the identification of homologous proteins upon the accuracy of this method.
Results: In this study, we optimized the phylogenetic profiles method by integrating phylogenetic relationships among reference organisms and sequence homology information to improve prediction accuracy. Our results revealed that the selection of the reference organisms set and the criteria for homology identification significantly are two critical factors for the prediction accuracy of this method. Our refined phylogenetic profiles method shows greater performance and potentially provides more reliable functional linkages compared with previous methods.
Availability: The software (C, Perl) is available from the corresponding author.
Contact: yxli{at}sibs.ac.cn; tlshi{at}sibs.ac.cn; zhaoaimin{at}cncbd.org.cn
Supplementary information: There are three supplementarymaterials online, including related materials and results.
| INTRODUCTION |
|---|
|
|
|---|
Many cellular processes, such as metabolic and signal transduction pathways, involve proteinprotein interactions. Therefore, it is important to identify these interactions to fully understand the molecular mechanisms of the living cell (Auerbach et al., 2002; Eisenberg et al., 2000). The increasing availability of complete genomic sequences makes it possible to apply in silico or experimentally based reverse proteomics approaches to the detection of proteinprotein interactions on a proteome scale (Walhout and Vidal, 2001). The in silico or experimentally based reverse proteomics approaches include the yeast two-hybrid assay (Fields and Song, 1989), the gene neighbor method (Overbeek et al., 1999), the gene fusion method (Enright et al., 1999; Marcotte et al., 1999a) and the phylogenetic profiles method (Date and Marcotte, 2003; Gaasterland and Ragan, 1998; Marcotte et al., 1999b; Pellegrini et al., 1999). Such resultant proteinprotein interactions may provide a new basis for biological discoveries, especially for understanding the functions of many uncharacterized proteins (Chen and Xu, 2003).
The phylogenetic profiles method (Gaasterland and Ragan, 1998; Pellegrini et al., 1999)an in silico methodis based on the assumption that there is strong selective pressure on proteins that functionally interact with each other so that they are inherited together during speciation events. Thus, proteins in a target organism with the same or similar phylogenetic profiles [constructed by detecting homologous proteins as being present or absent in reference organisms with a predetermined threshold BLASTP (Altschul et al., 1997) E-value], can be hypothesized to interact with each other physically or functionally. Therefore, selection of reference organisms and determination of the threshold BLASTP E-value are the two critical steps of this method. As more and more completely sequenced genomes become available, it is natural to ask whether the addition of new genomes would improve the accuracy of the phylogenetic profiles method (Zheng et al., 2002) and whether changing the threshold E-value would affect the method's accuracy. However, to our knowledge, no published report has systematically studied the effects of the reference genome selection and the E-value threshold on the accuracy of this method.
We therefore investigated the phylogenetic profiles method by integrating the selection of reference organisms and the choice of a suitable E-value threshold, simultaneously. The results indicated that the reference organism selection and the E-value threshold greatly affect the performance of the method. Using our refined method, we predicted protein interaction datasets and unknown protein function for six microorganisms. Moreover, the comparison of proteinprotein interactions of Escherichia coli K12, predicted by Date and Marcotte (2003) (DM method) with our predicted proteinprotein interactions, demonstrated that our refined phylogenetic profiles method shows greater performance and predicts more reliable protein interactions over the DM method. Therefore, it is essential to consider the selection of reference organisms and E-value threshold when applying the phylogenetic profiles method in the prediction of the proteinprotein interactions.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Different combinations of reference organisms and E-value thresholds
The protein sequences of 163 organisms were downloaded from the National Center for Biotechnology Information (NCBI) ftp site (ftp.ncbi.nih.gov/genomes). Caulobacter crescentus (Ccr), E.coli K12 (Eco), E.coli O157:H7 (Ecs), Pseudomonas aeruginosa (Pae), Staphylococcus aureus subsp. aureus N315 (Sau) and Vibrio cholerae (Vch) were regarded as target organisms and the remainder as reference organisms. The classification names of the 163 organisms were downloaded from the NCBI Taxonomy site and used to reconstruct an evolutionary tree (see Supplementary Material 1Figure 1 online).
Each clade, one basic element of the evolutionary tree, is usually monophyletic, that is, all members in one clade share a common ancestor, meaning that each organism corresponds to different clades in different hierarchies within the evolutionary tree. We determined the subsets of selected genomes that corresponded to the clade hierarchy and selected an organism far away from other member in the same clade. When a clade with several organisms had no subclades, we randomly selected an organism from it. When a clade had subclades, we first selected the subclade with the fewest sub-subclades and then selected an organism as described above (detailed in Fig. 1). Therefore, the rationale for selection of different sets of organisms is, for a given clade, to take the organism that is evolutionarily the furthest apart from the rest of the organisms in that clade. Consequently, the selected organism can be regarded as a form of outlier compared with the others in that clade. Based on this rationale, we selected nine sets of reference organisms, named 18, 35, 55, 65, 86, 106, 128, 145 and 162, respectively.
|
At the same time, seven E-value thresholds were applied to determine whether a homologous protein was present or absent, using BLASTP: 1 x 101 (abbreviated as E01), 1 x 102 (E02), 1 x 103 (E03), 1 x 104 (E04), 1 x 105 (E05), 1 x 107 (E07) and 1 x 1010 (E10). Thus, 63 combinations of different reference organisms and various E-value thresholds were formed (e.g. 18E01, 35E01, 145E01, 145E04, 162E01 and 162E10).
Phylogenetic profiles method and the threshold ofmutual information
The protein sequences of a target organism (e.g. E.coli K12) were compared with those from reference organisms using BLASTP. For each protein i of the target organism, the BLAST E-value of the top scoring sequence alignment between proteins i and all the proteins of each reference organism j was assigned to Eij. Phylogenetic profiles were constructed as follows: for each protein i, a vector was generated with elements Pij, where Pij = 1/log Eij when Eij is lower than the predetermined E-value threshold, and Pij = 1 when the E-value is greater than or equal to the predetermined E-value threshold. For the DM method, Pij = 1 when the E-value is greater than or equal to 1 x 101. Construction of the shuffled phylogenetic profiles and the comparison of the actual and shuffled phylogenetic profiles were performed using the DM method (Date and Marcotte, 2003). The threshold of mutual information (TMI) of each combination was analyzed from differences between the distribution of the actual and shuffled phylogenetic profiles (Fig. 2 and Supplementary Material 1Figure 2). The linkages between two proteins whose mutual information value was higher than the TMI were regarded as putative functional linkages. Linkages between homologous proteins whose BLAST E-value was lower than 1 x 104 were removed.
|
Gold-standard positives and negatives
To evaluate the combination that had the greatest accuracy, reference datasets that serve as gold standards of positives (i.e. proteins that do interact) and negatives (i.e. proteins that do not interact) were needed. The DIP (Salwinski et al., 2004) E.coli dataset served as our positive control. We had no direct information about the proteins that did not interact. Fortunately, indirect information could be obtained from functional protein data since proteins with different functions tend not to interact (Schwikowski et al., 2000; von Mering et al., 2002). We applied the first level of KEGG orthology (KO) (Kanehisa et al., 2004), which includes five broad functional categories for each organism, and deleted those proteins belonging to more than two categories. Then protein pairs from different functional categories were compiled to form negative controls. These positive and negative datasets were compared and only one pair was found to be the same. In order to ensure the reliability of the negative and positive control datasets, we deleted these two proteins from the KO functional categories.
The comparative index (R-value) was used to measure the accuracy of each combination, and was calculated as follows:
![]() |
Prediction of genome-wide functional linkages and unknown proteins of six organisms
In addition to E.coli K12, genome-wide functional linkages of C.crescentus, E.coli O157:H7, P.aeruginosa, S.aureus subsp. aureus N315 and V.cholerae were also calculated using the combination of 145 set of reference organisms and an E-value 1 x 104 as the threshold for BLASTP. We predicted the function of uncharacterized proteins for six microorganisms using the guilt by association method (Oliver, 2000; Schwikowski et al., 2000).
Comparison of predicted protein interaction dataand published data
In order to measure the performance and reliability of our refined method over previous methods, we compared the number of interacting proteins, the number of predicted unknown proteins and the functional similarity index of proteinprotein interaction data for six microorganisms. In addition, we conducted an in-depth comparison by calculating several strings of sensitivity and specificity using the predicted proteinprotein interaction data of E.coli K12, based on biological pathway information (KEGG) (Kanehisa et al., 2004), protein complexes (EcoCyc) (Keseler et al., 2005) and experimental proteinprotein interactions (DIP) (Salwinski et al., 2004). To find out whether the method to select reference organism sets is practical, we compared the performance based on selected genomes as described above with that based on randomly selected genome sets.
We first used the number of interacting proteins as an indicator for genome coverage and the functional similarity index as an indicator for accuracy. The method of functional similarity has been used previously to evaluate functional linkages and accuracy of the predicted data (Strong et al., 2003). Here we determined the functional categories of six microorganisms, which were downloaded from the Clusters of Orthologous Groups of proteins (COG) database. The functional similarity index of a protein interaction dataset was calculated as the maximum true-positive fraction divided by the maximum false-positive fraction, where the maximum false-positive fraction was calculated as the fraction of pairwise links that do not belong to the same COG functional category, and the maximum true-positive fraction was calculated as the fraction of pairwise links that do belong to the same COG functional category.
Sensitivity was calculated based on the pathway data from the KEGG database, protein complex data from the EcoCyc database and protein interaction data from the DIP database. Specificity was calculated using the negative control data given above. For the KEGG database, proteins on the same biological KEGG pathway were presumed to be functionally linked (Strong et al., 2003; von Mering et al., 2005), while those on different maps were not. Similarly, for the EcoCyc database, proteins that appear in the same complex are presumed to interact, while those in different complexes are not. For the DIP database, sensitivity was calculated as described above.
| RESULTS |
|---|
|
|
|---|
Comparison of different combinations of reference organisms and E-value thresholds
Nine sets of reference genomes from 163 organisms were selected according to their phylogenetic relationships (see Supplementary Material 1Figure 1). For each set of reference genomes, seven different E-value thresholds were applied, forming 63 total combinations. Figure 2 shows the distributions of mutual information values of actual over shuffled profiles for each combination of E01, E04 and E10 (for others see Supplementary Material 1Figure 2). Variance analysis showed that the distributions of actual and shuffled mutual information scores of 18E01, 35E01, 55E01, 18E02 and 18E03 were not significantly different (P > 0.05), while the remaining combinations were significantly different (P < 0.05). These results demonstrate that the number of reference organisms and the E-value thresholds have an effect on the application of the method.
The TMIs were extracted by comparing the distributions of mutual information values between actual and shuffled profiles. Protein pairs whose mutual information scores were higher than the TMIs were considered functionally linked. As a result, 63 potential protein interaction datasets for E.coli were generated.
The comparative index (R-value) was used to measure the accuracy of each interaction dataset of the 63 combinations: the higher the R-value, the better the predicted result. Figure 3 illustrates the R-value for each combination and reveals a wide variation in accuracy for the 63 combinations. The R-values of nine combinations with E-value thresholds of 1 x 101 were the lowest among all the sets and were not significantly different (P < 0.95), and the highest R-value did not exist among the nine combinations with E-value thresholds of 1 x 1010. However, the R-values of E-value thresholds of 1 x 104 and 1 x 105 are generally higher than those of the other E-value thresholds. In particular, E04145 [GenBank] and E05145 [GenBank] had the highest R-values. These results showed that the performance of the phylogenetic profiles method was affected by the E-value thresholds and was not improved by decreasing these thresholds.
|
Because the 145 set had the highest R-values, Wilcoxon tests were performed between the R-values of 145 and the other sets. The results showed that there were significant differences between the R-values of the 145 set and those of the 18, 35, 55 or 65 sets (P > 0.95, H = 1) and there were no significant differences between the R-values of the 145 set and those of the 86, 106, 128 or 162 sets (P < 0.95, H = 0). At the same time, there was a jump between the 1865 sets and the 86162 sets. The results implied that within a certain number of reference organisms (<86 here), the performance of the method has been improved significantly with the increasing number of genomes. However, beyond a certain number (>86), the improvement becomes minuscule and reaches saturation at the 145 set. Here, the 86 set seemed to represent the point of inflexion.
From the results above, we next applied the E04145 [GenBank] set to predict the proteinprotein interaction for six microorganisms.
Datasets comparison
We first compared the coverage and accuracy of the DM data, the 55 set (Sun_55) and the 145 set (Sun_145) (Table 1), based on predicted proteinprotein interactions for six microorganisms. These sets were selected because the number of reference organisms (55) is nearly equal to that of the DM data (57) and the 145 set is our preferred subset. The comparison of Sun_55 and DM_57 shows that the coverage (the number of interacting proteins) is lower than that of the DM data while the accuracy (functional similarity index, FSI) was higher than that of the DM data. The comparison of Sun_145 and DM_57 shows that both the coverage and the accuracy are higher than that of the DM data.
|
Figure 4a, b and c showed the comparison of the range of sensitivity and specificity values for the 18, 35, 55, 65, 86, 106, 128 and 145 datasets, the corresponding data of randomly selected genomes, and DM_57, based on the KEGG database, the EcoCyc database and the DIP database, respectively. Although the three databases differ in the biological information provided, the results are unexpectedly similar. First, the sensitivity and specificity of the protein interaction data derived from selected organisms are higher than that from randomly selected organisms, which suggested that the protocol to select reference organisms is practical. Second, the sensitivity and specificity of the protein interaction data of the 55 set are higher than that of the DM_57 dataset, which indicated that the selection of reference organisms and the criteria for homology identification significantly improved the prediction accuracy of this method. Third, the sensitivity and specificity of the protein interaction data of the 145 set are higher than that of the DM_57 dataset and the other sets. Therefore, the refined phylogenetic profiles method provides better performance than the original DM method.
|
Genome-wide protein interaction data and protein function prediction for six organisms
The dataset for E.coli using the 145E04 combination predicted 45 451 functional linkages involving 2481 proteins. We also predicted protein interaction data for five other microorganisms using the refined phylogenetic profiles method (see Supplementary Material 2 online). Additionally, many functional pathways, for example, the citrate cycle (TCA cycle), fatty acid biosynthesis (path 1), ABC transporters, two-component system biosynthesis, flagellar assembly and chemotaxis in E.coli were demonstrated (see Supplementary Material 1Table 1 online).
We predicted protein function for six microorganisms (see Supplementary Material 3 online) using the guilt by association method (Oliver, 2000; Schwikowski et al., 2000). For example, in E.coli, YabB (PID 1788865) was predicted to belong to the category of cell envelope biogenesis (outer membrane COG category) and was validated by EcoCyc annotation (EG11084). YabB belongs to an operon involved in formation of the cell envelope and in the cell division. YjgP (PID 1790712), a putative transmembrane protein possibly involved in transport in the EcoCyc database (EG12535), was also predicted to belong to the same functional category of cell wall/membrane biogenesis in COG as YabB. YadR (PID 1786351), whose paralogous IscA and SufA genes are involved in central intermediary metabolism, as described in EcoCyc (EG12332), is predicted to belong to the category of coenzyme metabolism in COG. YafB (PID 1786400) was predicted to belong to the category of energy production and conversion, consistent with the description in the EcoCyc database (EG11648) that describes the 2,5-diketo-D-gluconate reductase B protein as catalyzing the following reaction: 2,5-diketo-D-gluconate + NADPH = 2-keto-L-gluconate + NADP. The results above show that protein function prediction by the refined method is reliable and could aid in future experimental design.
| DISCUSSION |
|---|
|
|
|---|
The phylogenetic profiles method has been widely used to predict functional protein linkages (Enright et al., 1999; Pellegrini et al., 1999; Strong et al., 2003; Wu et al., 2003) and protein subcellular localization (Marcotte et al., 2000), to annotate genomes (Enault et al., 2003; Zheng et al., 2002) and to discover novel pathways (Date and Marcotte, 2003). In this study, we showed that careful determination of reference organisms and E-value thresholds significantly improved the performance of the phylogenetic profiles method and proposed a practical protocol for the selection of reference organisms when applying the phylogenetic profiles method.
When the method was first proposed and exploited by Pellegrini et al. (1999) only 16 fully sequenced organisms were used to construct phylogenetic profiles. Later, studies by other groups used all the available genomes without considering the impact of organism selection on the method's predictive power (Date and Marcotte, 2003; Enault et al., 2003; Marcotte et al., 1999b; Marcotte et al., 2000; Pellegrini et al., 1999; Strong et al., 2003; Wu et al., 2003), because of the limited number of complete genome sequences at that time. As more and more complete genomic sequences become available, it becomes possible to pursue the question of whether the addition of new genomes would improve accuracy and coverage of the method (Zheng et al., 2002). Zheng et al. demonstrated that more genomes (68 versus 30) would generate greater putative functional associations and also proposed that there was a possible upper limit of accuracy for the phylogenetic profiles method. The authors failed, however, to provide a proper strategy for selecting organisms and to reveal the number and combination of organisms that would generate the highest accuracy. So, it is necessary to develop a proper strategy for sampling organisms from different taxa as more and more completely sequenced genomes become available.
Here, we exploited the phylogenetic relationships of 162 currently available genomes to select reference organisms. Our results indicated that increasing the size of the reference genome pool within a certain range does improve the accuracy of the phylogenetic profiles method, while beyond this range, the improvement trend becomes rather gradual. It is probably because fewer reference genomes (such as 18 or 35) do not include enough co-evolutionary information and results in lower accuracy and lower coverage. As the number of reference organisms increases, there is more co-evolutionary information used and the performance improves until the co-evolution information provided by a certain number of reference organisms (86 here) covers most of the co-evolutionary information available from all reference organisms (162 here). Therefore, the addition of more genomes into the optimal number of reference organisms, will not improve the performance as much as expected, and could even decrease it a little (as with 162 set in our results) as too many genomes might mix more noise into phylogenetic information.
Therefore, when applying the phylogenetic profiles method to predict proteinprotein interactions, it is essential to consider the selection of reference organisms, and choosing a good strategy to select the reference organisms is of importance. Here we exploited the organisms' phylogenetic relationships to select reference organisms, for all members in a clade should evolve from a common ancestor and the one far apart from the rest is close to their ancestor. Therefore, for a given clade, we selected the organism that is evolutionarily the farthest apart from the rest of the organisms in that clade, essentially selecting an outlier of that clade. Our results showed that exploiting the phylogenetic relationships of organisms is an effective strategy to select reference genomes. The predictive power of the method could be expected to further improve if the organisms' evolutionary distance were also taken into account.
Here, we only investigated the 162 organisms available when we began this study. However, with more and more completely sequenced organisms available, it might not be necessary and suitable to use the whole set of 162 organisms. According to our results, here we give a practical protocol to select the set of reference genomes. First, go to the link http://www.ncbi.nlm.nih.gov/genomes/Complete.html. And then click Eubacteria in the sentence See Archaea and Eubacteria genome projects sorted by taxonomic groups and download the phylip tree file. Second, use the software TreeView to open the phylip tree file. Third, select organisms of the appropriate level from the phylogenetic tree, which is evolutionarily the farthest apart from the rest of the organisms in the same clade (i.e. an outlier). Finally add all the Archaea and Eukarytoes organisms with complete genomes, which has much less species than Eubacteria, and the reference organisms set would be ready for further calculation. When applying the phylogenetic profiles method, the program BLASTP is used to compare the protein sequences and to calculate E-values between the proteins in the target and reference organisms. As for the E-value threshold determination of the presence or absence of homologous proteins, no systematic efforts have been made to optimize E-value thresholds. Most authors used different values without giving any explanation, and used the binary value (present, 1 and absent, 0) to record the presence or absence of homologous proteins. Although Date and Marcotte (2003) claimed that the phylogenetic profiles method they used requires no minimum threshold of similarity to be specified, they applied the most permissive E-value threshold (101) and then used E-values lower than the threshold to capture different degrees of sequence divergence. Such a low threshold might result in information without any biological significance being included in the phylogenetic profiles. Hence, it is necessary to investigate whether varying E-value thresholds would affect both the accuracy and coverage of the method and to determine the proper E-value threshold. Through our systematic investigation of E-values, our results show that the E-value threshold has a significant effect on the application of the method, and an E-value cutoff of 1 x 104 or 1 x 105 would achieve optimal accuracy and coverage.
In order to evaluate this method, we used E.coli protein interaction data from DIP as a positive control because this database records experimentally determined proteinprotein interactions (Salwinski et al., 2004). In addition, we used the first level of KEGG orthology to compile the negative controls because the KEGG functional categories at this level are clear-cut and well defined (Kanehisa et al., 2004). It is reasonable to assume, therefore, that interactions between proteins from different categories are not likely to occur. Although the reference datasets are not necessarily complete and may be biased to a certain degree, they can be used for evaluation purposes. Although some researchers have used a standardized keyword annotation of the Swiss-Prot database to evaluate the quality of predicted functional linkages (Marcotte et al., 1999b; Strong et al., 2003), we found it difficult to use and not appropriate for this study. The functional similarity method used here applied the same principle as the keyword recovery scheme, as both approaches were based on functional annotations.
In addition to proteinprotein interaction data from the DIP database, we used E.coli protein pathway information from the KEGG database and E.coli protein complexes from the EcoCyc database when comparing the performance of our refined method with that of the DM method. The KEGG database includes known biological pathway information, and the EcoCyc protein complexes database includes experimentally determined protein complexes. Although different databases have their own biases in biological information, the results based on the three databases are similar in this study, indicating that the functional linkages predicted by the phylogenetic profiles method not only included physical interactions such as protein complexes, but also genetic interactions such as proteins in related signal transduction pathways.
The results presented above demonstrated that the performance of the phylogenetic profiles method has been improved by our modifications. Functional linkages could reveal functional roles for hundreds of previously uncharacterized proteins. As an important complementary tool for homology analysis, this method, in combination with other non-homology methods (Enright et al., 1999; Overbeek et al., 1999), is expected to be valuable, not only to identify interacting protein pairs, but also to infer protein function.
| Acknowledgments |
|---|
The authors would like to thank Jiancheng Lin for helpful discussions, Youyu He, Shaoyou Yang and Wei Huang for the help with programming. The authors also thank Dr Qi Sun from the Cornell Theory Center of Cornell University and three anonymous reviewers for their invaluable comments and suggestions. The 863 Hi-Tech Program grants 2001AA231011, 2002AA231051 and 2003AA231011, the State Key Program of Basic Research of China grants 2001CB510209, 2002CB713807 and 2003CB715901, and National Natural Science Foundation of China grant 90408010 supported this project.
Conflict of Interest: none declared.
| Footnotes |
|---|
The authors wish it to be known that, in their opinion, the first author and the fourth author should be regarded as joint First Authors.
Received on October 14, 2004; revised on May 13, 2005; accepted on June 7, 2005
| REFERENCES |
|---|
|
|
|---|
Altschul, S., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids. Res., 25, 33893402
Auerbach, D., et al. (2002) The post-genomic era of interactive proteomics: facts and perspectives. Proteomics, 2, 611623[CrossRef][Web of Science][Medline].
Chen, Y. and Xu, D. (2003) Computational analyses of high-throughput proteinprotein interaction data. Curr. Protein Pept. Sci., 4, 159181[CrossRef][Web of Science][Medline].
Date, S.V. and Marcotte, E.M. (2003) Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nat. Biotechnol., 21, 10551062[CrossRef][Web of Science][Medline].
Eisenberg, D., et al. (2000) Protein function in the post-genomic era. Nature, 405, 823826[CrossRef][Medline].
Enault, F., et al. (2003) Annotation of bacterial genomes using improved phylogenomic profiles. Bioinformatics, 19, Suppl. 1, i105i107[Abstract].
Enright, A., et al. (1999) Protein interaction maps for complete genomes based on gene fusion events. Nature, 402, 8690[CrossRef][Medline].
Fields, S. and Song, O. (1989) A novel genetic system to detect proteinprotein interactions. Nature, 340, 245246[CrossRef][Medline].
Gaasterland, T. and Ragan, M.A. (1998) Microbial genescapes: phyletic and functional patterns of ORF distribution among prokaryotes. Microb. Comp. Genomics, 3, 199217[Medline].
Kanehisa, M., et al. (2004) The KEGG resource for deciphering the genome. Nucleic Acids. Res., 32, D277D280
Keseler, I.M., et al. (2005) EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res., 33, D334D337
Marcotte, E.M., et al. (1999a) Detecting protein function and proteinprotein interactions from genome sequences. Science, 285, 751753
Marcotte, E.M., et al. (1999b) A combined algorithm for genome-wide prediction of protein function. Nature, 402, 8386[CrossRef][Medline].
Marcotte, E.M., et al. (2000) Localizing proteins in the cell from their phylogenetic profiles. Proc. Natl Acad. Sci. USA, 97, 1211512120
Oliver, S. (2000) Guilt-by-association goes global. Nature, 403, 601603[CrossRef][Medline].
Overbeek, R., et al. (1999) The use of gene clusters to infer functional coupling. Proc. Natl Acad. Sci. USA, 96, 28962901
Pellegrini, M., et al. (1999) Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc. Natl Acad. Sci. USA, 96, 42854288
Salwinski, L., et al. (2004) The Database of Interacting Proteins: 2004 update. Nucleic Acids Res., 32, D449D451
Schwikowski, B., et al. (2000) A network of proteinprotein interactions in yeast. Nat. Biotechnol., 18, 12571261[CrossRef][Web of Science][Medline].
Strong, M., et al. (2003) Inference of protein function and protein linkages in Mycobacterium tuberculosis based on prokaryotic genome organization: a combined computational approach. Genome Biol., 4, R59[CrossRef][Medline].
von Mering, C., et al. (2002) Comparative assessment of large-scale datasets of proteinprotein interactions. Nature, 417, 399403[Medline].
von Mering, C., et al. (2005) STRING: known and predicted proteinprotein associations, integrated and transferred across organisms. Nucleic Acids Res., 33, D433D437
Walhout, A.J. and Vidal, M. (2001) Protein interaction maps for model organisms. Nat. Rev. Mol. Cell Biol., 2, 5562[CrossRef][Web of Science][Medline].
Wu, J., et al. (2003) Identification of functional links between genes using phylogenetic profiles. Bioinformatics, 19, 15241530
Zheng, Y., et al. (2002) Genomic functional annotation using co-evolution profiles of gene clusters. Genome Biol., 3, RESEARCH0060.
This article has been cited by other articles:
![]() |
P. R Kensche, V. van Noort, B. E Dutilh, and M. A Huynen Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution J R Soc Interface, February 6, 2008; 5(19): 151 - 170. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Cui, P. Li, G. Li, F. Xu, C. Zhao, Y. Li, Z. Yang, G. Wang, Q. Yu, Y. Li, et al. AtPID: Arabidopsis thaliana protein interactome database an integrative platform for plant systems biology Nucleic Acids Res., January 11, 2008; 36(suppl_1): D999 - D1008. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Dawelbait, C. Winter, Y. Zhang, C. Pilarsky, R. Grutzmann, J.-C. Heinrich, and M. Schroeder Structural templates predict novel protein interactions and targets from pancreas tumour gene expression data Bioinformatics, July 1, 2007; 23(13): i115 - i124. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Yellaboina, K. Goyal, and S. C. Mande Inferring genome-wide functional linkages in E. coli by combining improved genome context methods: Comparison with high-throughput experimental data Genome Res., April 1, 2007; 17(4): 527 - 535. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Barker, A. Meade, and M. Pagel Constrained models of evolution lead to improved prediction of functional linkage from correlated gain and loss of genes Bioinformatics, January 1, 2007; 23(1): 14 - 20. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








