Skip Navigation


Bioinformatics Advance Access originally published online on June 9, 2005
Bioinformatics 2005 21(16):3409-3415; doi:10.1093/bioinformatics/bti532
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/16/3409    most recent
bti532v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (20)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sun, J.
Right arrow Articles by Li, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sun, J.
Right arrow Articles by Li, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Refined phylogenetic profiles method for predicting protein–protein interactions

Jingchun Sun 1,{dagger}, Jinlin Xu 1, Zhen Liu 2, Qi Liu 3,{dagger}, Aimin Zhao 4,*, Tieliu Shi 3,* and Yixue Li 3,*

1School of Life Sciences & Technology, Shanghai Jiaotong University Shanghai 200240, China
2Department of Biology, Hunan Normal University Changsha 410081, China
3Bioinformation Center, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences Shanghai 200031, China
4The Chinese National Center for Biotechnology Development Beijing 100081, China

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 

Motivation: The increasing availability of complete genome sequences provides excellent opportunity for the further development of tools for functional studies in proteomics. Several experimental approaches and in silico algorithms have been developed to cluster proteins into networks of biological significance that may provide new biological insights, especially into understanding the functions of many uncharacterized proteins. Among these methods, the phylogenetic profiles method has been widely used to predict protein–protein interactions. It involves the selection of reference organisms and identification of homologous proteins. Up to now, no published report has systematically studied the effects of the reference genome selection and the identification of homologous proteins upon the accuracy of this method.

Results: In this study, we optimized the phylogenetic profiles method by integrating phylogenetic relationships among reference organisms and sequence homology information to improve prediction accuracy. Our results revealed that the selection of the reference organisms set and the criteria for homology identification significantly are two critical factors for the prediction accuracy of this method. Our refined phylogenetic profiles method shows greater performance and potentially provides more reliable functional linkages compared with previous methods.

Availability: The software (C, Perl) is available from the corresponding author.

Contact: yxli{at}sibs.ac.cn; tlshi{at}sibs.ac.cn; zhaoaimin{at}cncbd.org.cn

Supplementary information: There are three supplementarymaterials online, including related materials and results.


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Many cellular processes, such as metabolic and signal transduction pathways, involve protein–protein interactions. Therefore, it is important to identify these interactions to fully understand the molecular mechanisms of the living cell (Auerbach et al., 2002; Eisenberg et al., 2000). The increasing availability of complete genomic sequences makes it possible to apply in silico or experimentally based reverse proteomics approaches to the detection of protein–protein interactions on a proteome scale (Walhout and Vidal, 2001). The in silico or experimentally based reverse proteomics approaches include the yeast two-hybrid assay (Fields and Song, 1989), the gene neighbor method (Overbeek et al., 1999), the gene fusion method (Enright et al., 1999; Marcotte et al., 1999a) and the phylogenetic profiles method (Date and Marcotte, 2003; Gaasterland and Ragan, 1998; Marcotte et al., 1999b; Pellegrini et al., 1999). Such resultant protein–protein interactions may provide a new basis for biological discoveries, especially for understanding the functions of many uncharacterized proteins (Chen and Xu, 2003).

The phylogenetic profiles method (Gaasterland and Ragan, 1998; Pellegrini et al., 1999)—an in silico method—is based on the assumption that there is strong selective pressure on proteins that functionally interact with each other so that they are inherited together during speciation events. Thus, proteins in a target organism with the same or similar phylogenetic profiles [constructed by detecting homologous proteins as being present or absent in reference organisms with a predetermined threshold BLASTP (Altschul et al., 1997) E-value], can be hypothesized to interact with each other physically or functionally. Therefore, selection of reference organisms and determination of the threshold BLASTP E-value are the two critical steps of this method. As more and more completely sequenced genomes become available, it is natural to ask whether the addition of new genomes would improve the accuracy of the phylogenetic profiles method (Zheng et al., 2002) and whether changing the threshold E-value would affect the method's accuracy. However, to our knowledge, no published report has systematically studied the effects of the reference genome selection and the E-value threshold on the accuracy of this method.

We therefore investigated the phylogenetic profiles method by integrating the selection of reference organisms and the choice of a suitable E-value threshold, simultaneously. The results indicated that the reference organism selection and the E-value threshold greatly affect the performance of the method. Using our refined method, we predicted protein interaction datasets and unknown protein function for six microorganisms. Moreover, the comparison of protein–protein interactions of Escherichia coli K12, predicted by Date and Marcotte (2003) (DM method) with our predicted protein–protein interactions, demonstrated that our refined phylogenetic profiles method shows greater performance and predicts more reliable protein interactions over the DM method. Therefore, it is essential to consider the selection of reference organisms and E-value threshold when applying the phylogenetic profiles method in the prediction of the protein–protein interactions.


    MATERIALS AND METHODS
 TOP
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Different combinations of reference organisms and E-value thresholds
The protein sequences of 163 organisms were downloaded from the National Center for Biotechnology Information (NCBI) ftp site (ftp.ncbi.nih.gov/genomes). Caulobacter crescentus (Ccr), E.coli K12 (Eco), E.coli O157:H7 (Ecs), Pseudomonas aeruginosa (Pae), Staphylococcus aureus subsp. aureus N315 (Sau) and Vibrio cholerae (Vch) were regarded as target organisms and the remainder as reference organisms. The classification names of the 163 organisms were downloaded from the NCBI Taxonomy site and used to reconstruct an evolutionary tree (see Supplementary Material 1—Figure 1 online).

Each clade, one basic element of the evolutionary tree, is usually monophyletic, that is, all members in one clade share a common ancestor, meaning that each organism corresponds to different clades in different hierarchies within the evolutionary tree. We determined the subsets of selected genomes that corresponded to the clade hierarchy and selected an organism far away from other member in the same clade. When a clade with several organisms had no subclades, we randomly selected an organism from it. When a clade had subclades, we first selected the subclade with the fewest sub-subclades and then selected an organism as described above (detailed in Fig. 1). Therefore, the rationale for selection of different sets of organisms is, for a given clade, to take the organism that is evolutionarily the furthest apart from the rest of the organisms in that clade. Consequently, the selected organism can be regarded as a form of outlier compared with the others in that clade. Based on this rationale, we selected nine sets of reference organisms, named 18, 35, 55, 65, 86, 106, 128, 145 and 162, respectively.



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 1 The schematic description of the evolutionary tree and genome subset selection. The down arrow symbol indicates the evolutionary hierarchy. C1–C8 represents each clade, 1G–5G represents each subset and 1–8 represents each organism. Clade 1 (corresponding to organism 1) and clade 2 are at the first level, clade 3 and clade 4 (containing organisms 6, 7 and 8) are at the same level under clade 2, clade 5 and clade 6 (corresponding to organism 5) are at the same level under clade 3. We selected organism 1, and randomly selected organism 6 from organisms 6, 7 and 8 to form subset 1 (abbreviated as 1G), and selected organisms 1, 5 and 7 to form subset 2 (2G)and so on.

 
At the same time, seven E-value thresholds were applied to determine whether a homologous protein was present or absent, using BLASTP: 1 x 10–1 (abbreviated as E01), 1 x 10–2 (E02), 1 x 10–3 (E03), 1 x 10–4 (E04), 1 x 10–5 (E05), 1 x 10–7 (E07) and 1 x 10–10 (E10). Thus, 63 combinations of different reference organisms and various E-value thresholds were formed (e.g. 18E01, 35E01, 145E01, 145E04, 162E01 and 162E10).

Phylogenetic profiles method and the threshold ofmutual information
The protein sequences of a target organism (e.g. E.coli K12) were compared with those from reference organisms using BLASTP. For each protein i of the target organism, the BLAST E-value of the top scoring sequence alignment between proteins i and all the proteins of each reference organism j was assigned to Eij. Phylogenetic profiles were constructed as follows: for each protein i, a vector was generated with elements Pij, where Pij = –1/log Eij when Eij is lower than the predetermined E-value threshold, and Pij = 1 when the E-value is greater than or equal to the predetermined E-value threshold. For the DM method, Pij = 1 when the E-value is greater than or equal to 1 x 10–1. Construction of the shuffled phylogenetic profiles and the comparison of the actual and shuffled phylogenetic profiles were performed using the DM method (Date and Marcotte, 2003). The threshold of mutual information (TMI) of each combination was analyzed from differences between the distribution of the actual and shuffled phylogenetic profiles (Fig. 2 and Supplementary Material 1—Figure 2). The linkages between two proteins whose mutual information value was higher than the TMI were regarded as putative functional linkages. Linkages between homologous proteins whose BLAST E-value was lower than 1 x 10–4 were removed.



View larger version (44K):
[in this window]
[in a new window]
 
Fig. 2 Distribution of mutual information scores of all actual and shuffled protein pairs for E.coli E04 (E-values of 1 x 10–4). The number ‘18’ represents the distributions of scores of the actual mutual information of 18 organisms as reference organisms and s18 indicates the distributions of scores of the shuffled mutual information using the same reference organisms. The same score parameters are applied to the following: 35, s35, 65, s65, 86, s86, 106, s106, 128, s128, 145, s145, 162 and s162. The solid lines represent the distributions of scores of the actual mutual information and the dashed lines represent the distributions of scores of the shuffled mutual information. For comparison, distribution of mutual information scores of all actual and shuffled protein pairs E.coli E01 and E10 (E-values of 1 x 10–1 and 1 x 10–10) (inset) are presented, which show that the differences between the distribution of mutual information scores of the actual and that of the shuffled protein pairs increased by E01, E02,..., E10, and by 18,...,to 162.

 
Gold-standard positives and negatives
To evaluate the combination that had the greatest accuracy, reference datasets that serve as gold standards of positives (i.e. proteins that do interact) and negatives (i.e. proteins that do not interact) were needed. The DIP (Salwinski et al., 2004) E.coli dataset served as our positive control. We had no direct information about the proteins that did not interact. Fortunately, indirect information could be obtained from functional protein data since proteins with different functions tend not to interact (Schwikowski et al., 2000; von Mering et al., 2002). We applied the first level of KEGG orthology (KO) (Kanehisa et al., 2004), which includes five broad functional categories for each organism, and deleted those proteins belonging to more than two categories. Then protein pairs from different functional categories were compiled to form negative controls. These positive and negative datasets were compared and only one pair was found to be the same. In order to ensure the reliability of the negative and positive control datasets, we deleted these two proteins from the KO functional categories.

The comparative index (R-value) was used to measure the accuracy of each combination, and was calculated as follows:

where TP (true positive) is the number whose MI is higher than TMI in the positive control, P is the number of positive controls, TN (true negative) is the number whose MI is lower than TMI in the negative control and N is the number of negative controls.

Prediction of genome-wide functional linkages and unknown proteins of six organisms
In addition to E.coli K12, genome-wide functional linkages of C.crescentus, E.coli O157:H7, P.aeruginosa, S.aureus subsp. aureus N315 and V.cholerae were also calculated using the combination of 145 set of reference organisms and an E-value 1 x 10–4 as the threshold for BLASTP. We predicted the function of uncharacterized proteins for six microorganisms using the ‘guilt by association’ method (Oliver, 2000; Schwikowski et al., 2000).

Comparison of predicted protein interaction dataand published data
In order to measure the performance and reliability of our refined method over previous methods, we compared the number of interacting proteins, the number of predicted unknown proteins and the functional similarity index of protein–protein interaction data for six microorganisms. In addition, we conducted an in-depth comparison by calculating several strings of sensitivity and specificity using the predicted protein–protein interaction data of E.coli K12, based on biological pathway information (KEGG) (Kanehisa et al., 2004), protein complexes (EcoCyc) (Keseler et al., 2005) and experimental protein–protein interactions (DIP) (Salwinski et al., 2004). To find out whether the method to select reference organism sets is practical, we compared the performance based on selected genomes as described above with that based on randomly selected genome sets.

We first used the number of interacting proteins as an indicator for genome coverage and the functional similarity index as an indicator for accuracy. The method of functional similarity has been used previously to evaluate functional linkages and accuracy of the predicted data (Strong et al., 2003). Here we determined the functional categories of six microorganisms, which were downloaded from the Clusters of Orthologous Groups of proteins (COG) database. The functional similarity index of a protein interaction dataset was calculated as the maximum true-positive fraction divided by the maximum false-positive fraction, where the maximum false-positive fraction was calculated as the fraction of pairwise links that do not belong to the same COG functional category, and the maximum true-positive fraction was calculated as the fraction of pairwise links that do belong to the same COG functional category.

Sensitivity was calculated based on the pathway data from the KEGG database, protein complex data from the EcoCyc database and protein interaction data from the DIP database. Specificity was calculated using the negative control data given above. For the KEGG database, proteins on the same biological KEGG pathway were presumed to be functionally linked (Strong et al., 2003; von Mering et al., 2005), while those on different maps were not. Similarly, for the EcoCyc database, proteins that appear in the same complex are presumed to interact, while those in different complexes are not. For the DIP database, sensitivity was calculated as described above.


    RESULTS
 TOP
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Comparison of different combinations of reference organisms and E-value thresholds
Nine sets of reference genomes from 163 organisms were selected according to their phylogenetic relationships (see Supplementary Material 1—Figure 1). For each set of reference genomes, seven different E-value thresholds were applied, forming 63 total combinations. Figure 2 shows the distributions of mutual information values of actual over shuffled profiles for each combination of E01, E04 and E10 (for others see Supplementary Material 1—Figure 2). Variance analysis showed that the distributions of actual and shuffled mutual information scores of 18E01, 35E01, 55E01, 18E02 and 18E03 were not significantly different (P > 0.05), while the remaining combinations were significantly different (P < 0.05). These results demonstrate that the number of reference organisms and the E-value thresholds have an effect on the application of the method.

The TMIs were extracted by comparing the distributions of mutual information values between actual and shuffled profiles. Protein pairs whose mutual information scores were higher than the TMIs were considered functionally linked. As a result, 63 potential protein interaction datasets for E.coli were generated.

The comparative index (R-value) was used to measure the accuracy of each interaction dataset of the 63 combinations: the higher the R-value, the better the predicted result. Figure 3 illustrates the R-value for each combination and reveals a wide variation in accuracy for the 63 combinations. The R-values of nine combinations with E-value thresholds of 1 x 10–1 were the lowest among all the sets and were not significantly different (P < 0.95), and the highest R-value did not exist among the nine combinations with E-value thresholds of 1 x 10–10. However, the R-values of E-value thresholds of 1 x 10–4 and 1 x 10–5 are generally higher than those of the other E-value thresholds. In particular, E04145 [GenBank] and E05145 [GenBank] had the highest R-values. These results showed that the performance of the phylogenetic profiles method was affected by the E-value thresholds and was not improved by decreasing these thresholds.



View larger version (23K):
[in this window]
[in a new window]
 
Fig. 3 Prediction powers of different combination sets for E.coli. The comparative index (R-value) was calculated using the equation R = [(TP/P)2 + (TN/N)2]1/2, as described in the text.

 
Because the 145 set had the highest R-values, Wilcoxon tests were performed between the R-values of 145 and the other sets. The results showed that there were significant differences between the R-values of the 145 set and those of the 18, 35, 55 or 65 sets (P > 0.95, H = 1) and there were no significant differences between the R-values of the 145 set and those of the 86, 106, 128 or 162 sets (P < 0.95, H = 0). At the same time, there was a jump between the 18–65 sets and the 86–162 sets. The results implied that within a certain number of reference organisms (<86 here), the performance of the method has been improved significantly with the increasing number of genomes. However, beyond a certain number (>86), the improvement becomes minuscule and reaches saturation at the 145 set. Here, the 86 set seemed to represent the point of inflexion.

From the results above, we next applied the E04145 [GenBank] set to predict the protein–protein interaction for six microorganisms.

Datasets comparison
We first compared the coverage and accuracy of the DM data, the 55 set (Sun_55) and the 145 set (Sun_145) (Table 1), based on predicted protein–protein interactions for six microorganisms. These sets were selected because the number of reference organisms (55) is nearly equal to that of the DM data (57) and the 145 set is our preferred subset. The comparison of Sun_55 and DM_57 shows that the coverage (the number of interacting proteins) is lower than that of the DM data while the accuracy (functional similarity index, FSI) was higher than that of the DM data. The comparison of Sun_145 and DM_57 shows that both the coverage and the accuracy are higher than that of the DM data.


View this table:
[in this window]
[in a new window]
 
Table 1 Comparison of our and DM protein–protein interaction data of six microorganisms

 
Figure 4a, b and c showed the comparison of the range of sensitivity and specificity values for the 18, 35, 55, 65, 86, 106, 128 and 145 datasets, the corresponding data of randomly selected genomes, and DM_57, based on the KEGG database, the EcoCyc database and the DIP database, respectively. Although the three databases differ in the biological information provided, the results are unexpectedly similar. First, the sensitivity and specificity of the protein interaction data derived from selected organisms are higher than that from randomly selected organisms, which suggested that the protocol to select reference organisms is practical. Second, the sensitivity and specificity of the protein interaction data of the 55 set are higher than that of the DM_57 dataset, which indicated that the selection of reference organisms and the criteria for homology identification significantly improved the prediction accuracy of this method. Third, the sensitivity and specificity of the protein interaction data of the 145 set are higher than that of the DM_57 dataset and the other sets. Therefore, the refined phylogenetic profiles method provides better performance than the original DM method.



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 4 (ac) The range of sensitivity and specificity values of protein–protein interactions of E.coli with different reference organisms. DM_57 was data from DM results, R18 was based on a set of 18 randomly selected genomes, R35 on 35 randomly selected genomes and so on. Others—for example, 18 and 35—were based on selected reference organisms using phylogenetic relationships.

 
Genome-wide protein interaction data and protein function prediction for six organisms
The dataset for E.coli using the 145E04 combination predicted 45 451 functional linkages involving 2481 proteins. We also predicted protein interaction data for five other microorganisms using the refined phylogenetic profiles method (see Supplementary Material 2 online). Additionally, many functional pathways, for example, the citrate cycle (TCA cycle), fatty acid biosynthesis (path 1), ABC transporters, two-component system biosynthesis, flagellar assembly and chemotaxis in E.coli were demonstrated (see Supplementary Material 1—Table 1 online).

We predicted protein function for six microorganisms (see Supplementary Material 3 online) using the ‘guilt by association’ method (Oliver, 2000; Schwikowski et al., 2000). For example, in E.coli, YabB (PID 1788865) was predicted to belong to the category of cell envelope biogenesis (outer membrane COG category) and was validated by EcoCyc annotation (EG11084). YabB belongs to an operon involved in formation of the cell envelope and in the cell division. YjgP (PID 1790712), a putative transmembrane protein possibly involved in transport in the EcoCyc database (EG12535), was also predicted to belong to the same functional category of cell wall/membrane biogenesis in COG as YabB. YadR (PID 1786351), whose paralogous IscA and SufA genes are involved in central intermediary metabolism, as described in EcoCyc (EG12332), is predicted to belong to the category of coenzyme metabolism in COG. YafB (PID 1786400) was predicted to belong to the category of energy production and conversion, consistent with the description in the EcoCyc database (EG11648) that describes the 2,5-diketo-D-gluconate reductase B protein as catalyzing the following reaction: 2,5-diketo-D-gluconate + NADPH = 2-keto-L-gluconate + NADP. The results above show that protein function prediction by the refined method is reliable and could aid in future experimental design.


    DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
The phylogenetic profiles method has been widely used to predict functional protein linkages (Enright et al., 1999; Pellegrini et al., 1999; Strong et al., 2003; Wu et al., 2003) and protein subcellular localization (Marcotte et al., 2000), to annotate genomes (Enault et al., 2003; Zheng et al., 2002) and to discover novel pathways (Date and Marcotte, 2003). In this study, we showed that careful determination of reference organisms and E-value thresholds significantly improved the performance of the phylogenetic profiles method and proposed a practical protocol for the selection of reference organisms when applying the phylogenetic profiles method.

When the method was first proposed and exploited by Pellegrini et al. (1999) only 16 fully sequenced organisms were used to construct phylogenetic profiles. Later, studies by other groups used all the available genomes without considering the impact of organism selection on the method's predictive power (Date and Marcotte, 2003; Enault et al., 2003; Marcotte et al., 1999b; Marcotte et al., 2000; Pellegrini et al., 1999; Strong et al., 2003; Wu et al., 2003), because of the limited number of complete genome sequences at that time. As more and more complete genomic sequences become available, it becomes possible to pursue the question of whether the addition of new genomes would improve accuracy and coverage of the method (Zheng et al., 2002). Zheng et al. demonstrated that more genomes (68 versus 30) would generate greater putative functional associations and also proposed that there was a possible upper limit of accuracy for the phylogenetic profiles method. The authors failed, however, to provide a proper strategy for selecting organisms and to reveal the number and combination of organisms that would generate the highest accuracy. So, it is necessary to develop a proper strategy for sampling organisms from different taxa as more and more completely sequenced genomes become available.

Here, we exploited the phylogenetic relationships of 162 currently available genomes to select reference organisms. Our results indicated that increasing the size of the reference genome pool within a certain range does improve the accuracy of the phylogenetic profiles method, while beyond this range, the improvement trend becomes rather gradual. It is probably because fewer reference genomes (such as 18 or 35) do not include enough co-evolutionary information and results in lower accuracy and lower coverage. As the number of reference organisms increases, there is more co-evolutionary information used and the performance improves until the co-evolution information provided by a certain number of reference organisms (86 here) covers most of the co-evolutionary information available from all reference organisms (162 here). Therefore, the addition of more genomes into the optimal number of reference organisms, will not improve the performance as much as expected, and could even decrease it a little (as with 162 set in our results) as too many genomes might mix more noise into phylogenetic information.

Therefore, when applying the phylogenetic profiles method to predict protein–protein interactions, it is essential to consider the selection of reference organisms, and choosing a good strategy to select the reference organisms is of importance. Here we exploited the organisms' phylogenetic relationships to select reference organisms, for all members in a clade should evolve from a common ancestor and the one far apart from the rest is close to their ancestor. Therefore, for a given clade, we selected the organism that is evolutionarily the farthest apart from the rest of the organisms in that clade, essentially selecting an outlier of that clade. Our results showed that exploiting the phylogenetic relationships of organisms is an effective strategy to select reference genomes. The predictive power of the method could be expected to further improve if the organisms' evolutionary distance were also taken into account.

Here, we only investigated the 162 organisms available when we began this study. However, with more and more completely sequenced organisms available, it might not be necessary and suitable to use the whole set of 162 organisms. According to our results, here we give a practical protocol to select the set of reference genomes. First, go to the link http://www.ncbi.nlm.nih.gov/genomes/Complete.html. And then click Eubacteria in the sentence ‘See Archaea and Eubacteria genome projects sorted by taxonomic groups’ and download the ‘phylip tree’ file. Second, use the software TreeView to open the ‘phylip tree’ file. Third, select organisms of the appropriate level from the phylogenetic tree, which is evolutionarily the farthest apart from the rest of the organisms in the same clade (i.e. an outlier). Finally add all the Archaea and Eukarytoes organisms with complete genomes, which has much less species than Eubacteria, and the reference organisms set would be ready for further calculation. When applying the phylogenetic profiles method, the program BLASTP is used to compare the protein sequences and to calculate E-values between the proteins in the target and reference organisms. As for the E-value threshold determination of the presence or absence of homologous proteins, no systematic efforts have been made to optimize E-value thresholds. Most authors used different values without giving any explanation, and used the binary value (present, 1 and absent, 0) to record the presence or absence of homologous proteins. Although Date and Marcotte (2003) claimed that the phylogenetic profiles method they used requires no minimum threshold of similarity to be specified, they applied the most permissive E-value threshold (10–1) and then used E-values lower than the threshold to capture different degrees of sequence divergence. Such a low threshold might result in information without any biological significance being included in the phylogenetic profiles. Hence, it is necessary to investigate whether varying E-value thresholds would affect both the accuracy and coverage of the method and to determine the proper E-value threshold. Through our systematic investigation of E-values, our results show that the E-value threshold has a significant effect on the application of the method, and an E-value cutoff of 1 x 10–4 or 1 x 10–5 would achieve optimal accuracy and coverage.

In order to evaluate this method, we used E.coli protein interaction data from DIP as a positive control because this database records experimentally determined protein–protein interactions (Salwinski et al., 2004). In addition, we used the first level of KEGG orthology to compile the negative controls because the KEGG functional categories at this level are clear-cut and well defined (Kanehisa et al., 2004). It is reasonable to assume, therefore, that interactions between proteins from different categories are not likely to occur. Although the reference datasets are not necessarily complete and may be biased to a certain degree, they can be used for evaluation purposes. Although some researchers have used a standardized keyword annotation of the Swiss-Prot database to evaluate the quality of predicted functional linkages (Marcotte et al., 1999b; Strong et al., 2003), we found it difficult to use and not appropriate for this study. The functional similarity method used here applied the same principle as the keyword recovery scheme, as both approaches were based on functional annotations.

In addition to protein–protein interaction data from the DIP database, we used E.coli protein pathway information from the KEGG database and E.coli protein complexes from the EcoCyc database when comparing the performance of our refined method with that of the DM method. The KEGG database includes known biological pathway information, and the EcoCyc protein complexes database includes experimentally determined protein complexes. Although different databases have their own biases in biological information, the results based on the three databases are similar in this study, indicating that the functional linkages predicted by the phylogenetic profiles method not only included physical interactions such as protein complexes, but also genetic interactions such as proteins in related signal transduction pathways.

The results presented above demonstrated that the performance of the phylogenetic profiles method has been improved by our modifications. Functional linkages could reveal functional roles for hundreds of previously uncharacterized proteins. As an important complementary tool for homology analysis, this method, in combination with other non-homology methods (Enright et al., 1999; Overbeek et al., 1999), is expected to be valuable, not only to identify interacting protein pairs, but also to infer protein function.


    Acknowledgments
 
The authors would like to thank Jiancheng Lin for helpful discussions, Youyu He, Shaoyou Yang and Wei Huang for the help with programming. The authors also thank Dr Qi Sun from the Cornell Theory Center of Cornell University and three anonymous reviewers for their invaluable comments and suggestions. The 863 Hi-Tech Program grants 2001AA231011, 2002AA231051 and 2003AA231011, the State Key Program of Basic Research of China grants 2001CB510209, 2002CB713807 and 2003CB715901, and National Natural Science Foundation of China grant 90408010 supported this project.

Conflict of Interest: none declared.


    Footnotes
 
{dagger}The authors wish it to be known that, in their opinion, the first author and the fourth author should be regarded as joint First Authors. Back

Received on October 14, 2004; revised on May 13, 2005; accepted on June 7, 2005

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 

    Altschul, S., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids. Res., 25, 3389–3402[Abstract/Free Full Text].

    Auerbach, D., et al. (2002) The post-genomic era of interactive proteomics: facts and perspectives. Proteomics, 2, 611–623[CrossRef][ISI][Medline].

    Chen, Y. and Xu, D. (2003) Computational analyses of high-throughput protein–protein interaction data. Curr. Protein Pept. Sci., 4, 159–181[CrossRef][ISI][Medline].

    Date, S.V. and Marcotte, E.M. (2003) Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nat. Biotechnol., 21, 1055–1062[CrossRef][ISI][Medline].

    Eisenberg, D., et al. (2000) Protein function in the post-genomic era. Nature, 405, 823–826[CrossRef][Medline].

    Enault, F., et al. (2003) Annotation of bacterial genomes using improved phylogenomic profiles. Bioinformatics, 19, Suppl. 1, i105–i107[Abstract].

    Enright, A., et al. (1999) Protein interaction maps for complete genomes based on gene fusion events. Nature, 402, 86–90[CrossRef][Medline].

    Fields, S. and Song, O. (1989) A novel genetic system to detect protein–protein interactions. Nature, 340, 245–246[CrossRef][Medline].

    Gaasterland, T. and Ragan, M.A. (1998) Microbial genescapes: phyletic and functional patterns of ORF distribution among prokaryotes. Microb. Comp. Genomics, 3, 199–217[Medline].

    Kanehisa, M., et al. (2004) The KEGG resource for deciphering the genome. Nucleic Acids. Res., 32, D277–D280[Abstract/Free Full Text].

    Keseler, I.M., et al. (2005) EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res., 33, D334–D337[Abstract/Free Full Text].

    Marcotte, E.M., et al. (1999a) Detecting protein function and protein–protein interactions from genome sequences. Science, 285, 751–753[Abstract/Free Full Text].

    Marcotte, E.M., et al. (1999b) A combined algorithm for genome-wide prediction of protein function. Nature, 402, 83–86[CrossRef][Medline].

    Marcotte, E.M., et al. (2000) Localizing proteins in the cell from their phylogenetic profiles. Proc. Natl Acad. Sci. USA, 97, 12115–12120[Abstract/Free Full Text].

    Oliver, S. (2000) Guilt-by-association goes global. Nature, 403, 601–603[CrossRef][Medline].

    Overbeek, R., et al. (1999) The use of gene clusters to infer functional coupling. Proc. Natl Acad. Sci. USA, 96, 2896–2901[Abstract/Free Full Text].

    Pellegrini, M., et al. (1999) Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc. Natl Acad. Sci. USA, 96, 4285–4288[Abstract/Free Full Text].

    Salwinski, L., et al. (2004) The Database of Interacting Proteins: 2004 update. Nucleic Acids Res., 32, D449–D451[Abstract/Free Full Text].

    Schwikowski, B., et al. (2000) A network of protein–protein interactions in yeast. Nat. Biotechnol., 18, 1257–1261[CrossRef][ISI][Medline].

    Strong, M., et al. (2003) Inference of protein function and protein linkages in Mycobacterium tuberculosis based on prokaryotic genome organization: a combined computational approach. Genome Biol., 4, R59[CrossRef][Medline].

    von Mering, C., et al. (2002) Comparative assessment of large-scale datasets of protein–protein interactions. Nature, 417, 399–403[Medline].

    von Mering, C., et al. (2005) STRING: known and predicted protein–protein associations, integrated and transferred across organisms. Nucleic Acids Res., 33, D433–D437[Abstract/Free Full Text].

    Walhout, A.J. and Vidal, M. (2001) Protein interaction maps for model organisms. Nat. Rev. Mol. Cell Biol., 2, 55–62[CrossRef][ISI][Medline].

    Wu, J., et al. (2003) Identification of functional links between genes using phylogenetic profiles. Bioinformatics, 19, 1524–1530[Abstract/Free Full Text].

    Zheng, Y., et al. (2002) Genomic functional annotation using co-evolution profiles of gene clusters. Genome Biol., 3, RESEARCH0060.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
J. Cui, P. Li, G. Li, F. Xu, C. Zhao, Y. Li, Z. Yang, G. Wang, Q. Yu, Y. Li, et al.
AtPID: Arabidopsis thaliana protein interactome database an integrative platform for plant systems biology
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D999 - D1008.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
G. Dawelbait, C. Winter, Y. Zhang, C. Pilarsky, R. Grutzmann, J.-C. Heinrich, and M. Schroeder
Structural templates predict novel protein interactions and targets from pancreas tumour gene expression data
Bioinformatics, July 1, 2007; 23(13): i115 - i124.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
S. Yellaboina, K. Goyal, and S. C. Mande
Inferring genome-wide functional linkages in E. coli by combining improved genome context methods: Comparison with high-throughput experimental data
Genome Res., April 1, 2007; 17(4): 527 - 535.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. Barker, A. Meade, and M. Pagel
Constrained models of evolution lead to improved prediction of functional linkage from correlated gain and loss of genes
Bioinformatics, January 1, 2007; 23(1): 14 - 20.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/16/3409    most recent
bti532v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (20)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sun, J.
Right arrow Articles by Li, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sun, J.
Right arrow Articles by Li, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?