Bioinformatics Advance Access originally published online on April 1, 2008
Bioinformatics 2008 24(10):1257-1263; doi:10.1093/bioinformatics/btn106
Assigning functional linkages to proteins using phylogenetic profiles and continuous phenotypes
1Institute for Informatics, Ludwig-Maximilians-Universität München, Amalienstr. 17, 80333 Munich and 2Department of Membrane Biochemistry, Max-Planck Institute for Biochemistry, 82152 Martinsried, Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: A class of non-homology-based methods for protein function prediction relies on the assumption that genes linked to a phenotypic trait are preferentially conserved among organisms that share the trait. These methods typically compare pairs of binary strings, where one string encodes the phylogenetic distribution of a trait and the other of a protein. In this work, we extended the approach to automatically deal with continuous phenotypes.
Results: Rather than use a priori rules, which can be very subjective, to construct binary profiles from continuous phenotypes, we propose to systematically explore thresholds which can meaningfully separate the phenotype values. We illustrate our method by analyzing optimal growth temperatures, and demonstrate its usefulness by automatically retrieving genes which have been associated with thermophilic growth. We also apply the general approach, for the first time, to optimal growth pH, and make novel predictions. Finally, we show that our method can also be applied to other properties which may not be classically considered as phenotypes. Specifically, we studied correlations between genome size and the distribution of genes.
Contact: orlandgonzalez{at}gmail.com
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Comparative genomic approaches for gene function prediction are becoming increasingly important due to the growing number of fully sequenced genomes. The trend is likely to continue as the cost of sequencing continues to drop, and the pace at which genomes of different organisms are sequenced accelerates. Most of these computational approaches are based on matching proteins of interest to other proteins with known function (Bork et al., 1998). These methods, traditionally and predominantly based on sequence homology (Altschul et al., 1990, 1997), have widely been successful, as evidenced by their extensive use. However, it has also become clear that homology-based methods have to be complemented. The stringent requirement of finding a match with known function for a given query protein cannot always be satisfied. Moreover, it is well known that two proteins with the same function do not necessarily have similar sequences.
Methods which exploit gene context have emerged as successful complements to homology-based predictions (Huynen and Snel, 2000; Huynen et al., 2000). These approaches use contextual information such as gene fusion, the conservation of local gene neighborhoods and the co-occurrence of genes across genomes. For example, proteins encoded by genes with homologs which are fused in another organism tend to to be functionaly related (Enright et al., 1999; Marcotte et al., 1999; Snel et al., 2000). Likewise, genes which are significantly encountered as neighbors across genomes, detected by the conservation of either gene order (Dandekar et al., 1998) or genes in a run (Overbeek et al., 1999), also tend to be functionally related. While these methods typically do not predict specific functions by themselves, they are used to predict higher level functions, such as the participation of a protein in a particular structural complex or metabolic pathway.
Methods which utilize the third type of contextual information, the co-occurrence of genes across genomes, are premised on the hypothesis that functionally linked proteins are likely to evolve in a correlated fashion (Pellegrini et al., 1999). For example, two proteins in the same pathway tend to be either both present or both absent in a genome. Accordingly, genes encoding correlated proteins should have homologs in the same subset of organisms. The basic method, often referred to as phylogenetic profiling, involves the creation of gene phylogenetic profiles (GPPs), which are binary strings which indicate the presence or absence of genes in the different genomes, and the subsequent search for matching pairs of these profiles. Since it was observed that the requirement for finding perfect matches between GPPs can be too restrictive, the approach was later extended to rank pairs of GPPs using a more general scoring function (Wu et al., 2003). Further developments of the method include the use of mutual information for scoring (Date and Marcotte, 2003), guidelines for selecting appropriate reference organisms for maximum information (Sun et al., 2007), methods which account for phylogeny using reconstructed phylogenetic trees (Barker et al., 2007; David et al., 2008; Pazos and Valencia, 2001; Vert, 2002; Zhou et al., 2006) or heuristics (Cokus et al., 2007), extensions for discovering three-way interactions (Bowers et al., 2004) and methods specifically adapted for assigning genes to orphan activities in metabolic networks (Chen and Vitkup, 2006; Osterman and Overbeek, 2003).
Closely related to the class of phylogenetic profiling methods described above are methods which directly associate phenotypic traits to GPPs. Rather than comparing pairs of GPPs, these new methods compare GPPs directly with phenotype phylogenetic profiles (PPPs) (Levesque et al., 2003). PPPs are very similar to the GPPs, except that the ones and zeros represent the presence or absence of phenotypic traits rather than genes. The hypothesis is that genes associated with a specific phenotypic trait tend to have phylogenetic distributions similar to that of the phenotype. The approach has been applied to motility (Levesque et al., 2003), the presence of pili, thermophily, respiratory tract tropism, Gram-negativity, aerobic respiration, endospore formation and pathogenicity (Jim et al., 2004; Slonim et al., 2006). In this study, we extend the approach to directly and automatically deal with continuous (quantitative) phenotypes. We eliminate the need for a priori classification rules which are currently used to construct the PPPs from continuous values. We demonstrate the feasibility of systematically exploring meaningful threshold values for discretizing and separating the data, and describe a method from information theory for selecting the best one. We illustrate our method by uncovering relationships between GPPs, reconstructed from the clusters of orthologous group (COG) database, and the optimal growth parameters (pH and temperature) of the organisms represented. While the general approach has already been applied to thermophily (Jim et al., 2004), in this work we do not use any outside information to classify an organism as mesophilic, thermophilic or hyperthermophilic a priori. To the best of our knowledge, this work also represents the first application of the general approach to study adaptations to extreme pH. Finally, we demonstrate that our method can also be used to analyze relationships between gene phylogenies and other quantitative properties of organisms that may not be classically considered as phenotypes, such as genome size and GC content.
| 2 METHODS |
|---|
|
|
|---|
2.1 Phylogenetic profiles
Let N = 66 be the number of lineages (genomes), a GPP X is a binary string of length N where the value at position i, designated as X [i ], has value 1 if the corresponding gene is present in the genome associated with position i, or X [i ] = 0 otherwise. Similarly, a PPP Y is also a binary string of length N. However, since the phenotypes in this study are continuous, we use a threshold
to discretize the data. Specifically,
|
| (1) |
, and not the other way around, is only a convention. It does not affect the measure used for scoring GPP–PPP pairs.
2.2 Scoring GPP–PPP pairs
The empirical information entropy H(Y) of a PPP Y can be defined as
|
| (2) |
The information gain (IG) about a PPP Y obtained from observations of a GPP X can be defined as
|
| (3) |
and all genomes without X with phenotype <
, or vice versa. Certainly, if H(Y ) = 0, then the IG is 0 regardless of X.
The Information Gain (IG) as a function of the separation size, imposed by
, and the accuracy is shown in Figure 1. Note that even if the co-occurrence of ones and zeros in X and Y were perfect (i.e. X [i ] = Y [i ] for all i implying accuracy = 100%), the GPP–PPP pair will not necessarily have a high IG score. For example, if the gene associated with X only occurs in one particular lineage j (i.e. X [i] = 1 iff i = j ), and the phenotype is also equal to or above the threshold
only for lineage j (i.e. Y [i] = 1 iff i = j), then the IG will still be small because it is capped by a low H(Y). This is a desirable property for a scoring function. Since there is only one genome wherein X and Y co-occur, the possibility that they co-occur by chance alone is very high. In general, the maximum IG value is achieved when the correlation between the values of X and Y is perfect (either positive or negative correlation), and X and Y occur in about half of the lineages. The IG as defined above is similar to the mutual information measure used in other studies (Date and Marcotte, 2003; Slonim et al., 2006). We discuss alternatives to using IG in the Supplementary Material under the heading Alternatives to Information Gain, particularly the applicability of the two-sample t-test.
|
2.3 Systematic exploration of threshold values
It is clear from Equation (3) that IG, which is used to score GPP–PPP pairs, is dependent on the threshold value
(Equation 1) used to discretize the phenotype values. In this work, we systematically explored thresholds by using representative values taken from ranges which are significantly different from each other, that is, ranges defined such that moving from one to the other can potentially affect the IG as defined in Equation (3). Briefly, for a phenotype with values {y1, y2, ... , yi ,... , yN} where yi is the value of the phenotype for lineage i, we created the ordered sequence
2.4 Data sources
The COG database represents an attempt at a phylogenetic classification of the proteins encoded in genomes (Tatusov et al., 1997, 2001). Essentially, COG groups proteins which are thought to be orthologous to each other, i.e. connected through vertical evolutionary descent. Such a property often carries notions of gene equivalence, making the COG database a very rich and convenient source for encoding GPPs. We used the 2003 release of the database for unicellular clusters. Optimal growth temperature data were retrieved from the Prokaryotic Growth Temperature database (PGTdb) (Huang et al., 2004). Optimal growth pH's; were collected from various literature sources and from the German Collection of Microorganisms and Cell Cultures (http://www.dsmz.de/). The dataset is provided as Supplementary Table 4.
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
3.1 Phenotype: optimal growth temperature
Microorganisms exist in nature under an enormous range of physical conditions. With respect to temperature, they have been found even in extreme environments, from <0°C up to 115°C (Brock, 1994, 1985; Morita and Moyer, 2004). Generally, organisms with the ability to grow near 0°C are referred to as psychrophiles, those with an optimum near 37°C as mesophiles, those between about 45°C and 70°C as thermophiles, and those with an optimum of 80°C or higher as hyperthermophiles. Under such a scheme, 10 of the lineages in the dataset we used, 8 bacteria and 2 archaea, are considered hyperthermophilic, 4 lineages, 3 archea and a gram-positive bacterium, thermophilic, and the rest are mesophilic. As mentioned earlier, correlations between GPPs and thermophily have been studied before (Jim et al., 2004; Klinger et al., 2003). However, in these previous works it was a priori indicated which genomes were marked as thermophilic, i.e. which elements of the PPP are set to 1. Such a step carries complications. For example, it must be decided whether the PPP be created such that only hyperthermophiles are set to 1, or whether the more moderate thermophiles should be included. In addition, note that even the boundary between thermophily and hyperthermophily is not clear-cut. In contrast, our method deals with the continuous optimal growth temperature values directly and automatically. Moreover, we also investigate GPPs which are negatively correlated with thermophily. Negative correlations were not considered in previous studies.
In Figure 2, we summarize the results of systematically exploring thresholds. The top-left graph (Fig. 2A, open triangle) shows the average IG scores over all GPPs as a function of the threshold
. For comparison, the graph also shows average IG scores after randomly shuffling all GPPs (open circle). Evidently, there is a correlation between optimal growth temperature and the phylogenetic distribution of at least some of the genes. The top-right graph (Fig. 2B) is similar to the top-left graph (Fig. 2A) except that maximum IG scores are indicated rather than averages. Each of the graphs at the bottom (Fig. 2C and D) is similar to the corresponding graph above, except that the data are shown as a function of the separation sizes induced by the different thresholds. Specifically, the value of the x-axis indicates the percentage of the lineages with an optimal growth temperature below the threshold. The right-skew of the curve corresponding to the actuall GPPs (open triangle) in the bottom-left graph (Fig. 2C) is important. From the definition of IG (Equation 3), one would typically expect the average IGs, given unbiased data, to be roughly symmetric with respect to the separation size of the phenotype profile. This can readily be seen in the curve drawn with circles (open circle) in the same graph, which corresponds to results when the GPPs were shuffled. The right skew of the actual GPPs indicate that it would be more meaningful to analyze one particular end of the phenotype spectrum. From the conventions used in this study, the right skew corresponds to the upper range of the temperature spectrum. This is probably due to the underrepresentation of organisms which thrive at low temperatures in the current dataset (genomes included in the COG database). The results for the shuffled profiles are currently only shown for comparison with the actual GPPs. Nevertheless, they could also be used to derive a measure of significance for the actual results.
|
The top 20 COGs correlated with optimal growth temperature are listed in Table 1, each with its highest IG score and corresponding threshold (
) value found. Figure 3 shows the IG scores of the top three of these at the different threshold values. For each COG, the
at which it has the highest IG is taken to be the COG's; best threshold value. The figure also shows the average IG over all COGs for reference. The top ranked COG, COG1618, is present in 14 of the 66 lineages, and is described as a predicted nucleotide kinase. With respect to the phenotype, COG1618 is present in the 12 lineages with the highest optimal growth temperatures, encompassing all hyperthermophiles and three of the four thermophiles as classified using the scheme described above. Even though COG1618 is present in one clearly mesophilic genome (Methanosarcina acetivorans), it is nevertheless likely that COG1618 is linked with thermophily given its relatively high specificity to organisms with a high optimal growth temperature. In fact, the gene in COG1618 from Aquifex aeolicus has been characterized as an NTPase with optimal activity at 70°C (Klinger et al., 2003). While it is possible that the COG is only indirectly linked to thermophily, for example, if its member proteins are simply thermophilic versions of proteins which are present in other organisms, it is nonetheless interesting that it is for the specific function, and not for most of the other functions shared with other organisms, that a non-homologous thermophilic gene has evolved. The genome which is not represented in COG1618, but is classified as thermophilic in the PGTdb, is that of Bacillus halodurans. In the study by Klinger et al. (2003), B.halodurans was not considered as thermophilic, once more illustrating the subjective nature of the boundaries between classifications of organisms with respect to optimal growth temperature. Again, we note that we used no such a priori classification scheme, as the threshold
is chosen automatically.
|
|
The second highest ranked COG, COG1110, is reverse gyrase, which is present in all of the hyperthermophilic genomes (>, 90°C). Reverse gyrase induces positive supercoiling in DNA, which can improve DNA stability at high temperatures (Confalonieri et al., 1993; Forterre et al., 1985, 2000). Although it was demonstrated recently that reverse gyrase is not a prerequisite for hyperthermophilic life (Atomi et al., 2004), it was also observed in the same study that disruption of the gene causes growth retardation which becomes more pronounced at higher temperatures.
The third highest ranked COG, COG1980, is described as archaeal fructose 1,6-bisphosphatase. Although the archaeal qualifier of the description of the COG, when viewed in light of the fact that most of the thermophilic and hyperthermophilic lineages in the dataset are from the archaea, may suggest that the high score of the COG has nothing to do with the phenotype in question, closer inspection reveals that this is not necessarily the case. In fact, the archaeal qualifier is a bit misleading as the COG is not represented in all archaea, and actually includes A.aeolicus, which is a bacterial species. Moreover, it is striking that the COG is specific to thermophilic genomes, and that it is not represented in only one of the lineages with an optimal growth temperature above 53°C (Thermotoga maritima). We note that fructose 1,6-bisphosphatase is a key enzyme for synthesizing sugars via gluconeogenesis, which are incorporated into outer structures such as cell walls and surface layers. Evidently, these structures have important roles in protecting cells from unfavorable elements in their environments.
Other high ranking COGs with functions which have somehow been linked to thermophily include: (1) COG1144 described as Pyruvate:ferredoxin oxidoreductase and related 2-oxoacid:ferredoxin oxidoreductases, delta subunit. Unusual ferredoxin oxidoreductases involved in anaerobic respiration with high specificity for hyperthermophiles have been described (Kelly and Adams, 1994). (2) COG1688 and other COGs have been predicted as parts of a DNA repair system which is highly specific to thermophilic archaea and bacteria through analysis of conserved gene neighborhoods (Makarova et al., 2002). However, only COG1688 was included in Table 1 as the others are less specific or less conserved in the thermophilic lineages. (3) COG3635 (Predicted phosphoglycerate mutase, AP superfamily) has been linked to extremophilic life due to its presence in most thermophiles and the polyextremophile Deinococcus radiodurans (Reichard and Kaufmann, 2003).
Finally, examples of GPPs in Table 1 which are negatively correlated with optimal growth temperature include COG0443 and COG0484, both described as molecular chaperones. Molecular chaperones are proteins involved in the folding (unfolding) and assembly (disassembly) of other macromolecular structures. It is striking that although both COGs are represented in all mesophilic genomes, each of them is present in only one out of the nine lineages with an optimal growth temperature above 82.5°C. Hyperthermophiles seem to tend to lose or not acquire genes from these COGs. Note that both COGs are represented in the archaea as well as the bacteria, just not the hyperthemophilic ones.
As an alternative to IG, we also tested using versions of the Two-Sample T-Test (TSTT) for correlating GPPs and PPPs. While the results were comparable, particularly for top ranked COGs, some significant differences were apparent. We believe that any future application will benefit from using multiple methods, at least to allow comparison. In addition, we also tested our method on larger, albeit automatically generated, datasets by constructing GPPs for three reference organisms using genome-to-genome BLAST searches. The results obtained are also comparable. Details on TSTT and the application to the larger dataset are provided as Supplementary sections Alternatives to Information Gain and COG and Available Genomes, respectively.
3.2 Phenotype: optimal growth pH
Similar to temperature, the pH (hydrogen ion concentration) of natural environments in which microorganisms are found are also varied, from about 0.5 in the most acidic soils to about 10.5 in the most alkaline lakes (Krulwich and Guffanti, 1989). In fact, an archaeon capable of growth at near zero pH has been reported (Edwards et al., 2000). Considering that pH values are in logarithms, the spectrum is quite vast indeed. We analyzed associations between the GPPs and the continuous growth pH values. Figure 4, supplied as Supplementary information, shows the average (left) and maximum (right) IG scores over all GPPs as a function of the threshold
. Results using the actual profiles are indicated in triangles (open triangle), and results after randomly shuffling the profiles in circles (open circle). Clearly, some GPPs have correlations with the phenotype above what may be expected by chance. However, it is also evident that the differences between the IG values of the actual profiles and those of the shuffled ones are not as large as previously obtained with optimal temperature. The likely reason is that there are far less extremophiles with respect to pH as compared to temperature in the current COG dataset. The 10 top ranked COGs are listed in Table 2.
|
The top ranked COG, COG4934, is a predicted protease. Proteases are enzymes that conduct proteolysis, which is the hydrolysis of peptide bonds that link amino acids together in polypeptide chains. Proteolysis is the initial step in protein catabolism. COG4934 is represented by multiple genes from each of the three lineages Thermoplasma volcanium, Thermoplasma acidophilum and Sulfolobus solfataricus, which are considered acidophilic, and by only one gene from Clostridium acetobutylicum. Even though C.acetobutylicum is not considered acidophilic, it is an anaerobic organism that can quickly acidify its environment to pH below 5.0 during growth, due to the formation of organic acids (Bowles and Ellefson, 1985; Gottwald and Gottschalk,1985). In fact, pH has been implicated in the transition of the organism from an acidogenic to a solventogenic state. Therefore, in light of the COG's; relatively high specificity to organisms with low optimum pH, it is likely that the COG pertains to acidity linked genes.
Other COGs which have high specificity for genomes of low pH organisms include COG3888, COG4344, COG4946 and COG5592. Each of these are represented by genes from all three lineages with optimal pH below 3.0 and in only one other genome: COG3888 in Archaeoglobus fulgidus; COG4344 in Pyrococcus horikoshii; COG4946 in Pyrobaculum aerophilum; and COG5592 in Nostoc sp. PCC 7120.
3.3 Non-phenotype attributes
In addition to phenotypic traits, our method can be also be applied to other attributes which may not be classically considered as phenotypes. Here, we explored the associations between the genome complexity of the microorganisms, measured as the number of genes in each genome, and the GPPs. The question we ask is whether some genes are preferentially conserved or lost, based on the complexity of a microorganism's; lifestyle, which we assume is correlated with its gene count. The genome sizes in the dataset ranged from 520 genes in Mycoplasma genitalium, to 7329 genes in Mesorhizobium loti. The summary of IG scores is provided as Supplementary information (Fig. 5). The top scoring COGs are listed in Table 3. Again, some COGs showed significantly higher scores than their randomized counterparts, comparable to the results obtained with optimal growth temperature.
The top scoring COG, COG0384, is a predicted epimerase which tends to occur in complex genomes. Epimerases are enzymes that catalyze the inversion of stereochemistry in biological molecules. While the general trend in nature is to utilize only one enantiomer of a given molecule in essential biochemical pathways (e.g. L-amino acids and D-sugars), organisms, in many instances, can benefit from the ability to use molecules with unusual stereochemistry, either as biosynthetic building blocks or as metabolic precursors (Tanner, 2002). The presence of a gene from COG0384 is, therefore, likely to be advantageous for a microorganism with a more complex lifestyle, for example, if it needs a high degree of nutritional self-sufficiency.
|
COG0452, described as phosphopantothenoylcysteine synthetase (EC 6.3.2.5 [EC] ), is the second top ranked COG. Its biochemical function is related to the synthesis of Coenzyme A, an acyl group carrier, notable for its role in the synthesis and oxidization of fatty acids, and the oxidation of pyruvate through the citric-acid cycle. Clearly, the function of COG0452 is critical to organisms with little access to an exogenous supply of the coenzyme. Taking this into consideration, it is interesting that the COG is underrepresented in genomes with low-gene count. Simple organisms seem to tend to lose or not develop the associated biosynthetic pathway, probably preferring to rely on external sources instead.
| 4 CONCLUSION |
|---|
|
|
|---|
Methods based on gene-context analysis have emerged as powerful complements to traditional homology-based protein function prediction. Unlike the traditional approach, these methods do not require that a homolog with known function be available for a given query protein. A particular class of these methods works by comparing pairs of binary strings, one string corresponding to a gene and the other to a phenotypic trait, based on the assumption that genes necessary for a phenotypic trait are preferentially conserved among organisms which share the trait. In this study, we extended this class of approaches to automatically deal with continuous phenotypes.
Rather than using a priori rules, which can be very subjective, to construct binary profiles from continuous phenotypes, we propose that thresholds which can meaningfully separate the values be systematically explored instead. We illustrated our approach by finding associations between COGs and optimal growth temperature, and demonstrated its validity by automatically retrieving genes which have been associated with thermophily, such as a nucleotide kinase and DNA repair components. Results were highly indicative that optimal growth temperature is indeed an important factor in genetic phylogeny. We also applied the approach of matching GPP–PPP pairs to optimal growth pH for the first time, and made novel predictions.
Finally, we demonstrated that our method can also be applied to organism attributes which may not be classically considered as phenotypes. In particular, we analyzed microorganismal lifestyle complexity, which we assumed as being related to genome size, and asked whether genes are preferentially conserved or lost based on this. From the results, it is evident that, at least for some genes, absence or presence is indeed related to genome size. For example, phosphopantothenoylcysteine synthetase (EC 6.3.2.5) is clearly underrepresented in lowgene-count lineages. The enzyme is a key step in the synthesis of the important cofactor Coenzyme A, which is notable for its role in fatty-acid metabolism. It would seem that simple organisms tend to lose or not develop the associated pathway, presumably because they rely on exogenous sources instead. Clearly, application of the method to non-phenotype data can also yield interesting biological insights.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on December 20, 2007; revised on March 20, 2008; accepted on March 20, 2008
| REFERENCES |
|---|
|
|
|---|
Altschul S, et al. Basic local alignment search tool. J. Mol. Biol (1990) 215:403–410.[CrossRef][Web of Science][Medline]
Altschul S, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:3389–3402.
Atomi H, et al. Reverse gyrase is not a prerequisite for hyperthermophilic life. J. Bacteriol (2004) 186:4829–4833.
Barker D, et al. Constrained models of evolution lead to improved prediction of functional linkage from correlated gain and loss of genes. Bioinformatics (2007) 23:14–20.
Bork P, et al. Predicting function: from genes to genomes and back. J. Mol. Biol (1998) 283:707–725.[CrossRef][Web of Science][Medline]
Bowers P, et al. Use of logic relationships to decipher protein network organization. Science (2004) 306:2246–2249.
Bowles L, Ellefson W. Effects of butanol on clostridium acetobutylicum. Appl. Environ. Microbiol (1985) 50:1165–1170.
Brock TD. Life at high temperatures. Science (1985) 230:132–138.
Brock TD. Life at high temperatures (1994) (last accesed on october 16, 2007). Available at http://www.bact.wisc.edu/bact303/b1.
Chen L, Vitkup D. Predicting genes for orphan metabolic activities using phylogenetic profiles. Genome Biol (2006) 7:R17.[CrossRef][Medline]
Cokus S, et al. An improved method for identifying functionally linked proteins using phylogenetic profiles. BMC Bioinformatics (2007) 8(Suppl. 4):S7.
Confalonieri F, et al. Reverse gyrase: a helicase-like domain and a type I topoisomerase in the same polypeptide. Proc. Natl. Acad. Sci. USA (1993) 90:4753–4757.
Dandekar T, et al. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci (1998) 23:324–328.[CrossRef][Web of Science][Medline]
Date S, Marcotte E. Discovery of uncharacterized cellular systems by genomewide analysis of functional linkages. Nat. Biotechnol (2003) 21:1055–1062.[CrossRef][Web of Science][Medline]
David J, et al. High-confidence prediction of global interactomes based on genome-wide coevolutionary networks. Proc. Natl Acad. Sci. USA (2008) 105:934–939.
Edwards K, et al. An archaeal iron-oxidizing extreme acidophile important in acid mine drainage. Science (2000) 287:1796–1799.
Enright A, et al. Protein interaction maps for complete genomes based on gene fusion events. Nature (1999) 402:86–90.[CrossRef][Medline]
Forterre P, et al. High positive supercoiling in vitro catalyzed by an ATP and polyethylene glycol-stimulated topoisomerase from Sulfolobus acidocaldarius. EMBO J (1985) 4:2123–2128.[Web of Science][Medline]
Forterre P, et al. Reverse gyrase from hyperthermophiles: probable transfer of a thermoadaptation trait from archaea to bacteria. Trends Genet (2000) 16:152–154.[CrossRef][Web of Science][Medline]
Gottwald M, Gottschalk G. The internal pH of clostridium acetobutylicum and its effect on the shift from acid to solvent formation. Arch. Microbiol (1985) 143:42–46.[CrossRef][Web of Science]
Holte R. Very simple classification rules perform well on most commonly used datasets. Mach. Learn (1993) 11:63–91.[CrossRef]
Huang S, et al. PGTdb: a database providing growth temperatures of prokaryotes. Bioinformatics (2004) 20:276–278.
Huynen M, Snel B. Gene and context: integrative approaches to genome analysis. In. In: Analysis of Amino Acid Sequences—Bork P, ed. (2000) San Diego, CA: Adv. Prot. Chem. Academic Press. 345–379.
Huynen M, et al. Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res (2000) 10:1204–1210.
Jim K, et al. A cross-genomic approach for systematic mapping of phenotypic traits to genes. Genome Res (2004) 14:109–115.
Kelly R, Adams M. Metabolism in hyperthermophilic microorganisms. Antonie van Leeuvenhook (1994) 66:247–270.[CrossRef]
Klinger C, et al. Thermophile-specific proteins: the gene product of aq 1292 from Aquifex aeolicus is an NTPase. BMC Biochemistry (2003) 4:12.[CrossRef][Medline]
Krulwich T, Guffanti A. Alkalophilic bacteria, Annu. Rev. Microbiol. (1989) 43:435–463.
Levesque M, et al. Trait-to-gene: a computational method for predicting the function of uncharacterized genes. Curr. Biol (2003) 13:129–133.[CrossRef][Web of Science][Medline]
Makarova K, et al. A DNA repair system specific for thermophilic Archaea and bacteria predicted by genomic context analysis. Nucleic Acids Res (2002) 30:482–496.
Marcotte E, et al. Detecting protein function and protein-protein interactions from genome sequences. Science (1999) 285:751–753.
Morita R, Moyer C. Psychrophiles, origin of. In. In: Encyclopedia of Biodiversity—Levin AS, ed. (2004) New York: Elsevier. 2000, pp. 917–924, 9780122268656, doi:10.1016/B0-12-226865-2/00362-X.
Osterman A, Overbeek R. Missing genes in metabolic pathways: a comparative genomics approach. Curr. Opin. Chem. Biol (2003) 7:238–251.[CrossRef][Web of Science][Medline]
Overbeek R, et al. The use of gene clusters to infer functional coupling. Proc. Natl Acad. Sci. USA (1999) 96:2896–2901.
Pazos F, Valencia A. Similarity of phylogenetic trees as indicator of protein-protein interaction. Protein Eng (2001) 14:609–614.
Pellegrini M, et al. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl Acad. Sci. USA (1999) 96:4285–4288.
Reichard K, Kaufmann M. EPPS: mining the COG database by an extended phylogenetic patterns search. Bioinformatics (2003) 19:784–785.
Shannon C. A mathematical theory of communication. Bell Syst. Tech. J (1948) 27:379–423. 623–656.
Slonim N, et al. Ab initio genotypephenotype association reveals intrinsic modularity in genetic networks. Mol. Syst. Biol (2006) doi:10.1038/msb4100047.
Snel B, et al. Genome evolution: gene fusion versus gene fission. Trends Genet (2000) 16:9–11.[Web of Science][Medline]
Sun J, et al. Phylogenetic profiles for the prediction of protein-protein interactions: how to select reference organisms? Biochem. Biophys. Res. Commun (2007) 353:985–991.[CrossRef][Web of Science][Medline]
Tanner M. Understanding natures strategies for enzyme-catalyzed racemization and epimerization. Acc. Chem. Res (2002) 35:237–246.[CrossRef][Web of Science][Medline]
Tatusov R, et al. A genomic perspective on protein families. Science (1997) 278:631–637.
Tatusov R, et al. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res (2001) 29:22–28.
Vert J. A tree kernel to analyse phylogenetic profiles. Bioinformatics (2002) 18:S276–S284.[Abstract]
Wu J, et al. Identification of functional links between genes using phylogenetic profiles. Bioinformatics (2003) 19:1524–1530.
Zhou Y, et al. Inferring functional linkages between proteins from evolutionary scenarios. J. Mol. Biol (2006) 359:1150–1159.[CrossRef][Web of Science][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


