Bioinformatics Advance Access originally published online on September 3, 2004
Bioinformatics 2005 21(3):293-306; doi:10.1093/bioinformatics/bti015
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 21 issue 3 © Oxford University Press 2005; all rights reserved.
Genome Properties: a system for the investigation of prokaryotic genetic content for microbiology, genome annotation and comparative genomics


The Institute for Genomic Research 9712 Medical Center Drive, Rockville, MD 20850, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: The presence or absence of metabolic pathways and structures provide a context that makes protein annotation far more reliable. Compiling such information across microbial genomes improves the functional classification of proteins and provides a valuable resource for comparative genomics.
Results: We have created a Genome Properties system to present key aspects of prokaryotic biology using standardized computational methods and controlled vocabularies. Properties reflect gene content, phenotype, phylogeny and computational analyses. The results of searches using hidden Markov models allow many properties to be deduced automatically, especially for families of proteins (equivalogs) conserved in function since their last common ancestor. Additional properties are derived from curation, published reports and other forms of evidence. Genome Properties system was applied to 156 complete prokaryotic genomes, and is easily mined to find differences between species, correlations between metabolic features and families of uncharacterized proteins, or relationships among properties.
Availability: Genome Properties can be found at http://www.tigr.org/Genome_Properties
Contact: selengut{at}tigr.org
Supplementary information: http://www.tigr.org/tigr-scripts/CMR2/genome_properties_references.spl
| INTRODUCTION |
|---|
|
|
|---|
Assigning names to all predicted proteins in a complete genome, even when carried out with exquisite accuracy, provides only an initial layer of understanding of the activities in a cell. For example competence genes (comA-G) participate in cellular competency, where competency refers to a cells ability to take up extracellular DNAs and incorporate those DNAs into the bacteriums own chromosome. However, many com genes are also found in organisms such as Escherichia coli that do not exhibit natural competency. It has been suggested that the com operon is part of a mechanism for using extracellular DNA as a nutrient source which is evident in E.coli and in many other species (Finkel and Kolter, 2001). The point being, it is not possible to determine this cellular role of com genes in isolation; genome annotation is made more complete when individual genes are placed in context of metabolic pathways, coordinated cellular activities or cellular structures. This secondary layer of biological description provides a more complete and contextually rich picture of biological processes.
We call objects in this secondary layer as genome properties. A genome property is a single assertion (a numerical value, a truth state such as Yes or No, or a controlled vocabulary term such as facultative anaerobic) for some attribute, applied to a single completely sequenced genome. The attribute may be a metabolic capability such as tryptophan biosynthesis from chorismate, a physical feature such as outer membrane, a taxonomic classification or a calculated value such as GC content (Table 1).
|
In this paper, we describe a computational analysis system for the investigation of genome properties for completely sequenced prokaryotes. The goals of the project are 5-fold. First, to create a repository of property assertions for each species, whether taken from the scientific literature or produced during genome annotation. Second, to increase the accuracy and richness of genome annotation. Third, to provide concise species summaries with controlled vocabularies suitable for comparative analyses. Fourth, to create a research tool for hypothesis generation by any of several techniques, including phylogenetic profiling (Pellegrini et al., 1999). Fifth, to create a compendium of knowledge concerning microbial genome properties by linking the underlying data with brief scholarly descriptions, primary and secondary literature references and relevant websites.
Assertions in Genome Properties can be made either manually, through the intervention of a human curator, or automatically, using the results of precomputed analyses, such as hidden Markov models (HMMs) from the Pfam (Bateman et al., 2004) and TIGRFAMs (Haft et al., 2003) databases. Many TIGRFAM HMMs are built to segregate larger protein families into smaller subfamilies, termed equivalogs (Haft et al., 2001) where all members share the same specific function. Combinations of HMMs can be used as an evidence to show that a genome contains the complete set of enzymes from a pathway or subunits from a protein complex. Utilizing HMM evidence and other sequence analyses allow many properties to be set automatically by rules encoded within the Genome Properties system; these rules may be applied even for unannotated genomes, as soon as HMM search results are available.
Although it is not restricted to the generation of metabolic pathways, Genome Properties joins a number of other projects that combine metabolic reconstruction and comparative genomics. Other examples of metabolic reconstruction methods include EcoCyc, which provides a metabolic map of E.coli based on exhaustive literature searching and expert curation (Karp et al., 2002a). EcoCycs pathway definitions can be used to find corresponding pathways, based on matching protein annotations, in other species (Karp et al., 2002c). Its companion database, MetaCyc (Karp et al., 2002b), confines itself to experimentally verified instances of pathways in other species but expands the set of model pathways considerably. Given sufficient quality annotation, a fairly robust metabolic reconstruction can be performed. The KEGG database (Kanehisa et al., 2004) presents detailed representations of sets of interconnected pathways, containing multiple species with different subsets of each pathway. The KEGG system uses bi-directional best hits across genomes to identify probable orthologs and functionally equivalent proteins, with manual curation, which has led to fairly extensive prediction in the presence or absence of pathways and annotation of the indicated enzyme components. WIT (Overbeek et al., 2000), which is no longer publicly available, used a stringent protein clustering scheme related to the bi-directional best hit heuristic in COGs (Tatusov et al., 2003), combined with human curation, to detect the presence of pathways and their variants across a wide collection of genomes. ERGO (Overbeek et al., 2003) is a commercial successor to WIT. All these efforts improve on the picture of metabolism that might be understood from the annotated gene list alone. Other websites provide analyses such as TransportDB, which contains data on transporters (Ren et al., 2004); SACSO, which contains comparative analysis of completely sequenced organisms including base composition, amino acid composition, ancestral duplication and ancestral conservation (Tekaia et al., 2002); and Comparative Genometrics, which contains nucleotide composition data (Roten et al., 2002).
Genome Properties currently contains >17 000 property assertions (Table 2). Our interface that has been created for Genome Properties allows facile searching of these data and affords the ability to compare multiple properties over multiple genomes. The inclusion of taxonomic properties allows searches and comparisons to be easily restricted to certain phylogenetic clades. Genes mapped to specific properties are linked directly to the Comprehensive Microbial Resource (CMR) (Peterson et al., 2001) which provides an additional set of analyses of whole genomes and their encoded proteins for comparison, summarization and investigation.
|
The computational system for Genome Properties will be released in the near future as open source.
| SYSTEMS AND METHODS |
|---|
|
|
|---|
Genomic source data
The input to the Genome Properties system is data from the CMR (Peterson et al., 2001). The CMR uses a relational database called the Omniome to store all data associated with the completed prokaryotic genomes. The Omniome provides primary sequence data and annotations that mirror those contained in GenBank. It also provides a second set of protein predictions carried out using GLIMMER (Delcher et al., 1999) with automated annotation. For genomes sequenced at The Institute for Genomic Research (TIGR), only the primary annotation is used. This set of protein predictions is made initially using GLIMMER but is then curated extensively by human annotators who adjust start sites, remove spurious gene calls and add missing genes using homology evidence. A gene is considered present in a genome if it appears in either the primary or the secondary gene lists. The CMR is updated continuously with new genomes as they become available.
Data model
Storage for the Genome Properties system is implemented using a relational database. Entities represented in separate tables in the database include definitions of the properties, relationships between properties, links to external sources of information related to the properties and the assertions made for those properties. For assertions that can be made by rules, Genome Properties tables represent the components (metabolic steps, structural subunits or other hallmarks) that signify the property and the evidence for identifying those components. Forms of evidence used in the current release include HMM scores, manual assignments of proteins to HMM families or to specific enzymatic functions, tRNA predictions by tRNA-scanSE (Lowe and Eddy, 1997), DNA features and coding region attributes such as selenocysteine codons (Stadtman, 1987) and programmed frameshifts (Farabaugh, 2000).
Property types
Properties are divided conceptually into a number of types.
Taxonomic properties reflect seven hierarchical levels of phylogenetic classification (superkingdom, phylum, class, order, family, genus and species) derived from the NCBI Taxonomy database (Wheeler et al., 2004). Also included in this type is the property TaxID, which stores the unique id that the NCBI has associated with each sequenced strain and allows us to maintain consistency between Genome Properties and NCBI.
Phenotypic properties reflect directly observable data (e.g. oxygen requirement, human pathogen or optimal growth temperature), and are not necessarily derivable in any simple way from the content of the genome. Phenotypic assertions are set manually. The concepts associated with phenotypic assignments vary widely and are not easily subjected to a unified classification scheme; we have used a controlled vocabulary for phenotypes whenever possible. Unlike properties whose states or values are set by automatic processes, not every genome may have a curated state for every phenotypic property.
Calculated properties are produced by computational analysis on the DNA sequence or the set of predicted proteins, and may include numerical or string values depending on the nature of the metric (e.g. GC content: 50%).
Pathway and system properties represent groups of genes that work together in some way. A pathway is composed of proteins that are able to perform a set of consecutive enzymatic steps to generate one or more products from some set of starting materials (e.g. glutathione biosynthesis). A system is analogous but broader, describing elements that work together but are not necessarily in a metabolic pathway (e.g. PTS transport system and nucleotide excision repair). In some cases these assertions are made manually, in others they are made as the result of rules (see below). Rules are used to automatically detect the components based on specified evidence stored in database tables.
Category properties organize other genome properties in a hierarchy (e.g. Sulfur metabolism and Biological niche).
Summary properties serve an organizational role in the same manner as category properties, but also hold a value for each genome that represents a summary, consensus or average of those properties that are below it in the hierarchy. These values are assigned either by the Genome Properties rules interpreter (see below) or by the separate algorithms written for each summary property (e.g. IPP biosynthesis summarizes IPP biosynthesis via mevalonate and IPP biosynthesis via deoxyxylulose).
The current set of Genome Properties as evaluated for Haemophilus influenzae KW20 Rd (Fleischmann et al., 2002) is shown in Table 1, which contains examples of the property types described above.
Evidencethe input data to Genome Properties
More than 30 types of sequence analysis data that have been generated from a computational pipeline are stored in the Omniome, including results of topological and structural prediction programs and various homology-based search results. Evidence for non-protein features such as tRNA molecules, programmed (i.e. genuine) frameshifts or repeat regions serve as an input to the Genome Properties system. The primary source of information to Genome Properties is the search results of TIGRFAM (Haft et al., 2003) and Pfam (Bateman et al., 2004) HMMs. The utility of the TIGRFAMs database for identifying protein functions has been described previously (Haft et al., 2003). Each of its models is labeled according to the functional diversity of the family of proteins it describes. Out of over 2000 models in TIGRFAMs, more than half are of type equivalog, meaning that all proteins found by the model are presumed to share the same primary function with each other and with their last common ancestor. The definition of equivalog differs from the term ortholog (Fitch, 1970) in two ways. First, it shows conserved function explicitly; functionally distinct offshoots of the protein family are excluded. Second, it allows the inclusion of laterally transferred genes, in contrast to the formal requirement of orthology that sequences are derived only by speciation. We have found that some Pfam models describe families all of whose members are functionally identical. These Pfam models may be used in the same way as TIGRFAMs equivalog models in the building of rules. Any score above the trusted cutoff of an HMM assigns the target sequence to that protein family and shows the family to be represented in the corresponding genome. Other protein classification systems such as PROSITE (Hulo et al., 2004) could also serve as sources of evidence but have not been used to date.
Definition and implementation of rules for system and pathway properties
We derive our system and pathway descriptions from compilations of biochemical pathways (Michal, 1999; Kanehisa et al., 2004), descriptions in entries from protein annotation databases (Boeckmann et al., 2003; McGarvey et al., 2000) and directly from the scientific literature. System and pathway properties are represented in our database as a list of components (enzymes, RNA or DNA elements, etc.) and, for each of these components, a list of the types of evidence (HMMs, EC numbers, etc.) that may be used to identify it.
Although we strive to adhere to definitions that are consistent with those norms accepted by the scientific community, our primary goal is to identify properties that result in unambiguous assertions for the greatest number of genomes possible. Subsequently, subsets of canonical pathways/systems may be defined as separate Genome Properties when a significant number of species contain those subsets in isolation.
For some cases, pathways and systems are described in our data model as alternative subsets of components with logical OR relationships. This is because, for example, certain organisms utilize acetylated intermediates while others utilize succinylated intermediates, and still others act without esterification. Different enzymes (and even different numbers of enzymes) catalyze these different steps. We also encounter situations where a component of a system or pathway may not be detected in a genome of an organism in which that property is present. In some of these cases, a component may play a non-essential role in the process and occur in a subset of lineages. In other cases, a particular step may be known to be essential but the identity of the component that fulfills that role may not yet have been discovered in all or a subset of species. In such cases, our representation allows for components of a Genome Property to be specified as optional.
Together, the lists of components, evidence types, required/optional flags, Boolean relationships and threshold values represent rules, which are used to evaluate individual properties and make assertions about their presence or absence in a particular genome. A stand-alone program evaluates these rules in the following five steps: (1) the evidence describing the property is read, (2) the genome is scanned for features that contain these types of evidence, (3) the features are recorded in the database, (4) the list of identified features and the required components are evaluated in accordance with the logical structure of the rule and (5) the proper assertion is written to the Genome Properties table.
System and pathway properties assertions are stored as a controlled vocabulary in the Genome Property table as follows. Yes indicates that all components have been identified, or the existence of the property has been experimentally determined or asserted by expert curation. some evidence indicates that a number of components have been identified, but one or more are not in evidence so that the presence of the property is possible. not supported indicates that some components are found but they are insufficient in number to argue that the property is present. The boundary between some evidence and not supported is determined separately for each property by assigning a threshold value. none found indicates that no components were identified, and No indicates that either a curator has evaluated a none found or a not supported assertion and concurred, or that the absence of a functional version of the system or pathway was verified experimentally.
Genome Properties curation
For many genomes, assertions are assigned unambiguously reflecting that all components or no components are found. For other genomes, the distribution of evidence is examined in a manual process. Typically, the curation of the two assertions not supported and some evidence are adjusted. Promotions of the states none found and not supported to No are performed only when some other line of reasoning exists such as experimental evidence, disrupted genes or alternative pathways. Genes that are expected to appear in otherwise complete pathway reconstructions or systems may be undetectable due to highly divergent sequences, missed annotation of the correct open reading frame, truncated, interrupted or inactivated genes. In such cases we manually evaluate homology criteria, molecular phylogeny, metabolic context and gene clustering. Often, this will result in the identification of the missing gene. We may also alter system and pathway rules in response to manual review, typically when spurious hits occur for HMMs that perform imperfectly, in which case the existing HMMs are improved. It may be that a protein family for some component appears only in certain lineages, and new models are created to provide alternative evidence for a particular component across some species range.
Genome Properties interface
Our interface consists of a series of web pages within the CMR. These include a home page (http://www.tigr.org/Genome_Properties), a search page, property definition pages and data display pages for comparison among genomes and properties as well as detailed displays for individual properties in single genomes. The search page allows a user to extract slices of the Genome Properties data based on one or more selected genomes, properties and/or property states including the states of taxonomic properties (Fig. 1A). Property definition pages include summaries, lists of components, the methods used to identify them, links to relevant literature through PubMed and external databases such as EcoCyc or KEGG (Fig. 1B). Comparison display pages show property assertions displayed as a table across the genomes selected by the user. Data displays for individual properties as evaluated for particular genomes include lists of all the genes mapped to each of the components of the property, the evidence by which each gene was identified as a component and the regional genomic context of each gene (Fig. 1C).
|
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
TIGRFAM accuracy and overall genome coverage
Much of our metabolic reconstruction system is based on searches of TIGRFAM HMMs against prokaryotic proteins. We consider searches using TIGRFAMs to be highly accurate because our experience is that when annotators annotating bacterial genomes manually evaluate TIGRFAM searches, it is rare that false positives are detected. In a controlled experiment to evaluate TIGRFAM accuracy, 50 equivalog HMMs were tested against the current release of Swiss-Prot (Boeckmann et al., 2003). The Swiss-Prot database has been subjected to years of manual curation by the trained experts and is used by the academic institutions throughout the world. For the purpose of this comparison, it was treated as a dataset containing an independently derived standard of truth. TIGRFAM HMM scores above the curated trusted cutoff were collected. A total of 1016 proteins were identified and the assertion of function by the TIGRFAM versus the Swiss-Prot annotation were compared by manual evaluation. Only two assignments of function conflicted, a difference traceable to a TIGRFAM annotation that used an outdated publication.
The overall coverage of our dataset was estimated as follows: the results of searching TIGRFAM HMMs against the genes in 144 completed bacterial genomes indicate that
1328% of all genes in a typical bacterial genome receive an automatic functional assignment from a TIGRFAM equivalog HMM. This corresponds to roughly half of all the genes in a typical genome that can receive a specific functional assignment. These data are displated for a representative selection of genomes in Figure 2.
|
Genome Property assertions
We have defined 172 properties (Table 1) resulting in over 17 500 property assertions after their application to 145 completed prokaryotic genomes (Table 1). Nearly 4000 property assertions result from direct computation such as DNA GC content and count of predicted proteins. Over 2000 property assertions have been performed manually using information derived from external sources for properties such as optimal pH, chemotaxis and phylum. We encourage the submission of further literature-supported assertions of phenotypic data through our website where such data are currently absent.
The remaining assertions are for pathway properties such as tryptophan biosynthesis from chorismate, and system properties such as Tat (Sec-independent) protein export, which utilize autonomous rules that evaluate HMM search results and other stored evidence (see Systems and methods section). Properties like these may weigh evidence from >30 HMMs, but most rules built so far weigh evidence for between three and eight components. Rules may assign the state Yes when all required components are identified, none found in the absence of any components, some evidence when no more than a specified number of components is missing and not supported when less than this number is identified.
Properties assigned to the extreme states Yes, not supported, none found and No outnumber those in the intermediate state some evidence by greater than 10:1. Yes assignments for properties map genes to specific biological processes in a way that context-independent functional identification of the protein alone may not (see below). Currently, over 40 000 entries from the CMR are linked as evidence for entries in Genome Properties. Over 34 000 of these contribute to Yes states for their respective properties.
Table 3 shows the components of the rule for the selenocysteine incorporation property and the application of that rule to the genome of H. influenzae KW20 Rd (Fleischmann et al., 1995). This example illustrates the diversity of both protein function and genomic evidence that may contribute to a property. Three types of genomic features are required: genes encoding two enzymes and a translation factor, the selenocysteine tRNA and an example of a protein that incorporates selenocysteine at a UGA codon. Evidence is provided by HMM search results, tRNA detection (Lowe and Eddy, 1997) and annotation of a selenoprotein translation exception. A single protein (HI0200) fills two requirements of the rule: a selenoprotein example and the enzyme selenophosphate synthase.
|
Validation of evidence-based Genome Properties assertions
Potentially, the accuracy of the Genome Properties system may be very high in that it uses HMM-based rules that were developed by expert curation. We evaluated the accuracy of this system by several approaches. First, we applied the system to the bacterium Corynebacterium glutamicum ATCC 13032. This organism was chosen because it was sequenced recently (Kalinowski et al., 2003) and therefore does not appear in many of the seed alignments of TIGRFAM HMMs. The bacterium is an industrially important source of lysine and glutamate and has been characterized extensively in the experimental literature. Application of Genome Properties to the C.glutamicum genome sequence resulted in 36 Yes assertions, 29 of which were supported in published reports on this organisms metabolism (see Supplementary Table S1). No literature reference could be identified that contradicted one of our assertions. Literature sources were also used to identify organisms like Corynebacterium that grow in the absence of supplemented amino acids, proteins or peptides. These organisms are presumed to have functional pathways for the biosynthesis of all amino acids.
Similarly, organisms that lack amino acid biosynthesis pathways should be restricted to environments rich in amino acids and peptides (e.g. obligate intracellular pathogens such as Chlamydia trachomatis). Literature-based support of amino acid biosynthesis Genome Properties for 77 genera are summarized in Supplementary Table S2. Of those pathways and organisms listed in Table S2, 615 positive assertions are expected. Genome Properties asserts Yes in 583 of these cases and some evidence in an additional 26 cases (99% overall success). The remaining 6 cases involve proline biosynthesis, mainly in Archaea, where an alternative pathway has been proposed but not yet characterized (Graupner and White, 2001).
Occasionally, literature scans will identify cases where Genome Properties has asserted the presence of a property, but yet specific tests in laboratories have failed to observe the associated phenotype. For instance, Lactococcus lactis appears to have complete pathways for the biosynthesis of several amino acids that are nonetheless required to be present in the media for cell growth. In this case, the sequenced L.lactis is an industrial strain used for the production of cheese and may have recently (over the course of laboratory isolation) developed the ability to not express these enzymes (Bolotin et al., 2001). This type of false positive assertion, when identified, is flagged by changing the state of the property from Yes to Cryptic. Cryptic states alert the users to conflicts between genomic content (about which Genome Properties makes assertions) and expressed phenotypes, which may depend on the nature of the experimental system and factors outside the scope of the Genome Property.
In certain cases, an essential metabolic function can be fulfilled by two or more independent pathways or systems. Genome Properties can be self-validated in these cases by observing that at least one of these properties should be present (complementarity). For example, proton-gradient energized ATPases are essential for cellular life and come in two types, the F1-F0 ATPase and the V-type ATPase. Every genome should have at least one of these systems, and in fact, Genome Properties finds this to be true in all cases, with only five examples of genomes containing both systems. These and similar results for IPP and lysine biosynthesis are presented in Table 4.
|
Genome Properties was benchmarked against KEGG, which employs an independent methodology for identifying the components of pathways and systems. KEGG does not explicitly assert the presence or absence of a complete pathways or system. However, where all the steps in a pathway were present in the KEGG database for an organism, we inferred that KEGG was asserting the presence of that pathway. Table 5 indicates that four pathways shared by KEGG and Genome Properties were in agreement for 95% of the organisms tested. The only differences occurred when KEGG did not identify a component found by Genome Properties. In each of these 17 cases, the assignment made by Genome Properties was supported by multiple lines of evidence such as HMM scores, multiple sequence alignments and co-localization with functionally related genes. In a number of cases, it appeared that KEGGs system did not identify a complete pathway because of a missed gene call in the original annotation. Such genes are often identified during Genome Properties curation by searches of genes against the genomic DNA rather than relying on the primary set of predicted proteins. In no instance did KEGG assert the presence of a component that was not also identified by Genome Properties.
|
Comparative Genometrics with Genome Properties
An example of Genome Properties used as a comparative tool for chorismate-associated biosynthetic pathways across many species is shown in Table 6. These pathways tend to be conserved for members of any given genus but exceptions to such phylogenetic patterns often prove interesting. Staphylococcus aureus has the Tat (Sec-independent) protein export system while Staphylococcus epidermidis lacks it. Examination showed that the Tat translocases in S.aureus are encoded adjacent to their lone target, suggesting lateral gene transfer of a cassette composed of a Tat translocase together with its target, a gene containing an N-terminal Tat signal sequence.
|
Missing components of genome properties
The property histidine biosynthesis from PRPP consists of 10 enzymatic steps. Currently, all 10 enzymes have been identified in 43 published genomes. In an additional 56 genomes only the ninth step, histidinol-phosphate phosphatase (HisB), is not found. In the Genome Property rule for this pathway, the ninth step is treated as an optional element (although the activity is surely required for the pathway) due to our current inability to detect it in many species. The lack of universal detection of this step does not change the overall quality of the assertion that the pathway is complete. Most probably, this step is carried out by enzymes from a number of non-orthologous gene families (Koonin et al., 1996), only two of which have been characterized and modeled by HMMs. The list of organisms carrying out histidine biosynthesis but lacking an identified hisB gene may be a useful starting point for investigations aimed at identifying novel hisB gene families.
One method of identifying such non-orthologous families (Osterman and Overbeek, 2003) involves looking for candidate genes that are nearby along the chromosome (Overbeek et al., 1999). In the case of the histidine biosynthesis property, a gene annotated as Inositol monophosphatase-like protein (due to its membership in a Pfam familyPF00459) is adjacent to the gene encoding the identified Step 8 of the pathway, histidinol phosphate aminotransferase (hisC) in Synechocystis species PCC6803 (loci NTL01SS01282 and NTL01SS01283, respectively). The Genome Properties interface allows one to view such information easily. The branch of the PF00459 family containing this gene includes genes from 18 other published bacterial genomes (including Actinobacteria, Alphaproteobacteria, Pirellula sp. strain 1 and Pseudomonas putida), all of which contain every step of the histidine biosynthesis pathway except HisB. Although no other published genome shows gene clustering of this phosphatase with histidine biosynthesis genes, it is observed in two unpublished genomes being finished at TIGR, Myxococcus xanthus DK 1622 and Fibrobacter succinogenes S85 (data not shown). It seems that this family is a strong candidate for the HisB enzyme in these genomes and warrants experimental characterization. This family of putative HisB enzymes has been modeled by a TIGRFAMs HMM (TIGR02067).
Mapping of process information onto protein annotations
The Gene Ontology (GO) (Harris et al., 2004) database has proven to be a versatile system for categorizing information pertaining to the functions, physical localizations and processes of genes. In certain cases TIGR annotators have recently begun adding GO terms to protein annotations where possible. From this experience, we have learned that the assignment of GO functional terms is analogous to the process of gene name annotation and relatively straightforward. However, the association of GO process terms is more labor intensive during gene-by-gene annotation. This is because a protein may serve multiple processes in different species. To address this issue, Genome Properties maps GO terms to each of the components of rules-based properties. When property states are set to Yes, those genes corresponding to components of the system are assigned GO process terms automatically. In the case of metabolic pathways that terminate in the production of branch-point metabolites, the presence of downstream pathways with Yes states assigned results in the transitive application of GO process references to the components of the upstream pathways. For instance most of the genomes listed in Table 1 contain all the components of the chorismate biosynthesis pathway, but chorismate is further utilized to a variety of different purposes depending on the genomic context. Genes of the chorismate pathway will receive GO-IDs corresponding to those processes active in that particular organism.
Phylogenetic profiling with genome properties
Genome Properties assertions can be converted into phylogenetic profiles (Pellegrini et al., 1999) simply by encoding each Yes assertion as 1 and No, none found, and not supported assertions as 0. The ambiguous assertion some evidence may be treated as missing data and ignored, or may be lumped with Yes under the hypothesis that missing components are likely present but not recognized. The resulting pattern of 1s and 0s for the presence or absence of a protein (as originally formulated) or of a genome property across many genomes can carry a significant amount of information, enough to suggest functional relationships between pairs of proteins or between proteins and properties.
In principle, analysis in terms of genome properties should make phylogenetic profiling more robust because the signal represented by a property is the aggregate of all of its components and therefore provides a less noisy profile. The phylogenetic profile of an individual component, say an enzyme carrying out a particular step in a pathway, may be noisy due to several issues. For instance, the component may be involved in more than one process, each of which has a separate and distinct phylogenetic profile. A component may exist as two or more functionally equivalent but non-orthologous families (Koonin et al., 1996), and the profiles of these families individually will represent only a subset of the whole phylogenetic range of the underlying biological process.
Relationships between genome properties
Many pairs of genome properties are strongly correlated. For example, both histidine and tryptophan must be synthesized if they cannot be imported, and environments typically allow import of both or neither. Among genomes analyzed to date, only six genera (of 73 total) contain species that break this rule. Plant and animal pathogens typically exploit rich environments that make de novo biosynthesis of both histidine and tryptophan unnecessary. Another example of correlation is evident in larger genomes where it is generally expected that these organisms will have more genes in most functional categories (van Nimwegen, 2003), scaling differently with genome size according to the category. Secondary and redundant capabilities including biosynthetic, catabolic and transport systems would be expected to be present more often in larger genomes, and in general this correlation is observed in the Genome Properties dataset.
We find a strong positive correlation of many genome properties with DNA GC content. Some differences in GC content follow major phylogenetic divisions, such as between the Actinobacteria (high-GC) and the Firmicutes (low-GC). However, large differences in GC content also occur within the various lineages, such as within the Gammaproteobacteria, the Actinobacteria, the Euryarchaeota or the Spirochaetes. In several lineages, and for all prokaryotic genomes taken together, the smallest genomes tend to be AT-rich and the largest genomes GC-rich. Figure 3 shows the relationship among DNA size (megabases), DNA GC content and histidine biosynthesis from PRPP. None of the genera that are above the median for both size and GC content lacks the ability to make histidine while a majority of species that are below both these levels are unable to do so. Both species that do and do not synthesize histidine are phylogenetically diverse. These trends seem consistent with the relationships of GC-to-AT transition bias for point mutations, low-GC content, gene loss and small genome size (Andersson and Andersson, 1999) found in a study of the Rickettsia. Biosynthetic pathways for various essential amino acids (including histidine) and enzyme cofactors are found in three Buchnera aphidicola genomes, despite extremes of low-GC and genome size. In this case, it appears that these aphid endosymbionts retain these abilities to benefit their insect hosts (Clark et al., 1998) while having shed much of the rest of the genetic capacity of their free-living bacterial ancestors.
|
| CONCLUSION |
|---|
|
|
|---|
The Genome Properties system contains a rich and varied collection of biological characterizations for completely sequenced prokaryotic genomes. We present a paradigm in which standard methods of sequence analysis, including but not limited to TIGRFAMs and Pfam HMM scoring, produce evidence that is stored in relational database tables. Rules weigh the evidence automatically and detect pathways and other features accordingly. Curation includes manual finishing of property assignments where rules cannot capture all the particulars for a species. The curation process generates feedback that leads to the improvement of the protein identification models on which the rules are based, as well as improvements in annotation accuracy, completeness and information content. The inclusion of metrics such as GC content, phylogenetic and other non-metabolic properties expands the value of the data for biological studies of individual prokaryotes as well as for comparative genomics.
We encourage members of the scientific community to contact us through our website (http://www.tigr.org/Genome_Properties) to suggest new Genome Properties that may be of particular interest to their research or to add manually curated data to existing properties.
| Acknowledgments |
|---|
We would like to thank Tanja Davidsen for her support in integrating Genome Properties into the Comprehensive Microbial Resource. This work was supported in part by NSF grant DBI-0110270 and DOE grant DE-FG02-01ER63203.
| Footnotes |
|---|
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.
Received on May 5, 2004; revised on July 9, 2004; accepted on August 5, 2004
| REFERENCES |
|---|
|
|
|---|
Andersson, J.O. and Andersson, S.G. (1999) Genome degradation is an ongoing process in Rickettsia. Mol. Biol. Evol., 16, 11781191[Abstract].
Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L. (2004) The Pfam protein families database. Nucleic Acids Res., 32, D138D141
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O'Donovan, C., Phan, I., Pilbout, S., Schneider, M. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365370
Bolotin, A., Wincker, P., Mauger, S., Jaillon, O., Malarme, K., Weissenbach, J., Ehrlich, S.D., Sorokin, A. (2001) The complete genome sequence of the lactic acid bacterium Lactococcus lactis ssp. lactis IL1403. Genome Res., 11, 731753
Clark, M.A., Baumann, L., Baumann, P. (1998) Buchnera aphidicola (Aphid endosymbiont) contains genes encoding enzymes of histidine biosynthesis. Curr. Microbiol., 37, 356358[CrossRef][ISI][Medline].
Delcher, A.L., Harmon, D., Kasif, S., White, O., Salzberg, S.L. (1999) Improved microbial gene identification with GLIMMER. Nucleic Acids Res., 27, 46364641
Farabaugh, P.J. (2000) Translational frameshifting: implications for the mechanism of translational frame maintenance. Prog. Nucleic Acid Res. Mol. Biol., 64, 131170[ISI][Medline].
Finkel, S.E. and Kolter, R. (2001) DNA as a nutrient: novel role for bacterial competence gene homologs. J. Bacteriol., 183, 62886293
Fitch, W.M. (1970) Distinguishing homologous from analogous proteins. Syst. Zool., 19, 99113[Medline].
Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.F., Dougherty, B.A., Merrick, J.M., et al. (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 269, 496512
Fleischmann, R.D., Alland, D., Eisen, J.A., Carpenter, L., White, O., Peterson, J., DeBoy, R., Dodson, R., Gwinn, M., Haft, D., et al. (2002) Whole-genome comparison of Mycobacterium tuberculosis clinical and laboratory strains. J. Bacteriol., 184, 54795490
Graupner, M. and White, R.H. (2001) Methanococcus jannaschii generates L-proline by cyclization of L-ornithine. J. Bacteriol., 183, 52035205
Haft, D.H., Loftus, B.J., Richardson, D.L., Yang, F., Eisen, J.A., Paulsen, I.T., White, O. (2001) TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Res., 29, 4143
Haft, D.H., Selengut, J.D., White, O. (2003) The TIGRFAMs database of protein families. Nucleic Acids Res., 31, 371373
Harris, M.A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., Eilbeck, K., Lewis, S., Marshall, B., Mungall, C. (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res., 32, D258D261
Hulo, N., Sigrist, C.J., Le Saux, V., Langendijk-Genevaux, P.S., Bordoli, L., Gattiker, A., De Castro, E., Bucher, P., Bairoch, A. (2004) Recent improvements to the PROSITE database. Nucleic Acids Res., 32, D134D137
Kalinowski, J., Bathe, B., Bartels, D., Bischoff, N., Bott, M., Burkovski, A., Dusch, N., Eggeling, L., Eikmanns, B.J., Gaigalat, L. (2003) The complete Corynebacterium glutamicum ATCC 13032 genome sequence and its impact on the production of L-aspartate-derived amino acids and vitamins. J. Biotechnol., 104, 525[CrossRef][ISI][Medline].
Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., Hattori, M. (2004) The KEGG resource for deciphering the genome. Nucleic Acids Res., 32, D277D280
Karp, P.D., Paley, S., Romero, P. (2002a) The Pathway Tools software. Bioinformatics, 18, (Suppl. 1), S225S232[Abstract].
Karp, P.D., Riley, M., Paley, S.M., Pellegrini-Toole, A. (2002b) The MetaCyc Database. Nucleic Acids Res., 30, 5961
Karp, P.D., Riley, M., Saier, M., Paulsen, I.T., Collado-Vides, J., Paley, S.M., Pellegrini-Toole, A., Bonavides, C., Gama-Castro, S. (2002c) The EcoCyc Database. Nucleic Acids Res., 30, 5658
Koonin, E.V., Mushegian, A.R., Bork, P. (1996) Non-orthologous gene displacement. Trends Genet., 12, 334336[ISI][Medline].
Lowe, T.M. and Eddy, S.R. (1997) tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res., 25, 955964
(Ed.). Biochemical Pathways: An Atlas of Biochemistry and Molecular Biology, (1999) , NY John Wiley and Sons.
McGarvey, P.B., Huang, H., Barker, W.C., Orcutt, B.C., Garavelli, J.S., Srinivasarao, G.Y., Yeh, L.S., Xiao, C., Wu, C.H. (2000) PIR: a new resource for bioinformatics. Bioinformatics, 16, , pp. 290291
Osterman, A. and Overbeek, R. (2003) Missing genes in metabolic pathways: a comparative genomics approach. Curr. Opin. Chem. Biol., 7, 238251[CrossRef][ISI][Medline].
Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G.D., Maltsev, N. (1999) The use of gene clusters to infer functional coupling. Proc. Natl. Acad. Sci. USA, 96, 28962901
Overbeek, R., Larsen, N., Pusch, G.D., D'Souza, M., Selkov, E., Jr, Kyrpides, N., Fonstein, M., Maltsev, N., Selkov, E. (2000) WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res., 28, 123125
Overbeek, R., Larsen, N., Walunas, T., D'Souza, M., Pusch, G., Selkov, E., Jr, Liolios, K., Joukov, V., Kaznadzey, D., Anderson, I. (2003) The ERGO genome analysis and discovery system. Nucleic Acids Res., 31, 164171
Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D., Yeates, T.O. (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci., USA, 96, 42854288
Peterson, J.D., Umayam, L.A., Dickinson, T., Hickey, E.K., White, O. (2001) The Comprehensive Microbial Resource. Nucleic Acids Res., 29, 123125
Ren, Q., Kang, K.H., Paulsen, I.T. (2004) TransportDB: a relational database of cellular membrane transport systems. Nucleic Acids Res., 32, D284D288
Roten, C.A., Gamba, P., Barblan, J.L., Karamata, D. (2002) Comparative Genometrics (CG): a database dedicated to biometric comparisons of whole genomes. Nucleic Acids Res., 30, 142144
Stadtman, T.C. (1987) Specific occurrence of selenium in enzymes and amino acid tRNAs. FASEB J., 1, 375379[Abstract].
Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., Koonin, E.V., Krylov, D.M., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4, 41[CrossRef][Medline].
Tekaia, F., Yeramian, E., Dujon, B. (2002) Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. Gene, 297, 5160[CrossRef][ISI][Medline].
van Nimwegen, E. (2003) Scaling laws in the functional content of genomes. Trends Genet., 19, 479484[CrossRef][ISI][Medline].
Wheeler, D.L., Church, D.M., Edgar, R., Federhen, S., Helmberg, W., Madden, T.L., Pontius, J.U., Schuler, G.D., Schriml, L.M., Sequeira, E. (2004) Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res., 32, D35D40
This article has been cited by other articles:
![]() |
J. M. Greene, F. Collins, E. J. Lefkowitz, D. Roos, R. H. Scheuermann, B. Sobral, R. Stevens, O. White, and V. Di Francesco National Institute of Allergy and Infectious Diseases Bioinformatics Resource Centers: New Assets for Pathogen Informatics Infect. Immun., July 1, 2007; 75(7): 3212 - 3219. [Full Text] [PDF] |
||||
![]() |
J. D. Selengut, D. H. Haft, T. Davidsen, A. Ganapathy, M. Gwinn-Giglio, W. C. Nelson, A. R. Richter, and O. White TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes Nucleic Acids Res., January 12, 2007; 35(suppl_1): D260 - D264. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. A. Lussier and Y. Liu Computational Approaches to Phenotyping: High-Throughput Phenomics Proceedings of the ATS, January 1, 2007; 4(1): 18 - 25. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Seshadri, S. W. Joseph, A. K. Chopra, J. Sha, J. Shaw, J. Graf, D. Haft, M. Wu, Q. Ren, M. J. Rosovitz, et al. Genome Sequence of Aeromonas hydrophila ATCC 7966T: Jack of All Trades J. Bacteriol., December 1, 2006; 188(23): 8272 - 8282. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. H. Badger, T. R. Hoover, Y. V. Brun, R. M. Weiner, M. T. Laub, G. Alexandre, J. Mrazek, Q. Ren, I. T. Paulsen, K. E. Nelson, et al. Comparative Genomic Evidence for a Close Relationship between the Dimorphic Prosthecate Bacteria Hyphomonas neptunium and Caulobacter crescentus. J. Bacteriol., October 1, 2006; 188(19): 6841 - 6850. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Y. Gerdes, O. V. Kurnasov, K. Shatalin, B. Polanuyer, R. Sloutsky, V. Vonstein, R. Overbeek, and A. L. Osterman Comparative Genomics of NAD Biosynthesis in Cyanobacteria. J. Bacteriol., April 1, 2006; 188(8): 3012 - 3023. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Liolios, N. Tavernarakis, P. Hugenholtz, and N. C. Kyrpides The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide Nucleic Acids Res., January 1, 2006; 34(suppl_1): D332 - D334. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Overbeek, T. Begley, R. M. Butler, J. V. Choudhuri, H.-Y. Chuang, M. Cohoon, V. de Crecy-Lagard, N. Diaz, T. Disz, R. Edwards, et al. The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes Nucleic Acids Res., October 7, 2005; 33(17): 5691 - 5702. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Joardar, M. Lindeberg, R. W. Jackson, J. Selengut, R. Dodson, L. M. Brinkac, S. C. Daugherty, R. DeBoy, A. S. Durkin, M. G. Giglio, et al. Whole-Genome Sequence Analysis of Pseudomonas syringae pv. phaseolicola 1448A Reveals Divergence among Pathovars in Genes Involved in Virulence and Transposition J. Bacteriol., September 15, 2005; 187(18): 6488 - 6498. [Abstract] [Full Text] [PDF] |
||||






