Bioinformatics Advance Access originally published online on October 10, 2005
Bioinformatics 2005 21(23):4209-4215; doi:10.1093/bioinformatics/bti711
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Syntons, metabolons and interactons: an exact graph-theoretical approach for exploring neighbourhood between genomic and functional data
1INRIA Rhône-Alpes, HELIX Group 655 avenue de l'Europe, 38334 Montbonnot Cedex, France
2Swiss Institute of Bioinformatics, Swiss-Prot Group 1 rue Michel Servet, CH-1211 Geneva, Switzerland
3UMR 8030, Genoscope, Centre National de Séquençage 2 rue Gaston Crémieux, CP570691057 Evry cedex, France
4Atelier de BioInformatique, Université Paris VI 12 rue Cuvier, 75005 Paris, France
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Modern comparative genomics does not restrict to sequence but involves the comparison of metabolic pathways or proteinprotein interactions as well. Central in this approach is the concept of neighbourhood between entities (genes, proteins, chemical compounds). Therefore there is a growing need for new methods aiming at merging the connectivity information from different biological sources in order to infer functional coupling.
Results: We present a generic approach to merge the information from two or more graphs representing biological data. The method is based on two concepts. The first one, the correspondence multigraph, precisely defines how correspondence is performed between the primary data-graphs. The second one, the common connected components, defines which property of the multigraph is searched for. Although this problem has already been informally stated in the past few years, we give here a formal and general statement together with an exact algorithm to solve it.
Availability: The algorithm presented in this paper has been implemented in C. Source code is freely available for download at: http://www.inrialpes.fr/helix/people/viari/cccpart
Contact: Alain.Viari{at}inrialpes.fr
| INTRODUCTION |
|---|
|
|
|---|
A large variety of biological data obtained from genome-wide experiments can be represented by graphs (Alm and Arkin, 2003). Best-known examples are proteinprotein interaction graphs where vertices are proteins and edges represent physical interactions between them. Metabolic networks are another example of more sophisticated graphs, called bipartite graphs, with two types of vertices representing chemical compounds and reactions and the edges indicate which compound is the substrate or product of a particular reaction. Finally, even the layout of the genes on a chromosome can be described as a particular graph, called an interval graph, where the vertices, representing the genes, are connected if the genes are adjacent (or, more generally, if the linear distance between two genes is less than a given threshold). Central to this graph representation is the notion of adjacency or connectivity (Galperin and Koonin, 2000). For instance a group of connected vertices in a proteinprotein interaction graph may be interpreted as a molecular complex, a group of connected reactions may be interpreted as a biochemical pathway and a group of co-oriented adjacent genes may represent an operon. The following step is then to integrate these different biological graphs in order to answer more complex biological questions, for instance to identify which contiguous genes do encode for enzymes catalysing contiguous reactions in the metabolic network or do encode for interacting polypeptides. Although the importance of graph representation to perform comparative analysis has been recognized for long (Ogata et al., 2000; Snel et al., 2002; Yanai and DeLisi, 2002; Overbeek et al., 1999), we still lack a unified framework for this kind of comparisons. The purpose of this paper is to provide such a unified approach and to illustrate it on several instances of biological graphs.
| METHODS |
|---|
|
|
|---|
For the sake of simplicity, all the definitions will be given hereafter for two graphs but the generalization to the case of n > 2 graphs is immediate.
The correspondence multigraph
Let us consider two graphs G1 and G2, hereafter referred to as the primary graphs, representing some biological data. Each graph Gi is described by a set of vertices Vi and edges Ei.
![]() |
Let us note P = V1 x V2, the Cartesian product, of V1 and V2. That is, the set of all couples of the form (v1, v2) where v1
V1 and v2
V2.
Let us consider a particular relation R stating a correspondence (not necessarily one-to-one) between the elements of V1 and V2. For instance a gene is R-related to a reaction if the EC number of the gene's product is the same as the EC number of the reaction.
Finally let us note VR the restriction of P to the couples where the two elements are related by R.
![]() |
For instance, we can restrict all the possible couples of genes to the couples of orthologous genes or all the possible couples of genes and reactions to those where the gene product catalyses the reaction.
The correspondence multigraph G is defined by the set of vertices VR and two sets of edges E'1 and E'2:
![]() |
![]() |
![]() |
An example of correspondence multigraph, involving the comparison of the gene organization with metabolic pathways, is given in Figure 1.
|
The common connected components
A connected component (CC) of a graph G = (V, E) is a maximal subset of vertices such that every vertex is reachable from each other vertex in the component. In a similar way, a common connected component (CCC) (Gai et al., 2003) of a multigraph G = (VR, E'1, E'2) is a maximal subset of vertices that are both connected in E'1 and E'2.
This definition is illustrated in Figure 1 where the multigraph has three CCCs ({1,2,3}, {4}, {5}). For the biological interpretation, it is important to remember that a CCC is a set of vertices such that every vertex is reachable from each other vertex through every type of edges. In other words, a CCC is composed of components that are connected in every primary graphpossibly by a different network of edges.
| ALGORITHM |
|---|
|
|
|---|
Computation of the CCCs
Figure 1 illustrates an important point for the practical computation of the CCCs: a CCC is not simply the intersection of separate CCs for each type of edges. In this example, the CCs of (VR, E'1) is {1,2,3,4,5} (plain edges) and the connected components of (VR, E'2) are {1,2,3,4} and {5} (dashed edges), therefore the intersection is {1,2,3,4} and {5} but {1,2,3,4} is not a CCC (since 4 is not reachable through plain edges). Now, going one step further, we can reiterate the intersection process on the set {1,2,3,4}. It then splits into two sets {1,2,3} and {4} that are the CCCs of the multigraph. More generally, the intersection of CC provides a partition of vertices with coarser grain than CCC. The idea of an exact CCC computation algorithm is therefore to iteratively refine this partition. The algorithm is initialized with a partition P0 composed of a single class with all vertices. Then, at each iteration step k, one computes the intersection of CCs within each class of the partition Pk, possibly splitting this class into subclasses. This eventually gives rise to a new partition Pk+1. The algorithm stops when the number of classes does not change (i.e. when Pk = Pk+1). The pseudo-code in Figure 2 gives a sketch of a recursive version of this algorithm. The worst-case time complexity of this algorithm is O(n (e·n + m)) where n = |VR| total number of nodes in the multigraph, e = number of primary graphs (e = 2 in the Figure 1 example) and
number of edges in the multigraph. The worst-case complexity corresponds actually to the case where the final partition is composed of n classes (each class being therefore a singleton) and when each of these classes is extracted at each iteration thus requiring n steps to perform the calculation. In practice, the number of steps that are necessary to get the final partition is much lower than n (for all our experiments with random and real data this number rarely exceeded 10 steps, even for very large multigraphs).
|
Considering insertions/deletions in primary graphs
An edge between two vertices in the correspondence multigraph implies, by construction, that the corresponding elements are connected in the corresponding primary graphs. For many practical applications, this requirement is too stringent. For instance, with genomic graphs, the requirement is that the genes are strictly adjacent on each of the chromosomes whereas we may want to allow some gene insertions/deletions. A straightforward way to do this is to introduce new edges in the primary graphs. More precisely, if we define the distance between two vertices as the minimum number of edges in a path connecting them, one should add an edge between all pairs of vertices lying at a distance

+ 1.
is therefore an insertion parameter, the case
= 0 corresponds to the original primary graph with no additional edge.
Related works
In 2000, Ogata et al. presented a graph comparison algorithm to detect functionally related enzyme clusters. Although not stated explicitly in the paper, this approach actually aims at finding CCCs between two graphs representing, respectively, the genomic and metabolic data. The correspondence multigraph is implicitly described as a list of correspondence between the vertices of these two primary graphs. Therefore, this approach is very similar in its spirit to the one presented here. An important difference lies in the fact that the proposed algorithm is a heuristic whereas our algorithm provides exact CCCs. Indeed, it can be shown that the solution of this heuristic approach is actually a subset of the exact solution. In other terms, the heuristic may miss solutions. An example of this will be shown later. In a very similar work, Zheng et al. (2002) proposed a graph-based method to infer bacterial operons. Again, the idea is to look for clusters of contiguous genes, coding for catalysers of connected reactions in the metabolic graph. These clusters are found by a breadth-first search on the metabolic graph, pruned by the distance in the genomic graph. As in the previous work, the proposed algorithm is a heuristic. As far as we know, Kelley et al. (2003) were the first to propose an explicit definition of a multigraph, in the context of protein interaction networks. Each node of the multigraph consists in a couple of proteins (one for each compared protein interaction network) with sufficient sequence similarity (using BlastP). As in our approach, some additional edges are added to the primary graphs (
= 1) to account for gaps. At this stage, their approach diverges from ours. Instead of working on the multigraph directly, the authors merge the edges to yield a single graph (called a network alignment) and then look for paths and densely connected subgraphs in this graph. This approach has been further extended to the case of three primary graphs by Sharan et al. (2005). In 2003, Gai et al. addressed the problem of identifying CCCs of two or more graphs from the computational point of view. They proposed a non-trivial algorithm combining maximal clique decomposition together with an Hopcroft-like partitioning approach. This algorithm achieves an O(n log n + m log2 n) time complexity. Furthermore, in the case of interval graphs, the complexity reduces toO((n + m) log n) (Habib et al., 2004). We acknowledge that this theoretical complexity is better than the one achieved by our simpler algorithm. However, as we shall see in the Results section, the computation of the CCCs is not the time-limiting step. It is usually done within few seconds whereas the construction of the multigraph may take several minutes up to one hour (e.g. when sequence similarities need to be established).
| RESULTS |
|---|
|
|
|---|
We will now illustrate the unified framework with three typical biological applications: (1) the identification of neighbouring genes with similar organization in several genomes (syntons), (2) the identification of neighbouring genes coding for enzymes catalysing connected metabolic reactions (metabolons) and (3) the identification of neighbouring genes coding for interacting proteins (interactons). For each application, we shall give a formal definition of the problem in terms of multigraph definition (by specifying V1, ..., Vn; E1, ..., En and the restriction VR) and we shall give examples of CCCs obtained with actual data.
Syntons
This problem can be informally stated as finding sets of contiguous genes (syntons) with conserved local organization across n (n
2) bacterial genomes. The input is therefore the gene layout on two or more chromosomes together with a gene orthology relationship. We can reformulate this problem as finding the CCCs of the following correspondence multigraph:
![]() | (1) |
![]() | (2) |
is the rank of gene
on chromosome i (taking into account boundaries for circular chromosomes). One should note that definition (2) does not explicitly require the conservation of the genes orientation although this condition can be easily added if needed.
Finally, the restriction condition writes
![]() | (3) |
![]() | (4) |
![]() | (5) |
Definition (4) requires that all genes in an n-tuple should be orthologous, i.e. (g1, ..., gn) is a clique of the orthologous relation. Definition (5) is less restrictive and only requires that one gene is orthologous to all the other, i.e. (gi, ..., gn) is a star of the orthologous relation. Other definitions are possible [e.g. (g1, ..., gn) is a CC of the orthologous relation] depending upon each particular problem. In the example that will be given later on, we choose definition (4).
Again, the orthologous relation should be made more precise. Some readily available classifications [such as COG (Tatusov et al., 1997)] can be used here:
![]() |
When no such classification is available one may resort to sequence similarity:
![]() |
We illustrate the approach with n > 2 genomes by comparing the chromosomal organization of five enterobacteria (Escherichia coli, Shigella flexeneri, Salmonella typhimurium, Yersinia pestis and Photorhabdus luminescens). As a sequence similarity measure (similar), we compared each gene product of one genome against the others using BlastP (Altschul et al., 1990) and retained couples of genes that can be aligned on at least 80% of the length of the shortest sequence with an identity of at least 40%. It should be pointed out that, since this similar relation is not a one-to-one correspondence, the same gene can be part of several syntons. With this definition, the multigraph has 10 039 vertices and the number of edges ranges from 559 938 (
= 0) to 1 566 968 (
= 10). This yields from 516 (
= 0) to 451 (
= 10) syntons of size
2 (the number of syntons decreases but the mean size of the syntons increases with
in this study). As an example, Figure 3a displays one of the two largest syntons (30 genes) found with
= 5. The time-limiting step is clearly the computation of sequence similarities (90 min) as the multigraph construction took <2 min and the CCCs computation took 10s (PowerBook G4 1.5 GHz).
|
Metabolons
This problem can be informally stated as finding sets of contiguous genes (metabolons) encoding for enzymes that catalyse connected metabolic reactions (and, therefore, that may be part of the same pathway). The input is therefore the gene layout on a chromosome, a metabolic graph and a correspondence between genes and chemical reactions. We can reformulate this problem as finding the CCCs of the following correspondence multigraph:
![]() | (6) |
![]() | (7) |
![]() | (8) (2) |
![]() | (9) |
2) can be introduced in order to relax the constraint of strict reaction connectivity. According to this, two reactions will be connected if their distance in the primary metabolic graph is 
2.
Finally, the restriction condition writes
![]() |
To illustrate these definitions, we considered the complete genome of E.coli (4289 genes) to build the genomic graph (V1). The metabolic graph (V2) was built from the KEGG/LIGAND database (Kanehisa et al., 2004), excluding compounds involved in >20 reactions (ubiquitous compounds). The resulting correspondence multigraph has 2275 vertices and the number of edges ranges from 215 948 (
1 = 0 and
2 = 0) to 1 645 510 (
1 = 5 and
2 = 5). This yields from 106 (
1 = 0 and
2 = 0) to 156 (
1 = 5 and
2 = 5) metabolons of size
2 (the size of a metabolon is defined as the number of different genes in the CCC). As an example, Figure 3b displays one of the largest metabolon (8 genes) found with
1 = 1 and
2 = 0. The total computation time does not exceed 1 min and computation of the CCCs took <5 s.
In order to validate the approach and to examine the impact of the parameters, we performed two different experiments. We first compared the number of metabolons (of size
2) obtained with
1
[0,5] and
2
[0,3] for E.coli and 100 shuffled E.coli genomes (i.e. gene ranks are shuffled). As already suggested by Ogata et al. (2000), this empirical approach allows to estimate a Z-score and a P-value based on the number of observed metabolons. The results are given in Table 1. As expected, the number of metabolons increases with the delta parameters, both on observed and shuffled data. Although, the observed values are highly significant in all cases (Z-scores from 69 to 7), one can observe that the significance tends to decrease when increasing the delta parameters. A second, more biologically meaningful, way of evaluating the metabolons is to compare them with already known clusters of genes like operons or regulons. For this purpose, we compared them with the whole set of operons in RegulonDB (Salgado et al., 2004). We do not expect a complete match between operons and metabolons because of three main reasons: (1) we do not enforce the gene coorientation whereas in RegulonDB an operon is composed of cooriented genes; (2) we expect a metabolon to match an operon but the converse is false since some operons may not be related to intermediate metabolism (e.g. ABC transporters or ribosomal proteins) and (3) the annotation of enzymes is not comprehensive (in E.coli only
25% of genes are associated to complete EC numbers). For these reasons, we defined two measures of coverage: (1) the fraction of metabolons that are covered by (at least) an operon. We say that a metabolon is covered if it shares at least half of its genes with the operon. (2) the fraction of operons that are covered by (at least) a metabolon. For the reason given above, in this latter case, we restrict to the operons containing at least two enzymes. The results are given in Table 1 (two last columns). The percentage of covered metabolons is rather high (up to 75%) and decreases with the delta parameters. This is due to the fact that the size of the metabolons increases with delta, making them more difficult to be covered by an operon. On the other hand, for the same reason, the percentage of covered operon increases. Of course, these results may vary with the organization of the species under study and the quality of available data.
|
Interactons
This problem can be informally stated as finding sets of contiguous genes (interactons) encoding for proteins that are known to interact with each other (such as components of a molecular complex). The input is therefore the gene layout on a chromosome and the interaction graph between their products. We can reformulate this problem as finding the CCCs of the following correspondence multigraph:
![]() | (10) |
![]() | (11) |
![]() | (12) (2) |
![]() | (13) |
Finally, the restriction condition writes
![]() |
To illustrate these definitions, we considered the complete genome of E.coli (4289 genes) to build the genomic graph (V1). The proteinprotein interaction (PPI) graph (V2) was build from the DIP database (Salwinski et al., 2004) and was restricted to proteins from E.coli. In this case, the multigraph is very small (4289 nodes and 4551 edges for
1 = 0 and
2 = 0), the total computation time for the multigraph construction and computation of the CCCs is <2 s. Figure 3c and d show two examples of interactons computed with
1 = 0 and
2 = 0. The first one corresponds to the largest interacton that has been detected (eight genes). It consists of the gene cluster coding for the eight components of the ATP synthase complex. In this case, the protein nodes are strongly linked to each other. For instance AtpH interacts with six other proteins of the interacton. The second example corresponds to an interacton of four genes, associated to the maltose ABC transporter. We have selected this example to emphasize a case of loosely linked nodes. In this case there is no two adjacent genes encode for two interacting proteins (in other terms, no nodes in the subgraph of the multigraph are connected both with genomic and PPI vertices). This is a typical example of a solution missed by Ogata et al. heuristic. Indeed this greedy heuristic requires at least two nodes being connected in all primary graphs in order to begin the clustering process. So, in this case, the CCC will be completely missed. More generally Ogata et al. heuristic may lead to smaller solutions than the exact algorithm.
| CONCLUSION |
|---|
|
|
|---|
We have introduced a general framework to represent correspondences between genomic or functional data represented by graphs together with an exact procedure to extract clusters of neighbouring entities (such as genes, proteins, reactions) from this representation. The advantage of this approach is that it allows a fairly general formulation of the problem that can be further adapted to different actual cases. This approach was illustrated in three particular cases. The first example mixes genomic (G) graphs only and could therefore be denoted as the Gn (syntons between n
2 genomes) problem. The second example mixes genomic and metabolic information and could be denoted as a G1M problem. The extension to the GnM problem is straightforward and could be used to identify clusters of neighbouring genes in several organisms and associated with connected reactions. Finally, the third example mixes genomic and proteinprotein interaction data and could therefore be described as the G1I1 problem (we used a subscript to the I graph to indicate that the interactions are species specific). Again, the extension to the GnI1 and GnIn problems could be envisaged as well. Another potentially interesting case is the I1M type of problem that related metabolic and interaction data in order to grab, for instance, the channelling (tunnelling) of substrates in enzymatic complexes. Finally, the GIM type of problem allows mixing the three kinds of data together. Other extensions may involve new kinds of primary graphs. This means either new types of primary nodes (e.g. using protein domains instead of complete proteins) or new kind of primary edges [e.g. shared expression patterns or gene fusions (Marcotte et al., 1999)]. Finally, we would like to point out another, more algorithmic extension of the current approach that may become important when dealing with more than two graphs. Considering for instance the Gn>2 problem, the idea is to introduce the notion of quorum that specifies a minimum number of primary graphs (i.e. species) for which the connectivity property holds. This will allow to look for syntons not occurring in all the species but, at least, in a minimum number of them. The precise definition of this extension will be our next task in the future.
| Acknowledgments |
|---|
The authors would like to thank Marie-France Sagot and Eric Coissac for helpful discussions during the course of this work.
Conflict of Interest: none declared.
Received on June 1, 2005; revised on July 22, 2005; accepted on October 6, 2005
| REFERENCES |
|---|
|
|
|---|
Alm, E. and Arkin, A.P. (2003) Biological networks. Curr. Opin. Struct. Biol, . 13, 193202[CrossRef][Web of Science][Medline].
Altschul, S.F., et al. (1990) Basic local alignment search tool. J. Mol. Biol, . 215, 403410[CrossRef][Web of Science][Medline].
Gai, A.-T., Habib, M., Paul, C., Raffinot, M. (2003) Identifying common connected components of graphs. Report RR-LIRMM-03016, , France LIRMM, Université de Montpellier 2.
Galperin, M.Y. and Koonin, E.V. (2000) Who's your neighbor? New computational approaches for functional genomics. Nat. Biotechnol, . 18, 609613[CrossRef][Web of Science][Medline].
Habib, M., Paul, C., Raffinot, M. (2004) Maximal common connected sets of interval graphs. Proceedings of Combinatorial Pattern Matching: 15th Annual Symposium, CPM 2004, LNCSJuly 57 2004Istanbul, Turkey 3109, , pp. 359372.
Kanehisa, M., et al. (2004) The KEGG resource for deciphering the genome. Nucleic Acids Res, . 32, D277D280
Kelley, B.P., et al. (2003) Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc. Natl Acad. Sci. USA, 100, 1139411399
Marcotte, E.M., et al. (1999) Detecting protein function and proteinprotein interactions from genome sequences. Science, 285, 751753
Ogata, H., et al. (2000) A heuristic graph comparison algorithm and its application to detect functionally related enzyme clusters. Nucleic Acids Res, . 28, 40214028
Overbeek, R., et al. (1999) The use of gene clusters to infer functional coupling. Proc. Natl Acad. Sci. USA, 96, 28962901
Salgado, H., et al. (2004) RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12. Nucleic Acids Res, . 32, D303D306
Salwinski, L., et al. (2004) The database of interacting proteins: 2004 update. Nucleic Acids Res, . 32, D449D451
Sharan, R., et al. (2005) Conserved patterns of protein interaction in multiple species. Proc. Natl Acad. Sci. USA, 102, 19741979
Snel, B., et al. (2002) The identification of functional modules from the genomic association of genes. Proc. Natl Acad. Sci. USA, 99, 58905895
Tatusov, R.L., et al. (1997) A genomic perspective on protein families. Science, 278, 631637
Yanai, I. and DeLisi, C. (2002) The society of genes: networks of functional links between genes from comparative genomics. Genome Biol, . 3, research0064.
Zheng, Y., et al. (2002) Computational identification of operons in microbial genomes. Genome Res, . 12, 12211230
This article has been cited by other articles:
![]() |
P. Lechat, L. Hummel, S. Rousseau, and I. Moszer GenoList: an integrated environment for comparative analysis of microbial genomes Nucleic Acids Res., January 11, 2008; 36(suppl_1): D469 - D474. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Vallenet, L. Labarre, Z. Rouy, V. Barbe, S. Bocs, S. Cruveiller, A. Lajus, G. Pascal, C. Scarpelli, and C. Medigue MaGe: a microbial genome annotation system supported by synteny results Nucleic Acids Res., January 10, 2006; 34(1): 53 - 65. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


















(2)




