Skip Navigation


Bioinformatics Advance Access originally published online on April 27, 2007
Bioinformatics 2007 23(13):1631-1639; doi:10.1093/bioinformatics/btm156
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
23/13/1631    most recent
btm156v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Li, Z.
Right arrow Articles by Chen, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, Z.
Right arrow Articles by Chen, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Alignment of molecular networks by integer quadratic programming

Zhenping Li 1,2,*, Shihua Zhang 2,3,*, Yong Wang 2,5, Xiang-Sun Zhang 3,{dagger} and Luonan Chen 4,5,6,7,{dagger}

1Beijing Wuzi University, Beijing 101149, China, 2Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100080, China, 3Graduate University of Chinese Academy of Sciences, Beijing 100049, China, 4Institute of Systems Biology, Shanghai University, Shanghai 200444, China, 5Osaka Sangyo University, Osaka 574-8530, Japan, 6ERATO Aihara Complexity Modelling Project, JST, Tokyo 153-8530, Japan and 7Institite of Industrial Science, The University of Tokyo, Tokyo 153-8505, Japan

{dagger}To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FORMULATION OF NETWORK...
 3 COMPONENTS OF ALIGNMENT
 4 SIMULATION
 5 DISCUSSION AND CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: With more and more data on molecular networks (e.g. protein interaction networks, gene regulatory networks and metabolic networks) available, the discovery of conserved patterns or signaling pathways by comparing various kinds of networks among different species or within a species becomes an increasingly important problem. However, most of the conventional approaches either restrict comparative analysis to special structures, such as pathways, or adopt heuristic algorithms due to computational burden.

Results: In this article, to find the conserved substructures, we develop an efficient algorithm for aligning molecular networks based on both molecule similarity and architecture similarity, by using integer quadratic programming (IQP). Such an IQP can be relaxed into the corresponding quadratic programming (QP) which almost always ensures an integer solution, thereby making molecular network alignment tractable without any approximation. The proposed framework is very flexible and can be applied to many kinds of molecular networks including weighted and unweighted, directed and undirected networks with or without loops.

Availability: Matlab code and data are available from http://zhangroup.aporc.org/bioinfo/MNAligner or http://intelligent.eic.osaka-sandai.ac.jp/chenen/software/MNAligner, or upon request from authors.

Contact: zxs{at}amt.ac.cn, chen{at}eic.osaka-sandai.ac.jp

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FORMULATION OF NETWORK...
 3 COMPONENTS OF ALIGNMENT
 4 SIMULATION
 5 DISCUSSION AND CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
One of the major challenges for post-genomic biology is to understand how genes, proteins and small molecules interact to form a functional network (Chen et al., 2004, 2005; Lee et al., 2004; Wang and Chen, 2005). In recent years, with rapid progress of biological science, many high-throughput technologies have been developed for studying interactions of molecules, such as microarray, the two-hybrid assay, co-immunoprecipitation and the chIP-chip approach, which can be used to screen for protein–protein interaction (PPI) (Kelley et al., 2003) or to infer gene regulatory network (Wang et al., 2006). For instance, these technologies have been adopted to derive PPI networks for many model species (Kelley et al., 2003), such as bacteria, yeast, nematode worm and fruit fly.

Molecular networks orchestrate the sophisticated and complex functions of the living cells. Various organisms differ not only because of differences of constituting proteins, but also because of architectures of their molecular networks. Hence, it is essential to address the similarities and differences in the molecular networks by comparative network analysis, which can directly be applied for analyzing signaling pathways, finding conserved regions, discovering new biological functions or understanding the evolution of protein interactions. In addition to PPI networks, research works on many other types of molecular networks are also emerging and increasing rapidly, such as metabolic networks (Pinter et al., 2005), gene regulatory networks (Trusina et al., 2005; Wang et al., 2006) and coexpression networks (Berg and Läsig, 2006) with the advance of biotechnology. Several network alignment algorithms for molecular networks have been discussed in recent studies, which are either mainly based on sequence similarities, such as PathBLAST (Kelley et al., 2004) and Local graph alignment algorithm (Berg and Läsig, 2004), or mainly based on network architecture similarities, such as pairwise local alignment algorithm (Koyutürk et al., 2005) and heuristic graph comparison algorithm (Ogata et al., 2000). Due to computational burden, most of the conventional approaches either restrict comparative analysis to special structures, such as pathways or adopt heuristic (or approximate) algorithms. A more comprehensive survey can be found in a recent review (Sharan and Ideker, 2006) for biological network comparison problems and potential applications.

In this article, we aim to develop an efficient algorithm for aligning general molecular networks based on both the node similarity (e.g. protein or gene sequence similarity, enzyme's identity) and the network architecture similarity, by using integer quadratic programming (IQP) with a log-probability-like criterion (Kelley et al., 2003). Numerical computation and theoretical analysis on the real biological data sets show that such an IQP can be relaxed into the corresponding quadratic programming (QP) which almost always ensures an integer solution. Therefore, a QP algorithm can be adopted to efficiently solve this IQP without any approximation. In terms of computational complexity, the proposed approach makes the computation of molecular network alignment tractable. On the other hand, from the viewpoint of graph theory, the proposed method can identify similar subsets between two graphs, which allow gaps of nodes and edges. The similarity of two biomolecules (or nodes) is defined by their sequence or functional homology, whereas the similarity of two edges is based on their confidence ratios of interactions/regulations between the respective two molecules. We have implemented the proposed algorithm by Lingo programming and Matlab software, respectively which are available upon request from the authors. Both theoretical and numerical results demonstrate that the proposed method is rather effective and general, i.e. it can be applied not only to weighted or unweighted networks, but also to directed or undirected networks. In addition to simple topological substructures such as chains and trees, it can reveal biologically meaningful units or subnetworks with loops or network motifs.


    2 FORMULATION OF NETWORK ALIGNMENT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FORMULATION OF NETWORK...
 3 COMPONENTS OF ALIGNMENT
 4 SIMULATION
 5 DISCUSSION AND CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We first formulate the molecular network alignment problem based on the graph representation. Given a molecular network G(V, E), where each node v in the node set V represents a molecule, e.g. protein or gene or RNA, each edge (u, v) in the edge set E represents an interaction or regulation between nodes Formula and Formula .

Given two molecular networks (directed or undirected) Formula and Formula , where Formula and Formula . The adjacent matrices of G1 and G2 for unweighted networks are, respectively Formula and Formula , where Formula if there is an interaction between proteins i and j, and Formula otherwise and Formula likewise. Besides binary values, notice that Formula and Formula can be straightforward extended to real numbers between 0 and 1 to represent the confidence ratios of the interactions. Several studies have suggested useful methods for evaluating the reliability of protein interactions (Bader et al., 2004; Deng et al., 2003), by which each protein interaction can be assigned a confidence value ranging from 0 to 1. In other words, the network can also be formulated as a weighted and/or directed graph.

For the node's similarities, we define a similarity score to measure the similarities between a pair of proteins or genes based on their sequences or any other information. The similarity score is defined as a function Formula (or Formula ). For any Formula and Formula , Formula measures the similarity between molecules Formula and Formula . Especially, Formula implies that the sequences of Formula and Formula are identical, whereas Formula (or Formula ) indicates no similarity of the sequences between Formula and Formula .

To uncover conserved subnets (or corresponding relationships) in the two aligned networks, it is necessary to determine the detailed matching among nodes between two networks. In our model, the matching between Formula and Formula is represented by a binary variable Formula ,


Formula

Formula is a corresponding relationship between two nodes, and the optimal matching Formula determined by the IQP model represents an ‘optimal’ alignment. Clearly, depending on Formula , we can get a local alignment or local matching between the two networks G1 and G2.

When modeling an aligned subnetwork, we are interested in entities (e.g. nodes and edges) and their different attributes (e.g. node or edge similarity). In a probabilistic model, each of these attributes is treated as a random variable. A model embodying a well-matched substructure follows the joint probability distribution of all the random variables of interest, which has the product form for all elements. Therefore, when assuming the independence of individual probabilities for all elements, the log-probability score is the summation of all individual (log) probabilities for the joint probability distribution.

On the other hand, in this article, the similarity between two molecular networks (G1 and G2) with respect to a given matching matrix X of nodes is defined by the sum score including both node and edge matching scores in the objective function, which is similar to the form of the log-probability score. Then, the alignment of networks can be formulated as the following IQP by maximizing the similarity score Formula between networks G1 and G2 among all feasible combinations X.


Formula

where, the coefficient {lambda} is a scalar parameter between 0 and 1 to control the balance between node and edge scores. For instance, only the node scores are considered in the alignment for {lambda} = 1, while only the edge scores are optimized for {lambda} = 0. Generally, the parameter is Formula , depending on the requirement of alignment. The first constraint implies that one node in G1 can correspond to at most one node in G2, while the second constraint means that each node in G2 can match at most one node in G1. The last constraint is the integer constraint for variable X. Depending on the parameter {lambda}, we can obtain different optimal alignment solutions. It can be shown that the IQP is a combinatorial optimization problem which belongs to NP-hard class. Notice that the log-likelihood score in the probabilistic model is summation of all the likelihood ratios observed over all the aligned nodes and edges under the assumption of independence for individual items. Clearly, the objective function Formula in the deterministic IQP model is the summation over all the aligned nodes and edges, which has the similar form to log-similarity probability although we do not require the independent assumption due to the deterministic model of this article, which is also an advantage of the IQP model.

Actually, such an IQP has a significant characteristic. By simple manipulation, it can be easily shown that the constraints of IQP have a unimodular property, which implies that the IQP can be relaxed into the corresponding QP with an integral solution for general cases. Actually, as proven in Section 1 of Supplementary Materials, with appropriate conditions on the objective function, the QP can be ensured to have an optimal integer solution. Note that Section 1 of Supplementary Materials is not the necessary conditions but the sufficient conditions, which means that the QP may have an integer solution even without satisfying the conditions. Actually, the numerical simulation shows that QP in the test problems always converges to integer solutions although many of them do not satisfy the sufficient conditions. On the other hand, for the case of a non-integer solution, a rounding-up strategy can be adopted approximately to make the solution of the QP integer.

Instead of solving the IQP directly, we can also transform the IQP equivalently into an ILP (integer linear programming) by introducing a family of variables, as derived in Section 2 of Supplementary Materials.


    3 COMPONENTS OF ALIGNMENT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FORMULATION OF NETWORK...
 3 COMPONENTS OF ALIGNMENT
 4 SIMULATION
 5 DISCUSSION AND CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
In order for the molecular networks to be aligned in a biologically meaningful manner, preparation of the underlying similarity function S and adjacent matrices (A, B) is crucial. Next, we mainly take two types of molecular networks, e.g. protein interaction networks and metabolic networks (pathways) as examples to demonstrate the proposed method.

3.1 Adjacent matrix
A PPI network is represented as an undirected graph G(V, E), i.e. each node represents a protein and each edge represents an existing interaction between two proteins or an interaction with a confidence value. A metabolic pathway is represented as a directed graph G(V, E) whose nodes correspond to enzymes that catalyze the pathway's reactions, and whose edges connect nodes if the product of one enzyme serves as the substrate of the other.

3.2 Similarity function
We measure the similarity S of proteins based on their sequences, which is assigned to any pair of proteins between two networks. The similarity score between a pair of proteins can be measured by several methods, e.g. based on the similarity of amino acid sequences or the evolutional relation of the protein pairs. One of them is measured by detecting orthologs and in-paralogs using INPARANOID (Remm et al., 2001), which is developed for finding disjoint ortholog clusters in two species. Each orthologs cluster is characterized by two main orthologs, one from each species, and possibly several other in-paralogs from both species. The main orthologs are assigned a confidence value between 0 and 1, while the in-paralogs are assigned confidence scores based on their relative similarity to the main ortholog in their own species. The similarity between two proteins u and v is defined as


Formula

Clearly, this score provides a normalized similarity function that takes values in interval [0,1]. Besides INPARANOID, the similarity score can also be measured by the following formula (Durbin et al., 1998)


Formula

When a common ancestor exists between proteins u and v, the numerator Puv is the probability that u is replaced by v, and the denominator expresses the product of the probabilities of obtaining u and v, respectively, by substitution at random (namely, the probability with which u and v are produced independently). That is, this score expresses the degree to which u and v relate evolutionary in terms of a log-odds ratio. Moreover, the similarity can be measured from the information of the sequence similarity, e.g. by BLAST (Kelley et al., 2003).

For metabolic networks, in order to build a node's similarity function appropriately, we associate each enzyme with its EC (Enzyme Commission) number which consists of four sets of numbers and categories, the type of the catalyzed chemical reaction. For two enzymes, the similarity between them can be defined as the formula as Tohsato et al. (2000) suggested. Here, we adopted a simple rule which has also been used in a similar manner (Pawlowski et al., 2000): Formula , where r means the class number of the lowest common upper class. For example, r([1.2.3.4], [1.2.3.5]) = 3 and r([1.2.3.4], [2.1.3.4]) = 0. In contrast to sequence similarity, enzymes with similar EC classification represent functional homologs for functionality of EC classification.

3.3 Statistical significance of alignments
Based on P-value calculated from t-test, we evaluate the statistical significance of an alignment in a similar manner implemented in the study of Pinter et al. (2005). We compute the P-value of an alignment with an objective score Formula by aligning the smaller network with 100 random networks generated by containing the same set of nodes and the same number of edges as the larger one, and counting the fraction of alignments with higher scores than Formula . We used the program from http://www.cmth.bnl.gov/~maslov/matlab.htm to generate the randomized undirected or directed networks. The program developed by Maslov and Sneppen (2002) generates randomized networks by randomly reshuffling links, while keeping the in- and out-degree of each node constant. The detailed implementation process can be seen in Maslov and Sneppen (2002).

3.4 Materials
In this study, all related materials containing protein interaction networks for yeast and fly, and similarity of proteins between yeast and fly were obtained from a recent study (Bandyopadhyay et al., 2006) (http://www.cellcircuits.org/Bandyopadhyay2006/). In that study, an approach is proposed for identifying functional orthologs based on protein interaction network comparison supplemented by sequence homology. Specifically, a method (Bader et al., 2004) is adopted to estimate confidence values of protein interactions, using a logistic regression model and applying the known Inparanoid algorithm (Remm et al., 2001) to define sequence-similar clusters of proteins from Saccharomyces cerevisiae and Drosophila melanogaster. Then, 14 319 interactions among 4389 proteins in yeast and 20 720 interactions among 7038 proteins in fly form two PPI networks. The 2244 clusters covering 2836 yeast proteins and 3828 fly proteins were generated by Inparanoid algorithm.

On the other hand, all related materials containing metabolic pathways of Escherichia coli and yeast S.cerevisiae were obtained from a recent study (Pinter et al., 2005) (http://www.cs.technion.ac.il/olegro/metapathwayhunter/), where Pinter et al. developed a pathway alignment procedure based on a graph matching algorithm. By applying this method to genome-scale metabolic networks of the bacterium E.coli and the yeast S.cerevisiae, their similarities and differences were investigated. The metabolic pathways are extracted from the EcoCyc (Karp et al., 2004) (http://ecocyc.org/) database and SGD (Christie et al., 2004) (http://www.yeastgenome.org/), respectively.

3.5 Network clustering algorithm
It is time-consuming to apply QP or IQP algorithm directly to align two large networks (over thousands of nodes). In order to overcome the problem, a network clustering method, i.e. the so-called MCODE algorithm (Bader and Hogue, 2003), is used to detect network clusters. Detection of functional modules in protein interaction networks is an important problem, and many studies (Hartwell et al., 1999; Rives and Galiski, 2003) have verified the modularity structure of PPI networks. The MCODE algorithm (http://cbio.mskcc.org/~bader/software/mcode/index.html) utilizes connectivity values in PPI networks to detect protein complexes. This algorithm is based on vertex weighting according to its local neighborhood density, and has a special advantage, i.e. it can detect overlapping clusters and relatively sparse clusters (i.e. subnetworks). We calculate clusters by MCODE algorithm in fly PPI network, and then find the corresponding subnets in yeast PPI network based on the nodes' homologous mapping of interspecies.


    4 SIMULATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FORMULATION OF NETWORK...
 3 COMPONENTS OF ALIGNMENT
 4 SIMULATION
 5 DISCUSSION AND CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have developed an alignment tool called MNAligner (Molecular Network ‘Aligner’) based on the proposed algorithm, and the program is encoded by Lingo and all simulations were performed on a PC. In addition, we also solve the QP by Matlab program based on an efficient interior algorithm (Ye, 1992). We test the proposed method for undirected and directed networks, respectively, which represent various kinds of molecular networks, such as protein networks, gene regulatory networks, coexpression networks or metabolic networks. The elements of adjacent matrices may be symmetric or asymmetric, depending on whether or not it is an undirected and directed network. In the following section, two small examples are first used to test MNAligner, and then we report the results of applying MNAlinger to two types of molecular networks including protein interaction networks and metabolic networks.

4.1 Example 1: aligning undirected networks
An example is taken from the tutorial files provided in the PathBLAST plugin of software Cytoscape 1.1 (http://www.cytoscape.org/plugins1.php) as shown in Figure 1. The adjacent matrices of the two networks and their similarity matrix S can be seen in Section 3 of Supplementary Materials. The result by our method MNAligner is shown in Figure 1. On the other hand, the results of PathBLAST are obtained by software Cytoscape 1.1 with the same data.


Figure 1
View larger version (27K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. The results of a tutorial network alignment example from PathBLAST plug-in of Cytoscape software with {lambda} = 0.5 by MNAligner. Figure 1 for objective function, t-score Figure 1 for t-test, and Figure 1 for P-value.

 
The network alignment procedure is similar for the two methods. In the first step, a global alignment graph is formed to consider the node similarity and edge similarity together. Then the conserved pathways are identified by considering the local acyclic structures with high score. To further support the benefits of MNAligner by data, we present a detailed comparison with PathBLAST by example 1 in Section 4 of Supplementary Materials both from the global alignment graph and the chosen pathways. The computational results show that our method is consistent with PathBLAST in terms of ability to find the conserved pathways in the two networks but outperform PathBLAST in the ability to find the most conserved pathways. Using our method, we obtained an optimal solution with the objective function 3.57 for {lambda} = 0.5. The protein matching is illustrated in Figure 1, where upper and lower nodes represent the nodes from the two networks, respectively. In this article, a bold line represents that the corresponding two edges are matched, whereas a thin line is an edge in one network without matching with the one in another. t-score for t-test and P-value are also given in each example.

Analyzing this result, we found three conserved pathways with length 3, i.e. Formula , Formula and Formula . By checking with the results of PathBLAST, surprisingly the pathway Formula has the highest probability score, and Formula is ranked in the second place by probability score. These results are consistent with the results obtained by PathBLAST. However, in some case PathBLAST and MNAligner identify the same elements of pathway but organize them differently. For example, the best pathway identified by PathBLAST Formula and the best pathway identified by MNAligner Formula contain the same elements (nodes and edges). PathBLAST inserts a gap between C and A and forms an indirect link as shown in Supplementary Figure S2. The MNAligner keeps the natural sequence order Formula and Formula in the original network. In this way, MNAligner outperforms PathBLAST both in PathBLAST and MNAligner score (refer to Supplementary Figure S2 in Section 4 of Supplementary Materials). It demonstrates that MNAligner can find the optimum pathway while PathBLAST only lists feasible solutions. Another difference of two methods is that some pathways identified by MNAligner do not show in the results of PathBLAST. It is very interesting that the pathway found by MNAligner Formula is not in the 357 list of PathBLAST (every run of the PathBLAST, the number of list pathways is different, we only pick one case to analyze) though this pathway pair has clear homology based on both similarity of their corresponding proteins and interconnections of edges. According to its PathBLAST score, this pathway should be listed in the top 20 pathways (refer to the Supplementary Table S1 in Section 4 of Supplementary Materials) in the PathBLAST results. By checking carefully the global alignment graph formed by PathBLAST in Supplementary Figure S1, we find that Formula is an isolated node which means it cannot be caught by the dynamic programming based search of PathBLAST. As a brief summary, the comparison results support that in methodology MNAligner can find the best matched subnetworks between two molecular networks from the viewpoint of optimization, while PathBLAST is a heuristics-based approach which can list many feasible solutions.

The parameter {lambda} in our algorithm aims to balance the node similarity and the edge similarity in two networks. By adjusting this parameter, we can have the results with stressing on different aspects of network alignment. Similar results were obtained when we emphasize on the edge information by decreasing parameter {lambda}.

4.2 Example 2: aligning directed networks
Our algorithm can also be used for the comparison of directed networks, such as the gene regulatory networks. In this section, we verify the effectiveness of MNAligner for the directed networks by an example which is shown in Figure 2, where the related information including adjacent matrices and their similarity matrix can be found in Section 3 of Supplementary Materials. We obtained the optimal objective function value 9.22 with parameter {lambda} = 0.5 using the MNAligner tool, where the optimal matching is as follows: (u1, v1),(u2, v2),(u3, v3),(u4, v4),(u5, v5),(u6, v6),(u7, v7),Formula ,Formula ,Formula ,Formula . Such a result is actually robust with the parameter perturbations, e.g. we can obtain the same corresponding result in other parameters, such as {lambda} = 0.6 and {lambda} = 0.4. Clearly, a significant advantage of MNAligner is that the new tool can find more complex substructures including loops, such as a loop Formula , as shown in Figure 2.


Figure 2
View larger version (32K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. The simulated example of two directed networks with {lambda} = 0.5 by MNAligner. Figure 2, t-score Figure 2 and Figure 2.

 
Note that based on the mathematical programming, our method can find the best matched subnetworks between two molecular networks from the viewpoint of optimization, comparing with heuristics-based approaches. In the above mentioned two examples, we show that MNAligner can be used to either undirected or directed networks with or without loops, in contrast to PathBLAST which is mainly for undirected networks without loops (e.g. pathways). Generally, PathBLAST performs heuristic search to globally align graphs and lists many short paths (e.g. length 3 or 4) based on statistical scores, whereas our method obtains the optimal alignments based on the objective function (i.e. the network similarity for both nodes and edges).

4.3 Molecular networks
In this section, we align protein interaction networks for yeast versus fly, and metabolic pathways for yeast versus bacterium.

By incorporating sequence homology and network structures, network alignment of protein interaction networks is a powerful tool for predicting protein functions, uncovering true functional orthologs, and revealing biologically significant pathways. To demonstrate the ability in discovering conserved subnets in protein interaction networks by MNAligner, we aligned those divided cluster pairs of two PPI networks. Figure 3 shows an example in which several well-matched subnets were found. The nodes in these subnets of yeast correspond to two basic functions:‘DNA processing' and ‘transport routines'. Naturally, we can predict that the corresponding subnets in fly also have these two functions. In addition, MNAligner can also be applied to identify functional orthologs in protein interaction networks just as Bandyopadhyay et al. have suggested (Bandyopadhyay et al., 2006). Note that according to the original paper of Kelly et al. (2003), their method mainly detects paths only with length 3 or 4, whereas MNAligner is a general comparison method for one to one matching without the limit on the size of the aligned networks.


Figure 3
View larger version (44K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. An illustration of well-matched subnets in yeast and fly protein interaction networks with {lambda} = 0.9. Each circle represents a match.

 
For yeast versus bacterium (Pinter et al., 2005), all the possible pairs between 113 E.coli pathways and 151 S.cerevisiae pathways are aligned by the proposed method. The detailed results for all of these alignments are available upon request. About half of those pathways in each species match well with the corresponding ones in another. Such results demonstrate that there are abundant conserved pathways among species although bacterium E.coli and yeast S.cerevisiae come from prokaryotic and eukaryotic, respectively.

Figure 4 shows four aligned pathways in E.coli and S.cerevisiae, which demonstrate the effectiveness of our method in uncovering interesting biological conservative units. The arginine biosynth2 pathway and arginine metabolism pathway, respectively from E.colic and S.cerevisiae showed in Figure 4A form a well-matched linear path except a non-identical match (the red box), where the non-match may hint to interesting evolutionary information. The mismatch in Figure 4A is the inconsistent match between AreE in E.coli (enzyme: acetylornithine deacetylase labeled 3.5.1.1 [EC] 6) and Ecm40 in S.cerevisiae (enzyme: acetylornithine acetyltransferase labeled 2.3.1.35 [EC] ). By checking the two pathways, we found that the two enzymes catalyze different reactions (N-acetyl-L-ornithine Formula O {lrdoublearrow} L-ornithine + acetate versus L-glutamate N-{alpha}-acetylornithine {lrdoublearrow} N-acetyl-L-glutamate + L-ornithine), but produce the same compound L-ornithine. However, in the following processes, the two pathways have the same biochemical reactions. Moreover, we found that these two proteins have no significant similarity by aligning their protein sequences using BLAST. The fact that non-homologous proteins carry out the different intermediate processes but mediate the same subsequent processes, may reflect the some evolutionary phenomena, e.g. non-homologous proteins may evolve into same function, or functions have been maintained although the sequences have been changed.


Figure 4
View larger version (36K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Three matched inter-species pairs with {lambda} = 0.9. The upper part represents the enzyme from E.coli and the lower part represents the enzyme from S.cerevisiae in each Box. Each Box represents a match. The bold line represents a matched edge whereas the red label indicates the inconsistency. The three pathway pairs of E.coli and S.cerevisiae are (A) arginine biosynth2 pathway and arginine metabolism pathway with Figure 4, t-score Figure 4 and Figure 4, (B) homoserine methionine biosynth pathway and methionine biosynth pathway with Figure 4, t-score Figure 4 and Figure 4, (C) phospholipid biosynth1 pathway and phosphatidic acid phospholipid biosynth pathway with Figure 4, t-score Figure 4 and Figure 4, (D) allantoine degradation pathway from E.coli and ureide degradation pathway from S.cerevisiae with Figure 4, t-score Figure 4 and Figure 4.

 
Figure 4B shows another example of homoserine methionine biosynth pathway and methionine biosynth pathway, respectively. These two pathways only have minor difference between two matched pairs including homoserine O-succinyltransferase (labeled 2.3.1.4 [EC] 6) versus homoserine O-trans-acetylase (labeled 2.3.1.3 [EC] 1) and ornithine carbamoyltransferase (labeled 2.1.1.1 [EC] 3) versus ate-homocysteine methyltransferase (labeled 2.1.1.14 [EC] ). Except that a gap can be observed, we can image that an enzyme (4.4.1.8) has been embedded into ethionine biosynth pathway from S.cerevisiae, which implies that there may exist more complex biochemical reactions in E.coli. By checking the two pathways, we found that biochemical reactions are the same in the first three products (compound:homoserine; after enzyme: 1.1.1.3 [EC] ), and then they differ in the following chain reactions where there are four and three different enzymes in E.coli and in S.cerevisiae, respectively, but have same final compound L-methionine. Biologically, the reactions catalyzed by metA, metB, metC and metE in E.coli can be viewed to play the same role as the reactions catalyzed by met2, met17 and met6 in S.cerevisiae. More interestingly, three among all the seven enzymes including met17, metB and metC are sequence homologs, which may imply either gene duplication in E.coli or gene fusion in S.cerevisiae.

Figure 4C shows the third pathway pairs, in which the phosphatidic acid phospholipid biosynth pathway from S.cerevisiae can be considered as a subnet of phospholipid biosynth1 pathway from E.coli. Figure 4D shows the fourth pathway pair, in which although the two pathways (allantoine degradation pathway from E.coli and ureide degradation pathway from S.cerevisiae) have different final compounds, they have very similar initiative subpathways. This difference may be due to the environment or requirements for adaptation. All of examples illustrate strong evolutionary trace between the distant organisms from the viewpoint of network structure, which may not be recognized simply by analyzing sequences or structures of individual molecules.

More comparison results are listed in Section 5 of Supplementary Materials. In addition, the proposed method can also be employed to carry out intra-species alignment which may provide interesting duplication and divergence information within a species. Figure 5. shows two statistically significant examples from all-against-all alignments in E.coli and S.cerevisiae, respectively. Another application of the proposed method as a tool is network query (Pinter et al., 2005), i.e. to uncover interesting subnetworks in a given network that is similar to the query network, which is known to be functional or expected important. For example, there are many biologically significant complexes in yeast but very few in other species such as fly. By applying MNAligner to query a known complex of yeast in the molecular network of another species, we can possibly identify the conservative or similar complex.


Figure 5
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Two matched intra-species pairs with {lambda} = 0.9. (A) isoleucine biosynth1 pathway and NAD dephosphorylation pathway with Figure 5, t-score Figure 5 and Figure 5 from E.coli, (B) threonine biosynth pathway and threonine methionine biosynth pathway with Figure 5, t-score Figure 5 and Figure 5 from S.cerevisiae.

 

    5 DISCUSSION AND CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FORMULATION OF NETWORK...
 3 COMPONENTS OF ALIGNMENT
 4 SIMULATION
 5 DISCUSSION AND CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
5.1 Conclusion
Several approaches for comparing molecular networks have been developed recently, such as PathBLAST applied to protein interaction networks and MetaPathwayHunter applied to metabolic pathways. However, these methods have their advantages or features mainly to specific networks. For example, PathBLAST (Kelley et al., 2003, 2004) evaluates the link similarity along path of connected nodes by adopting a sequence alignment algorithm, while MetaPathwayHunter (Pinter et al., 2005) analyzes metabolic pathways with no cycles (the cyclic pathways are broken down into all possible non-cyclic representations and then are analyzed as well [in private communications]) by a subtree comparison algorithm. In contrast, we proposed an effective algorithm which can compare general networks in a more accurate manner.

Specifically, we developed a novel formulation as well as an algorithm based on QP or LP for network alignment, which aims to find conserved patterns among molecular networks. In contrast to PathBLAST and MetaPathHunter which focus on the search of pathways without a loop, our approach can handle a general network alignment problem without restriction. To find the conserved substructures or evaluate the similarity between two biological networks, we developed an efficient algorithm for aligning molecular networks based on QP, which allow gaps for nodes and edges. Depending on the parameter {lambda} which balances the node and edge matching scores, we may have different results, which stress on different aspects of biological systems. A large {lambda} emphasizes on the node matching score, the aligned substructures generally have fewer edges but with more related nodes, such as homologous proteins or consistent enzymes labels. On the other hand, a small {lambda} emphasizes on the edge matching score, and the aligned substructures generally have more edges and are also larger in size. By selecting all the nodes without gaps and constructing a minimum connected subgraph in each network, we can obtain the two minimum subgraphs, which can be also regarded as conserved patterns. As demonstrated in this article, the proposed method can be applied to various types of networks including undirected/directed, unweighted/weighted, where the detected conserved subnets can be linear paths, tree-like nets and general subnets including loops. Note that the ability of uncovering conserved subnets with loops is very useful because many studies indicate that those subnets, such as feed-forward loop network motifs appear significantly in biological networks and play very important roles in biological systems (Alon, 2006).

5.2 Future work
Further theoretical works on tight sufficient conditions for an integer solution of the QP are necessary although the numerical computations always give integer solutions. Moreover, the relaxed QP is still intractable for large molecular networks with over thousand nodes, e.g. proteins. Although we have suggested a simplified strategy to apply MNAligner even for two large-scale molecular networks, i.e. by adopting decomposition technique to divide the whole PPI networks into small overlapping subnetworks so that MNAligner can be used on these subnetwork pairs for further detecting conserved patterns or network motifs, a sophisticated and accurate approach is desirable. Furthermore, based on the proposed method, it is also an interesting topic to develop a universal network query system, which can be used to query a functional unit of a given species, such as a complex or a well-known pathway in the network of a different species to find conserved (sub)units.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FORMULATION OF NETWORK...
 3 COMPONENTS OF ALIGNMENT
 4 SIMULATION
 5 DISCUSSION AND CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work is partly supported by Important Research Direction Project of CAS ‘Some Important Problems in Bioinformatics’, National Natural Science Foundation of China under Grant No.10631070, 60503004, the National Basic Research Program (973 Program) under Grant No. 2006CB503910 and K.G.Wang Education Foundation Hong Kong. This work is also partly supported by JSPS-NSFC Collaboration Project. The authors are grateful to the associate editor and anonymous referees for comments and helping to improve the earlier version.

Conflict of Interest: none declared.


    FOOTNOTES
 
*The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Back

Associate Editor: John Quackenbush

Received on December 20, 2006; revised on March 19, 2007; accepted on April 17, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FORMULATION OF NETWORK...
 3 COMPONENTS OF ALIGNMENT
 4 SIMULATION
 5 DISCUSSION AND CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Alon U. An introduction to systems biology: design principles of biological circuits. ( (2006) ) Boca Raton, FLA: Chapman &Hall/CRC, Taylor and Francis Group..

    Bader GD, Hogue CW. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, ( (2003) ) 4, : 2.[CrossRef][Medline].

    Bader JS, et al. Gaining confidence in high-throughput protein interaction networks. Nat. Biotechnol, ( (2004) ) 22, : 78–85.[CrossRef][ISI][Medline].

    Bandyopadhyay S, et al. Systematic identification of functional orthologs based on protein network comparison. Genome Res, ( (2006) ) 16, : 428–435.[Abstract/Free Full Text].

    Berg J, Läsig M. Local graph algorithm and motif search in biological networks. Proc. Natl Acad. Sci. USA, ( (2004) ) 101, : 14689–14694.[Abstract/Free Full Text].

    Berg J, Läsig M. Cross–species analysis of biological networks by Bayesian alignment. PNAS, ( (2006) ) 103, : 10967–10972.[Abstract/Free Full Text].

    Chen L, et al. Dynamics of gene regulatory networks with cell division cycle. Phys. Rev. E, ( (2004) ) 70, : 011909.[CrossRef].

    Chen L, et al. Noise–induced cooperative behavior in a multi–cell system. Bioinformatics, ( (2005) ) 21, : 2722–2729.[Abstract/Free Full Text].

    Chen L, et al. Inferring protein interactions from experimental data by association probabilistic method. Proteins, ( (2006) ) 62, : 833–837.[CrossRef][ISI][Medline].

    Christie KR, et al. Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res, ( (2004) ) 32, : D311–D314.[Abstract/Free Full Text].

    Deng M, et al. Assessment of the reliability of protein–protein interactions and protein function prediction. Pac. Symp. Biocomput, ( (2003) ) 8, : 140–151..

    Durbin R, et al. Biological Sequence Analyses: Probabilitic Methods of Proteins and Nucleic Acids, ( (1998) ) Cambridge: Cambridge University Press..

    Hartwell LH, et al. From molecular to modular cell biology. Nature, ( (1999) ) 402, : C47–C52.[CrossRef][Medline].

    Kelley BP, et al. Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc. Natl Acad. Sci. USA, ( (2003) ) 100, : 11394–11399.[Abstract/Free Full Text].

    Kelley PB, et al. PathBLAST: a tool for alignment of protein interaction networks. Nucleic Acids Res, ( (2004) ) 32, : 83–88..

    Koyutürk M, et al. Pairwise local alignment of protein interaction network guided by models of evolution. RECOM 2005 LNBI, ( (2005) ) 3500, : 48–65..

    Karp PD, et al. The E.coli EcoCyc Database: no longer just a metabolic pathway database. ASM News, ( (2004) ) 70, : 25–30.[ISI].

    Lee I, et al. A probabilistic functional network of yeast genes. Science, ( (2004) ) 306, : 1555–1558.[Abstract/Free Full Text].

    Maslov S, Sneppen K. Specificity and stability in topology of protein networks. Science, ( (2002) ) 296, : 910–913.[Abstract/Free Full Text].

    Ogata H, et al. A heuristic graph comparison algorithm and its application to detect functionally related enzyme clusters. Nucleic Acids Res, ( (2000) ) 28, : 4021–4028.[Abstract/Free Full Text].

    Pawlowski K, et al. Sensitive sequence comparison as protein function predictor. Pac. Symp. Biocomput, ( (2000) ) 5, : 42–53..

    Pinter RY, et al. Alignment of metabolic pathways. Bioinformatics, ( (2005) ) 21, : 3401–3408.[Abstract/Free Full Text].

    Remm M, et al. Automatic clustering of orthologs and in–paralogs from pairs species comparisons. J. Mol. Biol, ( (2001) ) 314, : 1041–1052.[CrossRef][ISI][Medline].

    Rives AW, Galitski T. Modular organization of cellular networks. Proc. Natl Acad. Sci. USA, ( (2003) ) 100, : 1128–1133.[Abstract/Free Full Text].

    Sharan R, Ideker T. Modeling cellular machinery through biological network comparison. Nat. Biotechnol, ( (2006) ) 24, : 427–433.[CrossRef][ISI][Medline].

    Tohsato Y, et al. A multiple alignment algorithm for metabolic pathway analysis using enzyme hierarchy. ( (2000) ) Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB 2000). 376–383..

    Trusina A, et al. Functional alignment of regulatory networks: a study of temerate phages. PLoS Comput. Biol, ( (2005) ) 1, : e74.[CrossRef][Medline].

    Wang R, Chen L. Synchronizing genetic oscillators by signalling molecules. J. Biol. Rhythms, ( (2005) ) 20, : 257–269.[Abstract].

    Wang Y, et al. Inferring gene regulatory networks from multiple microarray datasets. Bioinformatics, ( (2006) ) 22, : 2413–2420.[Abstract/Free Full Text].

    Ye Y. On affine-scaling algorithm for nonconvex quadratic programming. Math. Programm, ( (1992) ) 56, : 285–300.[CrossRef].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
B. S. Srinivasan, N. H. Shah, J. A. Flannick, E. Abeliuk, A. F. Novak, and S. Batzoglou
Current progress in network research: toward reference networks for key model organisms
Brief Bioinform, September 1, 2007; 8(5): 318 - 332.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
23/13/1631    most recent
btm156v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Li, Z.
Right arrow Articles by Chen, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, Z.
Right arrow Articles by Chen, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this? </