Bioinformatics Advance Access originally published online on October 28, 2004
Bioinformatics 2005 21(7):1154-1163; doi:10.1093/bioinformatics/bti071
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Inference of S-system models of genetic networks using a cooperative coevolutionary algorithm
1Bioinformatics Group, RIKEN Genomic Sciences Center 1-7-22 Suehiro-cho, Tsurumi, Yokohama 230-0045, Japan
2Protein Research Group, RIKEN Genomic Sciences Center 1-7-22 Suehiro-cho, Tsurumi, Yokohama 230-0045, Japan
3Structurome Group, RIKEN Harima Institute at Spring-8 1-1-1 Kohto, Mikazuki-cho, Sayo, Hyogo 679-5148, Japan
4Cellular Signaling Laboratory, RIKEN Harima Institute at Spring-8 1-1-1 Kohto, Mikazuki-cho, Sayo, Hyogo 679-5148, Japan
5Tokyo Research Laboratory, IBM Japan 1623-14 Shimo-tsuruma, Yamato, Kanagawa 242-8502, Japan
6Department of Biophysics and Biochemistry, Graduate School of Science, the University of Tokyo 7-3-1 Hongo, Bunkyo, Tokyo 113-0033, Japan
7Department of Biology, Graduate School of Science, Osaka University Toyonaka, Osaka 560-0043, Japan
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: To resolve the high-dimensionality of the genetic network inference problem in the S-system model, a problem decomposition strategy has been proposed. While this strategy certainly shows promise, it cannot provide a model readily applicable to the computational simulation of the genetic network when the given time-series data contain measurement noise. This is a significant limitation of the problem decomposition, given that our analysis and understanding of the genetic network depend on the computational simulation.
Results: We propose a new method for inferring S-system models of large-scale genetic networks. The proposed method is based on the problem decomposition strategy and a cooperative coevolutionary algorithm. As the subproblems divided by the problem decomposition strategy are solved simultaneously using the cooperative coevolutionary algorithm, the proposed method can be used to infer any S-system model ready for computational simulation. To verify the effectiveness of the proposed method, we apply it to two artificial genetic network inference problems. Finally, the proposed method is used to analyze the actual DNA microarray data.
Contact: skimura{at}gsc.riken.jp
Supplementary information: See Bioinformatics Online.
| INTRODUCTION |
|---|
|
|
|---|
Advancement in technologies such as DNA microarrays allows us to measure gene expression patterns on a genomic scale, but to exploit these technologies we must find ways to extract useful information from the massive amount of data (Kwon et al., 2003). Among the possible solutions for extracting information, many researchers have taken an interest in the inference of genetic networks. The inference of genetic networks is a problem in which mutual interactions among genes are estimated using time-series data of gene expression patterns. The inferred model of the genetic network is conceived as an ideal tool to help biologists generate hypotheses and facilitate the design of their experiments. On another level, it may also shed light on the biological functions of genes.
The numerous models proposed to describe biochemical networks have ranged from simple Boolean networks to detailed sets of differential equations of an arbitrary form (Akutsu et al., 2000; Chen et al. 1999; D'haeseleer et al., 2000; Maki et al., 2001; Sakamoto and Iba, 2001; Vance et al., 2002; Weaver et al., 1999). One of the well-studied models among them, the S-system, possesses a rich structure capable of capturing various dynamics, and can be analyzed by several available methods (Savageau, 1976; Voit and Radivoyevitch, 2000). These advantages have led to the successful application of the S-system model to the analysis of biochemical networks (e.g. Shiraishi and Savageau, 1992; Voit and Radivoyevitch, 2000). The model is a set of non-linear differential equations of the form
![]() | (1) |
i and ßi are multiplicative parameters called rate constants, and gi,j and hi,j are exponential parameters called kinetic orders.
The genetic network inference problem based on the S-system model is defined as an estimation problem of the S-system parameters. Several algorithms for the inference of S-system models of genetic networks have been proposed (Kikuchi et al., 2003; Morishita et al., 2003; Tominaga et al., 2000; Ueda et al., 2002). These algorithms estimate the S-system parameters (
i, ßi, gi,j and hi,j) using observed time-series data of gene expression patterns. Because the number of S-system parameters is proportional to the square of the number of network components, the algorithms must simultaneously estimate a large number of S-system parameters if they are to be used to infer large-scale network systems containing many network components. This is why inference algorithms based on the S-system model have only been applied to small-scale networks of less than five genes.
To resolve the high-dimensionality of the genetic network inference problem in the S-system model, a problem decomposition strategy, that divides the original problem into several subproblems, has been proposed (Maki et al., 2002; Kimura et al., 2003). This approach enables us to infer S-system models of large-scale genetic networks. However, when the given time-series data contain measurement noise, the inferred model cannot be used to computationally simulate a genetic network. Given that we depend on computational simulation for our analysis and understanding of the genetic network, this is viewed as an important disadvantage of the problem decomposition approach. To overcome the high-dimensionality of the genetic network inference problem, (Voit and Almeida, 2004) have proposed another approach that transforms the problem into a set of algebraic equations. However, the same disadvantage as the problem decomposition strategy still remains in their approach.
In this paper, we propose a new method to overcome this disadvantage. The proposed method solves the decomposed subproblems simultaneously using a cooperative coevolutionary algorithm (Potter and De Jong, 2000). All of the subproblems in this coevolutionary algorithm interact with each other through time-courses of gene expression levels. With this interaction, the proposed method can be used to infer any S-system model ready for computational simulation. To verify the effectiveness of this method, we apply it to two artificial genetic network inference problems containing 5 and 30 genes, respectively. Finally, the proposed method is used to analyze the actual DNA microarray data.
| GENETIC NETWORK INFERENCE PROBLEM |
|---|
|
|
|---|
Canonical problem definition
In general, the genetic network inference problem is formulated as a function optimization problem to minimize the following sum of the squared relative error (e.g. see Tominaga et al., 2000).
![]() | (2) |
Since 2N(N+1) S-system parameters must be determined in order to solve the set of differential equations (1), this function optimization problem is 2N(N + 1)-dimensional. This problem is too high-dimensional for non-linear function optimizers in cases where we try to infer S-system models of large-scale genetic networks containing many network components (Maki et al., 2001).
Decomposition of the problem
Because of the high-dimensionality, function optimizers have difficulty inferring S-system models of large-scale genetic networks. To resolve the high-dimensionality, Maki et al., 2002 proposed the strategy of dividing the genetic network inference problem into several subproblems. In this strategy, each subproblem corresponds to each gene. The objective function of the subproblem corresponding to the i-th gene is
![]() | (3) |
![]() | (4) |
![]() | (5) |
is an estimated time-course of the j-th gene expression level acquired not by solving a differential equation, but by making a direct estimation from the observed time-series data. We can obtain
using an interpolation technique such as a spline interpolation (Press et al., 1995) or a local linear regression (Cleveland, 1979).
Equation (4) is solvable when 2(N + 1) S-system parameters (i.e.
i, ßi, gi,1, ..., gi,N, hi,1, ..., hi,N) are given. Thus, the problem decomposition strategy divides a 2N(N + 1)-dimensional network inference problem into N subproblems that are 2(N + 1)-dimensional.
Use of a priori knowledge
The genetic network inference problem based on the S-system model may have multiple optima because the degree-of-freedom of the model is high and the observed time-series data are usually polluted by the measurement error. To increase the probability of inferring a correct S-system model, we introduced a priori knowledge of the genetic network into the objective function (Kimura et al., 2003).
Genetic networks are known to be sparsely connected (Thieffry et al., 1998). When an interaction between two genes is clearly absent, the S-system parameter values corresponding to the interaction (i.e. kinetic orders; gi,j and hi,j) are zero. Kikuchi et al., 2003 incorporated this knowledge into the objective function using a penalty term named the pruning term. This turns out to be an imperfect solution, however, since the pruning term pushes all of the kinetic orders down to zero, a condition that may make prevent the model from finding the existing interactions. To avoid this, we incorporated the knowledge into the objective function (3) by using another penalty term, as shown below (Kimura et al., 2003).
![]() | (6) |
|Gi,2|
...
|Gi,N| and |Hi,1|
|Hi,2|
...
|Hi,N|). The variable c is a penalty coefficient and I is a maximum indegree. The maximum indegree determines the maximum number of genes that directly affect the i-th gene. The penalty term is the second term on the right-hand side of Equation (6). This term forces most of the kinetic orders down to zero. In other words, when the penalty term is applied, most of the genes are disconnected from each other. However, when the number of genes that directly affect the i-th gene is smaller than the maximum indegree I, the term does not penalize. Thus, the optimum solutions to the objective functions (3) and (6) are identical when the number of interactions that affect the focused (i-th) gene is lower than the maximum indegree. In this paper, we use Equation (6) as an objective function that should be minimized.
| PROPOSED METHOD |
|---|
|
|
|---|
Concept
The problem decomposition strategy proposed by Maki et al., 2002 enables us to infer large-scale genetic networks. To solve the subproblems decomposed by this strategy, as mentioned above, the estimated time-courses of the gene expression levels,
, must be given. In the problem decomposition strategy,
are estimated directly from the observed time-series data using some interpolation method, and are not updated through the search. If
are correctly estimated, optimum solutions obtained from the problem decomposition approach and the canonical (non-decomposed) approach completely coincide with each other. However, when the given time-series data contain measurement noise, it is often difficult for us to estimate
correctly. When incorrect
are used, the optimum solutions of the decomposed subproblems do not always coincide with that of the non-decomposed problem. This means that the parameters obtained by solving the subproblems do not always provide us with a model [i.e. a set of differential equations (1)] that fits into the observed data. As such, in the problem decomposition approach, the inferred model is not yet suitable for the computational simulation of genetic networks.
In the subproblem corresponding to the i-th gene, the time-course of the i-th gene expression level is calculated by solving the differential equation (4). When optimizing the i-th subproblem, the function optimizer searches for the S-system parameters which make the calculated expression time-course of the i-th gene fits into the observed data. Therefore, the calculated time-courses obtained by solving the subproblems are the most suitable for
. If we can always use the calculated time-courses of the gene expression levels as
, optimizing the subproblems should give the model that fits into the observed data.
In order to use the time-courses of the gene expression levels obtained by solving the subproblems as
, we can use the cooperative coevolutionary approach (Liu et al., 2001; Potter and De Jong, 2000) an extension of the evolutionary algorithm (Holland, 1975). It consists of several subpopulations, each of which contains competing individuals (candidate solutions) for each subproblem. The subpopulations are genetically isolated, i.e. individuals mate only with other members of their subpopulation. These subpopulations interact with each other only when the fitness values (objective values) are calculated. In this paper, the subpopulations interact with each other only through the gene expression time-courses, i.e. when the proposed method solves the i-th subproblem, the calculated expression time-courses of the other genes, which are obtained from the best individuals of the other subproblems at the previous generation, are used as
(Fig. 1).
|
Algorithm
On the basis of the concept described above, we propose a new cooperative coevolutionary algorithm for inferring genetic networks. The following is an algorithm of the proposed method.
- Initialize. Generate N subpopulations, where N is the number of components in the genetic network. As an initial guess, estimate the gene expression time-courses from the observed time-series data. Set Generation = 0.
- Execution of function optimization. Execute one cycle of a function optimization algorithm on each subpopulation. In this paper, we use GLSDC (Kimura and Konagaya, 2003) as the function optimizer.
- Update of estimated gene expression time-courses. Update all of the estimated gene expression time-courses using the best individuals of the subpopulations. The updated gene expression time-courses are used as
in the next generation.
- Stop if halting criteria are satisfied. Otherwise, Generation
Generation +1 and go to Step 2.
Initialize
N subpopulations, each of which corresponds to one subproblem, are generated. Each subpopulation contains np individuals which are randomly created. At the same time, initial estimations of time-courses of gene expression levels,
, are made directly from the observed time-series data. In this paper, the local linear regression (Cleveland, 1979) is used to estimate the time-courses.
Execution of function optimization
Any type of function optimizer can be applied to the decomposed subproblem. In this study, we decided to adopt GLSDC, an evolutionary algorithm successfully applied to the genetic network inference problem as a function optimizer by (Kimura et al. (2003, 2004.
In this step, one cycle (generation) of GLSDC is performed on each subpopulation. When the algorithm calculates the fitness value of each individual in each subpopulation, the differential equation (4) is solved using the estimated time-courses of the gene expression levels,
. An initial gene expression level (an initial state value for the differential equation) is required together with the S-system parameters at this time. In this study, the initial gene expression level of the i-th gene was obtained from its estimated gene expression time-course, i.e. the value of
was used for Xi,cal,0.
Update of estimated gene expression time-courses
Next, we calculate the time-courses of the gene expression levels obtained from the best individuals of the subpopulations, each of which is given as a solution of the differential equation (4). The old gene expression time-courses are then updated to these calculated time-courses. The updated gene expression time-courses are used as
in the next generation.
When we calculate the time-courses of the gene expression levels, the initial levels of the gene expression are required. Since the noise in the actual time-series data corrupts the values of the initial gene expression levels, we should estimate these values together with the S-system parameters. However, the simultaneous estimation of the initial gene expression levels and S-system parameters increases the dimensionality of the function optimization problem, creating a condition inconvenient for function optimizers. To avoid this problem, we use an alternate method for estimation (Kimura et al., 2004).
In this step, the initial levels of the gene expression are adjusted to fit the new calculated gene expression time-courses into the observed time-series data, before the gene expression time-courses are updated. The adjustment of the initial gene expression level of the i-th gene is formulated as a one-dimensional function minimization problem. This is because the initial gene expression level of the i-th gene is a unique variable and all of the S-system parameters are fixed to the values of the best individual. The objective function of this adjustment problem is
![]() | (7) |
(0
1) is a discount parameter. Since the fixed model parameters obtained from the best individual are not always optimal, the calculated gene expression time-course may differ greatly from the actual time-course. When the estimated time-course is incorrect, the algorithm should not fit the time-course, especially the latter half of it, into the observed data. Therefore, in this study, we introduce a discount parameter
.
A golden section search (Press et al., 1995) is used to solve the one-dimensional function minimization problem described above. When multiple sets of time-series data are given as the observed data, the one-dimensional search is applied to all of the sets. After the adjustment, the new calculated gene expression time-courses are substituted for the old ones, and they are used as
in the next generation.
| NUMERICAL EXPERIMENTS |
|---|
|
|
|---|
To show the effectiveness of the proposed method, we applied it to two artificial genetic network inference problems. Then, it was used to analyze the actual DNA microarray data.
Experiment 1: noise-free environment
In this experiment, we confirm that the proposed method has an ability to infer a correct S-system model of the genetic network when a sufficient amount of noise-free data is given.
Experimental setup
As a target genetic network, we used a small-scale S-system model with the parameters listed in Table 1 (Kikuchi et al., 2003). This model consists of five network components (N = 5).
|
If an insufficient amount of time-series data is given as observed gene expression patterns, the high degree-of-freedom of S-system models ensures that many candidate solutions will be found. In this experiment, however, we used a sufficient amount of time-series data to enhance our chances of finding the correct solution. Specifically, we used 15 sets of noise-free time-series data, each covering all five genes. The sets of time-series data were obtained by solving the set of differential equations (1) on the target model. The initial values of these sets were generated randomly (listed in Table 2). In a practical application, these sets of time-series data could be obtained by actual biological experiments under different experimental conditions. A total of 11 sampling points for the time-series data were assigned on each gene in each set. Thus, the observed time-series data for each gene consisted of 15 x 11 = 165 sampling points. In this experiment, 2 x 5 x (5 + 1) = 60 S-system parameters and 15 x 5 = 75 levels of initial gene expression should be estimated.
|
As the proposed method is based on the stochastic search algorithm, we should perform multiple runs by changing the seed for pseudo random number in order to confirm its performance. Therefore, five runs were carried out. In each run, the proposed method produces one candidate solution. Each run was continued until the number of generations reached 75. The search regions of the parameters were [0.0, 20.0] for
i and ßi, and [3.0, 3.0] for gi,j and hi,j. As the observed initial gene expression levels should be close to true ones even when they are polluted by measurement error, the search regions of the initial gene expression levels were set to ±30% of the observed ones (i.e. [0.7 Xi,exp,0, 1.3 Xi,exp,0]). The maximum indegree I was 5, the penalty coefficient c was 1.0, and the discount parameter
was 0.75. In this paper, we used the following GLSDC parameters; the population size np is 3n, where n is the dimension of the search space of each subproblem; the number of children generated by the crossover per selection nc is 10; and the number of applied the converging operations N0 is np. The experiments were executed in parallel on a PC cluster (Pentium III 933 MHz x 8 CPUs).
In order to reduce the computational cost, we applied a structure skeletalizing technique (Tominaga et al., 2000). This technique assigns a value of zero to the kinetic orders (gi,j and hi,j), whose absolute values are less than the given threshold
s. Structure skeletalizing reduces the computational cost because the exponential calculation of Equation (4) can be omitted when the kinetic orders are zero. In this paper, the given threshold
s was 1.0 x 103.
Result
Tables 3 and 4 show the samples of the S-system parameters and the initial gene expression levels, respectively, estimated by the proposed method. As the tables show, our method was unable to estimate the parameter values with perfect precision. Notwithstanding, the values were precise enough to biologically interpret the network. The sum of the squared relative error between the time-courses produced by the inferred model and the given time-series data, i.e. the value of the function (2), averaged about 2.08 x 103 ±0.77 x 103.
|
|
In this experiment, we confirmed the effectiveness of the proposed method by estimating both the initial gene expression levels and the S-system parameters. In practice, however, there is no need to estimate the initial levels of the gene expression when the observed data seem to contain no measurement error. When the initial gene expression levels do not need to be estimated, the estimated parameters will be more precise since the problem contains fewer unknown parameters.
Our method running on the PC cluster (Pentium III 933 MHz x 8 CPUs) required
89.0 min to solve this problem. This is far less computing time than that required by Predictor by Evolutionary Algorithms and Canonical Equations 1 (PEACE1) proposed by Kikuchi et al. (2003). PEACE1 running on a PC cluster (Pentium III 933 MHz x 1040 CPUs) reportedly took more than 10 h to estimate the S-system parameters.
Experiment 2: noisy environment
Next, we test the performance of our method in a real-world setting by conducting the experiment with the sets of noisy time-series data.
Experimental setup
A large-scale S-system model containing 30 genes (N = 30) was used as a target model. The network structure and the S-system parameters of the model are given in Figure 2 and Table 5 respectively (Maki et al., 2001). The observed gene expression data included 20 sets of time-series data, each covering all 30 genes. The sets of time-series data began from randomly generated initial values in [0.0,2.0] and were obtained by solving the set of differential equations (1) on the target model. We added 10% Gaussian noise to the time-series data in order to simulate the measurement noise that often corrupts the observed data obtained from actual measurements of gene expression patterns. A total of 11 sampling points for the time-series data were assigned on each gene in each set. In this experiment, 2 x 30 x (30 + 1) + 30 x 20 = 2460 parameters should be estimated.
|
|
Five runs were carried out. As the performance of the algorithm seems to depend on the given data, different sets of time-series data, generated randomly, were used in each run. The search regions were [0.0, 3.0] for
i and ßi, [3.0, 3.0] for gi,j and hi,j, and [ 0.7 Xi,exp,0, 1.3 Xi,exp,0] for the initial levels of the gene expression. The experiments were executed in parallel on a PC cluster (Pentium III 933 MHz x 32 CPUs). All of the other experimental conditions were the same as those used in the experiment conducted in the noise-free environment described above. To confirm the effectiveness of the coevolutionary approach, we compared the results to those of a non-coevolutionary method that did not consider the interactions between decomposed subproblems (Kimura et al., 2004). This paper refers to this non-coevolutionary method as the problem decomposition approach.
Result
Figure 3 shows the calculated gene expression time-courses obtained from the methods with and without the coevolution. The calculated time-courses obtained by solving the set of Equations (1) and (4), respectively, are shown in the figure. As shown in Figure 3A, when the proposed coevolutionary approach was applied, the time-course obtained by solving the set of equations (1) was almost identical to that obtained by solving Equation (4). On the contrary, the calculated time-courses of the problem decomposition approach differed greatly (Fig. 3B).
|
When inferring S-system models of genetic networks, both approaches use the differential equation (4) to calculate time-courses of gene expression levels. In Equation (4), however, the perturbation in the i-th gene does not affect the expression levels of the other genes. Therefore, Equation (4) is not a suitable model to help biologists generate hypotheses and facilitate the design of their experiments, while it is a useful model for inferring genetic networks. When we analyze the inferred genetic network, the set of equations (1) must be used as the model for computational simulation. From this point of view, the problem decomposition approach does not produce a suitable model for computational simulation, since the model does not always fit into the observed data. As the time-courses obtained from Equation (1) are almost identical to those obtained from Equation (4), the proposed approach provides us with a suitable model. The sum of the squared relative error between the given data and the calculated time-courses of the model inferred by the proposed method was always smaller than that obtained from the problem decomposition approach in this experiment. The sums of squared relative error obtained from the methods with and without the coevolution averaged about 27.72 ± 0.68 and 28.18 ± 0.76, respectively.
Typical results obtained from the methods with and without the coevolution are shown in Figures 4 and 5 respectively. Inferred interactions for the 8th, 16th and 24th subproblems are shown. As the results show, both methods failed to infer some of the interactions present in the target model, and they inferred many erroneous interactions that had absolute parameter values too large to ignore. Weakly interactions were, especially, difficult to be correctly inferred, e.g. both methods often failed to infer interactions corresponding to g1,14, g24,18 and g26,28. In addition, an interaction represented as gi,j was sometimes inferred as that of hi,j. The failure to infer the correct interactions, however, does not seriously hinder our investigation, as the inferred model is intended mainly for use by biologists as a tool for generating hypotheses and facilitating the design of experiments. The necessary interactions that were not correctly inferred should be added, and the erroneous interactions should be removed in either of two ways, i.e. by using more sets of time-series data obtained from additional biological experiments or by using further a priori knowledge about the genetic network.
|
|
The model inferred by the proposed method contained 58.4 ± 2.1 true-positive interactions and 241.6 ± 2.1 false-positive interactions on average. The number of the inferred interactions corresponded to the maximum indegree I. Our method failed to infer an average of 9.6 ± 2.1 interactions that were present in the target model (i.e. the number of false-negative interactions was 9.6 ± 2.1). On the contrary, in the experiment using the problem decomposition approach, the numbers of true-positive false-positive and false-negative interactions averaged 57.6 ± 2.3, 242.4 ± 2.3 and 10.4 ± 2.3, respectively. To solve this problem, the proposed coevolutionary method required
57.8 h on the PC cluster (Pentium III 933 MHz x 32 CPUs). The computational time that the problem decomposition approach required for optimizing each subproblem averaged
57.5 h on a single-CPU personal computer (Pentium III 933 MHz), and the subproblems were optimized simultaneously on the PC cluster.
The experimental results suggest that our coevolutionary approach slightly improves the probability of inferring the correct interactions. In order to confirm the improvement, we performed a number of other experiments using different amount of time-series data. Figure 6 shows the number of false-negative interactions in each. In all of the experiments, the proposed method slightly reduced the number of false-negative interactions, i.e. it enhanced the probability of finding the correct interactions. This may be because the proposed method updates the estimated gene expression time-courses,
. In this study, the algorithm uses
to solve the decomposed subproblems. Therefore,
must be precise if the probability of finding the correct interactions is to be improved. Because the proposed coevolutionary approach updates
, their precision may be improved through searches.
|
In the proposed coevolutionary method, the improvement in a performance of finding correct interactions was slight. However, it should be noted that the same time-series data were given to both methods as the observed ones. As the proposed method extracts correcter information from the data, the inferred model is more reasonable to analyze genetic networks.
Experiment 3: analysis of actual data
The proposed method enables us to infer large-scale genetic networks containing dozens of genes. However, when attempting to analyze actual DNA microarray data, many hundreds or thousands of genes must be handled. This task lies far beyond the powers of the proposed coevolutionary method. One possible strategy to improve its inference capability is to use any clustering technique to identify genes with similar expression patterns and group them together (D'haeseleer et al., 2000; Eisen et al., 1998). By treating groups of similar genes as single-network components, the proposed coevolutionary method is capable of analyzing systems containing many hundreds of genes. In this study, we combined the proposed method with the clustering technique proposed by Kano et al., 2003. The combined method was then used to analyze cDNA microarray data of Thermus thermophilus HB8 strains.
Two sets of cDNA microarray time-series data, i.e. wild-type and UvrA gene disruptant, were observed. Each sets of the data were measured at 14 time points. The clustering technique used in this study grouped 612 putative open reading frames (ORFs) included in the data into 24 clusters. As we treated the disrupted gene, UvrA, as single-network component, the target system consisted of 24 + 1 = 25 network components. The time-series data of each cluster was given by averaging the expression patterns of ORFs included in the cluster. A total of 10 runs were carried out. The maximum indegree I was 3. The search regions were [0.0, 5.0] for
i and ßi, [3.0, 3.0] for gi,j and hi,j, and [ 0.7Xi,exp,0, 1.3 Xi,exp,0] for the initial levels of the gene expression. The experiments were executed in parallel on a PC cluster (Pentium III 933 MHz x 32 CPUs). All of the other experimental conditions were the same as those in previous experiments.
Figure 7 shows the core network structure where the interactions were inferred by the proposed method more than nine times within 10 runs. As the amount of observed data was insufficient, the inferred network model seems to contain many erroneous interactions. However, some reasonable interactions were also inferred. Many ORFs contained in the clusters 6, 7, 10, 15, 16, 19 and 22 are annotated to be concerned with Energy metabolism, and these clusters were relatively located near from each other in the inferred model. The figure shows the clusters 12 and 23 were also located near from the clusters of Energy metabolism. However, a few ORFs contained in the clusters 12 and 23 are annotated to be concerned with Energy metabolism. This fact suggests that some of hypothetical and unknown ORFs included in the clusters 12 and 23 may work for Energy metabolism or related functions.
|
In this experiment, as the amount of the measured time-series data was insufficient, it is hard to extract many suggestions from the inferred network. To obtain more meaningful results, we are now planning additional biological experiments.
| CONCLUSION |
|---|
|
|
|---|
In this paper, we proposed a new method for inferring the S-system models of large-scale genetic networks. The proposed method uses the problem decomposition strategy to divide the genetic network inference problem into several subproblems. The decomposed subproblems are then solved simultaneously using the cooperative coevolutionary algorithm. Because the decomposed subproblems interact with each other through their calculated gene expression time-courses, the inferred model can be used in the computational simulation. This feature is important because the computational simulation provides us with a better understanding of genetic networks. Through numerical experiments, we showed that the proposed method slightly enhanced the probability of finding the correct interactions of a network. Updating the gene expression time-courses also seems to enhance the probability of inferring a correct network structure. Finally, to analyze actual DNA microarray data, we combined the proposed coevolutionary method with the clustering technique.
| Acknowledgments |
|---|
We thank Dr S. Kuhara and Dr K. Tashiro in Kyushu University for supervising cDNA microarray experiments.
Received on June 16, 2004; revised on September 1, 2004; accepted on September 18, 2004
| REFERENCES |
|---|
|
|
|---|
Akutsu, T., Miyano, S., Kuhara, S. (2000) Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics, 16, 727734
Chen, T., He, H.L., Church, G.M. (1999) Modeling gene expression with differential equations. Proc. Pac. Symp. Biocomput., 4, 2940.
Cleveland, W.S. (1979) Robust locally weight regression and smoothing scatterplots. J. Am. Stat. Assoc., 79, 829836[CrossRef].
Dhaeseleer, P., Liang, S., Somogyi, R. (2000) Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics, 16, 707726
Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci., USA, 95, 1486314868
Holland, J.H. Adaptation in Natural and Artificial Systems, (1975) , Ann Arbor, MI The University of Michigan Press.
Kano, M., Nishimura, K., Tsutsumi, S., Aburatani, H., Hirota, K., Hirose, M. (2003) Cluster overlap distribution map: visualization for gene expression analysis using immersive projection technology. Presence Teleoper. Virt. Environ., 12, , pp. 96109.
Kikuchi, S., Tominaga, D., Arita, M., Takahashi, K., Tomita, M. (2003) Dynamic modeling of genetic networks using genetic algorithm and S-system. Bioinformatics, 19, 643650
Kimura, S. and Konagaya, A. (2003) High dimensional function optimization using a new genetic local search suitable for parallel computers. Proceedings of the 2003 Conference on Systems, Man and Cybernetics , Washington, DC, USA , pp. 335342.
Kimura, S., Hatakeyama, M., Konagaya, A. (2003) Inference of S-system models of genetic networks using a genetic local search. Proceedings of the 2003 Congress on Evolutionary Computation , Canberra, Australia , pp. 631638.
Kimura, S., Hatakeyama, M., Konagaya, A. (2004) Inference of S-system models of genetic networks from noisy time-series data. Chem-Bio Informatics J., 4, 114.
Kwon, A.T., Hoos, H.H., Ng, R. (2003) Inference of transcriptional regulation relationships from gene expression data. Bioinformatics, 19, 905912
Liu, Y., Yao, X., Zhao, Q., Higuchi, T. (2001) Scaling up fast evolutionary programming with cooperative coevolution. Proceedings of the 2001 Congress on Evolutionary Computation , Seoul, Korea , pp. 11011108.
Maki, Y., Tominaga, D., Okamoto, M., Watanabe, S., Eguchi, Y. (2001) Development of a system for the inference of large scale genetic networks. Proc. Pac. Symp. Biocomput., 6, 446458.
Maki, Y., Ueda, T., Okamoto, M., Uematsu, N., Inamura, Y., Eguchi, Y. (2002) Inference of genetic network using the expression profile time course data of mouse P19 cells. Genome Inform., 13, 382383.
Morishita, R., Imade, H., Ono, I., Ono, N., Okamoto, M. (2003) Finding multiple solutions based on an evolutionary algorithm for inference of genetic networks by S-system. Proceedings of the 2003 Congress on Evolutionary Computation , Canberra, Australia , pp. 615622.
Potter, M.A. and De Jong, K.A. (2000) Cooperative coevolution: an architecture for evolving coadapted subcomponents. Evol. Comput., 8, 129[CrossRef][Medline].
Press, W., Teukolsky, S., Vetterling, W., Flannery, B. Numerical Recipes in C, (1995) 2nd edn , Cambridge, UK Cambridge University Press.
Sakamoto, E. and Iba, H. (2001) Inferring a system of differential equations for a gene regulatory network by using genetic programming. Proceedings of the 2001 Congress on Evolutionary Computation , Seoul, Korea , pp. , pp. 720726.
Savageau, M.A. Biochemical Systems Analysis: a Study of Function and Design in Molecular Biology, (1976) , Reading, MA Addison-Wesley.
Shiraishi, F. and Savageau, M.A. (1992) The Tricarboxylic acid cycle in Dictyostelium discoideum. J. Biol. Chem., 267, , pp. 2291222918
Thieffry, D., Huerta, A.M., Pérez-Rueda, E., Collado-Vides, J. (1998) From specific gene regulation to genomic networks: a global analysis of transcriptional regulation in Escherichia coli. BioEssays, 20, 433440[CrossRef][Web of Science][Medline].
Tominaga, D., Koga, N., Okamoto, M. (2000) Efficient numerical optimization algorithm based on genetic algorithm for inverse problem. Proceedings of the Genetic and Evolutionary Computation Conference , Las Vegas, Nevada, USA , pp. 251258.
Ueda, T., Ono, I., Okamoto, M. (2002) Development of system identification technique based on real-coded genetic algorithm. Genome Inform., 13, 386387.
Vance, W., Arkin, A., Ross, J. (2002) Determination of causal connectivities of species in reaction networks. Proc. Natl Acad. Sci., USA, 99, 58165821
Voit, E.O. Computational Analysis of Biochemical Systems, (2000) , Cambridge, UK Cambridge University Press.
Voit, E.O. and Almeida, J. (2004) Decoupling dynamical systems for pathway identification from metabolic profiles. Bioinformatics, 20, , pp. 16701681
Voit, E.O. and Radivoyevitch, E.O. (2000) Biochemical systems analysis of genome-wide expression data. Bioinformatics, 16, 10231037
Weaver, D.C., Workman, C.T., Stormo, G.D. (1999) Modeling regulatory networks with weight matrices. Proc. Pac. Symp. Biocomput., 4, 112123.
This article has been cited by other articles:
![]() |
W.-P. Lee and W.-S. Tzou Computational methods for discovering gene networks from expression data Brief Bioinform, July 1, 2009; 10(4): 408 - 423. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Kimura, S. Nakayama, and M. Hatakeyama Genetic network inference as a series of discrimination tasks Bioinformatics, April 1, 2009; 25(7): 918 - 925. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Gennemark and D. Wedelin Benchmarks for identification of ordinary differential equations from time series data Bioinformatics, March 15, 2009; 25(6): 780 - 786. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Goel, I-C. Chou, and E. O. Voit System estimation from metabolic time-series data Bioinformatics, November 1, 2008; 24(21): 2505 - 2511. [Abstract] [Full Text] [PDF] |
||||
![]() |
P.-K. Liu and F.-S. Wang Inference of biochemical network models in S-system using multiobjective optimization approach Bioinformatics, April 15, 2008; 24(8): 1085 - 1092. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. R. Gonzalez, C. Kuper, K. Jung, P. C. Naval Jr, and E. Mendoza Parameter estimation using Simulated Annealing for S-system models of biochemical networks Bioinformatics, February 15, 2007; 23(4): 480 - 486. [Abstract] [Full Text] [PDF] |
||||
![]() |
D.-Y. Cho, K.-H. Cho, and B.-T. Zhang Identification of biochemical networks by S-tree based genetic programming Bioinformatics, July 1, 2006; 22(13): 1631 - 1640. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Larranaga, B. Calvo, R. Santana, C. Bielza, J. Galdiano, I. Inza, J. A. Lozano, R. Armananzas, G. Santafe, A. Perez, et al. Machine learning in bioinformatics Brief Bioinform, March 1, 2006; 7(1): 86 - 112. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||















