Skip Navigation


Bioinformatics Advance Access originally published online on July 24, 2006
Bioinformatics 2006 22(19):2413-2420; doi:10.1093/bioinformatics/btl396
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
22/19/2413    most recent
btl396v2
btl396v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (27)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wang, Y.
Right arrow Articles by Chen, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wang, Y.
Right arrow Articles by Chen, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Inferring gene regulatory networks from multiple microarray datasets

Yong Wang 1,2, Trupti Joshi 3, Xiang-Sun Zhang 2,*, Dong Xu 3,* and Luonan Chen 1,4,*

1 Department of Electrical Engineering and Electronics, Osaka Sangyo University Osaka 574-8530, Japan
2 Academy of Mathematics and Systems Science, CAS Beijing 100080, China
3 Computer Science Department and Christopher S. Bond Life Sciences Center, University of Missouri Columbia, MO 65211, USA
4 Institute of Systems Biology, Shanghai University Shanghai 200444, China

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

Motivation: Microarray gene expression data has increasingly become the common data source that can provide insights into biological processes at a system-wide level. One of the major problems with microarrays is that a dataset consists of relatively few time points with respect to a large number of genes, which makes the problem of inferring gene regulatory network an ill-posed one. On the other hand, gene expression data generated by different groups worldwide are increasingly accumulated on many species and can be accessed from public databases or individual websites, although each experiment has only a limited number of time-points.

Results: This paper proposes a novel method to combine multiple time-course microarray datasets from different conditions for inferring gene regulatory networks. The proposed method is called GNR (Gene Network Reconstruction tool) which is based on linear programming and a decomposition procedure. The method theoretically ensures the derivation of the most consistent network structure with respect to all of the datasets, thereby not only significantly alleviating the problem of data scarcity but also remarkably improving the prediction reliability. We tested GNR using both simulated data and experimental data in yeast and Arabidopsis. The result demonstrates the effectiveness of GNR in terms of predicting new gene regulatory relationship in yeast and Arabidopsis.

Availability: The software is available from http://zhangorup.aporc.org/bioinfo/grninfer/, http://digbio.missouri.edu/grninfer/ and http://intelligent.eic.osaka-sandai.ac.jp or upon request from the authors.

Contact: chen{at}eic.osaka-sandai.ac.jp, xudong{at}missouri.edu, zxs{at}amt.ac.cn

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Microarray technologies have produced tremendous amounts of gene expression data (van Someren et al., 2001; Hughes et al., 2000). Mining these data to understand gene expression and regulation represents a major challenge for bioinformatics. A major focus on microarray data analysis is the reconstruction of gene regulatory network (GN), which aims to find the underlying network of gene-gene interactions from the measured dataset of gene expression (Hartemink, 2005; Basso et al., 2005; Levine and Davidson, 2005; Akutsu et al., 2000; H (2000)). A wide variety of approaches have been proposed to infer gene regulatory networks from time-course data (Holter et al., 2001; Tegner et al., 2003; Dewey and Galas, 2001), such as discrete models of Boolean networks and Bayesian networks (Husmeier, 2003; Rangel et al., 2004; Beal et al., 2005), and continuous models of neural networks, difference equations (van Someren et al., 2001) and differential equations (Chen and Aihara, 2001, 2002).

Since a typical gene expression dataset consists of relatively few time points (often <20) with respect to a large number of genes (generally in thousands), a major difficulty of GN inference for all methods is scarcity of time-course data or the so-called dimensionality problem (D'haeseleer et al., 2000; Zak et al., 2003; van Someren et al., 2001). In other words, the number of genes far exceeds the number of time points for which data are available, making the problem of determining GN structure an ill-posed one. Current methods generally use a single set of time-course data under a specific experimental condition, and hence often fail in using experimental data to construct GN accurately. On the other hand, gene expression data generated by different groups worldwide are increasingly accumulated on many species and can be accessed from public databases or individual websites, although each experiment has only a limited number of time-points. For example, in the GEO database (http://www.ncbi.nlm.nih.gov/geo/), currently there are 241 microarray datasets for human alone. If such large amounts of data from different experiments are combined and further exploited in an integrative and systematic manner, the scarcity of data can be greatly alleviated and a more accurate reconstruction of GN can be expected. It is worth mentioning that simply arranging multiple time-course datasets into a single time-course dataset is inappropriate for GN inference owing to data normalization issues and lack of temporal relationships among these datasets. Hence, current GN inference methods typically cannot handle multiple sets of data.

In addition to the dimensionality problem of data, another problem for the conventional approaches is that the derived gene networks often have densely connected gene regulatory relationships among nodes, which are not biologically plausible. A biological gene network is expected to be sparse (Gardner and Faith, 2005; Yeung et al., 2002), which should also be reflected in the procedure of the network reconstruction.

This paper proposes a novel method to combine a wide variety of microarray datasets from different experiments (different environmental conditions or perturbations) for inferring GN with the consideration of sparsity of connections. GNR (Gene Network Reconstruction tool), based on LP (linear programming) and a decomposition procedure, is developed by exploiting the general solution form of arbitrary connectivity matrix for GN. The proposed method GNR theoretically ensures the derivation of the most consistent or invariant network structure with respect to all the used datasets, thereby not only significantly alleviating the problem of data scarcity but also remarkably improving the reliability. Specifically, inferring GN is formulated as an optimization problem with an objective function of forced matching and sparsity terms, so that a consistent and sparse structure that is also considered to be biologically plausible can be expected. An efficient algorithm has been developed to solve such a large-scale LP in an iterative manner. Both simulated examples and experimental data are used to demonstrate the effectiveness of GNR, which also leads to predictions of new gene regulation relationships for yeast and Arabidopsis. GNR is implemented in the Fortran programming language and the software is available from http://zhangorup.aporc.org/bioinfo/grninfer/, http://digbio.missouri.edu/grninfer/ and http://intelligent.eic.osaka-sandai.ac.jp or upon request from the authors.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Figure 1 illustrates the schematic of the proposed method. In this section, we first describe a GN as a set of differential equations, and then derive a special solution of the GN based on singular value decomposition (SVD) for a single dataset (time-course data). By constructing the general solution of the GN for each single dataset, we formulate the GN reconstruction problem as an optimization problem which is to find the most consistent network structure with respect to all the used datasets. The optimal solution can be viewed as a special solution for the multiple datasets with the minimal connections or edges. We show that such an optimization problem is equivalent to a linear programming, and an efficient algorithm is developed to solve such an LP based on the decomposition technique.


Figure 1
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Schematic of GNR.

 
2.1 Gene regulatory network
Generally, a genetic network can be expressed by a set of non-linear differential equations with each gene expression level as variables

Formula 1(1)
where x(t) = (x1(t), ... , xn(t))T isin Rn, and f = (f1, ... , fn)T : Rn ↦ Rn. xi(t) is the expression level (mRNA concentrations) of gene i at time instance t. Assume that there are a total of m time points for a given experimental condition from microarray, i.e. t1, ... , tm. fi is a C1 class non-linear function.

Although gene regulations are often non-linear, most of the existing approaches for GN inference use linear or additive models owing to unclear structures of biological systems and scarcity of data (D'Haeseleer et al., 1999; Gustafsson et al., 2005). From the viewpoint of dynamical systems, linear equations can at least capture the main features of the network or the function, in particular around a specific state of the system. The linear form of Equation (1) with appropriate normalization is

Formula 2(2)
where J = (Jij)nxn = {partial}f(x)/{partial}x is an n x n Jacobian matrix or connectivity matrix, and b = (b1, ... , bn)T isin Rn is a vector representing the external stimuli or environment conditions, which is set to zero when there is no external input.

2.2 General solution for a single dataset
To overcome the difficulty because of scarce data, many techniques, such as clustering of genes, SVD, interpolation of data (van Someren et al., 2001) have been developed. We first adopt the SVD technique to derive a particular solution and further the general solution of Equation (2), using a single time-course dataset. By rewriting Equation (2), we have

Formula 3(3)
where X = (x(t1), ... , x(tm)), B = (b(t1), ... , b(tm)) and Formula 3 are all n x m matrices with Formula 3 for i = 1, ... , n; j = 1, ... , m. By adopting SVD, i.e. Formula 3, where U is a unitary m x n matrix of left eigenvectors, E = diag(e1, ... , en) is a diagonal n x n matrix containing the n eigenvalues and VT is the transpose of a unitary n x n matrix of right eigenvectors. Without loss of generality, let all non-zero elements of ek be listed at the end, i.e. e1 = ··· = el = 0 and el+1, ... , en != 0. Then we can have a particular solution with the smallest L2 norm for the connectivity matrix Formula 3 as

Formula 4(4)
where E–1 = diag(1/ei) and 1/ei is set to be zero if ei = 0. Thus, the network family, or the general solution of the connectivity matrix J = (Jij)nxn is

Formula 5(5)
Y = (yij) is an n x n matrix, where yij is zero if ej != 0 and is otherwise an arbitrary scalar coefficient. Solutions of (5) represent all of the possible networks that are consistent with the single microarray dataset, depending on arbitrary Y. Notice that m + 1 points are required in (5) owing to the estimation of Formula 5.

2.3 Special solution with minimal connections for multiple datasets
Assume that there are multiple microarray datasets for one organism, each of which corresponds to its own general solution of (5). Each time-course dataset may be measured under various environments or stimuli by different labs. Specifically, there are N datasets, and we can infer N networks respectively as

Formula 6(6)
where the subscript k = 1, ... , N is the index of the dataset-k. Note that without normalization, Jk for each dataset is actually a normalized matrix even for different experiments with different time intervals due to the form of (4).

Next, we will find the most consistent network structure J = (Jij)nxn for all k = 1, ... , N of (6), with consideration of sparse structure, as illustrated in Figure 1. Mathematically, the problem is formulated as

Formula 7(7)
where Formula 7 is the function of Yk according to (6), and Y = (Y1, ... , YN). The variables are Y and J. The first term is the matching term which forces the matching of J and Jk, whereas the second term is the sparsity term which forces J sparse owing to L1 norm. {lambda} is a positive parameter, which balances the matching and sparsity terms in the objective function. The variables in (7) are Jij and all of non-zero Formula 7. {omega}k is a positive weight coefficient for the dataset-k with Formula 7. Since different datasets may have different qualities (e.g. different technologies, number of repeats in measurements, etc.), a weight coefficient is used to represent the reliability of each dataset. Assume that the number of the repeated experiments for the dataset-k is Nk by using the same type of microarray. Then {omega}k can be set as

Formula 8(8)
The optimization problem for (7) is a mathematical programming problem with positive combination of L1 norm of variables, which can be transformed into a linear programming problem through a well-known procedure and solved by a simple iterative procedure. Owing to L1 norm, generally the optimal solution of (7) has the property with the zeros for Formula 8 and |Jij| as many as possible, which exactly serves our purpose, i.e. consistent and sparse structure.

2.3.1 Decomposition and algorithm
Clearly when J is fixed, the original problem of (7) can be divided into N independent subproblems. We decompose (7) into the following form.

Formula 9(9)
Since (9) is a large-scale linear programming (LP) problem owing to a large number of variables, we adopt an iterative technique to solve (9). Specifically, first we fix J to solve N small-size matching subproblems Y, and then update J based on the results of Y for N subproblems. Such iteration continues until converged.

Therefore, we have the following algorithm for deriving gene network.

  • STEP-0: Initialization. Obtain all of the particular solution Formula 9 by SVD from (4), and {omega}k from (8). Set initial value Jij(0) = 0, Formula 9 and Formula 9, and positive values {lambda}, {varepsilon}. Let iteration index be q and set q = 1.
  • STEP-1: Set Formula 9 and solve Formula 9 at iteration q by LP for each subproblem from (9) with J (q – 1) fixed, i.e. solve Formula 9 of the following subproblem for k = 1, ... , N with J(q – 1) given

    Formula 10(10)
    Note that Formula 10 if j > lk according to (5).

  • STEP-2: Solving Jij(q) at iteration q by LP with all of Formula 10 given, i.e. solve J(q) of the following problem with all of Jk(q) fixed.

    Formula 11(11)
    The detail procedures of solving (10) and (11) are described in Supporting Material.

  • STEP-3: If J is converged, i.e. ||J(q) – J(q – 1)|| < {varepsilon}, then terminate the computation. Otherwise, go to STEP-1 by q -> q + 1.

Although the solution may depend on {lambda}, it is a single parameter which can be tuned in a relatively easy manner or be simply tested for a range of its value. A flowchart of the algorithm is illustrated in Supplementary Material. The non-linear network (e.g. with quadratic form) can also be derived with similar form of (7) in a self-consistent way.

2.3.2 Confidence evaluation
Let the optimal solution of (7) be J* and Y*k. Then, the variances vij and deviation {sigma}ij of each element Jij for J can be easily estimated by

Formula 12(12)

Formula 13(13)

By computing their average, we have

Formula 14(14)
In addition, the proposed approach can be further improved by combining with other methods, such as, the data expanding technique developed by van Someren et al. (2001).


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
In this section, we first report on several numerical tests that we have designed to benchmark GNR using multiple simulated datasets. Then we will describe the GN inference using yeast and Arabidopsis microarray gene expression data. As analysed in Methods, when a single time-course dataset is adopted, GNR is similar to the method of Yeung et al. (2002), which can recover the network connectivity from gene expression measurements in the presence of noise by SVD and regression. For a single time-course dataset, it is easy to show that the smallest number of time points needed is O(log n) to reconstruct the n x n connectivity matrix for an n-gene network (Yeung et al., 2002). When adopting multiple datasets, we can further infer the most consistent network structure with respect to all the datasets in a more accurate and robust manner.

3.1 Simulated data
The first example is a small simulated network with five genes governed by

Formula 14
where xi reflects the expression level of the gene-i and {xi}i(t) represents noise for i = 1, 2, 3, 4, 5. Clearly, the system is a negative gene regulation loop with genes 2, 3, 4, 5 repressing genes 1, 2, 3, 4 respectively, and with gene 1 in turn enhancing gene 5.

To test GNR, we randomly choose the initial condition of the system and take several points of x as a measured time-course dataset. With three different initial conditions, we obtained three different datasets with 4, 4 and 3 time points respectively, and applied GNR to reconstruct the connectivity matrix or the Jacobian matrix J. To measure the discrepancies between the true network and the inferred network with n genes, we adopt the simple criterion in Yeung et al. (2002) as E0 to assess the basic recovering ability:

Formula 15(15)
where eij takes 1 if Formula 15, otherwise 0. {delta} is a prescribed small value for error tolerance related to noise level of the system. Formula 15 and Formula 15 are interaction strength from gene-j to gene-i for the true and inferred networks, respectively.

Furthermore to depict the accuracy or correctness of GNR, we introduce the following two criteria E1 and E2 as

Formula 16(16)

Formula 17(17)
which are L1 norm and L2 norm errors respectively for all of interaction strengths.

The numerical results are depicted in Figures 2 and 3, which show the reconstructed networks without and with noises respectively. As indicated in Figure 2, clearly the more the datasets, the more accurate the inferred network. When using one dataset (Fig. 2b), it contains a wrong relation between x5 and x3. As two datasets are used, the topology of the network becomes correct (Fig. 2c). After using all three datasets, the predicted connectivity matrix, which represents the strengths among gene interactions, is very close to the true one (Fig. 2c). Such results imply that GNR is able to infer the solution of the highly under-determined problem in an accurate manner when a sufficient number of datasets (or experiments) are available even though each dataset has only a few time points and starts from different initial conditions. In GNR, we also introduce a scalar parameter {lambda} to control the sparsity of the inferred network (see Methods for details). When there are multiple solutions (which are typical) due to the under-determined nature, GNR prefers to infer a network with a sparse structure.


Figure 2
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 The simulated example with {lambda} = 0 and without noise. Arrows and arcs denote activation and repression, respectively. (a) The true network. (b) Using one dataset. (c) Using two datasets. (d) Using three datasets.

 


Figure 3
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 The simulated example with noise. Arrows and arcs denote activation and repression respectively. (a) Using one dataset with {lambda} = 0.0. (b) Using two datasets with {lambda} = 0.0. (c) Using three datasets with {lambda} = 0.0. (d) Using three datasets with {lambda} = 0.3.

 
Figure 3 shows the results when noises are added to the dynamics. As indicated in Tu et al. (2002), the distribution function of the noise in microarray is more like a Gaussian distribution. Therefore we set all of noises {xi}i(t), i = 1, 2, 3, 4, 5 obeying normal distribution in the simulated example. With gradual increase of noise level to N(0, 0.005), the network eventually cannot be correctly inferred even using all three datasets (Fig. 3c) due to the effect of noises. For such an under-determined case, GNR can reconstruct the network by an additional constraint of sparsity, i.e. introducing a positive parameter {lambda}, as shown in Figure 3d. With such a constraint, generally there is a better chance to construct a biologically plausible structure (Yeung et al., 2002) but at the expense of accuracy of interaction strengths. We have also tested for a nonlinear gene network by replacing all linear terms into quadratic terms. As demonstrated in Supplementary Material (in particular Supplementary Figure A2 and Table A1), comparing with the linear and noise cases, the link strengths of reconstructed networks have certain errors. Nevertheless, the topology of the network can be correctly inferred using all three datasets (Supplementary Figure A2c). Table 1 shows the accuracies of different error criteria, i.e. E0, E1 and E2 in the two cases without noise and with noise obeying N(0, 0.005) normal distribution, which indicate that adding datasets improves the accuracy of the network reconstruction, e.g. the more the datasets, the smaller are the E0, E1 and E2 values. This table also implies that a solution of GNR is a balance between the topology reconstruction (evaluated mainly by E0) and the accuracy of interaction strength (evaluated mainly by E1 or E2). The trade-off between E0 and E1 (or E2) can be controlled by the parameter {lambda}. The deviation Formula 17 in Table 1 is introduced to evaluate the confidence of the inferred network (see Methods for details). The tendency of Formula 17 also indicates that adding datasets improves the confidence of the network reconstruction.


View this table:
[in this window]
[in a new window]

 
Table 1 Accuracies for different error criteria and confidence evaluation

 
We also consider a large system to calibrate the proposed reverse engineering scheme. The results are listed in the Supplementary Material, which further confirm the effectiveness of GNR.

3.2 Application to experimental data
We applied GNR to experimental data. To ensure high quality of the data, we only used whole genome Affymetrix chips microarray experimental data, instead of any oligo or cDNA array data.

3.2.1 Heat-shock response data for yeast
We first test GNR using a small number of genes. We created an input dataset for 10 transcription factors related to heat-shock response in yeast Saccharomyces cerevisiae. Out of the 10 transcription factors 2 (Hsf1p and Skn7p) are known to be directly involved in heat shock response. Hsf1p and Skn7p each are known to regulate 4 other transcription factors among the 10. This information was obtained from YEASTRACT (http://www.yeastract.com/index.php). For the 10 transcription factors, we used 4 microarray datasets at the Stanford Microarray Database (http://smd.stanford.edu/) (y11, y14, y16:57–60, y16:109–112, with 7, 5, 5, 4 time points, respectively) for gene expression under heat shock conditions. We applied GNR to this dataset. As shown in Figure 4, the prediction succeeded in reconstructing four edges of the network with documented known regulation and 1 edge with documented potential regulation.


Figure 4
View larger version (38K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 Regulatory network reconstruction for set of 10 transcription factors for heat shock response microarray data in yeast. Activation is shown in red and repression in blue arrows. The confirmed edges are shown in bold arrows, while the potential edge is shown in yellow-red arrow indicating activation.

 
3.2.2 Cell cycle data for yeast
We tested GNR using the experimental data for cell cycle studies in Saccharomyces cerevisiae obtained from the Stanford Microarray Database (http://smd.stanford.edu/). We generated four datasets with different conditions. Supplementary Table A5 lists the experimental conditions and time points used for analysis. Among all the yeast genes, 140 of them have change of 2-fold up or down in at least 20% of the expression level across all datasets.

Application of GNR to the 140 differentially expressed genes of the four datasets generated consistent subnetworks with 64 links, 431 links, etc. depending on the scalar parameter used to control the sparsity or consistency of the subnetwork. Figure 5 shows a representation of the 64-link GN model. Figure 5a shows YGP1 in the center which is a cell wall-related secretory glycoprotein and induced by nutrient deprivation-associated growth arrest and upon entry into the stationary phase (Destruelle et al., 1994). In the predicted model, YGP1 activates three genes, i.e. DSE2, PIR3 and FET3. Both DSE2 and PIR3 relate to cell wall organization and biogenesis (Doolin et al., 2001; Mrsa and Tanner, 1999), whose activations may follow YGP1 at the entry of the stationary phase. Among the genes that YGP1 suppresses in the model, it is known that HLR1 suppresses the cell wall phenotypes (Alonso-Monge et al., 2001). Suppressing HLR1 by YGP1 is equivalent to enhance cell wall development, which is consistent to the activation of DSE2 and PIR3. TFA2 is TFIIE small subunit, involved in RNA polymerase II transcription initiation (Kornberg, 1998). In addition to the genes in Figure 5, other genes in the network show negative self regulation (data not shown).


Figure 5
View larger version (25K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5 Partial representation (with two connected sub-networks) of the 64-link inferred network in yeast based on cell cycle microarray experimental datasets. The isolated genes without interaction with others are not shown. The red arrows in the figure indicate repression while the blue arrows indicate activation. The circles in the same color indicate the same biological function.

 
3.2.3 Stress response data for Arabidopsis
We also applied our method in studying stress response in Arabidopsis thaliana. We used whole genome Affymetrix chips microarray experimental data for Arabidopsis thaliana from the ATGenExpress database at The Arabidopsis Information Resources (TAIR) (http://www.arabidopsis.org/). We applied nine datasets related to the stress responses, each with six or more time points and each for the root and shoot experiments. Supplementary Table A6 lists the experimental details and the time points used. We used the log ratios of the expression values for a treatment condition against the mock condition. We narrowed down the list of genes to 226 genes for the root experiments and 246 genes for the shoot experiments based on two-fold change (either up or down) in at least 70% of the ratios of a gene. This list represents the most statistically significant genes differentially expressed under stress in root and shoot based on all experiments.

GNR was applied to the 226 genes with the above nine datasets. Using different thresholds, we can predict various networks with different edge density, which are consistent with respect to all datasets. Figure 6 represents a 35-link sub-network in shoots. The network shows that the genes AT1G56600 and ATERF6 (AT4G17490) control the neighboring genes by either activating or suppressing them. We found that our network relates to some knowledge while predicting novel regulations. ATERF6 is a member of the ERF (ethylene response factor) subfamily B-3 of ERF/AP2 transcription factor family (Fujimoto et al., 2000). It is predicted to activate three genes encoding known or putative transcription factors, i.e. AT2G12940, AT3G49760 and AT2G40750. AT2G12940 is similar to transcription factor VSF-1; AT3G49760 is a Bzip transcription factor; AT2G40750 is a member of WRKY transcription factor family. Other genes have functions related to stress response. AT1G52560 is a heat shock protein. AT1G36030 encodes a member of the F-box family, whose members involved in regulating diverse cellular processes including cell cycle transition, transcriptional regulation and signal transduction.


Figure 6
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6 Inferred network in A.thaliana based on 35 links generated from stress response datasets in shoots. The red arrows represent activation while the blue arrows represent suppression. Putative transcription factors are shown in purple. F-box family proteins are shown in yellow.

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Microarray gene expression data has increasingly become the common data source that can provide insights into biological processes at a system-wide level. As indicated in Soinov (2003), although a large amounts of data are increasingly accumulated, one of the major problems with microarrays is that data often come from different platforms, laboratories, etc. It is often difficult to compare or combine results of experiments done by different research groups for biological inference. In contrast to the conventional methods which require more time points in a single dataset to infer more accurate network owing to the dimensionality problem, the main contribution of this paper is that we developed a methodology to reconstruct GN using multiple datasets from different sources without normalization among the datasets. In other words, we provide a general framework to handle the microarray data by fully exploiting all available microarray data for a given species, so as to alleviate the problem of dimensionality or data scarcity. As a byproduct of the new method, it provides a new way to compare hypotheses generated from different datasets, and also a new way to derive a common substructure not from network alignment but from the raw microarray datasets. In particular, it is very effective to find an invariant structure when multiple datasets with different conditions or perturbations are used.

We have tested our approach on both simulated problems and experimental biological data, which verified the efficiency and effectiveness of the algorithm. Depending on the trade-off parameter {lambda}, we can derive either a global structure with dense connection for a small {lambda} or local substructure with sparse connection for a large {lambda}. Furthermore, the role of parameter {lambda} in the inference algorithm is discussed and tested by comparing the inferred network structures for different {lambda}s. Also we discuss how to specify the proper value and the searching strategy in the parameter space of {lambda} (see the details in Supplementary Material).

There is an important assumption for the proposed method in this paper, i.e. the structure of the regulatory network is stationary, and does not ‘rewire’ under the environmental conditions for those different datasets. This means that the change of environmental conditions alters the level of gene expression instead of the network structure. Another assumption is that high resolution time-course microarray datasets are required so as to accurately infer the network structure because a genetic network is expressed by a set of differential equations with each gene expression level as a variable shown in Equation (1). Here high resolution data mean high quality time-course microarray data which are expected to capture the dynamic behavior of the gene regulatory network.

The linear differential equation model in this paper is used to identify gene regulation between RNA transcripts (Gardner and Faith, 2005). An advantage of such a strategy is that the model can implicitly capture regulatory mechanisms at the protein and metabolite levels that are not physically measured. That is, it is not restricted to describe only transcription factor/DNA interactions. By construction, the inferred model may accurately reflect a physical interaction if the regulator transcripts encode the transcription factors that directly regulate transcription. On the other hand, the implicit description of hidden regulatory factors by this approach may lead to prediction errors. Generally, the mRNA levels measured in a microarray experiment are the results of a variety of complex events including gene transcription and mRNA degradation (Gardner and Faith, 2005). With such events, dynamic Bayesian networks can be used to derive the regulations among biomolecules (Nachman et al., 2004; Rangel et al., 2004; Beal et al., 2005; Li and Zhan, 2006).

To examine causal relation among genes, a major source of errors comes from the noises of the gene expression data intrinsic to microarray technologies (Thattai and van Oudenaarden, 2001). To reduce the defect of unreliable data, GNR is able to assess the quality of microarray datasets by comparing their inferred results, and remove the inconsistent dataset using a clustering method according to their degree of inconsistency. As a result, GNR can alleviate the impact of noises to improve the prediction accuracy. In addition, GNR is also effective for reducing the effects of stochastic fluctuations by introducing the sparsity leverage {lambda} and combining the several datasets together (see the details in the Supplementary Material). Depending on the prior information of the data (e.g. reliability of experiments or number of experiments), we can also allocate different weights for the datasets to maximally utilize the information of reliable gene expression data.

Our method also has some limitations owing to the nature of microarray gene expression data, like other existing methods for GN inference. In particular, although GNR can provide a relationship in which the expression of one gene can lead to an increased (or decreased) expression of another gene, such a relationship does not show the exact mechanism. A predicted regulatory relationship does not always mean genetic regulation by a transcriptional factor. Some regulation can be at the post-transcriptional or post-translational level, which are often not reflected in mRNA expression levels detected by microarrays. Therefore, there is a need for integration with other information sources to derive regulatory networks in an accurate manner. In other cases, the transcriptional factors for direct regulation are not selected for GN construction due to their low expression levels or statistically insignificant changes. Hence, the GN models that we predicted include both direct and indirect regulations (i.e. via hidden variables). Typically one can interpret an edge in a GN model as the net effect if the gene from the source is deleted. For example, if an arrow pointing from gene A to gene B for activation, it is expected that deleting gene A will lead to an increased expression of gene B. Notice that the inferred results by GNR are only valid on the assumption that the dynamics of the system can be captured by the time intervals between the data points. Nevertheless, our predicted regulatory network is testable through a comparison in microarray data between wild type and mutant with specific deletion.

Currently, GNR is aimed to infer the consistent structure from a variety of datasets but for the same species or organism. With the similar mechanism, GNR can be extended to identify the conserved network patterns or motifs (Kelly et al., 2003) from the datasets of either the same species or different species, by adjusting the parameter {lambda}, i.e. a higher {lambda} results in a more consistent or conserved network.


    Acknowledgments
 
The authors are grateful to the associate editor and anonymous referees for comments and helping to improve the earlier version. This work is partly supported by Grant sponsor: project Bioinformatics, Bureau of Basic Science, CAS. The effort of T.J. and D.X. was supported by a USDA grant CSREES 2004–25,604-14,708.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Jonathan Wren

Received on March 5, 2006; revised on June 25, 2006; accepted on July 17, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

    Akutsu, T., et al. (2000) Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics, 16, 727–734[Abstract/Free Full Text].

    Alonso-Monge, R., et al. (2001) Hyperosmotic stress response and regulation of cell wall integrity in saccharomyces cerevisiae share common functional aspects. Mol. Microbiol, . 41, 717–730[CrossRef][Web of Science][Medline].

    Basso, K., et al. (2005) Reverse engineering of regulatory networks in human B cells. Nat. Genet, . 37, 382–390[CrossRef][Web of Science][Medline].

    Beal, M.J., et al. (2005) A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics, 21, 349–356[Abstract/Free Full Text].

    Chen, L. and Aihara, K. (2001) Stability and bifurcation analysis of differential-difference-algebraic equations. IEEE Trans. Circuits Syst. I, 48, 308–326[CrossRef].

    Chen, L. and Aihara, K. (2002) Stability of genetic regulatory networks with time delay. IEEE Trans. Circuits Syst. I, 49, 602–608[CrossRef].

    Destruelle, M., et al. (1994) Identification and characterization of a novel yeast gene: the ygp1 gene product is a highly glycosylated secreted protein that is synthesized in response to nutrient limitation. Mol. Cell Biol, . 14, 2740–2754[Abstract/Free Full Text].

    Dewey, T.G. and Galas, D.J. (2001) Dynamic models of gene expression and classification. Funct. Integr. Genomics, 1, 269–278[CrossRef][Medline].

    D'haeseleer, P., et al. (2000) Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics, 16, 707–726[Abstract/Free Full Text].

    D'Haeseleer, P., Wen, X., Fuhrman, S., Somogyi, R. (1999) Linear modeling of mRNA expression levels during cns development. In Altman, R.B., Dunker, A.K., Hunter, L., Klein, T.E., Lauderdaule, K. (Eds.). Pacific Symposium on Biocomputing Vol. 4, , pp. 41–52.

    Doolin, M.T., et al. (2001) Overlapping and distinct roles of the duplicated yeast transcription factors ace2p and swi5p. Mol. Microbiol, . 40, 422–432[CrossRef][Web of Science][Medline].

    Fujimoto, S.Y., et al. (2000) Arabidopsis ethylene-responsive element binding factors act as transcriptional activators or repressors of gcc box-mediated gene expression. Plant Cell, 12, 393–404[Abstract/Free Full Text].

    Gardner, T.S. and Faith, J. (2005) Reverse-engineering transcription control networks. Phys. Life Rev, . 2, 65–88.

    Gustafsson, M., et al. (2005) Constructing and analyzing a large-scale gene-to-gene regulatory network-lasso-constrained inference and biological validation. IEEE/ACM Trans. Comput. Biol. Bioinform, . 2, 254–261[CrossRef].

    H, D.J. (2002) Modeling and simulation of genetic regulatory systems: a literature review. J. Comput. Biol, . 9, 67–103[CrossRef][Web of Science][Medline].

    Hartemink, A.J. (2005) Reverse engineering gene regulatory networks. Nat. Biotechnol, . 23, 554–555[CrossRef][Web of Science][Medline].

    Holter, N.S., et al. (2001) Dynamic modeling of gene expression data. Proc. Natl Acad. Sci. USA, 98, 1693–1698[Abstract/Free Full Text].

    Hughes, T.R., et al. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109–126[CrossRef][Web of Science][Medline].

    Husmeier, D. (2003) Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks. Bioinformatics, 19, 2271–2282[Abstract/Free Full Text].

    Kelley, B.P., et al. (2003) Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc. Natl Acad. Sci. USA, 100, 11394–11399[Abstract/Free Full Text].

    Kornberg, R.D. (1998) Mechanism and regulation of yeast rna polymerase ii transcription. Cold Spring Harb. Symp. Quant. Biol, . 63, 229–232[CrossRef][Web of Science][Medline].

    Levine, M. and Davidson, E.H. (2005) Gene regulatory networks for development. Proc. Natl Acad. Sci. USA, 102, 4936–4942[Abstract/Free Full Text].

    Li, H. and Zhan, M. (2006) Systematic intervention of transcription for identifying network response to disease and cellular phenotypes. Bioinformatics, 22, 96–102[Abstract/Free Full Text].

    Mrsa, V. and Tanner, W. (1999) Role of naoh-extractable cell wall proteins ccw5p, ccw6p, ccw7p and ccw8p (members of the pir protein family) in stability of the Saccharomyces cerevisiae cell wall. Yeast, 15, 813–820[CrossRef][Web of Science][Medline].

    Nachman, I., et al. (2004) Inferring quantitative models of regulatory networks from expression data. Bioinformatics, 20, Suppl. 1, i248–i256[Abstract].

    Rangel, C., et al. (2004) Modeling T-cell activation using gene expression profiling and state-space models. Bioinformatics, 20, 1361–1372[Abstract/Free Full Text].

    Soinov, L. (2003) Supervised classification for gene network reconstruction. Biochem. Soc. Trans, . 31, 1497–1502[Web of Science][Medline].

    Tegner, J., et al. (2003) Reverse engineering gene networks: integrating genetic perturbations with dynamical modeling. Proc. Natl Acad. Sci. USA, 100, 5944–5949[Abstract/Free Full Text].

    Thattai, M. and van Oudenaarden, A. (2001) Intrinsic noise in gene regulatory networks. Proc. Natl Acad. Sci. USA, 98, 8614–8619[Abstract/Free Full Text].

    Tu, Y., et al. (2002) Statistics quantitative noise analysis for gene expression microarray experiments. Proc. Natl Acad. Sci. USA, 99, 14031–14036[Abstract/Free Full Text].

    van Someren, E.P., et al. (2001) Robust genetic network modeling by adding noisy data. Proceedings of 2001 IEEE-EURASIP Workshop on Nonlinear Signal and Image Processing.

    Yeung, M.K.S., et al. (2002) Reverse engineering gene networks using singular value decomposition and robust regression. Proc. Natl Acad. Sci. USA, 99, 6163–6168[Abstract/Free Full Text].

    Zak, D.E., et al. (2003) Importance of input perturbations and stochastic gene expression in the reverse engineering of genetic regulatory networks: insights from an identifiability analysis of an in silico network. Genome Res, . 13, 2396–2405[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
Y. Wang, X.-S. Zhang, and Y. Xia
Predicting eukaryotic transcriptional cooperativity by Bayesian network integration of genome-wide data
Nucleic Acids Res., October 1, 2009; 37(18): 5943 - 5958.
[Abstract] [Full Text] [PDF]


Home page
ReproductionHome page
S L Rodriguez-Zas, Y Ko, H A Adams, and B R Southey
Advancing the understanding of the embryo transcriptome co-regulation using meta-, functional, and gene network analysis tools
Reproduction, February 1, 2008; 135(2): 213 - 224.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
R.-S. Wang, Y. Wang, X.-S. Zhang, and L. Chen
Inferring transcriptional regulatory networks from high-throughput data
Bioinformatics, November 15, 2007; 23(22): 3056 - 3064.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Z. Li, S. Zhang, Y. Wang, X.-S. Zhang, and L. Chen
Alignment of molecular networks by integer quadratic programming
Bioinformatics, July 1, 2007; 23(13): 1631 - 1639.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
22/19/2413    most recent
btl396v2
btl396v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (27)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wang, Y.
Right arrow Articles by Chen, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wang, Y.
Right arrow Articles by Chen, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?