Bioinformatics Advance Access originally published online on December 6, 2005
Bioinformatics 2006 22(4):477-484; doi:10.1093/bioinformatics/bti816
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Least absolute regression network analysis of the murine osteoblast differentiation network

1Department of Mediametics, Delft University of Technology 2600 GA Delft, The Netherlands
2Department of Applied Biology, University of Nijmegen Nijmegen, The Netherlands
3N.V.Organon, Target Discovery Unit Oss, The Netherlands
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: We propose a reverse engineering scheme to discover genetic regulation from genome-wide transcription data that monitors the dynamic transcriptional response after a change in cellular environment. The interaction network is estimated by solving a linear model using simultaneous shrinking of the least absolute weights and the prediction error.
Results: The proposed scheme has been applied to the murine C2C12 cell-line stimulated to undergo osteoblast differentiation. Results show that our method discovers genetic interactions that display significant enrichment of co-citation in literature. More detailed study showed that the inferred network exhibits properties and hypotheses that are consistent with current biological knowledge.
Availability: Software is freely available for academic use as a Matlab package called GENLAB: http://genlab.tudelft.nl/genlab.html
Contact: E.P.vanSomeren{at}tudelft.nl
Supplementary information: Additional data, results and figures can be found at http://genlab.tudelft.nl/larna.html
| 1 INTRODUCTION |
|---|
|
|
|---|
The development of cells is largely determined by the past and current states of transcription of a relevant set of genes. The transcription levels themselves are the result of a complex network of influences exerted by the genes on each other through gene products and signaling pathways. In order to understand any developmental process, it is imperative to unravel the underlying genetic interaction network. This understanding might lead to the development of novel (therapeutic) intervention strategies. Although techniques like genome-wide location analysis (Lee et al., 2002) and promoter analysis (Tavazoie et al., 1999) may reveal some of the direct transcriptional interactions, the outcome of developmental events is still difficult to explain from sequence information and location data alone.
Alternatively, microarray measurements sampled over time present a global view on transcriptional changes during development and allow to elucidate the regulatory network in a complementary manner, i.e. from transcript data to regulatory influences. Current successful methodologies to infer genetic interactions from microarray data, however, have primarily been restricted to the use of perturbation (e.g. knockout) microarray data (Gardner et al., 2003; Segal et al., 2003; Rung et al., 2002; Pe'er et al., 2001; Hartemink et al., 2001) or learn interactions between representatives of gene clusters (Guthke et al., 2005; Wahde and Hertz, 2001).
Here, we present the Least Absolute Regression Network Algorithm (LARNA), an efficient and robust method to infer a sparse genetic regulatory network structure from microarray data sampled over time. In our framework, a generative linear model is learned to predict the expression values of a target gene, based on a linear combination of the past expression values of an automatically selected set of genes, denoted as its regulators. The main advantage of our method is the automatic determination of the network structure from the data. We applied our method on a time-course microarray dataset of the mouse C2C12 cell-line with and without stimulation to undergo osteoblast differentiation. Results show that LARNA discovered genetic interactions that are significantly more co-cited in literature than can be expected by chance. To a large extent LARNA shows higher enrichment of co-citation than hierarchical clustering. This exemplifies the fact that our network inference methodology presents a complementary approach to the more popular clustering methods, by aiming at discovering regulatory behavior instead of functional units.
The final network that was discovered presents a global view on the genes and influences involved in driving mesenchymal stem cells into osteoblastic or myoblastic differentiation. The network also poses hypotheses consistent with biological knowledge of which at least two were confirmed by literature study.
| 2 Materials and methods |
|---|
|
|
|---|
The complete preprocessing procedure consisted of (1) growing the cell cultures, harvesting samples and RNA isolation, (2) hybridization and read-out of microarrays, (3) removing array outliers, (4) array normalization and (5) gene selection and interpolation. This resulted in a single dataset, on which different clustering and regression methods were applied. Each method results in a different network. All network structures were tested for co-citation enrichment by comparing them with a co-citation network obtained from PubMed. A single regression network was chosen based on co-citation enrichment and its structure was visualized (using clustering results).
2.1 Cell culture and RNA isolation
C2C12 murine mesenchymal progenitor cells (obtained from the American Type Culture Collection) were maintained in Dulbecco's modified Eagle's medium (DMEM) supplemented with 10% newborn calf serum (HyClone®), penicillin (100 U/ml) and streptomycin (100 µg/ml) at 37°C in a humidified atmosphere containing 7.5% CO2. For differentiation experiments, cells were seeded at a density of 2.0 x 104 cells/cm2 and grown for 24 h. From this culture, cells were harvested in triplicate and this time point is denoted as time zero. Subsequently, the medium was replaced by DMEM containing 5% NCS. One subculture received no treatment and another subculture was treated with 500 ng/ml BMP2 (R & D Systems) and 12.5 mM ß-glycerophosphate (Sigma) to induce osteoblast differentiation. Cells were subsequently harvested from both subcultures at 1, 2, 6 and 10 days after time zero. As expected, untreated cells developed into myoblasts, whereas the treated cells differentiated into osteoblasts which was confirmed by monitoring cell morphology and marker gene expression as described previously (Vaes et al., 2002; de Jong et al., 2004a).
Total RNA was extracted according to an acid guanidium thiocyanatephenolchloroform method (TriPure® Isolation Reagent, Boehringer Mannheim) and RNA concentrations were determined by measuring the absorbance at 260 nm. Poly(A)+ RNA was prepared from total RNA using the OligotexTM Method (Qiagen). Quantification of poly(A)+ RNA was carried out according to the RibogreenTM RNA quantitation assay (Molecular Probes).
2.2 Array hybridization
A microarray dataset was generated from the aforementioned 11 samples. We employed an experimental design based on a common reference sample as described previously (Churchill, 2002). To this end, Cy5 labeled cDNA was synthesized from mRNA purified from the cell cultures and hybridized to Unigene 1 arrays (Incyte Genomics, Mountain View, CA) as described previously (Yue et al., 2001). Cy3 labeled cDNA derived from mouse brain was used as a control in these hybridizations.
2.3 Data preprocessing
Taking the proper prepreprocessing steps is important as it may heavily influence the outcome. In the supplementary materials, the chosen preprocessing steps and their motivations are described in more detail. Here, we briefly describe the preprocessing steps that were taken.
After read-out, array outlier detection was performed on the arrays. One array (24 h after treatment) was found to be aberrant and after manually being confirmed to be contaminated, it was removed from further analysis.
The common reference approach allowed us to employ the Cy3 intensities as a gene-specific way to compensate for array differences. Gene inductions in time were expressed as a balanced ratio between biological and control intensities after both were scaled to the initial untreated intensities. This results in the normalized expression value of gene g on array a:
![]() | (1) |
denotes the spot intensity in the Cy5 channel of the g-th gene on the a-th array, whereas
denotes the Cy5 intensity of the g-th gene, averaged over the three untreated arrays at time zero. A similar notation is used for Cy3. Using Equation (1), a normalized gene profile is constant zero if the spot intensities measured in the biological sample show the same deviation from its initial intensity level as its corresponding spot in the control sample.
After normalization, a set of significantly expressed genes was selected based on whether or not they showed an absolute gene expression value in any of the remaining arrays that exceeds a preset threshold
= 3.052. This threshold was determined by taking the maximal observed absolute gene expression values of a list of known housekeeping genes.
Having a non-uniform sampling, the data are augmented with a series of synthetic microarray results by using linear interpolation between subsequent arrays. Using a one-day interval,
t = 24 h, the final dataset resulted in 21 measurements, i.e. two series of 10 measurements at days 1, 2, ... , 10 and one measurement at zero. The choice of
t had minor effect on the overall performance (see Supplementary materials).
2.4 Regression networks
We inferred a genetic network by learning a mathematical model to predict future gene expression values based on past gene expression values. One of the simplest models is a linear genetic network model (D'Haeseleer et al., 1999; Gardner et al., 2003). The linear model assumes that the gene expression level of each gene is the result of a weighted sum of all other gene expression levels at the previous time point, or in vector notation:
![]() | (2) |
models the error and Q is the number of microarray transition pairs,
xq, yq
, that are available for learning and are indexed by q. If xq represents a measurement at time t, e.g. xq = x(t), then yq = x(t +
t). The interaction parameter, wij
W, represents the existence (wij
0) or absence (wij = 0) of a controlling action of regulator gene j on target gene i, whether it is activating (wij > 0) or inhibiting (wij < 0), as well as the strength (|wij|) of the relation. Note that all genes are target genes as well as possible regulator genes, i.e. i, j
1 , ... , N. Furthermore, a bias and a treatment node were included in W by adding two artificial inputs to xq that are treated as if they were normal genes, i.e. a bias gene with expression value equal to one for all arrays and a treatment gene with zero expression in untreated and one in treated arrays.
A standard approach to obtain an estimate of the complete set of model parameters,
, from data, is to minimize the squared error between the predicted
and measured yq gene expression levels, i.e.
![]() | (3) |
Restricting the number of regulators follows the biological intuition that only a limited number of genes will directly influence a gene's transcription during the process under consideration (Guelzim et al., 2002). Even if this assumption is violated in some cases, it makes sense to apply Occam's razor and infer only the most prominent influences when insufficient data are available (Hashimoto et al., 2004). Also, to function stably in a stochastic environment, the global regulatory system needs to react robustly against small perturbations in the regulatory influences and signalling pathways that feed it. Ensuring that the inferred models are robust against noise will therefore regularize the inference process and direct it towards more plausible models.
We obtained a combination of data-fit, robustness and limited connectivity by augmenting the standard squared error with a penalty term that sums the absolute values of the weights (van Someren et al., 2003):
![]() | (4) |
provides a trade-off between data-fit versus robustness and limited connectivity. Note that robustness implies that small changes in the input result in small changes in the output, i.e.
yi/
xj = wij are small. A solution to this equation is, generally, found using only a few iterations of the EM-algorithm (Grandvalet, 1998). In statistical learning, this method is called Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani, 1994), because it tends to shrink the weights (robustness) such that only a few weights remain non-zero (limited connectivity). As a result, LARNA automatically selects from among all genes, that set of genes (called regulators) that, in combination, provides the best prediction of that gene's future gene expression levels. The use of LASSO to infer genetic interactions was first introduced by van Someren et al. (2002) and shown to outperform other algorithms on synthetic data. To test how well our method compares with other approaches, we have also implemented three linear models [i.e. sparse QR (SQR), backward search (B) and forward beamsearch (F)] three non-linear approaches [i.e. Discrete Dynamic Bayesian Network (DDBN), Sigmoidal Gradient Ascent (SGA) and Sigmoidal Least Absolute Penalization (SLAP)] and two time-lagged correlation based methods [i.e. Correlation Metric Construction (CMC) and Significant Time-Lagged Correlation (STLC)]. SQR is a standard linear algebra method based on Gaussian Elimination and QR decomposition that returns a (zero-error) matrix that consists completely of zeros except for R columns which contain non-zero values (R equals the rank of the system). B is a method that finds sparse network solutions based on a greedy backward search algorithm that for each target gene keeps removing the regulator with the lowest weight until a given minimal number of regulators, Kmax, is reached or the error becomes non-zero (Weaver et al., 1999). F is a method that finds sparse network solutions based on a forward beam-search, with beam-size K = 5 (van Someren et al., 2001). For each target gene, this method starts by checking all single-regulator networks. Then it checks all two-regulator networks that can be constructed by adding a regulator to each of the best K single-regulator networks. This recursively repeats with more regulators until the prediction error is below a desired level Emax. The DDBN uses greedy hill-climbing with 100 restarts to find the best structure (with maximal connectivity Kmax) of a dynamic Bayesian network (van Berlo et al., 2003). Before learning, data were discretized into three gene expression levels using K-means clustering [as in (Pe'er et al., 2001)]. SGA uses 100 restarts of gradient ascent to learn a linear model augmented with a sigmoidal dose-response curve [similar to (Weaver et al., 1999; Wahde and Hertz, 2001)] and recursively sets the lowest weights to zero until Zmax non-zeros remain. SLAP is similar to SGA with the addition of the sum of absolute weights to the cost function. STLC first computes a matrix of the most significant correlation found between each pair of genes over all possible time-lags and returns different networks by thresholding the correlations in this matrix based on significance. CMC is an implementation of Arkin et al. (1997) which constructs a directed network from a single-linked tree that connects genes that have similar correlation profiles. Such a profile consists of the correlations to all other genes, individually maximized for all possible time-lags. All regression and cluster methods were applied to the same outlier-removed normalized interpolated data. More detailed descriptions of all methods can be found in the Supplementary materials.
2.5 Co-citation network and enrichment
Jenssen et al. (2001) has shown that a co-citation network reflects biologically meaningful relationships. One way to validate the network results is, therefore, to compare them against a co-citation network. To this end, a list of gene abbreviations was constructed by extracting symbol name and synonyms from the gene detail pages of MGI (Blake et al., 2003). For each gene abbreviation, PubMed was searched for publications that mention this term and a list of identifiers (PMID) referring to these publications was extracted. A pair is said to be co-cited when there exists at least one paper that mentions both
![]() | (5) |
For each network result, the number of true positives was determined as the number of gene pairs for which the inferred network predicts an interaction and that are co-cited
![]() | (6) |
![]() | (7) |
| 3 RESULTS |
|---|
|
|
|---|
3.1 Clustering and co-expression networks
C2C12 cells are myoblasts that form myotubes when cultured under low serum conditions. Upon treatment with BMP2, the cells differentiate to an osteoblast phenotype (Vaes et al., 2002). We studied gene expression at several time points during myoblast and osteoblast differentiation. Figure 1 shows an overview of the microarray data. A total of 101 genes (out of the 9596) showed significant changes in expression with respect to zero under our experimental conditions. To give a meaningful description of the data, these genes were grouped by hierarchical clustering into three main clusters. Cluster 1 consists of genes that are downregulated during either osteoblast or myoblast differentiation. This cluster contains many genes related to control of proliferation (e.g. Plf, Mcm5, Hmga1). See Supplementary materials for a complete list of gene symbols, gene names and accession numbers of all genes mentioned in this paper. The second cluster visible in Figure 1 consists of genes induced during osteoblast differentiation and contains typical osteoblast markers such as Akp2, Bglap1 and Col1a1. Cluster 3 consists of genes that are induced during myoblast differentiation but do not change in expression in BMP2-treated cells. Many well-known muscle-related genes (e.g. Myh1, Tncc, Myom1, Casq1, Ckm) reside in this cluster.
|
Clustering is known to place functionally related genes into the same group. The results of the hierarchical clustering could, therefore, be interpreted as if it represents a network of fully connected modules (one module for each cluster) without any connections between modules. Such a co-expression network is illustrated in Figure 2 for a small subset of genes (chosen arbitrarily just for visualization purposes).
|
3.2 Enrichment for co-citation in inferred networks
If the inferred networks express relationships that are biologically meaningful, it is expected that these networks also show enrichment for co-citation compared with random networks. Therefore, we constructed a co-citation matrix for the 101 selected genes. This matrix is represented in Figure 3. In this matrix, genes are ordered and grouped according to the clustering described in Figure 1. We expect that co-expression networks obtained from clustering will tend to show a strong enrichment for co-citation and that the co-citation enrichment scores of clustering may serve well as reference performance. Figure 3 already indicates that large clusters will not show a particulary strong enrichment for co-citation (only the cluster of osteoblast markers is slightly enriched). Instead, the occurrence of several small white triangles close to the diagonal suggests that enrichment is stronger for small clusters. Alternatively, we found that genes in close hierarchical proximity share similar row and column patterns, indicating that co-expressed genes tend to share the same pattern of co-citation with other genes (Chaussabel and Sher, 2002).
|
A systematic analysis of co-citation enrichment for different distance measures, linkage types and cluster sizes using hierarchical clustering shows that most co-expression networks resulting from small cluster sizes (i.e. a large number of clusters) are significantly enriched for co-citation. A cosine distance measure and single linkage combination was found to score best (See Supplementary materials). Therefore, we have used this best-scoring combination as a method of reference to compare co-citation enrichment of co-expression networks with regression networks.
Networks derived from clustering of genes show full connectivity within clusters (Fig. 2) whereas true genetic networks are likely to be sparsely connected. Furthermore, the clustering algorithms do not reveal the causal relationships within the genetic network. Therefore, we devised a method for LARNA that can unravel genetic network structure from microarray data sampled in time. LARNA distinguishes itself from other regression network models as it provides a unique trade-off between data-fit versus robustness and limited connectivity.
Repeated application of all methods on synthetic data generated using a linear model with random structure shows that HC, F, STLC and LARNA perform on average significantly better than random, whereas SQR, B and CMC do not or only marginally (See Supplementary materials).
Because synthetic data may not reflect the actual characteristics of the data, Figure 4 shows the performance of LARNA applied to the C2C12 differentiation data and compared with other network models. The specificity of the networks with random connections (Random) indicates that the prior probability of randomly selecting a pair of co-cited genes is
6% on average with roughly 2% SD. The results show that standard linear approaches that search for limited connectivity (B, F and SQR) perform significantly better than random if the inferred networks are sufficiently connected (SE > 0.04). Moreover, in the current setting, recursively adding possible regulators (F) seems a more fruitful search strategy than recursive elimination (B). The probabilistic and non-linear model (DDBN) is able to outperform randomly connected network structures, but is very sensitive to its parameter. The need to discretize the data prior to learning in combination with the severely limited number of samples may have contributed to DDBN's bad performance. The non-linear SGA does outperform the randomly connected structures, but not its linear equivalent (B). More strikingly, SLAP is not able to return networks that are significantly better than random. Finally, STLC outperforms HC at larger sensitivity ranges and performs similarly when returning more sparsely connected networks. In contrast, CMC (located at 1.6% sensitivity) which is also based on time-lagged correlations, but which was not originally intended for genetic network reconstruction, performs only marginally better than random. All of these methods aim to infer networks of limited connectivity, but they do not provide measures against noise on the data. LARNA is able to outperform these methods by incorporating both criteria simultaneously.
|
3.3 LARNA shows highest enrichment in co-citation
The best overall performance of all network inference methods was obtained by LARNA, which outperformed the best clustering method particularly at intermediate levels of sensitivity and to a lesser extent at high levels of sensitivity. The
parameter of LARNA determines the trade-off between fitting the data versus robustness and limited connectivity. When data-fit is less emphasized, the inferred networks become less sensitive to noise and more sparsely connected, resulting in higher specificity (sometimes at the cost of sensitivity). This pronounced effect is exemplified by LARNA's performance in Figure 4 around 6.5% sensitivity. While data-fit is relaxed, the sensitivity remains
6.5%, but the specificity increases from 15% to almost 32%. Here, LARNA removes unnecessary regulators from the network thereby removing a significant number of false positives.
3.4 Analysis of a single inferred network
To further illustrate the consistency of the networks inferred by LARNA with current biological expertise, a single network of interest was selected to be analyzed in more detail. Because, the best setting of
cannot be reliably determined from the already limited amount of data (e.g. by cross-validation), we propose to utilize the literature enrichment scores to select a single network solution. We have selected the LARNA solution that scored best in specificity (32%), while giving sufficiently high sensitivity (6%).
The selected network depicted in Figure 5 is indeed quite sparse, with genes having mostly one or two, but maximally four, regulators. Note that the sparseness constraint effects the number of regulators only, the number of targeted genes can still vary from none (e.g. Ogn) to many (e.g. Dcn). Obviously, the position of the genes within the network is also of importance, i.e. we expect genes with regulating roles (e.g. Tnnc2) to be network initiators (i.e. influencing many but being influenced by few) and known (osteoblast) differentiation markers to be more downstream in the network (e.g. Bglap1).
|
Furthermore, regulatory networks are expected to have a modular architecture (Featherstone and Broadie, 2002) consisting of highly interconnected modules where intermodule influences are exerted through only a few hub genes. Indeed, four modules can be recognized in the selected network, i.e. (1) the muscle development module (M1) i.e. related to the normal differentiation of untreated C2C12 precursor cells to myoblasts, containing Tnnc2 as a main regulator and many muscle-related genes, (2) a module related to proliferation (M2 containing o.a. Mrpplf3, Hmga1 and Igfbp2) regulated primarily by proliferin (Plf), (3) a module with genes related to extracellular matrix formation and processing (M3) with decorin (Dcn) and periostin (Postn) as main regulators and (4) a module of genes related to osteoblast maturation and mineralization (M4 containing o.a. Bglap1 and Ogn). To provide a better visualization of the network structure, we utilized clustering and domain knowledge to organize the genes into these four modules. These modules correspond to the clusters/modules recognized by hierarchical clustering and which were depicted in Figure 1. An exception is formed by the osteoblast differentiation module (cluster 2) that we split into a matrix formation module (M3) and a mineralization module (M4) according to the phases that are recognized in the osteoblast differentiation process (Franceschi, 1999).
Consistent with a modular structure, connections from one module to another are established only through one of the few hub genes (i.e. Tnnc2, Plf, Dcn and Postn). Furthermore, these regulating influences from one module to another are consistently positive or negative, i.e. all influences have the same sign. These influences (large arrows in Fig. 5) are indicative of the global regulatory behavior.
Genes that had exactly the same ingoing and outgoing influences were grouped together into a single node (black boxes in Fig. 5). A node consists of genes with the same pattern of positive and negative (non-zero) weights on its row and column (main diagonal excluded). Figure 6 shows the gene expression profiles of all 101 genes, where genes within the same node are plotted in the same subplot. This assignment of genes into nodes can be seen as an alternative way of clustering, where genes are grouped according to how they influence and how they are influenced by other genes. This property of received and exerted influences may potentially give a better reflection of a gene's function than its expression profile alone. Obviously, both properties are strongly related, i.e. genes with very similar expression profiles will probably receive similar incoming connections. Figure 6 shows that genes in the same node tend to have similar expression profiles, although some expression profiles that seem similar can be found in different nodes.
|
| 4 DISCUSSION |
|---|
|
|
|---|
We have presented a method to infer genetic interactions from time-course microarray data. Given the large number of genes and the low number of microarrays it is critical that the complexity of the model is sufficiently regularized and additional information is incorporated in the inference process. To this end, a simple linear model was chosen as the core of our approach and the model was constrained to be robust against noise and sparsely connected. We have shown how LARNA provides a balance between these properties and data-fit.
We have applied several approaches to a C2C12 microarray data set with 101 selected genes measured at eight different biological states available for learning. The interactions inferred by LARNA were significantly more frequent between genes that are mentioned together in literature than could be expected by chance. Obviously, the literature network is not a very accurate reflection of the true underlying network (which is not known). Co-citation networks contain a substantial amount of false connections (e.g. genes that are co-cited but do not influence each other) and missing connections (i.e. interactions that are not yet known). The sensitivity and specificity values are, therefore, only meaningful in a relative way. Nevertheless, literature enrichment is still useful as a measure to compare different methods, since (1) these errors are unbiased (i.e. they modify the score of any method in the same way) and (2) the number of co-cited genes that do truly interact is sufficiently high to show difference in performance.
The performance of LARNA was found to be superior to all other genetic network models. Search strategies may return significantly enriched structures, but are hampered by the substantial noise in the data. Alternatively, the non-linear models may potentially allow for more realistic model behavior, but they suffer too severely from unreliably parameter identification, due to the limited number of measurements. To a large extent, LARNA showed a higher enrichment of co-citation than could be obtained with the best setting of hierarchical clustering. This makes network inference a valid and interesting method to analyze micorarray data in a way that is complementary to clustering. Whereas clustering obtains groups of co-regulated genes, network inference goes one step further and unravels the causal relationships between them.
More detailed study showed that many inferred interactions and network properties are consistent with current biological knowledge and expertise. In our opinion, the selected network seems to capture the more global interrelationships corresponding to osteoblast differentiation, but some important parts are still missing and many inferred interactions are likely to reflect indirect effects. This is exemplified by the fact that the selected network contains only one known transcription factor (Hmga1). Furthermore, some well-known regulators of the myogenic and osteogenic pathways are missing. With respect to the osteogenic regulator Runx2 this is explained by the fact that the bone-specific transcript is not represented on the array (de Jong et al., 2004), whereas others did not show significant expression. Nevertheless, some missing elements are to be expected of a network that is generated ab initio and from such an extremely low number of arrays.
It seems, however, that the global interrelationships between genes and modules revealed by the network fit extremely well with the current knowledge of differentiation of mesenchymal cells. Differentiation of mesenchymal cells occurs with a concurrent decline in proliferative capacity (Aubin, 1998; Peng et al., 2004). Accordingly, the network shows negative feedback from the muscle and osteoblast differentiation modules to the proliferation module. In addition, it reveals negative feedback from the proliferation module to the osteoblast differentiation module. This is in line with experimental data that show a mutually exclusive relation between proliferation and terminal differentiation of mesenchymal cells (Ying et al., 2003; Marzia et al., 2000).
Within the discovered network, a central role is found for periostin and decorin, both extracellular genes involved in cell-matrix interactions (Horiuchi et al., 1999; Danielson et al., 1997). They initiate the forming of an extra-cellular matrix (Col1a1 and Timp3) and eventually initiate osteoblast maturation genes (Bglap1 and Ogn). Such a proposed stimulation of the differentiation process by matrixcell interactions is in line with published data. Disruption of the decorin gene leads to abnormal collageneous matrix formation (Danielson et al., 1997). Under normal conditions, formation of the extracellular matrix drives the differentiation of osteoblasts. Disruption of the matrix leads to impaired differentiation and an osteopenic phenotype in vivo (Franceschi et al., 1994; Lai et al., 2001; Corsi et al., 2002). The interactions between matrix-associated genes and terminal differentiation markers predicted by our computer-learned network are in good agreement with these experimental data.
In conclusion, we presented LARNA, a regression algorithm to infer sparse genetic networks from microarray data sampled over time. Interacting gene pairs identified by LARNA are enriched for co-cited genes. In this respect, LARNA outperforms standard network inference methods and clustering algorithms. Importantly, two global hypothesies raised by LARNA conform to current biological knowledge, i.e. (1) osteoblast maturation is induced by extra cellular matrix formation and (2) proliferation and differentiation are two mutually exclusive modes of operation. The interactions revealed by LARNA on the current limited set of microarrays are of a global character and do not reflect all details. Preliminary in silico studies support the intuition that samples should be taken more frequent at moments when the dynamics of the gene expression signals increases (data not shown). The enrichment scores were not sensitive to changes in the interpolation interval, but the interval at which measurements are obtained as well as the number of different treatments will be much more important. Future studies with more arrays sampled at smaller intervals are likely to provide a network map with higher resolution. It would also prove interesting to compare the results of the different regression networks with different publicly available co-citation networks and other global co-expression networks.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Current address: Nutrition, Metabolism and Genomics Group, Division of Human Nutrition, Wageningen University, Wageningen, The Netherlands Associate Editor: Alex Bateman
Received on November 1, 2005; revised on November 16, 2005; accepted on December 1, 2005
| REFERENCES |
|---|
|
|
|---|
Arkin, A., et al. (1997) A test case of correlation metric construction of a reaction pathway from measurements. Science, 277, 12751279
Aubin, J. (1998) Advances in the osteoblast lineage. Biochem. Cell Biol, . 76, 899910[CrossRef][Web of Science][Medline].
Blake, J., et al. (2003) Mgd: the mouse genome database. Nucleic Acids Res, . 31, 193195
Chaussabel, D. and Sher, A. (2002) Mining microarray expression data by literature profiling. Genome Biol, . 3, R55.155.16.
Churchill, G. (2002) Fundamentals of experimental design for cDNA microarrays. Nat. Genet, . 32, 490495.
Corsi, A., et al. (2002) Phenotypic effects of biglycan deficiency are linked to collagen fibril abnormalities, are synergized by decorin deficiency, and mimic ehlers-danlos-like changes in bone and other connective tissues. J. Bone Miner. Res, . 17, 11801189[CrossRef][Web of Science][Medline].
Danielson, K., et al. (1997) Targeted disruption of decorin leads to abnormal collagen fibril morphology and skin fragility. J. Cell Biol, . 136, 729743
de Jong, D.S., et al. (2004a) Regulation of notch signaling genes during bmp2-induced differentiation of osteoblast precursor cells. Biochem. Biophys. Res. Commun, . 320, 1001007[CrossRef][Web of Science][Medline].
de Jong, D.S., et al. (2004b) Identification of novel regulators associated with early-phase osteoblast differentiation. J. Bone Miner. Res, . 19, 947958[CrossRef][Web of Science][Medline].
D'Haeseleer, P., Wen, X., Fuhrman, S., Somogyi, R. (1999) Linear modeling of mRNA expression levels during CNS development and injury. Proceedings of the Pacific Symposium on Biocomputing '99 , NJ, USA World Scientific Publishing Co. Vol. 4, , pp. 4152.
Featherstone, D. and Broadie, K. (2002) Wrestling with pleiotropy: genomic and topological analysis of the yeast gene expression network. Bioessays, 24, 267274[CrossRef][Web of Science][Medline].
Franceschi, R. (1999) The developmental control of osteoblast-specific gene expression: role of specific transcription factors and the extracellular matrix environment. Crit. Rev. Oral Biol. Med, . 10, (1) 4057
Franceschi, R. (1994) Effects of ascorbic acid on collagen matrix formation and osteoblast differentiation in murine mc3t3-el cells. J. Bone Miner. Res, . 9, 843854[Web of Science][Medline].
Gardner, T., et al. (2003) Inferring genetic networks and identifying compound mode of action via expression profiling. Science, 301, (5629) 102105
Grandvalet, Y. (1998) Least absolute shrinkage is equivalent to quadratic penalization. International Conference on Artificial Neural Networks (ICANN'98) , Sweden Skövde.
Guelzim, N., et al. (2002) Topological and causal structure of the yeast transcriptional regulatory network. Nature Genetics, 31, , pp. 6063[CrossRef][Web of Science][Medline].
Guthke, R., et al. (2005) Dynamic network reconstruction from gene expression data applied to immune response during bacterial infection. Bioinformatics, 21, 16261634
Hartemink, A.J., et al. (2001) Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pacific Symposium on Biocomputing 2001 , Hawai World Scientific Publishing Co. Vol. 6, , pp. 422433.
Hashimoto, R.F., et al. (2004) Growing genetic regulatory networks from seed genes. Bioinformatics, 20, 12411247
Horiuchi, K., et al. (1999) Identification and characterization of a novel protein, periostin, with restricted expression to periosteum and periodontal ligament and increased expression by transforming growth factor beta. J. Bone and Miner. Res, . 14, 12391249[CrossRef][Web of Science][Medline].
Jenssen, T., et al. (2001) A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics, 28, 2128[CrossRef][Web of Science][Medline].
Lai, C., et al. (2001) Erk is essential for growth, differentiation, integrin expression, and cell function in human osteoblastic cells. J. Bio. Chem, . 276, 1444314450
Lee, T.I., et al. (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 298, 799804
Liao, J.C., et al. (2003) Network component analysis: Reconstruction of regulatory signals in biological systems. PNAS, 100, 1552215527
Marzia, M., et al. (2000) Decreased c-Src expression enhances osteoblast differentiation and bone formation. J. Cell Biol, . 151, 311320
Pe'er, D., et al. (2001) Inferring subnetworks from perturbed expression profiles. Bioinformatics, 17, S215S224[Abstract].
Peng, Y. (2004) Inhibitor of DNA binding/differentiation helixloophelix proteins mediate bone morphogenetic protein-induced osteoblast differentiation of mesenchymal stem cells. J. Biol. Chem, . 279, 3294132949
Rung, J., et al. (2002) building and analysing genome-wide gene disruption networks. Bioinformatics, 18, Suppl. 2, S202S210[Abstract].
Segal, E., et al. (2003) Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat. Genet, . 34, 166176[CrossRef][Web of Science][Medline].
Tavazoie, S., et al. (1999) Systematic determination of genetic network architecture. Nat. Genet, . 22, 281285[CrossRef][Web of Science][Medline].
Tibshirani, R. (1996) Regression selection and shrinkage via the lasso. J. Royal Statist. Soc. B, . 58, 267288.
Vaes, B., et al. (2002) Comprehensive microarray analysis of bone morphogenetic protein 2-induced osteoblast differentiation resulting in the identification of novel markers for bone development. J. Bone Miner. Res, . 17, 21062118[CrossRef][Medline].
van Berlo, R., et al. (2003) Studying the conditions for learning dynamic Bayesian networks to discover genetic regulatory networks. Simulation, 79, 689702[Abstract].
van Someren, E., Wessels, L., Reinders, M., Backer, E. (2001) Searching for limited connectivity in genetic network models. Proceedings of the Second International Conference on Systems BiologyPasadena, California , pp. 222230.
van Someren, E., Wessels, L., Reinders, M., Backer, E. (2002) Regularization and noise injection for improving genetic network models. Computational and Statistical Approaches to Genomics, , NJ, USA World Scientific Publishing Co., pp. 211226.
van Someren, E., et al. (2003) Multi-criterion optimization for genetic network modeling. Signal Process, . 83, 763775[CrossRef].
Wahde, M. and Hertz, J. (2001) Modeling genetic regulatory dynamics in neural development. J. Comput. Biol, . 8, 429442[CrossRef][Web of Science][Medline].
Weaver, D., Workman, C., Stormo, G. (1999) Modeling regulatory networks with weight matrices. Proceedings of the Pacific Symposium on Biocomputing '99Hawai , NJ, USA World Scientific Publishing Co. Vol. 4, , pp. 112123.
Ying, Q., et al. (2003) Bmp induction of Id proteins suppresses differentiation and sustains embryonic stem cell self-renewal in collaboration with STAT3. Cell, 115, 281292[CrossRef][Web of Science][Medline].
Yue, H., et al. (2001) An evaluation of the performance of cDNA microarrays for detecting changes in global mRNA expression. Nucleic Acids Res, . 29, E411.
This article has been cited by other articles:
![]() |
F. Ortega, K. Sameith, N. Turan, R. Compton, V. Trevino, M. Vannucci, and F. Falciani Models and computational strategies linking physiological response to molecular networks from large-scale data Phil Trans R Soc A, September 13, 2008; 366(1878): 3067 - 3089. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Hirose, R. Yoshida, S. Imoto, R. Yamaguchi, T. Higuchi, D. S. Charnock-Jones, C. Print, and S. Miyano Statistical inference of transcriptional module-based gene networks from time course gene expression profiles by using state space models Bioinformatics, April 1, 2008; 24(7): 932 - 942. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. J. Reiss, M. T. Facciotti, and N. S. Baliga Model-based deconvolution of genome-wide DNA binding Bioinformatics, February 1, 2008; 24(3): 396 - 403. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||












is positioned at the target. Similarly, a directed negative interaction (inhibition) is indicated by a
marker at the target. Self-regulations are not shown. The genes were grouped into four modules (large gray boxes) based on the clustering and known osteoblast differentiation phases (see text). Large arrows indicate interaction between modules. Numbers indicate TP per link.

