Skip Navigation


Bioinformatics Advance Access originally published online on December 6, 2005
Bioinformatics 2006 22(4):477-484; doi:10.1093/bioinformatics/bti816
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
22/4/477    most recent
bti816v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by van Someren, E. P.
Right arrow Articles by Reinders, M. J. T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by van Someren, E. P.
Right arrow Articles by Reinders, M. J. T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Least absolute regression network analysis of the murine osteoblast differentiation network

E. P. van Someren 1,*, B. L. T. Vaes 2, W. T. Steegenga 2,{dagger}, A. M. Sijbers 3, K. J. Dechering 3 and M. J. T. Reinders 1

1Department of Mediametics, Delft University of Technology 2600 GA Delft, The Netherlands
2Department of Applied Biology, University of Nijmegen Nijmegen, The Netherlands
3N.V.Organon, Target Discovery Unit Oss, The Netherlands

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 Materials and methods
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

Motivation: We propose a reverse engineering scheme to discover genetic regulation from genome-wide transcription data that monitors the dynamic transcriptional response after a change in cellular environment. The interaction network is estimated by solving a linear model using simultaneous shrinking of the least absolute weights and the prediction error.

Results: The proposed scheme has been applied to the murine C2C12 cell-line stimulated to undergo osteoblast differentiation. Results show that our method discovers genetic interactions that display significant enrichment of co-citation in literature. More detailed study showed that the inferred network exhibits properties and hypotheses that are consistent with current biological knowledge.

Availability: Software is freely available for academic use as a Matlab package called GENLAB: http://genlab.tudelft.nl/genlab.html

Contact: E.P.vanSomeren{at}tudelft.nl

Supplementary information: Additional data, results and figures can be found at http://genlab.tudelft.nl/larna.html


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 Materials and methods
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
The development of cells is largely determined by the past and current states of transcription of a relevant set of genes. The transcription levels themselves are the result of a complex network of influences exerted by the genes on each other through gene products and signaling pathways. In order to understand any developmental process, it is imperative to unravel the underlying genetic interaction network. This understanding might lead to the development of novel (therapeutic) intervention strategies. Although techniques like genome-wide location analysis (Lee et al., 2002) and promoter analysis (Tavazoie et al., 1999) may reveal some of the direct transcriptional interactions, the outcome of developmental events is still difficult to explain from sequence information and location data alone.

Alternatively, microarray measurements sampled over time present a global view on transcriptional changes during development and allow to elucidate the regulatory network in a complementary manner, i.e. from transcript data to regulatory influences. Current successful methodologies to infer genetic interactions from microarray data, however, have primarily been restricted to the use of perturbation (e.g. knockout) microarray data (Gardner et al., 2003; Segal et al., 2003; Rung et al., 2002; Pe'er et al., 2001; Hartemink et al., 2001) or learn interactions between representatives of gene clusters (Guthke et al., 2005; Wahde and Hertz, 2001).

Here, we present the Least Absolute Regression Network Algorithm (LARNA), an efficient and robust method to infer a sparse genetic regulatory network structure from microarray data sampled over time. In our framework, a generative linear model is learned to predict the expression values of a target gene, based on a linear combination of the past expression values of an automatically selected set of genes, denoted as its regulators. The main advantage of our method is the automatic determination of the network structure from the data. We applied our method on a time-course microarray dataset of the mouse C2C12 cell-line with and without stimulation to undergo osteoblast differentiation. Results show that LARNA discovered genetic interactions that are significantly more co-cited in literature than can be expected by chance. To a large extent LARNA shows higher enrichment of co-citation than hierarchical clustering. This exemplifies the fact that our network inference methodology presents a complementary approach to the more popular clustering methods, by aiming at discovering regulatory behavior instead of functional units.

The final network that was discovered presents a global view on the genes and influences involved in driving mesenchymal stem cells into osteoblastic or myoblastic differentiation. The network also poses hypotheses consistent with biological knowledge of which at least two were confirmed by literature study.


    2 Materials and methods
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 Materials and methods
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
The complete preprocessing procedure consisted of (1) growing the cell cultures, harvesting samples and RNA isolation, (2) hybridization and read-out of microarrays, (3) removing array outliers, (4) array normalization and (5) gene selection and interpolation. This resulted in a single dataset, on which different clustering and regression methods were applied. Each method results in a different network. All network structures were tested for co-citation enrichment by comparing them with a co-citation network obtained from PubMed. A single regression network was chosen based on co-citation enrichment and its structure was visualized (using clustering results).

2.1 Cell culture and RNA isolation
C2C12 murine mesenchymal progenitor cells (obtained from the American Type Culture Collection) were maintained in Dulbecco's modified Eagle's medium (DMEM) supplemented with 10% newborn calf serum (HyClone®), penicillin (100 U/ml) and streptomycin (100 µg/ml) at 37°C in a humidified atmosphere containing 7.5% CO2. For differentiation experiments, cells were seeded at a density of 2.0 x 104 cells/cm2 and grown for 24 h. From this culture, cells were harvested in triplicate and this time point is denoted as time zero. Subsequently, the medium was replaced by DMEM containing 5% NCS. One subculture received no treatment and another subculture was treated with 500 ng/ml BMP2 (R & D Systems) and 12.5 mM ß-glycerophosphate (Sigma) to induce osteoblast differentiation. Cells were subsequently harvested from both subcultures at 1, 2, 6 and 10 days after time zero. As expected, untreated cells developed into myoblasts, whereas the treated cells differentiated into osteoblasts which was confirmed by monitoring cell morphology and marker gene expression as described previously (Vaes et al., 2002; de Jong et al., 2004a).

Total RNA was extracted according to an acid guanidium thiocyanate–phenol–chloroform method (TriPure® Isolation Reagent, Boehringer Mannheim) and RNA concentrations were determined by measuring the absorbance at 260 nm. Poly(A)+ RNA was prepared from total RNA using the OligotexTM Method (Qiagen). Quantification of poly(A)+ RNA was carried out according to the RibogreenTM RNA quantitation assay (Molecular Probes).

2.2 Array hybridization
A microarray dataset was generated from the aforementioned 11 samples. We employed an experimental design based on a common reference sample as described previously (Churchill, 2002). To this end, Cy5 labeled cDNA was synthesized from mRNA purified from the cell cultures and hybridized to Unigene 1 arrays (Incyte Genomics, Mountain View, CA) as described previously (Yue et al., 2001). Cy3 labeled cDNA derived from mouse brain was used as a control in these hybridizations.

2.3 Data preprocessing
Taking the proper prepreprocessing steps is important as it may heavily influence the outcome. In the supplementary materials, the chosen preprocessing steps and their motivations are described in more detail. Here, we briefly describe the preprocessing steps that were taken.

After read-out, array outlier detection was performed on the arrays. One array (24 h after treatment) was found to be aberrant and after manually being confirmed to be contaminated, it was removed from further analysis.

The common reference approach allowed us to employ the Cy3 intensities as a gene-specific way to compensate for array differences. Gene inductions in time were expressed as a balanced ratio between biological and control intensities after both were scaled to the initial untreated intensities. This results in the normalized expression value of gene g on array a:

Formula 1(1)
Here, Formula 1 denotes the spot intensity in the Cy5 channel of the g-th gene on the a-th array, whereas Formula 1 denotes the Cy5 intensity of the g-th gene, averaged over the three untreated arrays at time zero. A similar notation is used for Cy3. Using Equation (1), a normalized gene profile is constant zero if the spot intensities measured in the biological sample show the same deviation from its initial intensity level as its corresponding spot in the control sample.

After normalization, a set of significantly expressed genes was selected based on whether or not they showed an absolute gene expression value in any of the remaining arrays that exceeds a preset threshold {alpha} = 3.052. This threshold was determined by taking the maximal observed absolute gene expression values of a list of known housekeeping genes.

Having a non-uniform sampling, the data are augmented with a series of synthetic microarray results by using linear interpolation between subsequent arrays. Using a one-day interval, {Delta}t = 24 h, the final dataset resulted in 21 ‘measurements’, i.e. two series of 10 ‘measurements’ at days 1, 2, ... , 10 and one measurement at zero. The choice of {Delta}t had minor effect on the overall performance (see Supplementary materials).

2.4 Regression networks
We inferred a genetic network by learning a mathematical model to predict future gene expression values based on past gene expression values. One of the simplest models is a linear genetic network model (D'Haeseleer et al., 1999; Gardner et al., 2003). The linear model assumes that the gene expression level of each gene is the result of a weighted sum of all other gene expression levels at the previous time point, or in vector notation:

Formula 2(2)
Here {varepsilon} models the error and Q is the number of microarray transition pairs, <xq, yq>, that are available for learning and are indexed by q. If xq represents a measurement at time t, e.g. xq = x(t), then yq = x(t + {Delta}t). The interaction parameter, wij isin W, represents the existence (wij != 0) or absence (wij = 0) of a controlling action of regulator gene j on target gene i, whether it is activating (wij > 0) or inhibiting (wij < 0), as well as the strength (|wij|) of the relation. Note that all genes are target genes as well as possible regulator genes, i.e. i, j isin 1 , ... , N. Furthermore, a bias and a treatment node were included in W by adding two artificial inputs to xq that are treated as if they were normal genes, i.e. a ‘bias gene’ with expression value equal to one for all arrays and a ‘treatment gene’ with zero expression in untreated and one in treated arrays.

A standard approach to obtain an estimate of the complete set of model parameters, Formula 2, from data, is to minimize the squared error between the predicted Formula 2 and measured yq gene expression levels, i.e.

Formula 3(3)
However, standard linear algebra techniques will not be sufficient, because the number of microarray pairs is far less than the number of selected genes, resulting in a highly under-determined problem. Due to the resulting problems of colinearity and overfitting, it is impossible to unambiguously determine the best parameter set for the complete set of regulators and it becomes essential to constrain the solution space appropriately (Liao et al., 2003).

Restricting the number of regulators follows the biological intuition that only a limited number of genes will directly influence a gene's transcription during the process under consideration (Guelzim et al., 2002). Even if this assumption is violated in some cases, it makes sense to apply Occam's razor and infer only the most prominent influences when insufficient data are available (Hashimoto et al., 2004). Also, to function stably in a stochastic environment, the global regulatory system needs to react robustly against small perturbations in the regulatory influences and signalling pathways that feed it. Ensuring that the inferred models are robust against noise will therefore regularize the inference process and direct it towards more plausible models.

We obtained a combination of data-fit, robustness and limited connectivity by augmenting the standard squared error with a penalty term that sums the absolute values of the weights (van Someren et al., 2003):

Formula 4(4)
Here, {lambda} provides a trade-off between data-fit versus robustness and limited connectivity. Note that robustness implies that small changes in the input result in small changes in the output, i.e. {partial}yi/{partial}xj = wij are small. A solution to this equation is, generally, found using only a few iterations of the EM-algorithm (Grandvalet, 1998). In statistical learning, this method is called Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani, 1994), because it tends to shrink the weights (robustness) such that only a few weights remain non-zero (limited connectivity). As a result, LARNA automatically selects from among all genes, that set of genes (called regulators) that, in combination, provides the best prediction of that gene's future gene expression levels. The use of LASSO to infer genetic interactions was first introduced by van Someren et al. (2002) and shown to outperform other algorithms on synthetic data.

To test how well our method compares with other approaches, we have also implemented three linear models [i.e. sparse QR (SQR), backward search (B) and forward beamsearch (F)] three non-linear approaches [i.e. Discrete Dynamic Bayesian Network (DDBN), Sigmoidal Gradient Ascent (SGA) and Sigmoidal Least Absolute Penalization (SLAP)] and two time-lagged correlation based methods [i.e. Correlation Metric Construction (CMC) and Significant Time-Lagged Correlation (STLC)]. SQR is a standard linear algebra method based on Gaussian Elimination and QR decomposition that returns a (zero-error) matrix that consists completely of zeros except for R columns which contain non-zero values (R equals the rank of the system). B is a method that finds sparse network solutions based on a greedy backward search algorithm that for each target gene keeps removing the regulator with the lowest weight until a given minimal number of regulators, Kmax, is reached or the error becomes non-zero (Weaver et al., 1999). F is a method that finds sparse network solutions based on a forward beam-search, with beam-size K = 5 (van Someren et al., 2001). For each target gene, this method starts by checking all single-regulator networks. Then it checks all two-regulator networks that can be constructed by adding a regulator to each of the best K single-regulator networks. This recursively repeats with more regulators until the prediction error is below a desired level Emax. The DDBN uses greedy hill-climbing with 100 restarts to find the best structure (with maximal connectivity Kmax) of a dynamic Bayesian network (van Berlo et al., 2003). Before learning, data were discretized into three gene expression levels using K-means clustering [as in (Pe'er et al., 2001)]. SGA uses 100 restarts of gradient ascent to learn a linear model augmented with a sigmoidal dose-response curve [similar to (Weaver et al., 1999; Wahde and Hertz, 2001)] and recursively sets the lowest weights to zero until Zmax non-zeros remain. SLAP is similar to SGA with the addition of the sum of absolute weights to the cost function. STLC first computes a matrix of the most significant correlation found between each pair of genes over all possible time-lags and returns different networks by thresholding the correlations in this matrix based on significance. CMC is an implementation of Arkin et al. (1997) which constructs a directed network from a single-linked tree that connects genes that have similar correlation profiles. Such a profile consists of the correlations to all other genes, individually maximized for all possible time-lags. All regression and cluster methods were applied to the same outlier-removed normalized interpolated data. More detailed descriptions of all methods can be found in the Supplementary materials.

2.5 Co-citation network and enrichment
Jenssen et al. (2001) has shown that a co-citation network reflects biologically meaningful relationships. One way to validate the network results is, therefore, to compare them against a co-citation network. To this end, a list of gene abbreviations was constructed by extracting symbol name and synonyms from the gene detail pages of MGI (Blake et al., 2003). For each gene abbreviation, PubMed was searched for publications that mention this term and a list of identifiers (PMID) referring to these publications was extracted. A pair is said to be co-cited when there exists at least one paper that mentions both

Formula 5(5)
Here, k(i) is the list of gene abbreviations of gene i and ps is the list of PubMed IDs corresponding to term s.

For each network result, the number of true positives was determined as the number of gene pairs for which the inferred network predicts an interaction and that are co-cited

Formula 6(6)
Because TP depends on the connectivity of the co-citation matrix as well as on the connectivity of the inferred matrix, two complementary performance measures were computed. Specificity (SP) indicates how often a proposed interaction concerns genes that are related and is defined as TP divided by the number of predicted interactions. Sensitivity (SE) indicates how much of the ‘ground truth’ is discovered and is defined as TP divided by the number of pairs that are co-cited.

Formula 7(7)
Here, FP stands for false positives and FN for false negatives.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 Materials and methods
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
3.1 Clustering and co-expression networks
C2C12 cells are myoblasts that form myotubes when cultured under low serum conditions. Upon treatment with BMP2, the cells differentiate to an osteoblast phenotype (Vaes et al., 2002). We studied gene expression at several time points during myoblast and osteoblast differentiation. Figure 1 shows an overview of the microarray data. A total of 101 genes (out of the 9596) showed significant changes in expression with respect to zero under our experimental conditions. To give a meaningful description of the data, these genes were grouped by hierarchical clustering into three main clusters. Cluster 1 consists of genes that are downregulated during either osteoblast or myoblast differentiation. This cluster contains many genes related to control of proliferation (e.g. Plf, Mcm5, Hmga1). See Supplementary materials for a complete list of gene symbols, gene names and accession numbers of all genes mentioned in this paper. The second cluster visible in Figure 1 consists of genes induced during osteoblast differentiation and contains typical osteoblast markers such as Akp2, Bglap1 and Col1a1. Cluster 3 consists of genes that are induced during myoblast differentiation but do not change in expression in BMP2-treated cells. Many well-known muscle-related genes (e.g. Myh1, Tncc, Myom1, Casq1, Ckm) reside in this cluster.


Figure 1
View larger version (24K):
[in this window]
[in a new window]
 
Fig. 1 Gene expression values from 11 microarrays (columns) of the 101 genes that were significantly differentially expressed (rows) visualized as a heatmap. Black values represent positive expression, white values negative expression, whereas gray values correspond the absence of expression (see color bar at the right). Arrays consist of an untreated triplicate (ZE), an untreated subseries (UN) and a treated subseries (BB). Both subseries were sampled at 1 (24), 2 (48), 6 (144) and 10 days (240) after ZE. Array BB24 was found to be contaminated and removed from further analysis. The bar to the right of the heatmap indicates the clustering result using hierarchical clustering with the cosine distance and complete linkage at a selection of three clusters. Before clustering and network inference, the triplicate at ZE was averaged and linear interpolation was used to obtain a virtual dataset that is (uniformly) sampled every day, e.g. expression values at UN72, UN96 and UN120 were obtained by linear interpolation between UN48 and UN144.

 
Clustering is known to place functionally related genes into the same group. The results of the hierarchical clustering could, therefore, be interpreted as if it represents a network of fully connected modules (one module for each cluster) without any connections between modules. Such a co-expression network is illustrated in Figure 2 for a small subset of genes (chosen arbitrarily just for visualization purposes).


Figure 2
View larger version (29K):
[in this window]
[in a new window]
 
Fig. 2 Illustration of co-expression network between a small arbitrarily chosen subset of genes based on the same clustering as was used in Figure 1. There are three clusters, i.e. one cluster of four genes, one cluster of six genes and one cluster of one gene. The corresponding co-expression network contains links between all genes in the same cluster, but no links between genes in different clusters. The position of the genes is based on a Sammon projection which places genes closer together when their expression profiles are more similar.

 
3.2 Enrichment for co-citation in inferred networks
If the inferred networks express relationships that are biologically meaningful, it is expected that these networks also show enrichment for co-citation compared with random networks. Therefore, we constructed a co-citation matrix for the 101 selected genes. This matrix is represented in Figure 3. In this matrix, genes are ordered and grouped according to the clustering described in Figure 1. We expect that co-expression networks obtained from clustering will tend to show a strong enrichment for co-citation and that the co-citation enrichment scores of clustering may serve well as reference performance. Figure 3 already indicates that large clusters will not show a particulary strong enrichment for co-citation (only the cluster of osteoblast markers is slightly enriched). Instead, the occurrence of several small white triangles close to the diagonal suggests that enrichment is stronger for small clusters. Alternatively, we found that genes in close hierarchical proximity share similar row and column patterns, indicating that co-expressed genes tend to share the same pattern of co-citation with other genes (Chaussabel and Sher, 2002).


Figure 3
View larger version (24K):
[in this window]
[in a new window]
 
Fig. 3 Co-citation matrix extracted from PubMed between the 101 selected genes. Because the matrix is symmetric, only the lower triangular part is visualized. White patches indicate pairs of genes that are co-cited at least once, black patches indicate no co-citation. The main diagonal indicates whether single genes are cited at least once (white) or not at all (black). The upper triangle visualizes the same clustering result as was used in Figure 1. Pairs of genes that are in the same cluster are indicated by a common color, white patches indicate pairs of genes that are in different clusters. Note that a clustering in N clusters that is strongly enriched in co-citation would show up as N triangular structures below the diagonal that contain more white patches than elsewhere.

 
A systematic analysis of co-citation enrichment for different distance measures, linkage types and cluster sizes using hierarchical clustering shows that most co-expression networks resulting from small cluster sizes (i.e. a large number of clusters) are significantly enriched for co-citation. A cosine distance measure and single linkage combination was found to score best (See Supplementary materials). Therefore, we have used this ‘best-scoring’ combination as a method of reference to compare co-citation enrichment of co-expression networks with regression networks.

Networks derived from clustering of genes show full connectivity within clusters (Fig. 2) whereas true genetic networks are likely to be sparsely connected. Furthermore, the clustering algorithms do not reveal the causal relationships within the genetic network. Therefore, we devised a method for LARNA that can unravel genetic network structure from microarray data sampled in time. LARNA distinguishes itself from other regression network models as it provides a unique trade-off between data-fit versus robustness and limited connectivity.

Repeated application of all methods on synthetic data generated using a linear model with random structure shows that HC, F, STLC and LARNA perform on average significantly better than random, whereas SQR, B and CMC do not or only marginally (See Supplementary materials).

Because synthetic data may not reflect the actual characteristics of the data, Figure 4 shows the performance of LARNA applied to the C2C12 differentiation data and compared with other network models. The specificity of the networks with random connections (Random) indicates that the prior probability of randomly selecting a pair of co-cited genes is ~6% on average with roughly 2% SD. The results show that standard linear approaches that search for limited connectivity (B, F and SQR) perform significantly better than random if the inferred networks are sufficiently connected (SE > 0.04). Moreover, in the current setting, recursively adding possible regulators (F) seems a more fruitful search strategy than recursive elimination (B). The probabilistic and non-linear model (DDBN) is able to outperform randomly connected network structures, but is very sensitive to its parameter. The need to discretize the data prior to learning in combination with the severely limited number of samples may have contributed to DDBN's bad performance. The non-linear SGA does outperform the randomly connected structures, but not its linear equivalent (B). More strikingly, SLAP is not able to return networks that are significantly better than random. Finally, STLC outperforms HC at larger sensitivity ranges and performs similarly when returning more sparsely connected networks. In contrast, CMC (located at 1.6% sensitivity) which is also based on time-lagged correlations, but which was not originally intended for genetic network reconstruction, performs only marginally better than random. All of these methods aim to infer networks of limited connectivity, but they do not provide measures against noise on the data. LARNA is able to outperform these methods by incorporating both criteria simultaneously.


Figure 4
View larger version (28K):
[in this window]
[in a new window]
 
Fig. 4 The specificity (SP) versus sensitivity (SE) for the network methods with respect to the co-citation network from Figure 3. Dotted lines depict performance of random networks plus and minus 1 SD. Lines with markers depict the performance of network inference methods for different parameter settings (the parameter is depicted in the legend after the method's name). Parameter-less methods are depicted with a single marker. The upper-right corner specifies optimal performance. For each method, one solution is chosen manually as a possible ‘best’ solution and indicated by an arrow with the corresponding parameter setting.

 
3.3 LARNA shows highest enrichment in co-citation
The best overall performance of all network inference methods was obtained by LARNA, which outperformed the best clustering method particularly at intermediate levels of sensitivity and to a lesser extent at high levels of sensitivity. The {lambda} parameter of LARNA determines the trade-off between fitting the data versus robustness and limited connectivity. When data-fit is less emphasized, the inferred networks become less sensitive to noise and more sparsely connected, resulting in higher specificity (sometimes at the cost of sensitivity). This pronounced effect is exemplified by LARNA's performance in Figure 4 around 6.5% sensitivity. While data-fit is relaxed, the sensitivity remains ~6.5%, but the specificity increases from 15% to almost 32%. Here, LARNA removes unnecessary regulators from the network thereby removing a significant number of false positives.

3.4 Analysis of a single inferred network
To further illustrate the consistency of the networks inferred by LARNA with current biological expertise, a single network of interest was selected to be analyzed in more detail. Because, the best setting of {lambda} cannot be reliably determined from the already limited amount of data (e.g. by cross-validation), we propose to utilize the literature enrichment scores to select a single network solution. We have selected the LARNA solution that scored best in specificity (32%), while giving sufficiently high sensitivity (6%).

The selected network depicted in Figure 5 is indeed quite sparse, with genes having mostly one or two, but maximally four, regulators. Note that the sparseness constraint effects the number of regulators only, the number of targeted genes can still vary from none (e.g. Ogn) to many (e.g. Dcn). Obviously, the position of the genes within the network is also of importance, i.e. we expect genes with regulating roles (e.g. Tnnc2) to be network initiators (i.e. influencing many but being influenced by few) and known (osteoblast) differentiation markers to be more downstream in the network (e.g. Bglap1).


Figure 5
View larger version (70K):
[in this window]
[in a new window]
 
Fig. 5 Diagram of the network inferred by LARNA. Each node (black box, white edge) represents a set of the genes with the same inferred input and output connections. For illustrative purposes, the genes that were not connected to any other gene were left out of the diagram. Arrows represent an inferred positive interaction (activation) from a regulating gene to its target(s), where the arrow head {uparrow} is positioned at the target. Similarly, a directed negative interaction (inhibition) is indicated by a {top} marker at the target. Self-regulations are not shown. The genes were grouped into four modules (large gray boxes) based on the clustering and known osteoblast differentiation phases (see text). Large arrows indicate interaction between modules. Numbers indicate TP per link.

 
Furthermore, regulatory networks are expected to have a modular architecture (Featherstone and Broadie, 2002) consisting of highly interconnected modules where intermodule influences are exerted through only a few ‘hub’ genes. Indeed, four modules can be recognized in the selected network, i.e. (1) the ‘muscle development’ module (M1) i.e. related to the normal differentiation of untreated C2C12 precursor cells to myoblasts, containing Tnnc2 as a main regulator and many muscle-related genes, (2) a module related to proliferation (M2 containing o.a. Mrpplf3, Hmga1 and Igfbp2) regulated primarily by proliferin (Plf), (3) a module with genes related to extracellular matrix formation and processing (M3) with decorin (Dcn) and periostin (Postn) as main regulators and (4) a module of genes related to osteoblast maturation and mineralization (M4 containing o.a. Bglap1 and Ogn). To provide a better visualization of the network structure, we utilized clustering and domain knowledge to organize the genes into these four modules. These modules correspond to the clusters/modules recognized by hierarchical clustering and which were depicted in Figure 1. An exception is formed by the osteoblast differentiation module (cluster 2) that we split into a matrix formation module (M3) and a mineralization module (M4) according to the phases that are recognized in the osteoblast differentiation process (Franceschi, 1999).

Consistent with a modular structure, connections from one module to another are established only through one of the few ‘hub’ genes (i.e. Tnnc2, Plf, Dcn and Postn). Furthermore, these regulating influences from one module to another are consistently positive or negative, i.e. all influences have the same sign. These influences (large arrows in Fig. 5) are indicative of the global regulatory behavior.

Genes that had exactly the same ingoing and outgoing influences were grouped together into a single node (black boxes in Fig. 5). A node consists of genes with the same pattern of positive and negative (non-zero) weights on its row and column (main diagonal excluded). Figure 6 shows the gene expression profiles of all 101 genes, where genes within the same node are plotted in the same subplot. This assignment of genes into nodes can be seen as an alternative way of clustering, where genes are grouped according to how they influence and how they are influenced by other genes. This property of received and exerted influences may potentially give a better reflection of a gene's function than its expression profile alone. Obviously, both properties are strongly related, i.e. genes with very similar expression profiles will probably receive similar incoming connections. Figure 6 shows that genes in the same node tend to have similar expression profiles, although some expression profiles that seem similar can be found in different nodes.


Figure 6
View larger version (45K):
[in this window]
[in a new window]
 
Fig. 6 Expression profiles of all 101 genes grouped into 18 different subplots. In each subplot, normalized gene expression is depicted on the Y-axis versus time (in days) on the X-axis for treated (solid line) and untreated cells (dotted line with marker). Genes in the same node are plotted in the same subplot. The subplots correspond to the 17 nodes depicted in Figure 5, except for one subplot (row 2, column 3) which corresponds to all unconnected genes.

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 Materials and methods
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
We have presented a method to infer genetic interactions from time-course microarray data. Given the large number of genes and the low number of microarrays it is critical that the complexity of the model is sufficiently regularized and additional information is incorporated in the inference process. To this end, a simple linear model was chosen as the core of our approach and the model was constrained to be robust against noise and sparsely connected. We have shown how LARNA provides a balance between these properties and data-fit.

We have applied several approaches to a C2C12 microarray data set with 101 selected genes measured at eight different biological states available for learning. The interactions inferred by LARNA were significantly more frequent between genes that are mentioned together in literature than could be expected by chance. Obviously, the literature network is not a very accurate reflection of the true underlying network (which is not known). Co-citation networks contain a substantial amount of false connections (e.g. genes that are co-cited but do not influence each other) and missing connections (i.e. interactions that are not yet known). The sensitivity and specificity values are, therefore, only meaningful in a relative way. Nevertheless, literature enrichment is still useful as a measure to compare different methods, since (1) these errors are unbiased (i.e. they modify the score of any method in the same way) and (2) the number of co-cited genes that do truly interact is sufficiently high to show difference in performance.

The performance of LARNA was found to be superior to all other genetic network models. Search strategies may return significantly enriched structures, but are hampered by the substantial noise in the data. Alternatively, the non-linear models may potentially allow for more realistic model behavior, but they suffer too severely from unreliably parameter identification, due to the limited number of measurements. To a large extent, LARNA showed a higher enrichment of co-citation than could be obtained with the best setting of hierarchical clustering. This makes network inference a valid and interesting method to analyze micorarray data in a way that is complementary to clustering. Whereas clustering obtains groups of co-regulated genes, network inference goes one step further and unravels the causal relationships between them.

More detailed study showed that many inferred interactions and network properties are consistent with current biological knowledge and expertise. In our opinion, the selected network seems to capture the more global interrelationships corresponding to osteoblast differentiation, but some important parts are still missing and many inferred interactions are likely to reflect indirect effects. This is exemplified by the fact that the selected network contains only one known transcription factor (Hmga1). Furthermore, some well-known regulators of the myogenic and osteogenic pathways are missing. With respect to the osteogenic regulator Runx2 this is explained by the fact that the bone-specific transcript is not represented on the array (de Jong et al., 2004), whereas others did not show significant expression. Nevertheless, some missing elements are to be expected of a network that is generated ab initio and from such an extremely low number of arrays.

It seems, however, that the global interrelationships between genes and modules revealed by the network fit extremely well with the current knowledge of differentiation of mesenchymal cells. Differentiation of mesenchymal cells occurs with a concurrent decline in proliferative capacity (Aubin, 1998; Peng et al., 2004). Accordingly, the network shows negative feedback from the muscle and osteoblast differentiation modules to the proliferation module. In addition, it reveals negative feedback from the proliferation module to the osteoblast differentiation module. This is in line with experimental data that show a mutually exclusive relation between proliferation and terminal differentiation of mesenchymal cells (Ying et al., 2003; Marzia et al., 2000).

Within the discovered network, a central role is found for periostin and decorin, both extracellular genes involved in cell-matrix interactions (Horiuchi et al., 1999; Danielson et al., 1997). They initiate the forming of an extra-cellular matrix (Col1a1 and Timp3) and eventually initiate osteoblast maturation genes (Bglap1 and Ogn). Such a proposed stimulation of the differentiation process by matrix–cell interactions is in line with published data. Disruption of the decorin gene leads to abnormal collageneous matrix formation (Danielson et al., 1997). Under normal conditions, formation of the extracellular matrix drives the differentiation of osteoblasts. Disruption of the matrix leads to impaired differentiation and an osteopenic phenotype in vivo (Franceschi et al., 1994; Lai et al., 2001; Corsi et al., 2002). The interactions between matrix-associated genes and terminal differentiation markers predicted by our computer-learned network are in good agreement with these experimental data.

In conclusion, we presented LARNA, a regression algorithm to infer sparse genetic networks from microarray data sampled over time. Interacting gene pairs identified by LARNA are enriched for co-cited genes. In this respect, LARNA outperforms standard network inference methods and clustering algorithms. Importantly, two global hypothesies raised by LARNA conform to current biological knowledge, i.e. (1) osteoblast maturation is induced by extra cellular matrix formation and (2) proliferation and differentiation are two mutually exclusive modes of operation. The interactions revealed by LARNA on the current limited set of microarrays are of a global character and do not reflect all details. Preliminary in silico studies support the intuition that samples should be taken more frequent at moments when the dynamics of the gene expression signals increases (data not shown). The enrichment scores were not sensitive to changes in the interpolation interval, but the interval at which measurements are obtained as well as the number of different treatments will be much more important. Future studies with more arrays sampled at smaller intervals are likely to provide a network map with higher resolution. It would also prove interesting to compare the results of the different regression networks with different publicly available co-citation networks and other global co-expression networks.

Conflict of Interest: none declared.


    FOOTNOTES
 
{dagger}Current address: Nutrition, Metabolism and Genomics Group, Division of Human Nutrition, Wageningen University, Wageningen, The Netherlands Back

Associate Editor: Alex Bateman

Received on November 1, 2005; revised on November 16, 2005; accepted on December 1, 2005

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 Materials and methods
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

    Arkin, A., et al. (1997) A test case of correlation metric construction of a reaction pathway from measurements. Science, 277, 1275–1279[Abstract/Free Full Text].

    Aubin, J. (1998) Advances in the osteoblast lineage. Biochem. Cell Biol, . 76, 899–910[CrossRef][Web of Science][Medline].

    Blake, J., et al. (2003) Mgd: the mouse genome database. Nucleic Acids Res, . 31, 193–195[Abstract/Free Full Text].

    Chaussabel, D. and Sher, A. (2002) Mining microarray expression data by literature profiling. Genome Biol, . 3, R55.1–55.16.

    Churchill, G. (2002) Fundamentals of experimental design for cDNA microarrays. Nat. Genet, . 32, 490–495.

    Corsi, A., et al. (2002) Phenotypic effects of biglycan deficiency are linked to collagen fibril abnormalities, are synergized by decorin deficiency, and mimic ehlers-danlos-like changes in bone and other connective tissues. J. Bone Miner. Res, . 17, 1180–1189[CrossRef][Web of Science][Medline].

    Danielson, K., et al. (1997) Targeted disruption of decorin leads to abnormal collagen fibril morphology and skin fragility. J. Cell Biol, . 136, 729–743[Abstract/Free Full Text].

    de Jong, D.S., et al. (2004a) Regulation of notch signaling genes during bmp2-induced differentiation of osteoblast precursor cells. Biochem. Biophys. Res. Commun, . 320, 100–1007[CrossRef][Web of Science][Medline].

    de Jong, D.S., et al. (2004b) Identification of novel regulators associated with early-phase osteoblast differentiation. J. Bone Miner. Res, . 19, 947–958[CrossRef][Web of Science][Medline].

    D'Haeseleer, P., Wen, X., Fuhrman, S., Somogyi, R. (1999) Linear modeling of mRNA expression levels during CNS development and injury. Proceedings of the Pacific Symposium on Biocomputing '99 , NJ, USA World Scientific Publishing Co. Vol. 4, , pp. 41–52.

    Featherstone, D. and Broadie, K. (2002) Wrestling with pleiotropy: genomic and topological analysis of the yeast gene expression network. Bioessays, 24, 267–274[CrossRef][Web of Science][Medline].

    Franceschi, R. (1999) The developmental control of osteoblast-specific gene expression: role of specific transcription factors and the extracellular matrix environment. Crit. Rev. Oral Biol. Med, . 10, (1) 40–57[Abstract/Free Full Text].

    Franceschi, R. (1994) Effects of ascorbic acid on collagen matrix formation and osteoblast differentiation in murine mc3t3-el cells. J. Bone Miner. Res, . 9, 843–854[Web of Science][Medline].

    Gardner, T., et al. (2003) Inferring genetic networks and identifying compound mode of action via expression profiling. Science, 301, (5629) 102–105[Abstract/Free Full Text].

    Grandvalet, Y. (1998) Least absolute shrinkage is equivalent to quadratic penalization. International Conference on Artificial Neural Networks (ICANN'98) , Sweden Skövde.

    Guelzim, N., et al. (2002) Topological and causal structure of the yeast transcriptional regulatory network. Nature Genetics, 31, , pp. 60–63[CrossRef][Web of Science][Medline].

    Guthke, R., et al. (2005) Dynamic network reconstruction from gene expression data applied to immune response during bacterial infection. Bioinformatics, 21, 1626–1634[Abstract/Free Full Text].

    Hartemink, A.J., et al. (2001) Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pacific Symposium on Biocomputing 2001 , Hawai World Scientific Publishing Co. Vol. 6, , pp. 422–433.

    Hashimoto, R.F., et al. (2004) Growing genetic regulatory networks from seed genes. Bioinformatics, 20, 1241–1247[Abstract/Free Full Text].

    Horiuchi, K., et al. (1999) Identification and characterization of a novel protein, periostin, with restricted expression to periosteum and periodontal ligament and increased expression by transforming growth factor beta. J. Bone and Miner. Res, . 14, 1239–1249[CrossRef][Web of Science][Medline].

    Jenssen, T., et al. (2001) A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics, 28, 21–28[CrossRef][Web of Science][Medline].

    Lai, C., et al. (2001) Erk is essential for growth, differentiation, integrin expression, and cell function in human osteoblastic cells. J. Bio. Chem, . 276, 14443–14450[Abstract/Free Full Text].

    Lee, T.I., et al. (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 298, 799–804[Abstract/Free Full Text].

    Liao, J.C., et al. (2003) Network component analysis: Reconstruction of regulatory signals in biological systems. PNAS, 100, 15522–15527[Abstract/Free Full Text].

    Marzia, M., et al. (2000) Decreased c-Src expression enhances osteoblast differentiation and bone formation. J. Cell Biol, . 151, 311–320[Abstract/Free Full Text].

    Pe'er, D., et al. (2001) Inferring subnetworks from perturbed expression profiles. Bioinformatics, 17, S215–S224[Abstract].

    Peng, Y. (2004) Inhibitor of DNA binding/differentiation helix–loop–helix proteins mediate bone morphogenetic protein-induced osteoblast differentiation of mesenchymal stem cells. J. Biol. Chem, . 279, 32941–32949[Abstract/Free Full Text].

    Rung, J., et al. (2002) building and analysing genome-wide gene disruption networks. Bioinformatics, 18, Suppl. 2, S202–S210[Abstract].

    Segal, E., et al. (2003) Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat. Genet, . 34, 166–176[CrossRef][Web of Science][Medline].

    Tavazoie, S., et al. (1999) Systematic determination of genetic network architecture. Nat. Genet, . 22, 281–285[CrossRef][Web of Science][Medline].

    Tibshirani, R. (1996) Regression selection and shrinkage via the lasso. J. Royal Statist. Soc. B, . 58, 267–288.

    Vaes, B., et al. (2002) Comprehensive microarray analysis of bone morphogenetic protein 2-induced osteoblast differentiation resulting in the identification of novel markers for bone development. J. Bone Miner. Res, . 17, 2106–2118[CrossRef][Medline].

    van Berlo, R., et al. (2003) Studying the conditions for learning dynamic Bayesian networks to discover genetic regulatory networks. Simulation, 79, 689–702[Abstract].

    van Someren, E., Wessels, L., Reinders, M., Backer, E. (2001) Searching for limited connectivity in genetic network models. Proceedings of the Second International Conference on Systems BiologyPasadena, California , pp. 222–230.

    van Someren, E., Wessels, L., Reinders, M., Backer, E. (2002) Regularization and noise injection for improving genetic network models. Computational and Statistical Approaches to Genomics, , NJ, USA World Scientific Publishing Co., pp. 211–226.

    van Someren, E., et al. (2003) Multi-criterion optimization for genetic network modeling. Signal Process, . 83, 763–775[CrossRef].

    Wahde, M. and Hertz, J. (2001) Modeling genetic regulatory dynamics in neural development. J. Comput. Biol, . 8, 429–442[CrossRef][Web of Science][Medline].

    Weaver, D., Workman, C., Stormo, G. (1999) Modeling regulatory networks with weight matrices. Proceedings of the Pacific Symposium on Biocomputing '99Hawai , NJ, USA World Scientific Publishing Co. Vol. 4, , pp. 112–123.

    Ying, Q., et al. (2003) Bmp induction of Id proteins suppresses differentiation and sustains embryonic stem cell self-renewal in collaboration with STAT3. Cell, 115, 281–292[CrossRef][Web of Science][Medline].

    Yue, H., et al. (2001) An evaluation of the performance of cDNA microarrays for detecting changes in global mRNA expression. Nucleic Acids Res, . 29, E41–1.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Phil Trans R Soc AHome page
F. Ortega, K. Sameith, N. Turan, R. Compton, V. Trevino, M. Vannucci, and F. Falciani
Models and computational strategies linking physiological response to molecular networks from large-scale data
Phil Trans R Soc A, September 13, 2008; 366(1878): 3067 - 3089.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
O. Hirose, R. Yoshida, S. Imoto, R. Yamaguchi, T. Higuchi, D. S. Charnock-Jones, C. Print, and S. Miyano
Statistical inference of transcriptional module-based gene networks from time course gene expression profiles by using state space models
Bioinformatics, April 1, 2008; 24(7): 932 - 942.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. J. Reiss, M. T. Facciotti, and N. S. Baliga
Model-based deconvolution of genome-wide DNA binding
Bioinformatics, February 1, 2008; 24(3): 396 - 403.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
22/4/477    most recent
bti816v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by van Someren, E. P.
Right arrow Articles by Reinders, M. J. T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by van Someren, E. P.
Right arrow Articles by Reinders, M. J. T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?