Skip Navigation


Bioinformatics Advance Access originally published online on June 6, 2007
Bioinformatics 2007 23(16):2096-2103; doi:10.1093/bioinformatics/btm309
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary data
Right arrow All Versions of this Article:
23/16/2096    most recent
btm309v3
btm309v2
btm309v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Elo, L. L.
Right arrow Articles by Aittokallio, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Elo, L. L.
Right arrow Articles by Aittokallio, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Systematic construction of gene coexpression networks with applications to human T helper cell differentiation process

Laura L. Elo 1,2,*, Henna Järvenpää 2, Matej Oresic 2,3, Riitta Lahesmaa 2 and Tero Aittokallio 1,2,4

1Department of Mathematics, FI-20014 University of Turku, 2Turku Centre for Biotechnology, PO Box 123, FI-20521 Turku, 3VTT Biotechnology, PO Box 1500, FI-02044 Espoo, Finland and 4Systems Biology Unit, Institut Pasteur, FR-75724 Paris, France

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Coexpression networks have recently emerged as a novel holistic approach to microarray data analysis and interpretation. Choosing an appropriate cutoff threshold, above which a gene–gene interaction is considered as relevant, is a critical task in most network-centric applications, especially when two or more networks are being compared.

Results: We demonstrate that the performance of traditional approaches, which are based on a pre-defined cutoff or significance level, can vary drastically depending on the type of data and application. Therefore, we introduce a systematic procedure for estimating a cutoff threshold of coexpression networks directly from their topological properties. Both synthetic and real datasets show clear benefits of our data-driven approach under various practical circumstances. In particular, the procedure provides a robust estimate of individual degree distributions, even from multiple microarray studies performed with different array platforms or experimental designs, which can be used to discriminate the corresponding phenotypes. Application to human T helper cell differentiation process provides useful insights into the components and interactions controlling this process, many of which would have remained unidentified on the basis of expression change alone. Moreover, several human–mouse orthologs showed conserved topological changes in both systems, suggesting their potential importance in the differentiation process.

Contact: laliel{at}utu.fi

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Network-based representation and analysis of data from high-throughput experimental technologies is increasingly being used to both visualize and identify the components and their interactions involved in a given cellular system. In particular, construction of coexpression networks from gene expression microarray datasets has recently become a popular alternative to the conventional analytic approaches, such as the detection of differential expression using statistical testing or the coexpression analysis using unsupervised clustering. Representing dependencies in the dataset as interaction networks allows the researcher to explore the whole spectrum of pairwise relationships among the genes as opposed to flat lists of genes from statistical tests or distinct groups of genes from clustering tools.

Large-scale gene coexpression networks have been used, e.g. to demonstrate that functionally related genes are frequently coexpressed across multiple datasets and across different organisms (Lee et al., 2004; Stuart et al., 2003), or to estimate the structure of the underlying regulatory relationships under various experimental conditions (Basso et al., 2005; Voy et al., 2006). By constructing separate coexpression networks for different conditions, such as normal and cancerous states, it is possible to identify disease-mediated changes in the network connectivity patterns (Choi et al., 2005; Liu et al., 2006). Moreover, investigating the association between the degree of differential expression and differential connectivity (DC) between the two conditions provides potential target components associated with the disease states (Carter et al., 2004; Reverter et al., 2006).

In the construction of a coexpression network, an important decision the researcher must make concerns the cutoff threshold above which a connection is considered relevant in the particular experiment. Although the results of the network-based discoveries can strongly depend on this choice, there is currently no systematic way to choose a good cutoff level for a given dataset. Previous works have typically relied on data-independent constant thresholds (Reverter et al., 2006; Sanoudou et al., 2003; Zhou et al., 2002), or statistical procedures that control for false positive error rates only (Carter et al., 2004; Voy et al., 2006). However, such criteria fail to take into consideration the false negative rates, which may lead to filtering out biologically significant connections, nor give they any guidance on how to define the cutoff or significance level in a particular research setting.

In this work, we present a systematic procedure for inferring a cutoff threshold of coexpression networks directly from their topological properties. The objective is to automatically select a threshold that preserves as many valid coexpression links as possible, while simultaneously controlling the number of false detections. The procedure is based on comparing the observed clustering coefficient and its randomized counterpart as the number of connections is gradually decreased. The internal validity of the network construction process is investigated in terms of its robustness to reduced experiment size, whereas the external validation of the interactions considers the degree of their functional relatedness as assessed with Gene Ontology (GO) annotations or protein–protein interactions (PPIs). We also show that the coexpression networks constructed with the proposed approach from multiple microarray experiments can reliably discriminate between two polarization states in the human T helper cell differentiation process as well as identify genes and interactions involved in this process.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Gene coexpression networks can be represented as graphs, where each node corresponds to a gene and a pair of nodes is connected with an undirected edge (link) if their pairwise expression similarity is above a particular threshold value. To demonstrate the operation of the procedure, we measured the similarities with the widely used Pearson correlation, denoted by r, although the general approach could in principle be applied to any preferred measure of similarity.

2.1 Traditional approaches for threshold selection
A standard way to construct a coexpression network is to apply a pre-defined cutoff value to either the correlation itself or its statistical significance. A wide range of rather ad hoc constant correlation cutoffs (usually between 0.6 and 0.99) have been applied in various microarray studies (Reverter et al., 2006; Sanoudou et al., 2003; Zhou et al., 2002). We evaluated here the general applicability of the constant thresholds 0.6, 0.8 and 0.99 (denoted by C0.6, C0.8 and C0.99, respectively) across various types of datasets. Statistical significance of correlation was assessed using the result that the statistic Formula has a Student's t-distribution with df = n–2 under the null hypothesis of no correlation (Bernstein and Bernstein, 1999), where n is the number of measurements (arrays) in the gene-wise expression profiles. To validate this assumption, we also estimated the null distribution using permutation. Since the permuted and analytical P-values agreed well, as has been reported also earlier (Lee et al., 2004), we considered here only the analytical P-values. The P-values were corrected for multiple testing by applying the Bonferroni correction for the number of genes (Lee et al., 2004; Voy et al., 2006). The commonly applied adjusted P-values of 0.01, 0.05 or 0.1 were evaluated (denoted by B0.01, B0.05 and B0.1, respectively).

2.2 Clustering coefficient-based threshold selection
The clustering coefficient of a node (gene) in a network is defined as the probability that its first neighbors are connected to each other (Watts and Strogatz, 1998). If Ei denotes the number of edges between the ki (> 1) first neighbors of a gene i, then the clustering coefficient of the gene is


Formula 1

(1)
The coefficient Ci obtains its maximum value 1, if all the neighbors are connected to each other. The clustering coefficient of a network is defined as the average over its gene-wise values


Formula 2

(2)
where K is the number of genes with ki > 1. If K = 0, we define C = 0.

The construction of a coexpression network can be thought of as a process, where links are removed from the initially complete graph by gradually increasing the correlation threshold. We conjectured that the removed links are likely to be noise as long as the difference between the observed clustering coefficient and its randomized counterpart increases monotonically. More specifically, we formulated this as a discrete optimization problem, where the critical cutoff threshold C* was determined by finding the minimum


Formula 3

(3)
over a set of thresholds r0 < r1 < ... < rJ–1 < rJ. Here, C(r) denotes the clustering coefficient of the coexpression network at a correlation threshold r, C0(r) is its randomized counterpart, and in the present work, we defined r0 = 0, rJ = 1 and rj+1 = rj + 0.01. This procedure resembles finding the first local maximum of the observed difference curve C C0 in the continuous case. To reduce the possible effect of numerical inaccuracies on the overall pattern of the differences in the discrete case, a median filtering procedure was applied to the curve CC0 prior to the threshold selection; see, e.g. Allen and Mills (2004).

A random reference network was constructed separately at each threshold value using the so-called configuration model, which preserves the prescribed degree distribution of the original network (Newman, 2003). If we denote by P(k) the probability that a node chosen uniformly at random from the network has degree k and by N the total number of nodes, then the clustering coefficient of the corresponding random network can be defined analytically. Briefly, the probability that two neighbors of degrees u and v of a same node are also neighbors themselves is given by Formula . On the other hand, the probability that a node is connected to a node of degree u can be written as Formula . By independence under the random model, the probability that a node is simultaneously connected to two nodes of degrees u and v becomes then pupv. Finally, the expected value of the clustering coefficient under this random network model can be calculated as


Formula 4

(4)
where Formula , and Formula .

Recently, Gupta et al. (2006) also described a procedure for selecting the correlation threshold based on the clustering coefficient. Their approach relies on the observation that at reasonably high correlation levels, clustering coefficient is expected to increase if highly connected subnetworks emerge. Accordingly, they suggested to select the highest similarity value R*, above which a sharp increase in C occurs. There are several major differences between their approach and ours: (1) Gupta et al. (2006) used the observed C curve only, whereas we account also for the random variation in the dataset using the expected C0 under the random network model; (2) the goal of Gupta et al. (2006) was to find a threshold that produces highly cliquish disconnected subnetworks, while we aim at finding a correlation threshold that would retain as many biologically relevant links as possible without producing an extensive number of false detections; (3) the approach of Gupta et al. (2006) can become unstable and even fail in many real microarray datasets that do not behave as they suggested, while our approach does not have any unrealistic assumptions on the properties of the data but can be applied to any given dataset and (4) unlike the visual inspection of Gupta et al. (2006), our procedure is fully automated. Visual inspection of the clustering coefficient curves is labor-intensive, error-prone and subjective, and becomes practically unfeasible if a large number of networks need to be constructed.

2.3 Datasets
2.3.1 Simulated data
Synthetic data were generated using the stochastic model of Thalamuthu et al. (2006). Briefly, initial templates for c clusters of genes were created. Sample and gene variability was then added from a hierarchical log-normal model, and a number of randomly simulated genes were included so that the total number of genes became 1000. Finally, random errors were generated to the expression signals x such that the noise-added signal became xs = x+{alpha}, where {alpha} ~ N(0, s2) and s is the noise level. A total of 60 datasets were analyzed with n = 25,50 samples, c = 15,20 gene templates and noise levels s = 0, 0.5, 1 (5 datasets of each type).

2.3.2 Real microarray data
The real microarray data consisted of five experiments measuring T helper cell differentiation in human (three experiments) or mouse (two experiments). In these experiments, CD4+ cells were isolated from human cord blood or spleens of Balc/cJ mice. Activated cells were polarized towards Th2 or Th1 subsets using IL-4 or IL-12, respectively, or cultured in ‘neutral conditions’ without cytokines (referred to as Th0). Samples were collected after various times of polarization and hybridized to Affymetrix HG-U133Plus2.0 (H-U133P data), Affymetrix HG-U133A (H-U133A data), Illumina Human-6 BeadChip (H-I6 data) or Affymetrix MG-U74Av2 arrays (M-U74 data). The H-U133P data contained measurements from Th0 and Th2 samples at 0, 0.5, 1, 2, 4, 6, 12, 24, 48 and 72 h, and from Th1 samples at 12, 24, 48 or 72 h. In the other datasets, Th0 and Th2 samples at 0, 2, 6 and 48 h were measured. The human datasets were comprised of 2 or 3 biological repeats of the temporal series, while both mouse datasets contained single series. Details of these experiments are given elsewhere (Chen et al., 2003; Elo et al., 2006; Lund et al., 2003, 2004).

In the coexpression analyses, logarithmic expression signals were used. The Affymetrix signals were derived using the RMA procedure (Irizarry et al., 2003) in the Bioconductor ‘affy’ package (www.bioconductor.org). The Illumina data were normalized with the quantile method of the same software. Data from multiple probesets/probes matching the same GeneID were averaged. To reduce the number of genes in the networks, only differentially expressed genes between Th0 and Th2 samples at one or more time points in the H-U133P data were included (modified t-test; Smyth, 2004). To combine the datasets, matched probeset/probe pairs between the platforms were defined. The pairs between the human Affymetrix arrays were defined using the so-called ‘bestmatch’ tables (www.affymetrix.com) and the pairs between the Affymetrix and Illumina arrays using the sequence alignments from Barnes et al. (2005). The human–mouse orthologs were taken from the tables provided by Affymetrix (www.affymetrix.com).

2.4 Evaluation criteria
2.4.1 Performance in simulated data
In the simulated data, the truth of the coexpression links was known. In these datasets, the inferred networks at various correlation thresholds were evaluated by determining their overall error level, measured by Formula , where FP and FN are the proportions of false positive and false negative links among the total number of possible links. This error function was chosen because under synthetic conditions it is difficult to weight the trade-off between filtering out true connections and retaining irrelevant false artifacts. For each dataset, the best possible correlation threshold was recorded and the observed errors obtained with the different approaches were compared to these minimum error levels. Although in the simulated data such optimal choice could be determined, in real research setting it is typically unknown. To demonstrate that the various performances do not arise merely from the selected evaluation criterion, we also provide some example ROC-curves in Supplementary Figure 1.

2.4.2 Internal validation
In the real datasets, the internal validity of the different threshold procedures was investigated by assessing the reproducibility of the reconstructed networks between biological replicates as the profile length was varied from 4 to 10. This gives indications of the robustness of the network construction, as biologically relevant gene pairs should be present across the replicates. In these analyses, the similarity between two networks was defined as the Pearson correlation between their degree vectors Formula . The replicate-specific Th0 and Th2 profiles from the H-U133P data were used (n = 10, N = 897). Five random subsets of arrays of each profile length were sampled.

2.4.3 Biological relevance of links
Biological relevance of the identified links was evaluated in terms of their functional similarity in the biological process GO (Ashburner et al., 2000). The functional similarity between a pair of genes was defined using the Resnik's semantic similarity measure (Resnik, 1999). If –log p(t) and S(t) are the information content and the parent set of a GO term t, respectively, and A(i) is the set of terms annotating gene i, then the semantic similarity between genes i1 and i2 was defined as


Formula 5

(5)
The average GO similarity of gene pairs at various correlation ranges was calculated. The observed values were then compared to the values obtained by randomly sampling 1000 subsets of the same size to assess their significance. Similarly, publicly available information on human PPI was utilized (www.cytoscape.org/cgi-bin/moin.cgi/Data_Sets) to evaluate the proportion of reported PPIs at various correlation ranges. In these comparisons, the longest available profiles were used, which were obtained from the H-U133P data over all available treatments and time points (n = 23, N = 897). The values across biological replicates within a time point and treatment were averaged before analysis. Out of the analyzed differentially expressed genes (FDR < 0.05), 524 (58.4%) were annotated to at least one GO term.

2.5 Topology-based discrimination
An ultimate goal of the present work was to evaluate the utility of gene coexpression networks in phenotype distinction. When multiple measurements exist on an individual, a separate network for the particular individual can be constructed. We characterized the network of an individual j by its standardized degree vector zj = (zj1, ..., zjN) across the genes. If we denote by µj the average degree of the network, by {sigma}j its SD and by kji the degree of a gene i, then the standardized degree of the gene was defined as zji = (kji–µj)/{sigma}j. The degree vectors were projected into two dimensions using the principal component analysis (PCA), and the ability of the principal components to capture phenotype-specific characteristics was investigated. In the present work, such analysis was carried out using the nine replicated Th0 and eight replicated Th2 time series from the three human datasets H-U133P, H-U133A and H-I6 and a set of N = 675 shared genes that were selected based on differential expression (FDR < 0.1).

2.6 Differential connectivity
To identify key components involved in Th2 differentiation at the gene-level, differences between the inferred Th0 and Th2 networks were investigated. DC of a gene between the two groups was defined as the average difference in its standardized degrees: Formula . Here, nl is the number of networks in the group Gl, l = 1, 2. The degree of evidence for DC was assessed with the Wilcoxon rank sum statistic. Finally, the observed d-values were compared to the corresponding signal log-ratios between the Th0 and Th2 samples and their conservation in the corresponding mouse networks was inspected.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
In the present study, we constructed gene coexpression networks from 60 simulated and 208 real datasets with different subsets of samples and/or genes (see Supplementary Table 1). The number of identified links varied greatly between the different threshold selection approaches and was strongly dependent on the sample size, especially with the constant cutoffs. The data-driven threshold C* produced typically considerably more links than the statistical thresholds, the relevance of which will be demonstrated in the following sections. In particular, we show that the performance of the traditional approaches based on a pre-defined cutoff or significance level varied drastically depending on the data and application, while the data-driven threshold was systematically among the best choices.

3.1 Behavior of the clustering coefficient
Figure 1 illustrates the behavior of the clustering coefficient C in two real datasets H-U133P n = 23 or n = 10. In the larger data, the observed C, indeed, showed a sharp increase at R* = 0.91, similarly as was suggested by Gupta et al. (2006). However, in the smaller data, no such sharp increase was observed at any correlation threshold, making the proposed R* inapplicable in the particular data. Importantly, the latter behavior of C was observed far more often than the former throughout the datasets investigated in the present work. This severely impaired the use of R* as a general threshold selection procedure. Unlike R*, the novel data-driven threshold C* could be determined in all the datasets. Moreover, the optimal threshold C* was never obtained at the limits of the investigated correlation space (r = 0 or r = 1).


Figure 1
View larger version (25K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Behavior of the clustering coefficient at different correlation thresholds in two real datasets H-U133P n = 10 (black) and n = 23 (gray). In the background graph, the solid lines correspond to the observed clustering coefficients, whereas the dashed lines show the theoretical null references. In the inset, the corresponding difference curves CC0 are shown. In the larger data, the clustering coefficient tended to increase at the high correlation thresholds. However, no such clear behavior could be seen in the smaller data. The gray arrow indicates the threshold R* suggested by Gupta et al. (2006), while the black arrows indicate the data-driven threshold C* in the two datasets. Notably, the threshold R* can be defined only in the larger dataset.

 
3.2 Performance in simulated data
Figure 2 summarizes the overall error levels of the different threshold selection procedures in the simulated data. Markedly, the data-driven threshold C* produced systematically the lowest error levels under the various simulated conditions (paired Wilcoxon test, P < 0.01). The constant cutoff C0.6 performed, in general, similarly as the three statistical thresholds, whereas the higher constant cutoffs produced significantly more errors (P < 0.01). In particular, none of the traditional threshold selection procedures performed well across all the simulated datasets but their performance was strongly dependent on the properties of the data, as exemplified in Supplementary Figure 1. A more detailed analysis of the errors revealed that the number of samples had a considerable impact on the statistical thresholds, while increased level of noise impaired the performance of each threshold value (data not shown).


Figure 2
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. A boxplot representation of the differences between the observed and minimum errors in the networks constructed from simulated datasets. The performance of the data-driven threshold C* was compared to three statistical (B0.01, B0.05, B0.1) and three constant (C0.6, C0.8, C0.99) thresholds under various simulated conditions. For explanation of the procedure names, see Section 2.1. The boxes show the median (thick line) and the interquartile range (IQR) of the observed differences, the whiskers indicate the range of the differences, and the points correspond to extreme observations with values greater than 1.5 times the IQR. The procedures were arranged according to the increasing median error level.

 
3.3 Coexpression networks using short profiles
The results from the internal validation in the real datasets are shown in Figure 3. As expected, the reproducibility across the replicate networks decreased with decreasing sample size. Notably, however, the highest constant cutoff C0.99 showed its best performance with the smallest sample size, highlighting further the challenge of using a constant cutoff without considering the data-specific properties. In particular, short profiles tend to produce relatively high correlations even by chance alone, and hence, a higher correlation threshold is appropriate in this situation. With longer profiles, instead, C0.99 is such a stringent criterion that a considerable number of relevant links is likely to be ignored. At all profile lengths, the data-driven threshold C* was again among the best choices. In each case, the constant threshold C0.8 yielded similar reproducibility values as C* (paired Wilcoxon test, P > 0.05), whereas the constant cutoff C0.6 gave significantly lower reproducibility (P < 0.01) with sample sizes over 5. The statistical thresholds behaved systematically the worst (P < 0.01), showing practically no reproducibility when the number of samples was reduced to 4.


Figure 3
View larger version (25K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Internal validity of the different threshold procedures in the real microarray data H-U133P (n = 10). Reproducibility of the procedures across independent biological replicates was studied as the sample size was varied from 4 to 10 (x-axis). The reproducibility was measured using the Pearson correlation between the degree distributions of the networks. The curves present the average values over 30 networks at each sample size. The error bars were omitted for the clarity of illustration. In general, the SDs between different subsets of samples were rather large, indicating that some of the samples can capture the coexpressions better than others. For a fair comparison, the same subsets of samples were used for each threshold procedure.

 
3.4 Biological relevance of coexpression
Among the genes investigated in the present study, gene pairs with correlations over 0.5 showed, on average, higher GO similarities than expected by chance (P < 0.05) (Fig. 4). Moreover, the pairs with correlations over 0.6, were also more likely to interact with each other according to the available human PPI data (Supplementary Fig. 2). Thus, both of these external data sources were consistent with the selected threshold C* = 0.72 = B0.1. The two other statistical thresholds (B0.01 = 0.78, B0.05 = 0.74) and the constant threshold C0.8 were somewhat higher and, consequently, may exclude a number of biologically relevant links. The other clustering coefficient-based threshold R* = 0.91 and the constant threshold C0.99 give such stringent cutoffs that a large number of potentially interesting links is likely to be ignored, as suggested by the observed GO similarities and PPIs.


Figure 4
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. External validity of the different threshold procedures in the real microarray data H-U133P (n = 23). The average semantic similarity in the biological process GO at various ranges of expression correlations is illustrated. The gray points and bars represent the random averages and their SDs, while the observed values are shown in black. The filled black circles indicate the ranges where the observed values were significantly different from random pairs as assessed with permutation testing (P < 0.05). The different thresholds are illustrated by the arrows.

 
3.5 Topology-based discrimination
The PCA visualization of the nine Th0 and eight Th2 networks in human were studied with the aim of distinguishing the two phenotypes known to have different functional roles. The networks constructed with the threshold C* revealed a clear separation between the sample groups across all the experiments (Fig. 5A). In particular, the separation was determined by the second principal component alone, and without using any information on the origin of the samples being compared. With the constant cutoffs, the separation became less evident, although a similar trend was observed. The PCA projection of the best-performing constant cutoff C0.99 is shown in Figure 5B. With the three statistical thresholds, the separation was further obscured and no evidence of general Th0- and Th2-specific characteristics could be seen (Fig. 5C). Instead, the technical differences between the datasets became prominent.


Figure 5
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. PCA projections of the Th0 and Th2 networks constructed from the human H-U133P, H-U133A and H-I6 datasets. A total of nine Th0 networks and eight Th2 networks were considered, consisting of N = 675 shared genes. Each network was represented by its standardized degree vector across the genes, which were then visualized in the space spanned by their first two principal components. The results for the data-driven threshold C*, and the best (C0.99) and worst (B0.01) pre-defined thresholds are shown. With the threshold C*, the PCA visualization revealed a clear separation between the Th0 and Th2 networks, determined by the second principal component only (illustrated with the dashed line).

 
3.6 Differential connectivity and biological implications
A total of 104 genes were identified as differentially connected (Wilcoxon test, P < 0.05) between the Th0 and Th2 networks constructed using the data-driven threshold C* (Fig. 6). Enriched GO terms identified among these genes were well in agreement with the current biological knowledge of Th2 differentiation process (see Supplementary Table 2). In particular, 27% of the genes were annotated to immune response (hypergeometric test, P < 0.001), supported by the central role of Th2 cells in the pathogenesis of allergic and inflammatory diseases (Mowen and Glimcher, 2004).


Figure 6
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Comparison between DC (degree difference) and differential expression (signal log-ratio) across the three human datasets at 48 h. The black points correspond to those 91 genes that showed significant DC between the Th0 and Th2 networks in human. The orange triangles denote those additional 13 genes that were significant not only in human but showed also similar connection changes in the orthologous mouse networks (two biological replicates). Several known players of cytokine signaling and T helper cell differentiation are highlighted, including IL4R, GATA3, BCL6, IFNG, IL12RB2, TXK, GADD45B and CTLA4.

 
Among the differentially connected genes, several known players of cytokine signaling and T helper cell differentiation stood out, including IL4R, GATA3, BCL6, IFNG, IL12RB2, TXK, GADD45B and CTLA4 (Mowen and Glimcher, 2004 and Berenson et al., 2004). Cytokine IL-4 is the key driving force of the Th2 polarization of the CD4+ T cells and it has been shown to induce the transcription of an important transcriptional regulator GATA3. In addition, several other transcription factors were detected that have previously been associated with IL-4 induction in the human and mouse systems, such as BCL6. Interestingly, in contrast to Th2 markers, several genes previously seen to be upregulated during Th1 cell differentiation showed significant change in their connectivity between Th0 and Th2 networks. These genes included the Th1 cell cytokine IFNG and the IL-12 receptor IL12RB2. Also, a transcriptional regulator driving the expression of IFNG gene, TXK (Takeba et al., 2002), and a signaling molecule of Th1 cells, GADD45B (Yang et al., 2001), were identified. The CTLA4 gene, on the other hand, has an important role in the negative regulation of T cell activation and other immune responses.

In addition to the already established differentiation markers, transcripts of several novel polarization-induced genes emerged, many of which would have remained unidentified based on the statistical testing of differential expression only. Especially, the emergence of the transcription factors as being among the most differentially connected genes provides an interesting aspect on further functional studies in this system.

For some of the novel findings from the analysis of DC, previous reports exist on their effects on the murine immune system, providing further clues on their similar roles in the human system. In particular, 13 of the human DC genes showed consistent topological changes also in the corresponding mouse networks analyzed in the present study (Fig. 6). These included some known differentiation markers, such as IL12RB2. Therefore, it will be highly interesting to study the possible role of the novel identifications in the cytokine-induced signaling events in the human Th cell differentiation process.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have demonstrated that the cutoff threshold may have a drastic influence on the construction and interpretation of coexpression networks, and none of the traditional approaches based on pre-defined cutoff or significance levels performed satisfactorily across all the datasets and applications. Therefore, we proposed a systematic procedure based on clustering coefficient to select the threshold directly from the data under analysis. In all the datasets and conditions tested, our procedure was consistently among the best choices, and never significantly worse than the best performing pre-defined threshold. When taking all the different comparison setups into consideration, the data-driven procedure clearly outperformed all the other threshold selection approaches as a general procedure for any given dataset. While the cutoff threshold selection has a large impact when analyzing the topology of individual networks alone (see Supplementary Fig. 3), its importance is emphasized when the topology of two or more coexpression networks is compared. A key finding of the present work was that, provided that the individual cutoff thresholds are appropriately selected, the network-centered approach enables a robust discrimination between the corresponding phenotypes from microarray data, irrespective of the study-specific characteristics such as the array platform used. This was possible even with a single network feature, degree distribution, and using a very coarse-grained visualization tool, PCA, which does not take into account any class information. We believe that our procedure gives a good starting point for more sophisticated network topology-based classification approaches.

While the present work demonstrated the good performance of the procedure with correlation networks, effectively similar to the relevance networks (Butte et al., 2000), the proposed approach can easily be supplemented by other network construction methods, such as graphical Gaussian models or dynamic Bayesian networks. For instance, the initial network can be pruned with more complex measures of connection, including conditional mutual information (Basso et al., 2005) or partial correlation coefficient (Magwene and Kim, 2004), to filter out indirect connections. After discovering interesting subnetworks from the pruned network, e.g. neighborhoods of differentially connected transcription factors, more heavily parametrized models such as Bayesian networks may be applied to provide quantitative simulations on these subsystems. The feasibility of such a top–down approach is supported by the recent comparative studies, demonstrating that relevance networks can surprisingly well predict the regulatory interactions on large scale (Faith et al., 2007), while the use of Bayesian networks may be justified on a smaller scale (Werhli et al., 2006).

To investigate the performance of the data-driven threshold selection with other measures of gene–gene relations, we also repeated the most interesting application of this work, the PCA analysis, using Euclidean distance, which is another widely used measure in the context of microarray data analysis. The PCA analysis suggested that also the Euclidean distance can separate the Th0 and Th2 networks based on connectivity if the thresholds were chosen appropriately, although in this case the third principal component was required (Supplementary Fig. 4). A more comprehensive evaluation of different similarity or distance functions would be needed to test the generalizability of the present results. However, already the Pearson correlation gave good results for our purpose of subsequent experimental testing. At this stage, only positive correlations were considered, instead of absolute correlations, when constructing the networks. This choice was motivated by the previous observation that positive co-regulation of gene expression may be favored in eukaryotes (Lee et al., 2004). It should be noted, however, that the choice of the similarity/distance measure is always dependent on the application at hand, and any prior knowledge of the performance of the various alternatives would be of high practical value.

We have also tested several alternative ways to utilize the clustering coefficient in the threshold selection. In particular, we considered various global optimization problems, including the maximization of the difference CC0, the normalized difference (CC0)/(1– C0), or the standardized difference (CC0)/sC with Formula . According to our results, however, the approach presented in this work produced, in general, the best performance under a wide range of practical setups tested (data not shown). An interesting topic for a future work would be to investigate the effect of various null models on the threshold selection. Although the configuration model performed well in the present application, this type of randomization may not give an optimal null model for a correlation network, as the transitivity of correlation is inevitably reflected in the network topology also in the random case. A nice advantage of the configuration model is that it leads to a convenient analytical formula that greatly simplifies the calculations.

The biological motivation behind developing the systematic procedure for choosing the cutoff threshold for individual networks originated from our ongoing study on elucidating the molecular mechanisms underlying the human T helper cell differentiation process. This is crucial for understanding the pathogenesis of many human immune-mediated and inflammatory diseases, such as asthma, allergy and other atopic diseases. In the absence of large scale information on the regulatory interactions in this particular system, genome-wide transcriptomics data on the early development of the differentiation process is being used to estimate the underlying regulatory networks corresponding to the functionally distinct Th0 and Th2 states. The novel genes predicted to be involved in the Th2 differentiation process are potential targets for developing therapeutics to modulate human Th cell responses. The novel target genes of transcription factors can further be investigated by computational means using sequence analysis of their promoter sequences. These computational predictions can be validated by means of siRNA experiments. Besides these directions, we are currently investigating efficient ways how to incorporate PPI data into the network construction process.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The work was supported by the Academy of Finland (grant 203632) and the Graduate School in Computational Biology, Bioinformatics, and Biometry (ComBi).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: David Rocke

Received on April 5, 2007; revised on May 21, 2007; accepted on June 3, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Allen RL, Mills DW. Signal Analysis: Time, Frequency, Scale, and Structure, ( (2004) ) NJ, USA: Wiley-Interscience..

    Ashburner M, et al. Gene Ontology: tool for the unification of biology. Nat.Genet, ( (2000) ) 25, : 25–29.[CrossRef][ISI][Medline].

    Barnes M, et al. Experimental comparison and cross-validation of the Affymetrix and Illumina gene expression analysis platforms. Nucleic Acids Res, ( (2005) ) 33, : 5914–5923.[Abstract/Free Full Text].

    Basso K, et al. Reverse engineering of regulatory networks in human B cells. Nat. Genet, ( (2005) ) 37, : 382–390.[CrossRef][ISI][Medline].

    Berenson LS, et al. Issues in T-helper 1 development – resolved and unresolved. Immunol. Rev, ( (2004) ) 202, : 157–174.[CrossRef][ISI][Medline].

    Bernstein S, Bernstein R. Schaum's Outline of Theory and Problems of Elements of Statistics II, ( (1999) ) USA: McGraw-Hill..

    Butte AJ, et al. Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl Acad. Sci. USA, ( (2000) ) 97, : 12182–12186.[Abstract/Free Full Text].

    Carter SL, et al. Gene co-expression network topology provides a framework for molecular characterization of cellular state. Bioinformatics, ( (2004) ) 20, : 2242–2250.[Abstract/Free Full Text].

    Chen Z, et al. Identification of novel IL-4/Stat6-regulated genes in T lymphocytes. J. Immunol, ( (2003) ) 171, : 3627–3635.[Abstract/Free Full Text].

    Choi JK, et al. Differential coexpression analysis using microarray data and its application to human cancer. Bioinformatics, ( (2005) ) 21, : 4348–4355.[Abstract/Free Full Text].

    Elo LL, et al. Improving identification of differentially expressed genes by integrative analysis of Affymetrix and Illumina arrays. OMICS, ( (2006) ) 10, : 369–380.[CrossRef][ISI][Medline].

    Faith JJ, et al. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol, ( (2007) ) 5, : e8.[CrossRef][Medline].

    Gupta A, et al. Elucidation of directionality for co-expressed genes: predicting intra-operon termination sites. Bioinformatics, ( (2006) ) 22, : 209–214.[Abstract/Free Full Text].

    Irizarry RA, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, ( (2003) ) 4, : 249–264.[Abstract].

    Lee HK, et al. Coexpression analysis of human genes across many microarray data sets. Genome Res, ( (2004) ) 14, : 1085–1094.[Abstract/Free Full Text].

    Liu CC, et al. Topology-based cancer classification and related pathway mining using microarray data. Nucleic Acids Res, ( (2006) ) 34, : 4069–4080.[Abstract/Free Full Text].

    Lund R, et al. Identification of novel genes regulated by IL-12, IL-4, or TGF-beta during the early polarization of CD4+ lymphocytes. J. Immunol, ( (2003) ) 171, : 5328–5336.[Abstract/Free Full Text].

    Lund R, et al. Early target genes of IL-12 and STAT4 signaling in Th cells. J. Immunol, ( (2004) ) 172, : 6775–6782.[Abstract/Free Full Text].

    Magwene PM, Kim J. Estimating genomic coexpression networks using first-order conditional independence. Genome Biol, ( (2004) ) 5, : R100.[CrossRef][Medline].

    Mowen KA, Glimcher LH. Signaling pathways in Th2 development. Immunol. Rev, ( (2004) ) 202, : 203–222.[CrossRef][ISI][Medline].

    Newman MEJ. Handbook of Graphs and Networks, —Bornholdt S, Schuster HG, eds. ( (2003) ) Berlin: Wiley-VCH. 35–68..

    Resnik P. Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J. Artif. Intel. Res, ( (1999) ) 11, : 95–130..

    Reverter A, et al. Simultaneous identification of differential gene expression and connectivity in inflammation, adipogenesis and cancer. Bioinformatics, ( (2006) ) 22, : 2396–2404.[Abstract/Free Full Text].

    Sanoudou D, et al. Expression profiling reveals altered satellite cell numbers and glycolytic enzyme transcription in nemaline myopathy muscle. Proc. Natl Acad. Sci. USA, ( (2003) ) 100, : 4666–4671.[Abstract/Free Full Text].

    Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol, ( (2004) ) 3, : 3..

    Stuart JM, et al. A gene-coexpression network for global discovery of conserved genetic modules. Science, ( (2003) ) 302, : 249–255.[Abstract/Free Full Text].

    Takeba Y, et al. Txk, a member of nonreceptor tyrosine kinase of Tec family, acts as a Th1 cell-specific transcription factor and regulates IFN-gamma gene transcription. J. Immunol, ( (2002) ) 168, : 2365–2370.[Abstract/Free Full Text].

    Thalamuthu A, et al. Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics, ( (2006) ) 22, : 2405–2412.[Abstract/Free Full Text].

    Voy BH, et al. Extracting gene networks for low-dose radiation using graph theoretical algorithms. PLoS Comput. Biol, ( (2006) ) 2, : e89.[CrossRef][Medline].

    Yang J, et al. IL-18-stimulated GADD45 beta required in cytokine-induced, but not TCR-induced, IFN-gamma production. Nat. Immunol, ( (2001) ) 2, : 157–164.[CrossRef][ISI][Medline].

    Watts DJ, Strogatz SH. Collective dynamics of ‘small-world’ networks. Nature, ( (1998) ) 393, : 440–442.[CrossRef][Medline].

    Werhli AV, et al. Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical gaussian models and bayesian networks. Bioinformatics, ( (2006) ) 22, : 2523–2531.[Abstract/Free Full Text].

    Zhou X, et al. Transitive functional annotation by shortest-path analysis of gene expression data. Proc. Natl Acad. Sci. USA, ( (2002) ) 99, : 12783–12788.[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary data
Right arrow All Versions of this Article:
23/16/2096    most recent
btm309v3
btm309v2
btm309v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Elo, L. L.
Right arrow Articles by Aittokallio, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Elo, L. L.
Right arrow Articles by Aittokallio, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?