Bioinformatics Advance Access originally published online on October 5, 2007
Bioinformatics 2007 23(21):2903-2909; doi:10.1093/bioinformatics/btm482
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Extracting three-way gene interactions from microarray data
Department of Bioinformatics and Computational Biology, The University of Texas M.D. Anderson Cancer Center, 1515 Holcombe Boulevard, Unit 237, Houston, TX 77030-4009, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: It is an important and difficult task to extract gene network information from high-throughput genomic data. A common approach is to cluster genes using pairwise correlation as a distance metric. However, pairwise correlation is clearly too simplistic to describe the complex relationships among real genes since co-expression relationships are often restricted to a specific set of biological conditions/processes. In this study, we described a three-way gene interaction model that captures the dynamic nature of co-expression relationship between a gene pair through the introduction of a controller gene.
Results: We surveyed 0.4 billion possible three-way interactions among 1000 genes in a microarray dataset containing 678 human cancer samples. To test the reproducibility and statistical significance of our results, we randomly split the samples into a training set and a testing set. We found that the gene triplets with the strongest interactions (i.e. with the smallest P-values from appropriate statistical tests) in the training set also had the strongest interactions in the testing set. A distinctive pattern of three-way interaction emerged from these gene triplets: depending on the third gene being expressed or not, the remaining two genes can be either co-expressed or mutually exclusive (i.e. expression of either one of them would repress the other). Such three-way interactions can exist without apparent pairwise correlations. The identified three-way interactions may constitute candidates for further experimentation using techniques such as RNA interference, so that novel gene network or pathways could be identified.
Contact: lzhangli{at}mdanderson.org
Supplementary information: http://odin.mdacc.tmc.edu/~zhangli/ThreeWay
| 1 INTRODUCTION |
|---|
|
|
|---|
High-throughput genomic data are a rich resource for elucidating how genes are interconnected (Alm and Arkin, 2003; Lander, 1999; Quackenbush, 2003; Zhang, 2002). Many computational tools have been developed to assess gene interactions using microarray gene expression profiling data (Alm and Arkin, 2003). A common approach is to cluster genes using pairwise correlation as a distance metric (Eisen et al., 1998; Ji et al., 2005; Tavazoie et al., 1999). However, pairwise correlation is clearly too simplistic to describe the complex relationships among real genes since it is rare to find gene pairs that are constitutively co-expressed. Typically, co-expression relationships are often restricted to a specific set of biological conditions and/or processes (Rao and Arkin, 2001). For example, some co-expression relationships were found to exist only in cancer but not in normal tissue (Choi et al., 2005). A gene pair may be co-expressed only in a specific organ, at a specific developmental stage, after a drug treatment or in a particular disease state. To deal with these complications, new methods have been developed to identify co-expressed gene groups in subsets of biological conditions (Dettling et al., 2005; Shedden and Taylor, 2004) or modules (Getz et al., 2000; Ihmels et al., 2002; Jornsten and Yu 2003; Segal et al., 2003; Wu et al., 2004). Using such methods, it is possible to build a dynamic network of co-expressed groups form a microarray dataset. Each link (edge) in the network represents a significant pairwise correlation between two genes evaluated from expression profiles of a subset of the microarray samples.
In this study, we took an alternative approach to assess co-expression gene network. Instead of assuming that certain biological conditions/modules may affect the co-expression relationship between a gene pair, we assume that there is a third gene (named the controller gene hereinafter) associated with the biological conditions/modules that can affect the co-expression relationship. With this approach, we are looking for three-way gene interactions instead of two-way interactions. The three-way model captures the dynamic nature of co-expression relationship through the introduction of the third gene. In reality, more than one gene may affect the co-expression of a gene pair. However, models that involve more than three genes are much less tractable because the number of combinations of four or more genes is much larger than that of three genes. Thus, the three-way model represents an appealing compromise between realism and tractability. Moreover, because it is believed that the entire human gene network contains mostly nodes with sparse connections and only a small number of hubs that have a large number of connections (Luscombe et al., 2002; Rzhetsky and Gomez, 2001; Thieffry et al., 1998; Wagner, 2002), it is reasonable to expect that the three-way interaction model should be a good approximation to many cases in the real network.
We applied a simple statistical method to screen gene triplets and selected those in which the correlation of two genes can be significantly associated with the expression level of a controller gene. Using a microarray dataset containing 678 samples collected from human cancer tissues and cell lines, we examined about 0.4 billion gene triplets based on 1000 genes. By splitting the data randomly into a training set and a testing set of equal sample sizes, we demonstrated that the most statistically significant triplets identified using the training set are validated in the independent testing set.
| 2 METHODS |
|---|
|
|
|---|
2.1 Microarray data source and pre-processing
The raw microarray data (CEL files) were downloaded from Gene Expression Omnibus database at ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE2109. The data were generated by International Genomic Consortium (IGC) in its Expression Project for Oncology using Affymetrix human genome array HG-U133 plus 2.0. We used data in IGC batches 1, 2, 3, 5 and 6, but excluded batch 4, because we found that the probe signal distributions of the samples in batch 4 showed a marked difference from the samples in other batches. Most of the samples were extracted from cancer tissues and a few from cell lines. A detailed description of sample information can be found in Supplementary Table S1.
We used the quantile normalization method (Bolstad et al., 2003) to normalize probe level data (PM probes only) and used the PDNN model (Zhang et al., 2003) to extract the gene expression values. The gene expression data were quantile normalized again to reduce systematic and technical biases between samples. Then, we randomly split the samples into a training set and a testing set, each with 339 samples. The training set and the testing set were processed separately in subsequent analyses.
2.2 Identification of genes with bimodal expression profiles
A model-based clustering algorithm, MCLUST (Fraley and Raftery, 2002), was used to determine whether the distribution of log-expression values of a gene is a single normal or a mixture of two normal distributions. We used the Bayesian information criteria to select between the two models. When the mixture of two normal distributions was found to be a better fit, we then applied the MCLUST to obtain a threshold value T that splits the samples into two subgroups. To ensure adequate sample size for our subsequent evaluations, only subgroups with at least 60 samples were kept for subsequent analysis.
2.3 Evaluating interactions in a gene triplet
Consider a gene triplet A, B and C. Without loss of generality, suppose gene C is the controller gene. Based on the threshold obtained from MCLUST, we divided the 339 samples in the training set into a low expression group of n1 samples and a high expression group of n2 samples according to the expression levels of gene C. Then, we computed r1, the Pearson correlation coefficient of the log-expression values between gene A and gene B from the n1 samples, and r2 from the n2 samples. Because correlation values can be unstable when the variances of the expression levels of genes A and B are small, we discarded the triplets in which either variance of gene A or that of gene B was <0.1 in either n1 or n2 samples. We used Fisher's z-transformation (Fisher and Belle, 1993) to transform the correlation coefficients to a test statistic z:
|
| (1) |
|
| (2) |
|
| (3) |
| 3 RESULTS |
|---|
|
|
|---|
3.1 Identification of significant triplets
First, we applied a gene filtering process to our microarray dataset to reduce computational cost of our survey and ensure quality of the expression measurements. Using the training set, we filtered out genes that had small variation across samples, poor annotation or redundant probe design. Of the
57 000 probe sets on the array, it is common that multiple probe sets represent a gene. For any two probe sets representing the same gene, we removed the one with less variance. Then, we removed probe sets that correspond to no entries in RefSeq database (Pruitt et al., 2003) or have no gene symbols. Subsequently, we selected the top 1000 probe sets that have the largest variances. Among these probe sets, the minimum variance of log-expression values was 0.37, which was much higher than the median variance of all probe sets on the array, which was 0.033. Consequently, the remaining genes should have large changes in expression levels that cannot be explained by merely random noise. The process resulted in a training set composed of 1000 probe sets (genes) and 339 samples. The testing set was assembled using the same probe sets and the remaining samples. The training set and testing set can be found in Supplementary Tables S2 and S3, respectively. Next, we examined expression profiles of the 1000 genes and found 796 of them have bimodal expression distributions in the training set using a model-based clustering algorithm. For each of the 796 genes, the samples were dichotomized into a low expression group and a high expressed group (see Methods section for details). To assess the statistical significance of each triplet interaction, we applied Fisher's z-transformation (Fisher and Belle, 1993) to convert Pearson correlation coefficients into z values (see Methods section for details). We evaluated all 0.40 billion triplets that can potentially be formed by these 1000 genes with the 796 control genes.
The main results of our survey are summarized in Figure 1. The central peak of the distribution of the 0.40 billion z values resembles a normal distribution (Fig. 1a and b). In principle, the distribution of z values asymptotically approaches the standard normal distribution (mean 0 and variance 1), if the correlations are evaluated from independent random data. However, we found the variance of z values to be 1.88 instead of 1.0. The inflated variance was supposed to be caused by the fact that the z values obtained from our survey are not independent from each other, i.e. the z values cannot be regarded as resulted from null data. Recent research suggests that such inflated variance of test statistics is a common phenomenon due to effects of pairwise correlations between test statistics. This widened distribution of the null test statistics has a substantial impact in large scale testing (Efron, 2007). The heavy tails (Fig. 1b) suggested that the significant interactions were prevalent.
|
Due to the enormous size of the survey, false positives are very difficult to quantify accurately. For example, with 0.4 billion triplets, we had 0.4 billion tests and P-values. To control the false positive rate at a small level, we need to compute the P-values with extremely small precision (e.g. as small as 10–12), which is practically impossible. Our approach to the problem is to use the testing set for validation. We computed the z values using the testing set (zt) for the 10 000 triplets that have the largest |z| values (|z| > 11.2) evaluated from the training set. Figure 1c shows the distribution of these zt values, which are mostly concentrated on the tails of the normal distribution (histogram in red). We also randomly selected 10 000 triplets in the testing set and computed their z values (zrandom, Fig. 1c, histogram in black). The distribution of zt shares little overlap with the distribution of zrandom values, implying that these triplets are indeed statistically significant. To obtain a P-value for each zt value, we treated the distribution of these 10 000 zrandom values as an empirical null for P-value estimation. As shown in Figure 1d, most of the resulting P-values are very small (97.8% of them are < 0.01).
3.2 The L-shaped relationships in top significant triplets
We observed an interesting reoccurring pattern from the top triplets with the highest |z| values. As shown in Figure 2, depending on the expression of the controller gene, there are two distinct types of interactions for the remaining two genes: they are linearly correlated in one type while form an L-shaped pattern in the other. The L-shape reflects a mutually exclusive relationship within the gene pair; only one, but not both, of the genes could be highly expressed in a sample.
|
In Figures 3 and 4, we presented two examples in detail. The first example involves genes CFTR (cystic fibrosis transmembrane conductance regulator, ATP-binding cassette), MYB (v-myb myeloblastosis viral oncogene homolog) and USH1C (Usher syndrome 1C). Figure 3a contains the log expression values of MYB and USH1C in the training set. There is no apparent relationship between the expression values of these two genes (correlation coefficient is 0.09). The distribution of the log-expression values of CFTR appear to be bimodal (Fig. 3b), which is composed of a narrow but high peak at left side and a broad low peak on the right side. By separating samples into two subgroups according to the bimodality of CFTR, the relationship between MYB and USH1C emerges. When the expression level of CFTR is high (>0.98), the correlation between MYB and USH1C is positive (r = 0.7) (Fig. 3d); when the expression level of CFTR is low (
0.98), not only the correlation becomes negative (r = –0.434), but also emerges an L-shaped relationship (Fig. 3c). Previous studies indicate that both MYB and CFTR are regulated by NF-kappa B (Brouillard et al., 2001; Suhasini and Pilz, 1999). However, it remains an intriguing problem to find out what causes the mutually exclusive relationship between USH1C and MYB.
|
|
In the second example (Fig. 4), the genes involved are Kruppel-like factor 5 (KLF5), cadherin 6 type 2 (CDH6) and UDP glucuronosyltransferase 1 family, polypeptide A (UGT1A). When KLF5 is expressed low, CDH6 and UGT1A show an L-shaped relationship (r = –0.313); when KLF5 is expressed high, CDH6 and UGT1A are co-expressed (r = 0.893).
We investigated if the L-shaped relationships shown in Figures 2–4![]()
were caused by expressions of tissue specific genes. To explore this issue, we used Figure 5 to show the tissue compositions of the L-shaped relationships seen in Figures 3 and 4. In Figure 5a, most of the samples in the vertical arm came from kidney, while most of the samples in the horizontal arm came from breast. These observations imply that when the control gene (CFTR) is expressed low, USH1C is kidney specific and MYB is breast specific. In Figure 5b, colon samples dominated the vertical arm, but no tissue type appeared to dominate the horizontal arm. More examples are shown in Supplementary Material (Fig. S5). These examples suggest that in some of the cases the L-shaped relationships involve tissue specific genes while in others the L-shaped relationships are not related to tissue types. It should be noted, however, such tissue specificity is conditional on the control gene. For example, besides expressed in kidney, USH1C is also highly expressed in spinal cord (Su et al., 2002; URL: http://symatlas.gnf/SymAtlas). Thus, the triplet interactions identified in our study cannot be explained simply by tissue specific genes.
|
It is important to note that many of the three-way interactions revealed by these triplets could not have been predicted by merely pairwise correlation. We calculated pairwise correlation for the 9780 triplets that have large |z| values and small P-values. The results showed that 52.8% (5164), 38.37% (3753) and 21.84% (2136) of these triplets have pairwise correlations all < 0.6, 0.5 and 0.4, respectively. Therefore, it seems that our method is particularly useful in identifying potential gene triplets where there is no obvious correlation for any gene pair, but more complex interactions exist.
| 4 DISCUSSION |
|---|
|
|
|---|
Our survey of three-way gene interactions finds significant statistical associations among expression profiles of three genes. A significant gene triplet (A, B and C) is supposed to exhibit the following traits: (1) Gene C's expression profile displays bimodality, which enables us to partition the samples (arrays) into a high expression group and a low expression group; (2) Correlation of expression between genes A and B in the high expression group is significantly different from that in the low expression group.
It will require further experimentation to identify the underlying molecular interactions that drive the observed statistical associations found in the significant triplets. The statistical associations may also be caused by technical factors, such as cross-hybridization, RNA degradation and biases induced in normalization procedures. The technical factors may play a dominant role when the signal-to-noise ratio in the profiling data is low, which is why we have selected only the highly variable genes in our survey. Our survey was designed with the following molecular mechanisms in mind: a transcription factor X, which regulates both gene A and B, is conditional on the expression level of gene C. However, the X is unknown, and its relationship to C is not determined by our method. Gene C may encode X. Alternatively, gene C may regulate X's activity through some intermediate agents. There are also other molecular mechanisms that can explain changes in co-expression relationship between a gene pair. For example, Tomlins et al. (2005) found that there is a recurrent chromosomal aberration in prostate cancer tissues that led to fusion of two genes (TMPRSS2 and ETS transcription factor). In normal tissues, the expression of these two genes are mutually exclusive (i.e. L-shaped). In cancer tissues, because the two genes are fused together, they become strictly co-expressed. There is no known controller gene in this example to affect TMPRSS2 and ETS.
We have no direct experimental evidence to support the interactions of the top triplets from our survey. However, it is possible to test these triplet interactions experimentally. For example, we may manipulate the expression level of a control gene using RNA interference and examine the co-expression pattern of two other genes.
In recent years, there have been several algorithms developed for inferring three-way interactions based on fuzzy logic (Woolf and Wang, 2000), mutual information (Bowers et al., 2004) and liquid association (LA) (Li, 2002). The mutual information method was applied to phylogenetic profile data to identify interacting protein triplets (Bowers et al., 2004). This method has not been tested with gene expression profiling data. In LA method, all genes in a three-way interaction are supposed to have unimodal distributions. The controller gene acts as a continuous modulator rather a qualitative switch as in our method. To ensure unimodal distributions, LA method transforms all gene expression values to their normal quantiles. We evaluated the triplets shown in Figure 2 with LA method and found all of them are highly significant (P-value < 10–4). However, when LA method was used to conduct our large-scale survey, we encountered some difficulties. We found that LA method yielded P-values < 0.01 in 20% cases of randomly assembled triplets from our data, i.e. small P-values occurred frequently. Thus, to distinguish the top significant triplets from
0.5 billion triplets, we had to compute the small P-values with very high precision. Because LA method relies on permutation to compute P-values, it is impractical to compute a large number of extremely small P-values. Besides methodological differences, these studies mentioned above are also different from ours because they used much smaller sample sizes and did not show validation. These studies (Bowers et al., 2004; Li, 2002; Woolf and Wang, 2000) did describe a few examples of triplet interactions with plausible biological mechanisms.
Our method is also closely related to another recently developed method called differential correlation (Shedden and Taylor, 2004), which seeks for significant changes in correlation of expression between a gene pair that can be associated with a dichotomized clinical factor. With our method, the clinical factor is replaced by expression of another gene.
Experimentally or computationally, we note that there is a lack of well-studied gene triplet interactions. However, we believe this is not due to lack of three-way gene interactions in the cells, but due to the fact that most studies focus on characterizing gene relationships involving only two genes at a time. However, as we shown in our study here, three-way interaction cannot be decomposed as sum of three pairwise interactions; strong three-way interactions can be identified when there is no sign of obvious pairwise correlations. We hope this study will attract more experimentalists to examine triplet interactions among genes.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Margaret Newell for editorial assistance and Haitao Zhao for managing and processing the microarray database. L.Z.' s research was partially funded by M. D. Anderson Cancer Center Institutional Research Grant and NIH grants (CA108558-01; CA016672-28). Y.J.' s research was partially supported by the CML P01 grant, CA049639.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: David Rocke
Received on April 28, 2007; revised on September 16, 2007; accepted on September 23, 2007
| REFERENCES |
|---|
|
|
|---|
Alm E, Arkin AP. Biological networks. Curr. Opin. Struct. Biol. (2003) 13:202.
Bolstad BM, et al. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics (2003) 19:185–193.
Bowers PM, et al. Use of logic relationships to decipher protein network organization. Science (2004) 306:2246–2249.
Brouillard F, et al. NF-kappa B mediates up-regulation of CFTR gene expression in Calu-3 cells by interleukin-1beta. J. Biol. Chem. (2001) 276:9486–9491.
Choi JK, et al. Differential coexpression analysis using microarray data and its application to human cancer. Bioinformatics (2005) 21:4348–4355.
Dettling M, et al. Searching for differentially expressed gene combinations. Genome Biol. (2005) 6:R88.[CrossRef][Medline]
Efron B. Correlation and large-scale simultaneous significance testing. J. Am. Stat. Assoc. (2007) 102:93–103.[CrossRef][Web of Science]
Eisen MB, et al. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA (1998) 95:14863–14868.
Fisher LD, Belle Gv. Biostatistics (1993) New York, NY: John Wiley & Sons Inc.
Fraley C, Raftery A. Model based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. (2002) 97:611–631.[CrossRef][Web of Science]
Getz G, et al. Coupled two-way clustering analysis of gene microarray data. Proc. Natl Acad. Sci. USA (2000) 97:12079–12084.
Ihmels J, et al. Revealing modular organization in the yeast transcriptional network. Nat. Genet. (2002) 31:370–377.[CrossRef][Web of Science][Medline]
Ji Y, et al. Applications of beta-mixture models in bioinformatics. Bioinformatics (2005) 21:2118–2122.
Jornsten R, Yu B. Simultaneous gene clustering and subset selection for sample classification via MDL. Bioinformatics (2003) 19:1100–1109.
Lander ES. Array of hope. Nat. Genet. (1999) 21:3–4.[CrossRef][Web of Science][Medline]
Li KC. Genome-wide coexpression dynamics: theory and application. Proc. Natl Acad. Sci. USA (2002) 99:16875–16880.
Luscombe NM, et al. The dominance of the population by a selected few: power-law behaviour applies to a wide variety of genomic properties. Genome Biol. (2002) 3:0040.
Pruitt KD, et al. NCBI reference sequence project: update and current status. Nucleic Acids Res. (2003) 31:34–37.
Quackenbush J. Genomics. Microarrays—guilt by association. Science (2003) 302:240–241.
Rao CV, Arkin AP. Control motifs for intracellular regulatory networks. Annu. Rev. Biomed. Eng. (2001) 3:391–419.[CrossRef][Web of Science][Medline]
Rzhetsky A, Gomez SM. Birth of scale-free molecular networks and the number of distinct DNA and protein domains per genome. Bioinformatics (2001) 17:988–996.
Segal E, et al. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat. Genet. (2003) 34:166–176.[CrossRef][Web of Science][Medline]
Shedden K, Taylor J. Differential correlation detects complex associations between gene expression and clinical outcomes in lung adenocarcinomas. In: Methods of Microarray Data Analysis (2004) Boston: Kluwer Academic Publishers.
Su AI, et al. Large-scale analysis of the human and mouse transcriptomes. Proc. Natl Acad. Sci. USA (2002) 99:4465–4.
Suhasini M, Pilz RB. Transcriptional elongation of c-myb is regulated by NF-kappaB (p50/RelB). Oncogene (1999) 18:7360–7369.[CrossRef][Web of Science][Medline]
Tavazoie S, et al. Systematic determination of genetic network architecture. Nat. Genet. (1999) 22:281–285.[CrossRef][Web of Science][Medline]
Thieffry D, et al. From specific gene regulation to genomic networks: a global analysis of transcriptional regulation in Escherichia coli. Bioessays (1998) 20:433–440.[CrossRef][Web of Science][Medline]
Tomlins SA, et al. Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science (2005) 310:644–648.
Wagner A. Estimating coarse gene network structure from large-scale gene perturbation data. Genome Res. (2002) 12:309–315.
Woolf PJ, Wang Y. A fuzzy logic approach to analyzing gene expression data. Physiol. Genomics (2000) 3:9–15.
Wu CJ, et al. Gene expression module discovery using gibbs sampling. Genome Inform. Ser. Workshop Genome Inform. (2004) 15:239–248.[Medline]
Zhang L, et al. A model of molecular interactions on short oligonucleotide microarrays. Nat. Biotechnol. (2003) 21:818–821.[CrossRef][Web of Science][Medline]
Zhang MQ. Extracting functional information from microarrays: a challenge for functional genomics. Proc. Natl Acad. Sci. USA (2002) 99:12509–12511.
This article has been cited by other articles:
![]() |
Y. Choi and C. Kendziorski Statistical methods for gene set co-expression analysis Bioinformatics, November 1, 2009; 25(21): 2780 - 2786. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Kayano, I. Takigawa, M. Shiga, K. Tsuda, and H. Mamitsuka Efficiently finding genome-wide three-way gene interactions from transcript- and genotype-data Bioinformatics, November 1, 2009; 25(21): 2735 - 2743. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Everett, A. Vo, and S. Hannenhalli PTM-Switchboard--a database of posttranslational modifications of transcription factors, the mediating enzymes and target genes Nucleic Acids Res., January 1, 2009; 37(suppl_1): D66 - D71. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






