Skip Navigation


Bioinformatics Advance Access originally published online on September 7, 2009
Bioinformatics 2009 25(21):2735-2743; doi:10.1093/bioinformatics/btp531
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
25/21/2735    most recent
btp531v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Kayano, M.
Right arrow Articles by Mamitsuka, H.
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kayano, M.
Right arrow Articles by Mamitsuka, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author(s) 2009. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Efficiently finding genome-wide three-way gene interactions from transcript- and genotype-data

Mitsunori Kayano 1,2, Ichigaku Takigawa 1,2, Motoki Shiga 1,2, Koji Tsuda 2,3 and Hiroshi Mamitsuka 1,2,*

1 Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji 611-0011, 2 Institute for Bioinformatics Research and Development (BIRD), Japan Science and Technology Agency (JST) and 3 Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan

* To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 EXPERIMENTS
 5 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: We address the issue of finding a three-way gene interaction, i.e. two interacting genes in expression under the genotypes of another gene, given a dataset in which expressions and genotypes are measured at once for each individual. This issue can be a general, switching mechanism in expression of two genes, being controlled by categories of another gene, and finding this type of interaction can be a key to elucidating complex biological systems. The most suitable method for this issue is likelihood ratio test using logistic regressions, which we call interaction test, but a serious problem of this test is computational intractability at a genome-wide level.

Results: We developed a fast method for this issue which improves the speed of interaction test by around 10 times for any size of datasets, keeping highly interacting genes with an accuracy of ~85%. We applied our method to ~3 x 108 three-way combinations generated from a dataset on human brain samples and detected three-way gene interactions with small P-values. To check the reliability of our results, we first conducted permutations by which we can show that the obtained P-values are significantly smaller than those obtained from permuted null examples. We then used GEO (Gene Expression Omnibus) to generate gene expression datasets with binary classes to confirm the detected three-way interactions by using these datasets and interaction tests. The result showed us some datasets with significantly small P-values, strongly supporting the reliability of the detected three-way interactions.

Availability: Software is available from http://www.bic.kyoto-u.ac.jp/pathway/kayano/bioinfo_three-way.html

Contact: kayano{at}kuicr.kyoto-u.ac.jp

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 EXPERIMENTS
 5 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We address the issue of efficiently finding a three-way gene interaction, precisely two interacting genes in expression under the genotypes of a different gene, given a dataset in which both gene expressions and genotypes are measured for each individual. We illustrate our problem setting by using synthetic 2D diagrams in Figure 1, where expression values of two genes are plotted with three classes (genotypes): +, * and {bigtriangleup}. In this figure, panel (a) shows expression values being just randomly distributed; (b) shows expression values being easily categorized into three classes; and (c) shows that classes can be categorized by expressions without using two genes at the same time. We are not interested in (a–c) but in (d), which shows that the correlation in expression between two genes differs for each class. More concretely, two genes are positively correlated for one class, whereas they are negatively correlated for another. This is exactly a switching mechanism in expression between correlation and inverse-correlation of two genes, controlled by another gene. Also this is the three-way gene interaction which we attempt to find in this article. We note that this can be categorized into a general switch in biology. A simple, well-known example is Max, a transcription factor, which plays a role of an activator or a suppressor, depending on whether it binds to Myc (i.e. Myc-Max) or Mad (i.e. Mad-Max) (Ayer and Eisenman, 1993). We emphasize that this type of interaction must be a key to elucidating complex biological systems.


Figure 1
View larger version (31K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Synthetic examples: expressions of two genes under the three classes of another gene. (a) randomly distributed, (b,c) easily categorized into three classes and (d) a switching mechanism.

 
A reasonable approach to detect such three-way interactions is the likelihood ratio test for regression (LRTR). Particularly, logistic regression must be suitable the most, because of categorical responses (genotypes) in our setting (McCullagh and Nelder, 1989). The first item of note is that parameter estimation for logistic regression is based on the maximum likelihood, for which a time-consuming iterative gradient descent, Newton–Raphson, is usually used. Secondly, in our case, classes are genotypes, causing a problem of an explosive number of combinations of one SNP (genotypes) and two genes (expressions). For example, for 50 000 SNPs and 1000 genes, we have roughly 5 x 1010 (= 50 000 x 1000 x 1000) combinations, making scanning over all possible combinations intractable. In fact, >24 h are needed to run Newton–Raphson over only 107 combinations in our experiments. Thus, the main focus of this article is to speed up the procedure of finding the three-way interactions. Our strategy for this issue is to prune irrelevant combinations, such as those in which the expression values of two genes are randomly distributed as in Figure 1a, by using a hypothesis test assuming the normality of given examples.

The contribution of this article can be summarized into three folds: (i) We present a problem setting of finding a three-way gene interaction of two numerical variables and one categorical, corresponding to a biological switch in expression. (ii) LRTR and LRT of logistic regression (LRTLR) are the standard approaches for this problem, but these are computationally inefficient, particularly for a huge number of combinations that we can have. We then propose an efficient method for pruning large part of input combinations. (iii) Our experiment with a huge dataset of human brain samples showed that our method run 10 times faster than LRTLR for any data size, keeping the accuracy of detecting three-way interactions at ~85%.


    2 RELATED WORK
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 EXPERIMENTS
 5 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Three-way interactions in expression have not been considered except only a few cases of using simple methods (Li et al., 2004; Zhang et al., 2007). There are two reasons for this: (i) dealing with more than two-way correlations is intractable at a genome-wide level, because of the explosive number of combinations and (ii) three-way interactions along this line can be inferred from two-way co-expression. We emphasize that our three-way interaction is different from them, in terms that correlation or inverse-correlation in expression of two genes is controlled by genotypes of another gene.

Genome-wide association studies (GWA) using genotypes, especially single nucleotide polymorphisms (SNPs), have been highlighted in these few years (McCarthy and Hirschhorn, 2008), whereas cDNA microarrays have been a standard tool for understanding gene/protein behaviors in a cell. Thus, currently a large number of studies use both gene expressions and genotypes, showing the importance of combining these two information sources (Nica and Dermitzakis, 2008). Consequently, we now have a unique dataset, in which both gene expressions and genotypes are measured at once for each individual, and this type of dataset, which we use in this article, is increasing in these few years, which makes our approach very promising (Dixon et al., 2007; Myers et al., 2007; Schadt et al., 2008).

A standard analysis in GWA is conducted between a single SNP (i.e. genotypes at a locus) and a categorical or continuous outcome (phenotype). For this analysis, the two most typical approaches are ANOVA (Analysis of Variance) and LRTR (Balding, 2006). Usually more complex analysis is multiple (usually two) SNPs with a single phenotype where two-way ANOVA or LRTR with two explanatory variables can be considered. This situation is closely related with epistasis, a general concept in modern quantitative genetics (Aylor and Zeng, 2008; Cordell, 2002), meaning the interaction between multiple loci and phenotype (Marchini et al., 2005). Our problem setting looks similar to this but interestingly in the reverse direction. That is, we consider the interaction between two expression phenotypes under categorical genotypes which thus have not been examined in GWA. We note that ANOVA cannot be applied to this issue,1 whereas LRTR can be applied as a standard manner for our setting. Another item of note is that finding three-way interactions in only SNPs exists (Lo et al., 2008), but their problem setting is straightforward and totally different from our setting.


    3 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 EXPERIMENTS
 5 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Notations and preliminaries
Let X be an input matrix, in which each row is an individual and each column is a numerical vector of gene expressions or a categorical vector of SNPs (in genes). Let E be the set of genes for which expressions are measured in X and Q be the set of SNPs in X, indicating that |E|+|Q| is the total number of columns of X. To test the three-way interaction, we choose one combination, i.e. two genes (e1 and e2) and one SNP (q) out of E and Q, respectively, and we write X(e1, e2, q) which has only three columns of X, corresponding to e1, e2 and q [we write X(e, q) when we choose only one gene e out of E and q out of Q]. Hereafter, until Section 3.6, we assume that we already choose one combination.

For gene expressions, let X=(X1,..., XK)'isinRK be a K-dimensional numerical variable, taking value x=(x1,..., xK)'. We note that using two genes in expression does not necessarily mean K=2. For example, for two genes, we can set K=3, where X1, X2 and X3 correspond to one gene, the other gene and the interaction between these two genes, respectively. For genotypes, let C be the number of groups (or classes), and in fact, C=3. We denote three genotypes by G1, G2 and G3, into one of which each individual falls. Let Y be the class variable, taking value y, where Y=(Y1, Y2)'isin{0, 1} x {0, 1}. Here, we note that y takes the following values: y=(1, 0)' if xisinG1, y=(0, 1)' if xisinG2 and y=(0, 0)' if xisinG3. We denote N inputs (individuals) by X=(x1,..., xN)' and Y=(y1,..., yN)'=(y(1), y(2)), which can be classified into N1, N2 and N3 inputs for G1, G2 and G3, respectively. The average expression values can be defined for each class c and all classes: Formula and Formula , respectively, where Formula . IK is the identity matrix of size K, and 1 is an n-dimensional vector in which all elements are 1.

We incorporate some basic statistics: Formula and Formula , where T=B+W. We can further define covariance matrix Sc for class c, Formula and total covariance matrices S and ST, Formula and Formula . We note that W = {sum}FormulaNcSc and Formula .

We show the multivariate normal distribution, having two parameters, µc and {Sigma}c (the mean and the covariance matrix of class c), and the log-density (log-likelihood) function of this distribution can be given as follows:


Formula 1

(1)
From this equation, we can see that Formula , covariance matrix Sc and covariance matrix S can be the maximum likelihood estimators of µc, {Sigma}c and {Sigma} (={Sigma}1=···={Sigma}C), respectively.

We briefly describe likelihood ratio test (LRT), which will be used. We first assume that examples x1, x2 ,..., xn are generated according to parameter vector {theta}. Let H0 : {theta}isin{Omega}0 be a null hypothesis and H1 : {theta}isin{Omega}1 be the alternative hypothesis. The statistic {lambda} for testing H0 against H1 can be defined as {lambda}=LFormula/LFormula, where LFormula and LFormula are the maximum likelihoods under {theta}isin{Omega}0 and {theta}isin{Omega}1, respectively. Usually we can use the log-likelihood ratio (LLR), –2log{lambda}=2({ell}Formula{ell}Formula), where {ell}Formula=logLFormula and {ell}Formula=log LFormula. We note that this statistic follows {chi}Formula distribution as N->{infty}, where qr is the degree of freedom (df) of the {chi}2 distribution.

3.2 Finding three-way interactions: interaction test (Likelihood Ratio Test of Logistic Regression, LRTLR)
A standard and exact approach for our problem is LRTLR (McCullagh and Nelder, 1989), which we simply call interaction test in this article.

3.2.1 Logistic regression
We first denote the probability that x is in G1 by p1(x), and similarly the probability that x is in G2 by p2(x), by which the probability that x is in G3 is p3(x) (=1–p1(x)–p2(x)). We use logistic regression to link these probabilities to K-dimensional input x by using weight parameters (or coefficients) w=(w1', w2')', where w1=(w10, w11,..., w1(K–1))', w2=(w20, w21,..., w2(K–1))' as follows:


Formula 2

(2)
Here, we denote p1(x), p2(x) and p3(x) by p1(x; w), p2(x; w) and p3(x; w) (=1–p1(x; w)–p2(x; w)), respectively, because they can be functions of w. We can then write the likelihood of logistic regression for given N examples and parameters w, as follows:


Formula

where yi=(yi1, yi2)'.

3.2.2 Parameter estimation
We can obtain the maximum likelihood estimator w for w by maximizing the log-likelihood l(w)=log L(w). A standard approach for this purpose is the Newton–Raphson method, which is an iterative gradient descent, having the following updating rule by which we can have w(t+1) at the (t+1)-th iteration, using w(t) of the t-th iteration:


Formula 3

(3)
where Hessian matrix H(w) (={partial}2l/{partial}w{partial}w') and gradient vector U(w) (={partial}l/{partial}w) can be given in the following:


Formula

where X*=diag(X, X) (block diagonal matrix of X), a(w)=(a1(w)', a2(w)')' where aj(w)=y(j)pj(w) and pj(w)=(pj(x1; w),..., pj(xN; w))' (j=1, 2).


Formula

where N x N matrix Rjk(w) (j, k=1, 2) is given by Rjj(w)=diag{pj(w){odot}(pj(w)–1)} and Rjk(w)=diag{pj(w){odot} pk(w)} (j!=k).

Finally, the updating rule of the Newton–Raphson method for logistic regression can be rewritten in the following:


Formula 4

(4)
In practise, we start with some initial values w(0) and update w(t+1) according to Equation (4) until the following equation is satisfied:


Formula 5

(5)
where {delta} is set at a certain value.

3.2.3 Interaction test
We then examine the significance of the interaction in expression between two genes in terms of classes of another gene. Let xi1 and xi2 be expression values of the corresponding two genes for input i. The interaction term is xi1xi2, meaning that our purpose is to find the case that the logistic model is well fitted to the data when this term is added. We then let xi=(1, xi1, xi2, xi1xi2)' and w=(w10, w11, w12, w13, w20, w21, w22, w23)', and the logistic model with the interaction term is given as follows:


Formula

If wc3=0, the model does not have the interaction term, meaning that the null hypothesis and w0 are given as follows:


Formula 6

(6)
Then the test statistic, LLR and its asymptotic distribution can be given:


Formula 7

(7)
where {chi}Formula({alpha}i) is the {chi}2 distribution with the df of two, meaning that interacting genes can be obtained as those which have lower P-values under this distribution than the input significance level {alpha}i. We run interaction test 100 times over four examples in Figure 1, and Table 1 shows the average results over the 100 runs. This table clearly shows that the P-value is very large for Figure 1a–c, while that is zero for Figure 1d, indicating that intraction test can detect our target sample correctly.


View this table:
[in this window]
[in a new window]

 
Table 1. Log-likelihoods and LLR by Newton–Raphson

 
Figure 2 shows a pseudocode of interaction test. We can write interaction test by function Interaction_test(e1, e2, q, {alpha}i), which outputs one if given example (e1, e2, q) has the three-way interaction; otherwise zero. A significant drawback of interaction test is computational inefficiency. In fact, Equation (6) shows K=8, meaning that Newton–Raphson needs to compute an 8 x 8 inverse-matrix at each of its iteration procedure.


Figure 2
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Pseudocode of interaction test.

 
3.3 Key idea for speeding-up interaction finding
A basic idea for accelerating the finding of a three-way interactions is to prune some combinations, to which interaction test does not have to be applied. From Equation (7), we can see that the interacting genes should have a larger LLR. Figure 3 shows a schematic figure, in which we plot the log-likelihood without the interaction term in the left-hand side and with the interaction term in the right-hand side. We note that the range of the log-likelihood can be limited, because the maximum log-likelihood is zero and the minimum log-likelihood can be given by the case of the uniform distribution for pi(x). The LLR in question can be then given by the distance being shown by a dotted line in Figure 3. Thus, two interacting genes should have a long dotted line, meaning that the point in the left-hand side should be lower and that in the right-hand side should be higher. This observation indicates that we can prune the following two cases: (I) a large likelihood can be obtained without the interaction term, and (II) only a small likelihood can be obtained even if we use the interaction term. These (I) and (II) correspond to areas I and II, respectively, in Figure 3. We then attempt to efficiently detect examples in areas I and II by assuming the normality on data distribution.


Figure 3
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. LLR and its components.

 
3.4 Linear discriminant analysis
Area I in Figure 3 contains examples in which expressions can be easily separated into three classes without the interaction term, as shown in Figure 1b and c. Thus, in this case, we can consider a simpler, easily computable estimation method for parameters of the logistic regression model without the interaction term, and if the likelihood for a given combination by that model is high enough, this combination can be pruned. For the simpler estimation method, we use linear discriminant analysis (LDA), which assumes that x follows the normal distribution N(µ, {Sigma}) with the same covariance {Sigma} for all three classes (Hastie et al., 2001). We skip the detail of this method due to space limitations because in our experiment only a small part of all given examples can be pruned by LDA. Interested readers should refer the Supplementary Material. We can write LDA by function LDA(e1, e2, q, {alpha}i) [or LDA(e, q, {alpha}i)], which outputs one if given example (e1, e2, q) [or (e, q)] should be pruned; otherwise zero.

3.5 Randomness test
Area II in Figure 3 contains an example for which the maximum likelihood with the interaction term is very low, implying that expression values are almost randomly distributed in terms of classes, as shown in Figure 1a. To detect the randomness of expression values, if we use a faster hypothesis test for randomness than Newton–Raphson, we can speed up the procedure for finding the three-way interaction. We assume that expression values follow the K-dimensional normal distribution for each class of genotypes, and under this assumption, we present our approach, which combines multivariate ANOVA (MANOVA) and Box's M test (Mardia et al., 1979). We can set K=2 for our test, meaning that the largest matrix size is 2 x 2, making the computation very efficient.

3.5.1 MANOVA
MANOVA considers the following null hypotheses over the means:


Formula

For testing H0 against H1, we use LLR, –2log{lambda} (=2({ell}Formula{ell}Formula)). By replacing {Sigma}c in Equation (1) with {Sigma} and using the maximum likelihood estimators Formula and S for µk and {Sigma}, respectively, we have the following:


Formula 8

(8)
On the other hand, for the log-likelihood under null hypothesis, we can use the maximum likelihood estimators Formula and ST for µk and {Sigma}, respectively, and we have the following:


Formula 9

(9)
Thus, the statistic can be given as follows: Formula . We can further see that q is Formula and r is Formula .

We conducted MANOVA over four samples in Figure 1, and Table 2 shows the resultant average over 100 runs with SDs in parentheses. The P-value of MANOVA for (a) was high (0.53), whereas that for (b) [and (c)] was zero, meaning that MANOVA can discriminate (a) from (b) [and (c)]. However, the P-value of (d) was also high (0.94), meaning that MANOVA could not separate (a) from (d). Thus, we need another hypothesis test, which can distinguish (a) from (d).


View this table:
[in this window]
[in a new window]

 
Table 2. MANOVA, Box's M test and Means-Covariances (MC) test on four examples in Figure 1

 
3.5.2 Box's M test
We then consider the following hypotheses over the covariance:


Formula

Here, {ell}Formula can be given by {ell}Formula of MANOVA [i.e. Equation (8)], and {ell}Formula can be obtained by using maximum likelihood estimators Formula and Sk for µk and {Sigma}k, respectively, in Equation (1).


Formula 10

(10)
Thus, the statistic is –2log{lambda}={sum}Formula Nclog det(SFormulaS). Here, q is Formula and r is Formula .

We run Box's M test over four samples in Figure 1, and Table 2 shows the results. This result shows that the P-value of (a) was high (0.70), whereas that of (d) was zero, meaning that M-test separated (a) from (d). However, this time, this test could not discriminate (a) from (b) [and (c)], since the P-value of (b) [and (c)] was also high. Thus, this result showed that Box's M test can be a complement of MANOVA, implying that we can combine these two tests for detecting random distributions such as Figure 1a.

3.5.3 MC test (MANOVA + M Test)
We finally consider the following hypotheses over both the means and covariances:


Formula

We emphasize that this test suits our purpose the most, although this is an unpopular statistic and not named. We then call this test as MC test. Interestingly, {ell}Formula of this test is given by {ell}Formula of MANOVA, i.e. Equation (9) and {ell}Formula is given by {ell}Formula of M test, i.e. Equation (10). Thus, the statistic of MC test is given as follows:


Formula 11

(11)
since Formula . Here, Formula and Formula , meaning that df is 10 in our case. Figure 4 shows a pseudocode of MC test. We can write MC test by function MC_test(e1, e2, q, {alpha}m), having significance level {alpha}m as an input which removes given combination (e1, e2, q) if its P-value is larger than {alpha}m, meaning that a larger number of combinations can be removed if {alpha}m is smaller. This function outputs one if (e1, e2, q) should be pruned; otherwise zero.


Figure 4
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Pseudocode of MC test.

 
We checked the performance of MC test using synthetic four samples of Figure 1. Table 2 shows that all P-values are zero, except (a) with the P-value of 0.60, indicating that MC test can successfully detect (a) out of the four examples and is expected to work on real data as well.

3.6 Proposed procedure
Figure 5 shows a pseudocode of our entire procedure. We can first check each pair of a gene and a SNP by LDA, and if the log-likelihood is high, this pair is stored to be pruned. We then generate all possible combinations of two genes and a SNP out of given data. For each of these combinations, it is first pruned if it contains the stored gene–SNP pair. Then, LDA and MC test are run in sequence for pruning, and finally interaction test is applied to the remaining. Hereafter, we call our proposed procedure FTGI, standing for Fast finding Three-way Gene Interactions, whereas we call the approach of running Interaction Test Only over all possible combinations as ITO. More details of our proposed method is shown in the Supplementary Material.


Figure 5
View larger version (38K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Pseudocode of our entire procedure: FTGI.

 

    4 EXPERIMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 EXPERIMENTS
 5 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
4.1 Data
We used the human brain-derived dataset of Myers et al. (2007), which originally has 193 rows (individuals) and 14 078 numerical columns (corresponding to gene expressions) and 366 140 categorical columns (corresponding to SNPs). We first removed the columns containing missing values and the columns which have a genotype to which only less than 10 individuals are assigned. Our purpose is to find three-way gene interactions, and so we further removed SNPs which are neither in coding regions nor in introns, by specifying genes on sequences using the FTP site of NCBI Mapviewer for Homo sapiens. Finally, we obtained 5269 numerical vectors (in expression of genes) and 13 411 categorical vectors (in genotypes of SNPs) for 193 individuals, which we call the Source dataset. Myers et al. (2007) collected the original dataset from human brains, and so we focused on neurodegenerative diseases [including Alzheimer's disease (AD) and Parkinson's disease, etc.] out of five disease pathways in the KEGG disease database (Kanehisa et al., 2008), resulting in 142 genes which we call Neuro. All experiments were run on a machine with Dual-Core AMD Opteron 2222 SE (3.0 GHz) and 18 GB RAM. Throughout Section 4, each P-value is shown by log10(P-value).

4.2 Results and discussion
4.2.1 Speeding-up finding three-way interactions and pruning accuracy
We examined the improvement in time efficiency by FTGI over ITO. Figure 6 shows the real computation time of ITO and FTGI, when we changed the number of combinations randomly chosen from the source dataset. We here focused on Area II of Figure 3 only, since we found that in the Source dataset of Area I had only a small number of examples, which do not affect the efficiency greatly. This figure clearly shows that as {alpha}m decreased, the amount of running time of FTGI became smaller for any size of inputs, by pruning a larger number of them. In particular, at {alpha}m of 0.001, FTGI runs approximately 10 times faster than ITO, resulting in only ~2 h for 107 combinations, being a sizable improvement. This means that for 5 x 1010 (= 50 000 SNPs x 1000 genes x 1000 genes) combinations, FTGI just needs only a couple of days with 100 CPUs, while ITO needs more than a month.


Figure 6
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Computation time improvement by reducing {alpha}m.

 
The {alpha}m controls the number of pruned combinations, and Table 3 shows the pruning rate, i.e. the ratio of pruned combinations to all input combinations, with varying {alpha}m for 107 input combinations. We further checked the pruning accuracy, which can be defined as the overlap between the resultant top K (set at 100) combinations by P-values of ITO and those of FTGI. Table 3 shows that for {alpha}m of 0.05, FTGI can prune around ~70% of input combinations with pruning accuracy of almost 100%. If {alpha}m is reduced to 0.001, ~94% inputs can be pruned, keeping the pruning accuracy of ~85%. This high pruning rate effects the time efficiency of FTGI.


View this table:
[in this window]
[in a new window]

 
Table 3. Pruning rates and pruning accuracies (top 100) at three {alpha}m values of FTGI for 107 combinations

 
We note that all results in this section were averaged over three runs at each corresponding setting.

4.2.2 Detecting three-way interactions
We then generated all combinations from the Source dataset, focusing on the genes in Neuro, meaning that we had totally ~3 x 108 combinations (= 13 411 SNPs x 142 genes x 142 genes). We then run FTGI with {alpha}m of 0.001 over these combinations. Figure 7 shows the gene expressions of the resultant top 10 combinations in terms of P-values. We note that these P-values of interaction test were computed by the procedure in Section 3.2. Each of Figure 7 is a 2D diagram on which expression values of the corresponding two genes are plotted with Contour lines for each genotype. This figure shows that the topographical distribution of different genotypes are clearly crossed in all cases, meaning that in each of all the top 10 combinations, genes are interacting in expression, being controlled by genotypes, as shown in Figure 1d.


Figure 7
View larger version (35K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7. Expressions of two genes under three genotypes of another gene for top 10 (aj) ranked three-way interactions out of 3 x 108 combinations.

 
Table 4 shows the detail (Gene name for one SNP and the name with GeneID, the definition and the pathway for each of two interacting genes in expression) of the 10 three-way interactions in Figure 7, all information in this table being retrieved from KEGG.2 For example, the first interaction of Table 4 shows the switching mechanism of two genes, COX6C and UBA1, being controlled by a SNP in LARP4.


View this table:
[in this window]
[in a new window]

 
Table 4. Details of the top 10 three-way interactions in Figure 7

 
4.2.3 Validating detected interactions with permutations
To confirm the statistical significance of the detected three-way interactions, we conducted permutations by measuring P-values of ‘null data’, generated in the following three manners, and comparing them with those of the interactions we detected.
  • Null data 1: we randomly chose 10 000 combinations out of all combinations using the Source dataset (13 411 SNPs x 5269 genes x 5269 genes) and randomly permuted the genotypes of these combinations 100 times. Totally, we had one million null examples.
  • Null data 2: we randomly chose 10 000 combinations out of all combinations using the Neuro dataset (13 411 SNPs x 142 genes x 142 genes) and randomly permuted the genotypes of these combinations 100 times. Totally, we had one million null examples.
  • Null data 3: we permuted the genotypes of each of the detected top 10 interactions in Figure 7 one million times, resulting in one million null examples for each combination.
We first show the results of permutation tests when we use Null data 1 and 2. Figure 8 shows the distribution of P-values of null examples, being located in the right side, for Null data 1 and 2. In this figure, the distribution of P-values for the top 10 000 interactions detected by FTGI is located in the left side. This figure shows that the red-colored distribution is clearly separated from the black-colored one, meaning that the detected three-way interactions have significantly small P-values. For Null data 3, we show the result, focusing on two cases (the top and the 10th interactions), since the trend of results was kept the same for all 10 interactions in Table 4. Figure 9 shows the distribution of P-values of null examples generated from the top interaction (or the 10th), with the P-value of the top (or the 10th) interaction by an arrow. This figure indicates that the P-value of the top (or the 10th) interaction is clearly distant from the P-value distribution of null examples, implying that P-values of the detected interactions are statistically significant.


Figure 8
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 8. Distributions (left side) of P-values of the top 10 000 interactions detected by FTGI, with those (right side) of Null data (a) 1 and (b) 2.

 

Figure 9
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 9. The P-values of the (a) top and (b) 10th interactions (shown by arrows) and the distributions of P-values of the corresponding null examples generated.

 
4.2.4 Validating detected interactions with GEO
To confirm the reliability of the interactions in Table 4, we tried to find, for each gene pair, the switching mechanism in expression which can be controlled by some experimental condition of gene expression. This is because if found, this directly means that the corresponding gene pair can be controlled by another categorical factor, such as genotypes of another gene.

For this purpose, we used GEO (version of June 1, 2009; Barrett et al., 2007), from which we found 2089 GDSs (gene datasets) which are annotated. Out of the 2089 datasets, we selected datasets which satisfy all the following four conditions for each gene pair in Table 4: (i) expression values of the corresponding gene pair are contained; (ii) the total number of experiments is ≥50; (iii) experimental conditions can be divided into two or more classes; and (iv) each class has 10 or more experiments. We then obtained 36 datasets.3 For each gene pair of the top 10 list, we conducted interaction test by using pairwise (binary) classes in each dataset and ranked them according to P-values of interaction test. Table 5 shows a list of datasets, each giving the lowest P-value for each gene pair of the 10 interactions in Table 4. For example, for COX6C and UBA1, the gene pair of the first interaction of Table 4, we found a switching mechanism in GDS2960_1 with the P-value of –3.9532, showing the statistical significance of this mechanism. This directly indicates that there must exist a switching mechanism in expression between these two genes under the alteration of experimental conditions which is specified by the annotation of GDS2960_1. In fact, Table 5 indicates that the switching mechanism happens between patients of Marfan syndrome and controls. This type of explanation is possible for all 10 GDSs in Table 5 by using annotations in this table. As well all P-values shown in Table 5 are small enough,4 supporting the reliability of the three-way interactions in Table 4 which our method detected. Furthermore, Figure 10 shows the real expression values of two genes, being categorized into two classes, for each GDS of Table 5. These orthogonal Contour plots also assist the reliability of three-way interactions that we detected in Table 4.


Figure 10
View larger version (37K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 10. (aj) Expressions of two genes which give the smallest P-value of interaction test in the corresponding GDS of GEO.

 

View this table:
[in this window]
[in a new window]

 
Table 5. Results of interaction test over the datasets from GEO

 
We further briefly checked the genes having SNPs in the first and the third interactions in Table 4: (i) the first interaction in Table 4 has two genes, COX6C and UBA1, which is controlled by a SNP in LARP4, i.e. La ribonucleoprotein domain family member 4. This gene was already known as an important gene in both AD and aging, being already pointed out that LARP4 increases expression with increasing AD progression and normal aging (Miller et al., 2008). As our focus was on 142 genes on neurodegenerative diseases including AD, the known function on LAPR4 is consistent enough with the interaction with COX6C and UBA1, being possibly in the switching mechanism. (ii) The gene with the SNP in the second interactions in Table 4 was a hypothetical one, but the third interaction has two genes, ATP5D and ITCH, being controlled by a SNP in RPS3AP5, which is a pseudogene of RPS3A, i.e. ribosomal protein S3A. This gene is known to be downregulated in the same manner as some genes in oxidative phosphorylation pathway (Welle et al., 2003), which includes ATP5D. Thus, these observations reveal the possibility that the third interaction also may exist as the switching mechanism in expression of two genes, i.e. ATP5D and ITCH.

Overall our extensive analysis has implied that the detected three-way interactions can exist. These results show the potential of our approach to explicate complex biological systems appearing in modern biology and medical sciences.


    5 CONCLUDING REMARKS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 EXPERIMENTS
 5 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have presented a fast method for finding three-way gene interactions from transcript-and genotype-data and showed experimental results obtained by applying this method to ~3 x 108 human brain samples. In our experiments, we confirmed the three-way interactions that we found in various manners. Possible future work would be to apply our approach to various types of transcript- and genotype-data further to uncover three-way gene interactions, i.e. biological switches by genotypes.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 EXPERIMENTS
 5 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
The authors would like to thank anonymous reviewers for their helpful comments and advice.

Funding: Grant-in-Aid for Young Scientists 20700134, 20700269, 21680025 from Ministry of Education, Culture, Sports, Science and Technology (MEXT, in parts); the Functional RNA Project of New Energy and Industrial Technology Development Organization (NEDO, in parts).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alfonso Valencia

1ANOVA can be applied only to the case with a single continuous response (phenotype) and one or more discrete explanatory variables (genotypes). Back

2The Supplementary Material shows annotations by Reactome (Vastrik et al., 2007) for interacting genes. Back

3In each GDS, if it has more than two classes or replicated experiments, we consider all possible pairwise combinations of them. We then name generated multiple datasets from one GDS (e.g. GDS2960) those like GDS2960_1, GDS2960_2, etc. This results in that the number of datasets we used could be >36. The actual number of datasets for each gene pair is shown in Table 5. Back

4For each gene pair, not only the dataset giving the top P-value but also 10 datasets providing the top 10 P-values are shown in the Supplementary Material. All P-values in the Supplement Material are small, showing the statistical significance of the switching mechanism of each gene pair. Back

Received on March 17, 2009; revised on August 6, 2009; accepted on August 25, 2009

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RELATED WORK
 3 METHODS
 4 EXPERIMENTS
 5 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Ayer DE, Eisenman RN. A switch from myc:max to mad:max heterocomplexes accompanies monocyte/macrophage differentiation. Genes Dev. (1993) 7:2110–2119.[Abstract/Free Full Text]

    Aylor DL, Zeng Z-B. From classical genetics to quantitative genetics to systems biolog: modern epistasis. PLoS Genet. (2008) 4:e1000029.[CrossRef][Medline]

    Balding DJ. A tutorial on statistical methods for population association studies. Nat. Genet. (2006) 7:781–791.[Web of Science]

    Barrett T, et al. NCBI GEO: mining tens of millions of expression profiles.database and tools update. Nucleic Acids Res. (2007) 35:D760–D765.[Abstract/Free Full Text]

    Cordell HJ. Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Hum. Mol. Genet. (2002) 11:2463–2468.[Abstract/Free Full Text]

    Dixon AL, et al. A genome-wide association study of global gene expression. Nat. Genet. (2007) 39:1202–1207.[Medline]

    Hastie T, et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. (2001) New York: Springer.

    Kanehisa M, et al. KEGG for linking genomes to life and the environment. Nucleic Acids Res. (2008) 36:D480–D484.[Abstract/Free Full Text]

    Li K-C, et al. A system for enhancing genome-wide coexpression dynamics study. Proc. Natl Acad. Sci. USA (2004) 101:15561–15566.[Abstract/Free Full Text]

    Lo S-H, et al. Discovering interactions among brca1 and other candidate genes associated with sporadic breast cancer. Proc. Natl Acad. Sci. USA (2008) 105:12387–12392.[Abstract/Free Full Text]

    Marchini J, et al. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. Genet. (2005) 37:413–417.[CrossRef][Web of Science][Medline]

    Mardia KV, et al. Multivariate Analysis. (1979) New York: Academic Press.

    McCarthy MI, Hirschhorn JN. Genome-wide association studies: Past, present and future. Hum. Mol. Genet. (2008) 17(Review Issue 2):R100–R101.[Free Full Text]

    McCullagh P, Nelder J. Generalized Linear Models. (1989) 2. Boca, Raton: Chapman & Hall CRC.

    Miller JA, et al. A systems level analysis of transcriptional changes in Alzheimer's disease and normal aging. J. Neurosci. (2008) 28:1410–1420.[Abstract/Free Full Text]

    Myers AJ, et al. A survey of genetic human cortical gene expression. Nat. Genet. (2007) 39:1494–1499.[CrossRef][Web of Science][Medline]

    Nica AC, Dermitzakis ET. Using gene expression to investigate the genetic basis of complex disorders. Hum. Mol. Genet. (2008) 17(Review Issue 2):R129–R134.[Abstract/Free Full Text]

    Schadt EE, et al. Mapping the genetic architecture of gene expression in human liver. PLoS Biol. (2008) 6:e107.[CrossRef][Medline]

    Vastrik I, et al. Reactome: a knowledge base of biologic pathways and processes. Genome Biol. (2007) 8:R39.[CrossRef][Medline]

    Welle S, et al. Gene expression profile of aging in human muscle. Physiol. Genomics (2003) 14:149–159.[Abstract/Free Full Text]

    Zhang J, et al. Extracting three-way gene interactions from microarray data. Bioinformatics (2007) 23:2903–2909.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
25/21/2735    most recent
btp531v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Kayano, M.
Right arrow Articles by Mamitsuka, H.
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kayano, M.
Right arrow Articles by Mamitsuka, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?