Bioinformatics Advance Access originally published online on December 14, 2007
Bioinformatics 2008 24(3):404-411; doi:10.1093/bioinformatics/btm612
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Incorporating gene networks into statistical tests for genomic data via a spatially correlated mixture model
Division of Biostatistics, School of Public Health, University of Minnesota, A460 Mayo Building (MMC 303), Minneapolis, MN 55455-0378, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: It is a common task in genomic studies to identify a subset of the genes satisfying certain conditions, such as differentially expressed genes or regulatory target genes of a transcription factor (TF). This can be formulated as a statistical hypothesis testing problem. Most existing approaches treat the genes as having an identical and independent distribution a priori, testing each gene independently or testing some subsets of the genes one by one. On the other hand, it is known that the genes work coordinately as dictated by gene networks. Treating genes equally and independently ignores the important information contained in gene networks, leading to inefficient analysis and reduced power.
Results: We propose incorporating gene network information into statistical analysis of genomic data. Specifically, rather than treating the genes equally and independently a priori in a standard mixture model, we assume that gene-specific prior probabilities are correlated as induced by a gene network: while the genes are allowed to have different prior probabilities, those neighboring ones in the network have similar prior probabilities, reflecting their shared biological functions. We applied the two approaches to a real ChIP-chip dataset (and simulated data) to identify the transcriptional target genes of TF GCN4. The new method was found to be more powerful in discovering the target genes.
Contact: weip{at}biostat.umn.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
High-throughput microarray technologies have been widely used in genomic studies. For example, cDNA and Affymetrix arrays are used to monitor genome-wide gene expression, while ChIP-chip, a hybrid of chromatin immunoprecipitation and microarray technologies, has been used to quantify occupancy of genomic promoter regions by a transcription factor (TF). A common task in analyzing microarray data is to identify a subset of the genes satisfying certain user-specified requirements, for example, the genes with differential expression between two experimental conditions or between tumors and normal tissues, or the genes that are regulatory targets of a TF. The problem can be formulated as statistical hypothesis testing: a null hypothesis, such as equal expression (EE) or non-target of a TF, and an alternative one, usually the opposite of the null hypothesis, are specified for each gene; statistical hypothesis testing is conducted on each gene. Many statistical methods have been proposed for such a purpose, including SAM (Tusher et al., 2001), Empirical Bayes (EB) methods (Efron et al., 2001; Newton et al., 2001) and Normal mixture models (McLachlan et al., 2006; Pan et al., 2002). Most methods, including all the above cited, treat the genes equally and independently a priori. In particular, they ignore biological knowledge about gene functions and biological pathways. There have been increasing efforts recently to incorporate biological knowledge into statistical analysis of microarray data to gain statistical efficiency. For example, Pan (2005, 2006, 2006b) proposed a stratified mixture model to incorporate gene functions as annotated in the Gene Ontology (GO) database (Ashburner et al., 2000); Xiao et al. (2005) developed a Hidden Markov Model to incorporate genomic location information into analysis, accounting for spatial patterns of gene co-expression. Another class of emerging methods is to analyze gene sets, rather than individual genes; the gene sets are formed based on biological pathways or gene functional groups (Subramanian et al., 2005; Tian et al., 2005; Efron et al., 2007; Newton et al., 2007). Nevertheless, each of the gene-set methods treats the genes equally and independently a priori, and the results depend on the specification of gene sets: a too large or too small gene set may lead to reduced statistical power; in fact, each gene set can be regarded as a gene subnetwork. As biological knowledge accumulates rapidly, genome-wide gene networks represented by undirected graphs with genes as nodes and gene–gene interactions as edges have been constructed; see Lee et al. (2004) for a probabilistic approach to constructing functional networks for the yeast genome and Franke et al. (2006) for a human protein–protein interaction network. However, there has been very limited work attempting to incorporate genome-wide gene network information into statistical analysis of microarray data. An exception goes to the recent work by Wei and Li (2007) (see also references therein), who proposed integrating KEGG pathways or other gene networks into analysis of differential gene expression via a Markov Random Field (MRF) model.
Unlike directly modeling the state of each gene via a MRF (Wei and Li, 2007), we propose modeling some latent variables related to the prior probabilities of the genes' being in a state via MRFs, leading to a spatially correlated mixture model. The spatial mixture model was first proposed by Fernandez and Green (2002) in spatial statistics, and applied to analyze comparative genomic hybridization (CGH) data by Broet and Richardson (2006). This is in contrast to standard mixture models. For example, McLachlan et al. (2006) proposed using a standard two-component normal mixture model to identify differentially expressed genes; although promising results have been obtained, one key assumption of the standard normal mixture model is that all genes share the same prior probability of coming from a component of the normal mixture without regard to their biological functions. In the context of CGH data analysis, Broet and Richardson (2006) proposed using a spatially correlated normal mixture model to introduce gene-specific prior probabilities and allow those prior probabilities to be correlated among neighboring genes on a chromosome, and gained more power to identify gene copy number changes. Extending the work of Broet et al. from 1D chromosome locations to 2D gene networks, we propose using the spatially correlated normal mixture model to improve power in identifying differentially expressed genes (for cDNA, Affymetrix or any other expression arrays) or target genes of a TF (for ChIP-chip data). A key difference from Broet et al. is our utilizing existing biological knowledge databases, such as KEGG pathways (Kanehisa and Goto, 2000), or computationally predicted gene networks from integrated analysis (Lee et al., 2004), to construct gene functional neighborhoods and incorporating them into a spatially correlated normal mixture model. The basic rationale underlying the proposed model is that functionally linked genes tend to be co-regulated and co-expressed, which is thus incorporated into analysis.
The rest of this article is organized as follows. We first review the standard normal mixture model, then propose the spatially correlated normal mixture model. For illustration, we apply and compare the two methods using a ChIP-chip dataset to identify the target genes of TF GCN4. A simulated dataset is also used to demonstrate the advantage of the proposed method over the standard mixture model. Finally, we summarize our results and outline some future work.
| 2 METHODS |
|---|
|
|
|---|
2.1 Problem
The goal of analysis is to identify which genes satisfy a certain condition, such as being differentially expressed (DE) or being a TF's transcriptional targets. It can be formulated as a formal or informal hypothesis testing problem: for each gene i, we test for a null hypothesis H0,i against an alternative hypothesis H1,i, usually the opposite of H0,i. For example, H0,i is that gene i is equally expressed (EE) for expression data or that gene i is not a target of the TF for ChIP-chip data, while H1,i is the opposite of H0,i.
We assume that the data have been summarized as measurement Zi for each gene i, i = 1, ... , G; for example, Zi can be a test statistic measuring the relative abundance of mRNA (or TF), or the statistical significance level for rejecting H0,i. Define Ti = I(H0,i is false); that is, Ti = 1 or Ti = 0 corresponds to that H1,i or H0,i holds, respectively.
For our real data, we extracted a p-value for each gene, then transformed it to a Z-score and subsequently model the Z-scores (McLachlan et al., 2006). The transformation is given by zi =
–1 (1 – Pi), where
is the cumulative distribution function of the standard Normal distribution N(0, 1), and Pi is the P-value for gene i. If Pi is properly calculated as a genuine p-value, then the null distribution of zi is exactly the standard normal. In addition, the non-null distribution may model the right-tail of the Z-score distribution. The resulting two-component normal mixture model is
|
|
(·; µ,
2) is the probability density function of N(µ,
2). However, sometimes the null distribution of the z-scores is not standard normal due to approximate p-values (e.g. resulting from possible correlations among the genes, in contrary to the adopted independence assumption). In this situation, we need to estimate the null distribution N(µ0,
2.2 Gene networks
The types of gene networks that can be used here are not restrictive; it can be any network as long as the basic assumption holds: based on a gene network, two neighboring genes (i.e. two genes with an edge between them) are more likely to satisfy H0,i or H1,i at the same time than two non-neighboring genes. As stated before, a gene network can be extracted from existing biological databases, such as KEGG pathways, or any computationally predicted gene network, possibly resulting from integrated analysis of multiple sources of genomic data. In this article, we use the functional linkage network of yeast genes constructed by Lee et al. (2004) as an example. Lee et al. applied a naive Bayes method to assign a score to each possible gene pair by integrating a variety of genomic data, including mRNA co-expressions, gene co-citations, protein–protein interactions, gene fusions and phylogenetic profiles. Two genes with a score high enough are linked, suggesting that it is highly likely that they share some biological function. They obtained a gene network called ConfidentNet with high credibility. Represented by an undirected graph, the ConfidentNet consists of 4681 nodes (genes) and 34 000 edges (gene–gene functional linkages). A summary of the distribution of the number of direct neighbors is: minimum = 1, 25th-percentile = 2, median = 6, 75th-percentile = 13 and maximum = 188. We will use this yeast gene network in our real data example.
2.3 Standard normal mixture model
Suppose that the distribution functions of the data (e.g. Z-scores) for the genes with Ti = 1 and Ti = 0 are f1 and f0, respectively. Assuming that a priori all the genes have an identical and independent distribution (iid), we have a marginal distribution of zi as a standard mixture model:
|
| (1) |
0 is the prior probability that H0,i holds. It is worth noting that the prior probabilities are the same for all genes. The standard mixture model has been widely used in microarray data analysis (e.g. Efron et al., 2001; McLachlan et al., 2006; Newton et al., 2001; Pan et al., 2002).
The null and non-null distributions f0 and f1 can be approximated by finite normal mixtures:
and
, where
(µ,
2) is the density function for a Normal distribution with mean µ and variance
2. For Z-scores, using Kj = 1 often suffices (McLachlan et al., 2006). In our real data example, we found that K0 = 2 and K1 = 1 worked well.
The standard mixture model can be fitted via maximum likelihood with the EM algorithm (McLachlan and Peel, 2000). Once the parameter estimates are obtained, statistical inference is based on the posterior probability that H1,i holds: Pr(Ti = 1|zi) =
1 f1 (zi)/f(zi). Because the spatially correlated mixture model is fitted in a Bayesian framework while it is unclear how to fit it in a frequentist approach, to facilitate comparison, we fitted the standard mixture model in a similar Bayesian framework; see below for prior specifications.
2.4 Spatial normal mixture model
In a spatial normal mixture model, we introduce gene-specific prior probabilities
i, j = Pr(Ti = j) for i = 1, ... , G and j = 0, 1. Hence, the marginal distribution of zi is
|
| (2) |
Based on a gene network, we relate the prior probabilities
i, j to two latent Markov random fields xj = {xi, j; i = 1, ... , G} by a logistic transformation:
|
|
i}, depends only on its direct neighbors. More specifically, we have |
|
i is the set of indices for the neighbors of gene i, and mi is the corresponding number of neighbors. To allow identifiability, we impose
i xij = 0 for j = 0,1. In this model, the parameter
i, j's for those genes that are neighbors in the network. In summary, it is biologically reasonable to assume that the neighboring genes in a network are more likely to share biological functions and thus to participate in the same biological processes. Hence they should have similar prior probabilities of being DE or being targets of a TF at the same time.
2.5 Prior distributions
We largely followed Broet et al.'s prior specifications. For either mixture model, we use vague or non-informative prior distributions: µ0 = 0, µ1
N(0, 106)I(a, 0), a truncated normal distribution between a = mini zi and 0; µ2
N(0, 106)I(0, b) and b = maxi zi. The two truncated normals are constructed to ensure unique labeling of the normal mixture components. In addition, we have
for j = 0, 1, 2. See Section 3.2.3 for discussions on selection of hyperparameters in the Inverse Gamma distribution.
For the standard mixture model, (
0,
1,
2)
Dirichlet(1, 1, 1). For the spatial mixture model,
for j = 0, 1, 2. Note that the precision parameter,
, has Gamma(0.01, 0.01) with mean 1 and variance 100.
For completeness, the details of the model specifications and the WinBUGS codes for both spatial and standard normal mixture models are given in Wei and Pan (2007).
2.6 Inference
Each of the above mixture models can be readily implemented in WinBUGS (Spiegelhalter et al., 2003). The posterior mean of any parameter based on Markov Chain Monte Carlo (MCMC) samples is used as its point estimate. In particular, based on whether the point estimate
is smaller than a threshold c, we determine whether to reject H0. There is a correspondence between c and false discovery rate (FDR), which can be estimated based on
. In this article, we consider varying c, leading to various sensitivities and specificities and thus yielding an ROC plot.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Real data
3.1.1 Data
To evaluate the performance of our proposed method, we applied the two methods—standard normal mixture model, denoted as Independence model, and spatial normal mixture model, denoted as Spatial model - to a ChIP-chip dataset for transcription factor GCN4. A TF is a protein that binds to the promoter regions of its regulatory target genes and participates in the recruitment of RNA polymerase, thus regulating the transcription of its target genes into messenger RNA. ChIP-chip is a hybrid of chromatin-immunoprecipitation (ChIP) and microarray technology that is used to quantify the occupancy of genome-wide promoter regions by a TF. A typical ChIP-chip dataset contains log binding ratios measuring the relative abundances of the TF bound to the genes, and possibly inferred statistical significance levels (p-values) for rejecting the null hypothesis that each of the genes is not bound by the TF. As a TF, GCN4 is involved in response to amino acid starvation in yeast. Lee et al. (2002) did ChIP-chip experiments for GCN4 with three replicates. Log binding ratios and p-values for 6181 yeast genes were provided. Pokholok et al. (2005) constructed a set of 80 genes that are very likely to be the binding targets of GCN4 from multiple sources of data (including another set of more accurate ChIP-chip experiments based on a new generation of microarrays, a gene expression dataset, and DNA motif analyses), as well as a set of 900 genes that are unlikely to be regulated by GCN4. Treating the positive and negative control sets as the true positives and true negatives, we used them to evaluate the performance of the two methods; that is, sensitivity and specificity were calculated based on the two control sets. In addition, for the spatial normal mixture model, we used the yeast gene network constructed by Lee et al. (2004) to specify gene neighborhoods. After merging the ChIP-chip data set and the gene network, we ended up with a 4616-node network with 33 432 edges. We extracted those 4616 genes' binding p-values and obtained their Z-scores for final analysis; correspondingly, there were 66 and 770 genes in the positive and negative control sets, respectively.
For illustration, a subnetwork consisting of only the positive and negative control genes is shown in Figure 1, where dark nodes represent the target genes (positive controls) while blank ones as non-targets (negative controls). Several features are noticeable. First, there is a cluster of positive control genes in the upper-right corner, and there are quite a few clusters of negative control genes. Second, positive control genes can be connected with negative control genes. Third, although there are many singletons (i.e. isolated genes without edges to other genes) in the subnetwork, they are not necessarily singletons in the whole network because they may be connected to other genes outside the control sets. We used the full network.
|
3.1.2 Model fitting
WinBUGS (Spiegelhalter et al., 2004) was used to implement the Bayesian models. Posterior means of the parameters were computed based on 4000 MCMC samples after 6000 burn-ins. First, a standard two-component normal mixture model with an empirical null distribution (i.e. its mean and variance parameters were unknown and needed to be estimated) was fitted; lack-of-fit of the mixture model against the data was observed (results not shown). To achieve better model fit, a third normal component with mean imposed to be negative was added into f0; the fitted mixture model was
|
|
0 and the third one with positive mean as
1. A visual examination revealed improved goodness-of-fit except at the peak of the data distribution (Figure 2). Similarly, a three-component spatial normal mixture model with an empirical null was fitted, yielding |
|
|
|
3.1.3 Statistical power
The ROC curves were constructed for the two methods based on the positive and negative control sets. As shown in Figure 3(a), when the specificity ranged from 0.9–0.4, as desired in practice, the spatial normal mixture model gave a higher sensitivity than that of the standard normal mixture model. Hence, at a high specificity (e.g. above 0.5 as usually desired), by taking use of biological knowledge embedded in a gene network, the spatial normal mixture model had higher statistical power to detect the targets than did the standard mixture model that ignored biological knowledge.
In addition, we compared the positive control genes' ranks by the posterior probabilities from the spatial model and by the original binding p-values. As shown in Figure 5(b) of Wei and Pan (2007), most of the positive control genes were ranked in the top 100 by both methods, while the spatial model boosted a few more genes' ranks from moderate (ranked between 200–400) to relatively high (ranked between 100–200).
3.1.4 Representative gene evaluations
We examined several individual genes to gain more biological insights. First, for ARG8 (YOL140W), a gene in the positive control set, its posterior probability of being a target was estimated to be 0.728 by the spatial model and 0.023 by the independence model. The binding ratio for this gene in Lee et al.'s rich medium ChIP-chip experiment was 1.02. However, Harbison et al. (2004) did more ChIP-chip experiments on GCN4 in amino acid starvation and nutrition deprivation conditions besides rich medium, and the binding ratio for ARG8 was 5.0 with p-value 10–11. Because GCN4 is a transcriptional activator of amino acid biosynthetic genes in response to amino acid starvation, it is expected that genes involved in amino acid biosynthetic process are likely to be binding targets of GCN4. In fact, ARG8 is known to be involved in arginine and ornithine biosynthetic processes (Jauniaux et al., 1978; Pauwel et al., 2003), and is annotated in GO Biological Process: amino acid biosynthetic process (GO ID:0008652). Note that ARG8, located in the upper-right cluster of positive control genes in Figure 1, is a direct neighbor of four other positive control genes but none of the negative control genes. We conjecture that its positive neighbors explain why ARG8's gene specific prior probability of being a target was estimated to be 0.733 by the spatial model, in contrast to 0.058 by the independence model; the high prior probability boosted its posterior probability of being a target. Second, TRP5 (YGL026C) is a gene in neither control set, but it is a direct neighbor of ARG8. Its gene specific prior probability of being a target was estimated to be 0.716 by the spatial model, as compared to 0.058 by the independence model, and the posterior probability was estimated to be 0.723 by the spatial model and to be 0.032 by the independence model. The binding ratios for this gene were 1.15 and 1.21 in Lee et al.'s and Harbison et al.'s experiments, respectively. However, Beyer et al. (2006) computationally predicted TRP5 to be a binding target of GCN4 with a high confidence level by integrating multiple data sources. In addition, TRP5 is known to participate in trypophan biosynthetic process (Elion et al., 1991, Toyn et al., 2000), and also annotated in GO Biological Process: amino acid biosynthetic process (GO ID:0008652). Hence, it is a likely target of GCN4. Finally, we looked at a positive control gene, ICY2 (YPL250C). It has six direct neighbors in the network: two of them are in the negative control set and none of them are in the positive control set. Its prior and posterior probabilities of being a target were estimated to be 0.668 and 0.836, respectively by the spatial model; in contrast, the independence model gave estimates of the prior and posterior probabilities as 0.058 and 0.548, respectively. Two of its direct neighbors are negative control genes, ADY2 (YCR010C) and CRS5 (YOR031W), whose prior probabilities of being a target were estimated to be 0.08 and 0.12, respectively by the spatial model, as compared to 0.058 for both by the independence model; their posterior probabilities of being a target were 0.06, 0.09 respectively by the spatial model and 0.02, 0.02 respectively by the independence model. Therefore, although ICY2 is surrounded by non-target genes in the network, it was still identified as a binding target by the spatial model, while its neighboring negative control genes were not identified as false positives. In summary, by taking use of biological knowledge embedded in a gene network, the spatial normal mixture model had higher statistical power to detect the targets than did the standard mixture model while maintaining a reasonable specificity.
3.2 Simulated data
3.2.1 Simulation set-up.
To further investigate the property of our proposed method, we conducted a simulation study that mimicked real data: we simulated a gene network similar to the one used for the real data, and used data-generating distributions similar to the ones fitted to the real data. We used Wei and Li's discrete local MRF model to generate the true binding (or differential expression) states for simulated data. Suppose for a network of G genes, we have the binding state vector y = (y1, y2, ... , yG), which is modeled by a MRF with parameter
= (
0,
1, β). More specifically, we have
|
|
|
| (3) |
To generate simulated data, for simplicity we first removed seven singletons from the yeast gene network and ended up with a 4609-node and 33 432-edge network. Second, to simulate y, the latent binding states, we initialized the 66 genes in the positive control set to be binding targets and the rest of genes to be non-targets, giving an initial y. Then we iterated the states 20 times based on Equation (3), with
0 = 1,
1 = 1, β = 2. It turned out that the number of binding targets became stable at
170 after 10 iterations, and we chose the states to be the ones right after the 10th iteration, giving 183 binding targets. Hence, the generated true state vector y was only an approximation to a MRF, lending the opportunity to investigate the robustness of the spatial mixture model. Note that our spatial mixture model assumes an exact MRF for the latent variables x related to prior probabilities, not a MRF for the latent binding states y, giving our model another source of model mis-specification. Next, given y, we simulated the Z-scores according to the fitted spatial mixture model from the real data; for simplicity, we only used the null and positive components, i.e.
(0, 0.632) and
(0.75, 1.532).
3.2.2 Simulation results.
We simulated 5 datasets according to the above procedure, and fitted the two-component spatial mixture model and standard mixture model as described in the previous section. To compare their performance, we plotted the ROC curves for these two methods as shown in Figure 3(b). Curves in the same color (or gray level) are for the same simulated dataset. Note that for each pair, the spatial mixture model won for any given specificity ranging from 10%– 95%. The average gain of sensitivity was
10% at specificity 80%, while the average gain could reach 20% at specificity 50%. It was confirmed that again the spatial mixture model gave a much higher sensitivity at a given specificity as compared to the standard independence mixture model.
3.2.3 Sensitivity analysis.
Because of incomplete biological knowledge, it is likely that a gene network contains false positive edges while missing some true ones. To evaluate the impact of misspecified network structures, we perturbed the network generated in the simulation. More specifically, we perturbed the network in three ways. In scenario 1, we randomly removed 5% (1672) edges from the original 33 432-edge network, and it resulted in 46 singletons. We eliminated those singletons by randomly connecting each of them to another gene and ended up with a 31 806-edge network. In scenario 2, we randomly added 1672 edges to the original network, and thus had a 33 432-edge new network. Third, we removed the same set of 1672 edges as in scenario 1 from, then added the same set of 1672 edges as in scenario 2 to the original network; further more, we eliminated 20 singletons by randomly connecting each of them of another gene, ending up with a 33 452-edge network. We applied the three perturbed networks as well as the true network to one of the aforementioned simulated datasets, and constructed the corresponding ROCs as shown in Figure 3(c). We see that removing a small percentage of edges did not seem to affect the results much (first scenario), while adding a small percentage of edges affected the results a bit more (second scenario). When the network contains both false negative and false positive edges as in the third scenario, the results seemed to be affected most substantially. Nevertheless, even in the third scenario the spatial model performed no worse than the independence model. Consequently we conclude that our proposed spatial model is reasonably robust to misspecified networks.
We also investigated how different hyperparameters of the prior distributions might influence the analysis results. In our current hyperparameter set-up, we imposed a moderately informative prior on the variance/precision parameters of the normal mixture components, i.e. the precision parameters had Gamma(0.1, 0.1) prior distribution with mean 1 and variance 10, while other parameters had almost flat priors. We tried an almost non-informative prior distribution, Gamma(0.0001, 0.0001), on the precision parameters as an alternative. This Gamma distribution has mean 1 and variance 10000. Congdon (2001) pointed out that if p(
)
Gamma(0.0001, 0.0001), the prior of
will be approximately p(
)
1/
, which is known as Jeffrey's prior and is a form of reference prior intended to reflect our ignorance about the parameter. We fitted the spatial model with these two hyperparameter set-ups, Gamma(0.1, 0.1) and Gamma(0.0001, 0.0001), to the same simulated dataset as used in the previous paragraph. The ROC curves are shown in Figure 3(d). We see that the two independence model curves are tied together all the way, while the two spatial model curves are coupled first and then separated a bit; either of the two curves from the spatial model is well above those from the independence model. In addition, the posterior distributions of the key parameters from the two spatial models were also compared and they were all very close (Tables 1 & 2 of Wei and Pan 2007).
| 4 DISCUSSION |
|---|
|
|
|---|
In this article, in contrast to the standard mixture model that treats the genes equally and independently a priori, we have proposed a spatially correlated mixture model that allows incorporating gene network information into statistical modeling of complex inter-relationships among the genes. As expected, by borrowing information from a gene network to account for coordinated functioning of the genes, the new method has potential to improve statistical power for new discoveries with high-throughput genomic data. An application to a ChIP-chip dataset and simulated data demonstrated the utility and advantage of the proposed method.
The proposed approach is in line with current efforts in integrating biological knowledge and multiple types of data (Dopazo, 2006): the gene network being used can be extracted directly from existing biological databases, e.g. KEGG pathways, or can be computationally predicted by integrative analysis of multiple types of data, as the gene function network constructed by Lee et al. (2004) and used in our example. In addition, the proposed spatial model covers the prior probability specification in a stratified mixture model for incorporating gene functional annotations (Pan, 2005; 2006; 2006b) as a special case; in the latter, a corresponding gene network may be regarded as the follows: each functional group is a subnetwork consisting of the genes fully connected to each other, while there is no connection between any two functional groups, and the smoothing parameter
Cj in the ICAR model is small enough (e.g.
Cj = 0) to induce a constant prior probability for the genes within the same functional group. Of course, the special case may be too restrictive: for example, first, the genes within the same functional group may play different roles, not necessarily sharing the same prior probability; second, different gene functional groups may interact to each other, possibly through some genes with multiple functions. On the other hand, our idea differs from existing approaches to using gene networks (Wei and Li, 2007, and references therein). For example, although there is a conceptual similarity between ours and that of Wei and Li (2007) in the use of MRF to account for correlations among the genes, we model the prior probabilities of genes' being in a certain state, in contrast to the underlying true states as modeled by Wei and Li, via a MRF. As a consequence, by the general result of the consistency of the posterior probability relative to a specified prior distribution, we expect that our method is more robust to misspecified gene networks than that of Wei and Li, which may be more efficient if the gene network is correctly specified; due to incomplete biological knowledge and prediction errors, it seems unlikely that a gene network can be specified completely correctly. In addition, we use a Normal distribution for each component of the mixture model while Wei and Li adopted the Gamma-Gamma model, although other parametric models in either can be adopted; we use the Gaussian MRF for the latent variables related to prior probabilities, in contrast to the binary MRF for the latent gene binding or DE states. They have implications on the resulting estimation procedures: ours is fully Bayesian while Wei and Li used a pseudo-likelihood method, the iterative conditional mode (ICM) algorithm (Besag, 1986). Further studies to investigate the operating characteristics of our proposal, including comparing its performance with others, and to extend our proposal to more complex settings (e.g. Lewin et al. 2006) will be interesting.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This research was partially supported by NIH grant HL65462 and a UM AHC Faculty Research Development grant. The authors thank Stuart Levine and Rick Young for sharing the binding data. The authors thank the reviewers for helpful and constructive comments.
Conflict of interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Chris Stoeckirt
Received on September 18, 2007; revised on November 19, 2007; accepted on December 7, 2007
| REFERENCES |
|---|
|
|
|---|
Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet (2000) 25:25–29.[CrossRef][Web of Science][Medline]
Beyer A, et al. Integrated assessment and prediction of transcription factor binding. PLoS Comput. Biol (2006) 2:e70.[CrossRef][Medline]
Besag J. On the statistical analysis of dirty pictures. J. Roy. Stat. Soc.: B (1986) 48:259–302.
Besag J, Kooperberg C. On conditional and intrinsic autoregressions. Biometrika (1995) 82:733–746.
Broet P, Richardson S. Detection of gene copy number changes in CGH microarrays using a spatially correlated mixture model. Bioinformatics (2006) 22:911–918.
Congdon P. Bayesian Statistical Modelling. (2001) Chichester: Wiley.
Dopazo J. Functional Interpretation of Microarray Experiments. OMICS: J. Int. Biol (2006) 10:398–410.[CrossRef]
Efron B, et al. Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc (2001) 96:1151–1160.[CrossRef][Web of Science]
Efron B, Tibshirani R. On testing the significance of sets of genes. Ann. Appl. Stat (2007) 1:107–129.[CrossRef]
Elion EA, Brill JA, Fink GR. FUS3 represses CLN1 and CLN2 and in concert with KSS1 promotes signal transduction. (1991) 88. Proceedings of National Academy of Science. 9392–9396.
Fernandez C, Green P. Modelling spatially correlated data via mixtures: a Bayesian approach. J. Roy. Stat. Soc.: B (2002) 64:805–826.[CrossRef]
Franke L, van Bakel H. Reconstruction of a Functional Human Gene Network, with an Application for Prioritizing Positional Candidate Genes. Am. J. Hum. Genet (2006) 78:1011–1025.[CrossRef][Web of Science][Medline]
Harbison CT, et al. Transcriptional Regulatory Code of a Eukaryotic Genome. Nature (2004) 431:99–104.[CrossRef][Medline]
Jauniaux JC, et al. Arginine metabolism in Saccharomyces cerevisiae: subcellular localization of the enzymes. J. Bacteriol (1978) 133:1096–1107.
Kanehisa M, Goto S. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res (2000) 28:27–30.
Lee I, et al. Probabilistic Functional Network of Yeast Genes. Science (2004) 306:1555–1558.
Lee TI, et al. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science (2002) 298:799–804.
Lewin A, et al. Bayesian modeling of differential gene expression. Biometrics (2006) 62:1–9.[Web of Science][Medline]
McLachlan GJ, et al. A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics (2006) 22:1608–1615.
McLachlan GJ, Peel D. Finite Mixture Models. (2000) New York: Wiley.
Newton MA, et al. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J. Comput. Biol (2001) 8:37–52.[CrossRef][Web of Science][Medline]
Newton MA, et al. Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann. Appl. Stat (2007) 1:85–106.[CrossRef]
Pan W, et al. Model-Based Cluster Analysis of Microarray Gene Expression Data. Genome Biol (2002) 3. research0009, 1–8.
Pan W. Incorporating Biological Information as a Prior in an Empirical Bayes Approach to Analyzing Microarray Data. Stat. Appl. Genet. Mol. Biol (2005) 4. Article 12.
Pan W. Incorporating gene functional annotations in detecting differential gene expression. Appl. Stat (2006) 55:301–316.
Pan W. Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics (2006b) 22:795–801.
Pauwels K, et al. The N-acetylglutamate synthase/N-acetylglutamate kinase metabolon of Saccharomyces cerevisiae allows co-ordinated feedback regulation of the first two steps in arginine biosynthesis. Eur. J. Biochem (2003) 270:1014–1024.[Web of Science][Medline]
Pokholok DK, et al. Genome-wide Map of Nucleosome Acetylation and Methylation in Yeast. Cell (2005) 122:517–527.[CrossRef][Web of Science][Medline]
Spiegelhalter D, et al. WinBUGS User Manual, Version 1.4. (2003) Available at http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/manual14.pdf.
Subramanian A, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. (2005) 102. Proceedings of National Academy of Science. 15545–15550.
Tian L, et al. Discovering statistically significant pathways in expression profiling studies. (2005) 102. Proceedings of National Academy of Science. 13544–13549.
Toyn JH, et al. A counterselection for the tryptophan pathway in yeast: 5-fluoroanthranilic acid resistance. Yeast (2000) 16:553–560.[CrossRef][Web of Science][Medline]
Tusher VG, et al. Significance analysis of microarrays applied to the ionizing radiation response. (2001) 98. Proceedings of National Academy of Science. 5116–5121.
Wei P, Pan W. Incorporating Gene Networks into Statistical Tests for Genomic Data via a Spatially Correlated Mixture Model. In: Research Report 2007–032. (2007) University of Minnesota, Division of Biostatistics. Available at http://www.biostat.umn.edu./rrs.php.
Wei Z, Li H. A Markov Random Field Model for Network-based Analysis of Genomic Data. Bioinformatics (2007) 23:1537–1544.
Xiao G, et al. Improved detection of differentially expressed genes through incorporation of gene locations. In: Research Report 2005–028. (2005) University of Minnesota, Division of Biostatistics. Available at http://www.biostat.umn.edu/rrs.php.
This article has been cited by other articles:
![]() |
S. R. Ramakrishnan, C. Vogel, T. Kwon, L. O. Penalva, E. M. Marcotte, and D. P. Miranker Mining gene functional networks to improve mass-spectrometry-based protein identification Bioinformatics, November 15, 2009; 25(22): 2955 - 2961. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Li, Z. Wei, and J. Maris A hidden Markov random field model for genome-wide association studies Biostat., October 12, 2009; (2009) kxp043v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Pan Network-based multiple locus linkage analysis of expression traits Bioinformatics, June 1, 2009; 25(11): 1390 - 1396. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Ulitsky and R. Shamir Identifying functional modules using expression profiles and confidence-scored protein interactions Bioinformatics, May 1, 2009; 25(9): 1158 - 1164. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Noirel, G. Sanguinetti, and P. C. Wright Identifying differentially expressed subnetworks with MMG Bioinformatics, December 1, 2008; 24(23): 2792 - 2793. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Li and H. Li Network-constrained regularization and variable selection for analysis of genomic data Bioinformatics, May 1, 2008; 24(9): 1175 - 1182. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Sanguinetti, J. Noirel, and P. C. Wright MMG: a probabilistic tool to identify submodules of metabolic pathways Bioinformatics, April 15, 2008; 24(8): 1078 - 1084. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




