Skip Navigation


Bioinformatics Advance Access originally published online on May 5, 2007
Bioinformatics 2007 23(12):1537-1544; doi:10.1093/bioinformatics/btm129
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/12/1537    most recent
btm129v2
btm129v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wei, Z.
Right arrow Articles by Li, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wei, Z.
Right arrow Articles by Li, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

A Markov random field model for network-based analysis of genomic data

Zhi Wei 1 and Hongzhe Li 1,2,*

1Genomics and Computational Biology Graduate Group and 2Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STATISTICAL MODELS AND...
 3 SIMULATION STUDIES
 4 APPLICATION TO REAL...
 5 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: A central problem in genomic research is the identification of genes and pathways involved in diseases and other biological processes. The genes identified or the univariate test statistics are often linked to known biological pathways through gene set enrichment analysis in order to identify the pathways involved. However, most of the procedures for identifying differentially expressed (DE) genes do not utilize the known pathway information in the phase of identifying such genes. In this article, we develop a Markov random field (MRF)-based method for identifying genes and subnetworks that are related to diseases. Such a procedure models the dependency of the DE patterns of genes on the networks using a local discrete MRF model.

Results: Simulation studies indicated that the method is quite effective in identifying genes and subnetworks that are related to disease and has higher sensitivity and lower false discovery rates than the commonly used procedures that do not use the pathway structure information. Applications to two breast cancer microarray gene expression datasets identified several subnetworks on several of the KEGG transcriptional pathways that are related to breast cancer recurrence or survival due to breast cancer.

Conclusions: The proposed MRF-based model efficiently utilizes the known pathway structures in identifying the DE genes and the subnetworks that might be related to phenotype. As more biological networks are identified and documented in databases, the proposed method should find more applications in identifying the subnetworks that are related to diseases and other biological processes.

Contact: hongzhe{at}mail.med.upenn.edu or hli{at}cceb.upenn.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STATISTICAL MODELS AND...
 3 SIMULATION STUDIES
 4 APPLICATION TO REAL...
 5 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Identification of genes and pathways involved in diseases and other biological processes is one of the important problems in genomic research. Microarray technology makes it possible to measure the expression levels of almost all human genes and therefore facilitate the identification of genes and pathways that are related to disease initiation and development. In a typical experiment, several phenotypes are compared, with a certain number of biological replicates for each phenotype. The goal is to identify the differentially expressed (DE) genes among the different phenotype groups.

There are many novel statistical methods that have been developed for identifying the DE genes. A general approach is to conduct a hypothesis test at each gene and then correct for multiple testing. Most of the statistics used are t statistics and differ primarily in the estimation of the variance (Dudoit et al., 2002; Tusher et al., 2001). Other methods include the empirical Bayes methods that can effectively pool data from different genes (Efron et al., 2001; Kendziorski et al., 2003; Lonnstedt and Speed, 2002; Newton et al., 2003). These DE genes identified are often linked to a predefined list of groups of genes such as known pathways in order to identify which groups include more DE genes than expected by chance using a hypergeometric distribution. Alternatively, the gene set enrichment analysis (GSEA) (Subramanian et al., 2005; Tian et al., 2005) can be used. For such a GSEA, one starts with a predefined list of groups of genes and assigns every such group a score that is essentially the average of the univariate test statistics of its member genes. The groups with high scores are more likely to be DE and P-values can be obtained by permutation methods.

One limitation of the commonly used methods of identifying the DE genes or the GSEA is that network structures are not utilized in the analysis. However, the interaction network is a more precise way to represent information than lists of genes or pathways, as it describes which genes are closely connected within a given pathway. Hence, it has the potential to detect more subtle changes of gene expressions, such as local disturbances within known pathways. Rahnenführer et al. (2004) demonstrated that the sensitivity of detecting relevant pathways can be improved by integrating information about pathway topology. In Sivachenko et al. (2005), a network topology extracted from literature was used jointly with microarray data to find significantly affected pathway regulators. Nacu et al. (2006) proposed an interesting permutation-based test for identifying subnetworks from a known network of genes that are related to phenotypes. The method is essentially based on a spatial scan statistic treating genes collected on the networks as neighbors. However, their method does not explicitly utilize the dependency of gene differential expression patterns on the network. Rapaport et al. (2007) proposed to first smooth the gene expression data on the network based on the spectral graph theory and then to use the smoothed data for classification. The method explicitly assumes that the true gene expression levels should be similar for genes that are neighbors on the networks. However, this assumption may be questionable due to both activating and inhibiting effects of gene regulations.

Markov random field (MRF) models have been widely used in image analysis in order to account for the local dependency of the observed pixel intensities (Besag, 1986) and have also been applied for functional prediction of proteins in order to account for the local dependency of protein functions in the protein–protein interaction networks (Deng et al., 2002, 2004; Letovsky and Kasif, 2003). The MRF model has also been applied for discovering molecular pathways from protein interaction and gene expression data (Segal et al., 2003). In this article, we propose to develop a MRF-based method for identifying the DEs and the subnetworks that are related to the phenotypes, where the MRF model is used to capture the dependency of the differential expression patterns for genes on the networks. Our method combines the two-group empirical Bayes method of Kendziorski et al. (2003) and Newton et al. (2003) with a MRF model to model the dependency of the DE patterns. In our model and those of Kendziorski et al. (2003) and Newton et al. (2003), each gene is either DE or equally expressed. Those genes which are Equally Expression (EE) present data according to some background Gamma distribution, and those which are DE present data according to a different distribution. The specific forms of these distributions arise by another layer of mixing over the latent mean expression level for each gene and the latent means follow another Gamma distribution. Such empirical Bayes approaches allow a level of information sharing amongst genes.

The rest of the article is organized as follows. We first introduce the model assumptions and a MRF model. We then provide an iterative conditional mode algorithm (ICM) of Besag (1986) for parameter estimation and for identifying the DE genes. We present simulation studies to demonstrate the methods and to compare the results with other commonly used methods for DE gene identification. Finally, we present results of applying the proposed method to two breast cancer gene expression datasets in order to identify the genes and the subnetworks that are related to breast cancer recurrence or death due to breast cancer. We conclude the article with a brief discussion of the results and methods.


    2 STATISTICAL MODELS AND METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STATISTICAL MODELS AND...
 3 SIMULATION STUDIES
 4 APPLICATION TO REAL...
 5 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Notation and assumptions
Given microarray gene expression profiling data under two conditions, we want to determine which genes are differentially expressed. Each gene can have two states, labeled 0 and 1, representing EE and DE, respectively. An arbitrary state assignment of gene set S will be denoted by Formula , where xi is the corresponding state of gene i and is 1 if gene i is differentially expressed (i.e. DE) and 0 otherwise. We write x* for the true but unknown gene state and interpret this as a particular realization of a random vector Formula , where Xi assigns state to gene i. Let yi denote the observed mRNA expression level of gene i and y the corresponding vector, interpreted as a realization of a random vector, Formula , where Yi itself is a vector Formula , composed of the m replicates under one condition and n replicates for the other. We further introduce the notation Formula and Formula .

In order to specify the joint distribution of Y, we make the following assumptions:

Assumption 1.
Given any particular realization x, the random variables Formula are conditionally independent and each Yi has the same unknown conditional density function Formula , dependent only on xi. The conditional density of the observed gene expression y, given, x, is simply,


Formula

Assumption 2.
The true state Formula is a realization of a locally dependent discrete MRF with a specified distribution Formula , which is defined in the next section.

2.2 Gamma–Gamma model for gene expression data
In this section, we first briefly review the Gamma–Gamma model for gene expression data introduced in Newton et al. (2003) and Kendziorski et al. (2003). We define Formula , which characterizes fluctuations in repeated measurements under the same condition from a gene i having latent mean expression level µi, and Formula , which describes fluctuations in these means among genes. We assume that the observation yi is a sample from a gamma distribution having shape parameter {alpha} > 0 and a mean value µi; thus, with scale parameter Formula . The corresponding density function can be written as


Formula

for measurement y > 0. Note that the coefficient of variation in this distribution is Formula , taken to be constant across genes i. Following Newton et al. (2003), we take Formula to be an inverse gamma. More specifically, fixing {alpha}, the quantity Formula has a gamma distribution with shape parameter {alpha}0 and scale parameter v. Let Formula be the parameters used to specify these two distributions. The joint predictive density for the replicates Formula of gene i under the same condition is


Formula

Under this general model, we have for the first condition


Formula

where


Formula

and for the second condition


Formula

where


Formula

Therefore, given the differential expression state xi, we have


Formula

where


Formula

Then based on assumption 1, the conditional density of all p genes can be written as


Formula 1

(1)

2.3 Discrete local MRF model for joint differential expression states
The gene differential expression states Formula are not independent. For example, if a gene is DE, it is more likely that its upstream regulators are also DE and that these regulators in turn affect their downstream-regulated genes. In order to explicitly account for such dependency of DE patterns over genes on the networks, we propose to use the known biological network information compiled in the form of pathways. Examples of such pathways include the KEGG pathway (Kanehisa and Goto, 2002) and BioCyc pathways (http://biocyc.com/). In our model, the network is expressed as an undirected graph with the nodes for genes and edges for connections on the network. Consider the p genes on the network, let Formula be the vector of unobserved differential expression states for the p genes. We propose to model the dependency of Formula using a MRF with parameter Formula . Specifically, we assume


Formula

where Formula is the number of genes at state 0, Formula is the number of genes at state 1 and Formula is the number of edges linking two genes with different states. The {gamma}0 and {gamma}1 are arbitrary parameters and we require that ß > 0, which discourages neighboring genes to have different DE states. By considering any two realizations which differ only at gene i, it follows that the conditional probability of state k occurring for gene i, given the states of all other genes is


Formula 2

(2)
where Formula denotes the number of neighbors of gene i having state Formula , k = 0, 1 (Besag, 1986). Maximum likelihood estimation of {Phi}, however, is computationally intractable. In general, it is the constant of proportionality in Formula which cannot be evaluated. A simple alternative to maximum likelihood estimation for a local MRF is provided by the ‘coding method’ (Besag, 1986), where the estimate Formula is chosen to maximize the conditional likelihood,


Formula 3

(3)
where Formula represents the neighbors of gene i.

In order to account for different numbers of neighbors for different genes on the network, we propose to modify the conditional probability (2) as


Formula 4

(4)
where Formula for k = 0, 1 and di is the number of neighbors for the ith gene. The conditional likelihood function (3) can be modified accordingly by replacing Formula with Formula .

2.4 Parameter estimation using ICM and identification of subnetworks
When inferring the true differential expression state x* for the p genes, the parameter estimation must be carried out simultaneously. We propose the following algorithm based on the ICM algorithm of Besag (1986) to estimate the parameter {theta} in the Gamma–Gamma model for gene expression data and the parameter {Phi} in the MRF model. The algorithm involves the following iterative steps:

  1. Obtain an initial estimate Formula of the true state x*, using a simple two sample t-test.
  2. Estimate {theta} by the value Formula which maximizes the likelihood Formula [see Equation (1)].
  3. Estimate {Phi} by the value Formula which maximizes the conditional likelihood Formula [see Equation (3)] based on current Formula .
  4. Carry out a single cycle of ICM based on the current Formula and Formula , to obtain a new Formula . Specifically, for i = 1 to p, update xi which maximizes


    Formula

    subject to xi = 1 or xi = 0.

  5. Go to step 2 for a fixed number of cycles or until approximate convergence of Formula .

The converged Formula are then taken to be the estimate of the true DE states. These estimates can then be mapped back to the network to identify the subnetworks, which are defined as those connected genes that show DE between the two experimental conditions.


    3 SIMULATION STUDIES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STATISTICAL MODELS AND...
 3 SIMULATION STUDIES
 4 APPLICATION TO REAL...
 5 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We first present simulation results to demonstrate our proposed methods. To simulate the data, we first obtained the network structure of 33 human regulatory pathways from the KEGG database (December 2006 version). We are only interested in gene–gene regulatory relations and any non-gene–gene interactions, e.g. compound-gene relations, compound–compound relations, were excluded from our analysis. The remaining gene–gene regulatory data are represented as an undirected graph where each node is a gene and two nodes are connected by an edge if there is a regulatory relation between them. Loops (nodes connected to themselves) were eliminated. This results in a graph with 1668 nodes and 8011 edges.

To simulate X, the gene expression states, we initialized the genes in the K pathways to be DE and the rest of genes to be EE, which gives us the initial Formula . Then we performed sampling five times based on Formula , according to Equation (3), with Formula . We chose Formula , so that we have different percentages of genes in DE states. Next, given X, we simulated the gene expression level Y according to GG model [Equation (1)] for 1668 genes in two conditions, having three replicates in each condition. We took model parameters similar to those in Newton et al. (Formula ). Simulations were repeated 100 times to assess the sensitivity, specificity, and false discovery rates (FDRs) of the proposed MRF GG model (MRFGG). We used the conditional probability (2) and the conditional likelihood (3) in our analysis. As a comparison, we also performed analysis on the simulated datasets using the standard two sample t-test which does not consider any prior information at all, and the empirical Bayesian GG model of Kendziorski et al. (2003).

The results over 100 replications are presented in Table 1, where the sensitivity is calculated as the average over the 100 replications of the fraction of DE genes correctly identified by the method; specificity is the average of the EE genes correctly identified; and the FDR is the average of the ratio of the number of false positives to the number of the genes identified as DE. For t-tests, a cut-value of 0.05 was used for declaring a gene to be the DE. We observed that overall specificity is high for all three procedures and the MRFGG model resulted in higher sensitivity than the GG model while the FDRs are similar. As expected, using a P-value of 0.05 can result in substantially higher FDRs. On the other hand, if the FDR controlling procedure of Benjamini and Hochberg (1995) was used, the two sample t-test resulted in very low sensitivity. The gain in sensitivity over the GG model is greater when there are more DE genes. These results demonstrated that by incorporating the network structure information, we can indeed gain sensitivity in identifying the DE genes.


View this table:
[in this window]
[in a new window]

 
Table 1. Comparison of performance for the proposed MRF approach (MRFGG), the Gamma–Gamma model (GG) of Kendiziorski et al. and standard two-sample t-test applied to the simulated data

 

    4 APPLICATION TO REAL DATASETS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STATISTICAL MODELS AND...
 3 SIMULATION STUDIES
 4 APPLICATION TO REAL...
 5 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We present results from application of the proposed methods to two breast cancer microarray gene expression studies in order to identify the subnetworks that are related to breast cancer metastasis or survival from breast cancer. We used the modified conditional probability (4) and the corresponding conditional likelihood function in the following analyses.

4.1 Application to the breast cancer gene expression dataset of Wang et al.
Wang et al. (2005) reported a large Affymetrix-based gene expression profiling for 286 patients with lymph-node-negative primary breast cancer. These patients were treated between 1980 and 1995 with age at surgery ranging 26–86 years and a median age at surgery of 52 years. No patient received any adjuvant therapy. During the follow-up period, 179 of these patients were relapse-free at 5 years, and 107 of them developed distant metastasis. Gene expression profiling using Affymetrix HG-133A was performed on all these patients, including 17 819 transcripts that were present in two or more samples. We merge the gene expression data with the 33 KEGG regulatory pathways and identified 1533 genes on the U133A array that can be found in the 1668-node KEGG network with 8011 edges. Our goal is to identify which genes and which subnetworks of the KEGG network of 33 pathways are related to breast cancer metastasis.

Two-sample t-tests identified only eight DE genes for FDR of 0.05 using the Benjamini and Hochberg's procedure. As a comparison, our proposed procedure identified 72 DE genes. The parameter estimates were Formula and ß = 1.24 in the MRF model. Figure 1 shows 17 of these genes that are mapped to the KEGG pathways, where the largest connected subnetwork includes six genes on the Cytokine–cytokine receptor interaction pathway. This subnetwork, centered around interleukin 8 receptor Beta (IL8RB), is down-regulated in cancer with relapse. In addition, the chemokine receptor (CCR6) and chemokine (C-C motif) ligand 20 (CCL20) are also down-regulated in cancers with relapse, indicating that a chemokine pathway is down-regulated in cancers with relapse. In addition, CCL20/CCR6 involvement in the neoplastic progression and metastatic spread was reported in several tumor types (Rubie et al., 2006), including breast cancer metastasis (Muller et al., 2001).


Figure 1
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Results from analysis of gene expression dataset of Wang et al. (2005). DE genes identified by the MRFGG method linked to the KEGG pathways, where genes in light grey are over-expressed and those in dark grey are under-expressed in cancer cells with breast cancer metastasis.

 
Another subnetwork includes four genes on the Wnt signaling pathway, including WNT4, WNT11 and secreted frizzled-related protein 1 (SFRP1), which were down-regulated in cancer with metastasis. SFRPs are secreted Wnt antagonists that directly interact with the Wnt ligand to inhibit signaling and members of the SFRP class bind directly to Wnts, thereby altering their ability to bind to the Wnt receptor complex. In particular, the SFRP1 gene is found at chromosome 8p21, a site of frequent loss of heterozygosity in human tumors and is down-regulated in cervical carcinoma, breast carcinoma and ovary and kidney carcinomas (Shulewitz et al., 2006). Wnt5b partially inhibits the canonical Wnt/beta-catenin signaling pathway. These findings agree with current knowledge of the involvement of the Wnt signaling pathway in breast cancer progression (Barker and Clevers, 2006).

We also found that the fibroblast growth factor 3 (FGF-3) and fibroblast growth factor 14 (FGF-14) are up-regulated in breast cancers with metastasis, while the fibroblast growth factor receptor-4 (FGFR4) is down-regulated. These three genes are connected on the MAPK signaling pathway. The four closely related human FGFRs and their more than 20 known ligands control a multitude of cellular processes, including cell growth, differentiation and migration, and it has been shown that the FGF/FGFR system plays a critical role in cancer development due to its angiogenic potential or direct enhancement of tumor growth (Burke et al., 1998).

Finally, recent study by Souaze et al. (2006) supports the contribution of neurotensin receptor (NTSR) in human breast cancer progression and pointed out the utility to develop therapeutic molecules targeting the neurotensin or NT1 receptor signaling cascade. We found that NTS and NTSR in the neuroactive ligand–receptor interaction pathway are linked and are both up-regulated in breast cancer with metastasis.

4.2 Application to the breast cancer gene expression dataset of Miller et al.
Miller et al. (2005) reported a gene expression profiling study of 251 primary breast cancer tissues resected in Uppsala County, Sweden from January 1, 1987 to December 31, 1989, using Affymetrix Chip HG-133A and HG-133B (GEO Accession No. GSE3494 [NCBI GEO] ). The authors identified an expression signature for p53 which can be used for predicting the mutation status, transcriptional effects, and patient survival. Among these patients, 236 of them had follow-up information in terms of time and event of disease-specific survival. Different from the previous dataset, these patients included both lymph-negative and lymph-positive patients.

In single gene analysis using t-tests, we obtained only four genes that are DE for FDR = 0.05 using the Benjamini and Hochberg's FDR procedure. The GG methods of Kendziorski et al. (2003) identified 82 genes and our MRFGG method identified 103 genes. The parameter estimates were Formula and ß = 0.55 in the MRF model. Figure 2 shows several of the connected subnetworks that were identified by the MRFGG method. Similar to the previous example, we found the genes in the WNT pathway (WNT4, SFRP4 and FZD7), genes related to the chemokine pathway (CCR7 and CCL19) and neurotensin and its receptor (NTS and NTSR2) are related to survival from breast cancer. We also found that Caveolin-1, Caveolin-2 (CAV1 and CAV2) are down-regulated in cancers in patients who died of breast cancer. A recent study by Sagara et al. (2004) indicated that a reduced CAV1 mRNA level was significantly associated with increasing tumor size and negative estrogen receptor status. There was also a significant association between low CAV2 mRNA level and negative progesterone receptor status. Sagara et al. (2004) further indicated that CAV1 suppression correlated closely with that of CAV2 in breast cancer, that CAV1 level was inversely correlated with tumor size and that CAV1 and CAV2 levels were correlated with hormonal receptor status. Therefore, CAV1 and CAV2 play an important role in tumor progression in breast cancer patients.


Figure 2
View larger version (24K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Results from analysis of gene expression dataset of Miller et al. (2005). DE genes identified by the MRFGG method linked to the KEGG pathways, where genes in light grey are over-expressed and those in dark grey are under-expressed in cancer cells from the patients who died of breast cancer.

 
The largest subnetwork identified includes six genes on the GAP Junction pathway, of which five genes (CDC25B, CDC2, TUBA1, PRKACB and RAP1B) are up-regulated in cancer samples from individuals who died of cancer. Interestingly, the gap junction membrane channel protein alpha 1 (GJA1) is, however, down-regulated. GJA1 has been reported to suppress cell proliferation (Yu et al., 2006) and therefore its down-regulation can lead to more cancer cell proliferation and increase the chance of death from breast cancer. In addition, we also observed over-expression of the tight junction proteins claudin-3 (CLDN3) and claudin-8 (CLDN8). Claudin-3 and claudin-4 are frequently over-expressed in several neoplasias, including ovarian, breast, pancreatic and prostate cancers (Hewitt et al., 2006; Morin, 2005).

We also observed over-expression of genes related to Cyclin B1 (CCNB1) and B2 (CCNB2), together with over-expression of STRATIFIN (SFN) in cancer samples from individuals who died of cancer. These three genes are on the cell cycle pathway. Cyclins are a family of regulatory proteins that play a key role in controlling the cell cycle. Abnormalities of cell cycle regulators, including cyclins and cyclin-dependent kinases, have been reported in various malignant tumors. Zhao et al. (2006) observed significantly greater cyclin B1 expression in invasive cervical cancer than in normal cervical tissue and indicated that aberrant expression of cyclin B1 might play an important role in cervical carcinogenesis. In addition, CCNB1 expression was highly correlated with the labeling index for antigen identified by mAb ki067 (Ki067, associated with increased tumor cell proliferation), which suggests a key role for CCNB1 in the regulation of neuroendocrine tumor cell proliferation (Igarashi et al., 2004; Lahad et al., 2005).

Other genes related to breast cancer survival include genes involving ECM-receptor interaction and cell communication pathway (COL2A1, SDC2, TNN, LAMA2, LAMA3 and SV2B) and genes in the insulin signaling pathway (GYS2, PPP1CAACACB, SREBF1 and PFKP).


    5 CONCLUSION AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STATISTICAL MODELS AND...
 3 SIMULATION STUDIES
 4 APPLICATION TO REAL...
 5 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have proposed a MFR-based procedure that uses information of interaction networks in identification of DE genes. The proposed method utilizes the structure information of the interaction networks in order to capture the dependency of differential expression patterns for genes on the network. By doing so, we can expect to obtain results that are not found by single–gene analysis. Simulation studies and application to several real microarray gene expression datasets demonstrated that our methods are more sensitive in identifying the DE genes than some of the commonly used methods while maintaining low FDR. Results from analysis of two breast cancer microarray gene expression datasets identified several sub-networks that are related to breast metastasis or death from breast cancer. Some of the subnetworks were reported in the literature.

We make several assumptions for the proposed methods. First, our proposed methods depend on the reliability of the structure of the interaction networks. We used KEGG interaction networks made of 33 regulatory pathways in our analysis of the breast cancer gene expression data. The edges of the networks include both protein–protein and DNA–protein interactions. Our methods treat all interactions equally, regardless of type and direction. However, if different types of interactions can be clearly defined, we can modify our method to allow for different dependency parameters for different interaction types. Important future research is to refine network structures and to extend our MFR models to more complex and refined network structures. Second, we used the Gamma–Gamma model of Kendziorski et al. (2003) and Newton et al. (2003) for modeling the gene expression data. Alternatively, one can assume a log-normal–normal model for the gene expression data. However, as shown in Kendziorski et al. (2003), the Gamma–Gamma model is quite robust to model misspecification. Third, in our model formulation, for each gene, we only consider its immediate neighbors on the network as its neighbors (i.e. first degree neighbors). However, if the differential expression patterns are dependent in neighbors centered at this gene with a radius r, we may want to include as its neighbors all the genes in this ball. This can potentially increase the sensitivity of identifying more DE genes.

In this article, we have focused on the problem of identifying the differentially expressed genes between two experimental conditions. The MRF methods can, however, be extended in several ways. First, it can be easily extended for identifying genes that show DEs among multiple groups using replicated gene expression profiles following the parametric empirical Bayes setup of Kendziorski et al. (2003). Second, the methods can also be extended to deal with other phenotypes such as continuous or censored survival phenotypes by considering whether a gene is related to the phenotype as a latent state and using the MFR for modeling such latent states. Third, the methods can also be extended to microarray time course gene expression data in order to identify subnetworks that change their expression states during a biological time course such as cancer initiation and progression. We are currently working on these extensions. Finally, important future research will include how to represent and assess the uncertainty of the inference of the true DE states x*.

In summary, we have proposed an MRF model for identifying DE genes between two experimental conditions in order to utilize the network structure information. As more and more networks become available, we expect more applications of such methods for identifying genes and pathways that are related to various phenotypes.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STATISTICAL MODELS AND...
 3 SIMULATION STUDIES
 4 APPLICATION TO REAL...
 5 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This research was supported by NIH grant ES009911 and a grant from the Pennsylvania Department of Health. We thank Mr. Edmund Weisberg, MS at Penn CCEB for editorial assistance and two referees for helpful comments.

Conflict of interest: none declared.


    FOOTNOTES
 
Associate Editor: Olga Troyanskaya

Received on February 1, 2007; revised on March 26, 2007; accepted on March 27, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 STATISTICAL MODELS AND...
 3 SIMULATION STUDIES
 4 APPLICATION TO REAL...
 5 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Barker N, Clevers H. Mining the Wnt pathway for cancer therapeutics. Nat. Rev. Drug Discov (2006) 5:997–1014.[CrossRef][Web of Science][Medline]

    Benjamini Y, Horchberg Y. Controling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (1995) 57:289–300.

    Besag J. On the statistical analysis of dirty pictures. J. R. Stat. Soc. B (1986) 48:259–302.

    Burke D, et al. Fibroblast growth factor receptors: lessons from the genes. Trends Biochem. Sci (1998) 23:59–62.[CrossRef][Web of Science][Medline]

    Deng MH, et al. Integrated probabilistic model for functional prediction of proteins. J. Comput. Biol (2004) 11:463–476.[CrossRef][Web of Science][Medline]

    Deng MH, et al. Prediction of protein function using protein-protein interaction data. (2002) The First IEEE Computer Society Bioinformatics Conference, CSB2002. 117–126.

    Dudoit S, et al. Statistucal methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat. Sin (2002) 12:111–139.

    Efron B, et al. Empirical Bayes Analysis of Microarray Experiment. J. Am. Stat. Assoc (2001) 96:1151–1160.[CrossRef][Web of Science]

    Hewitt KJ, et al. The claudin gene family: expression in normal and neoplastic tissues. BMC Cancer (2006) 6:186.[CrossRef][Medline]

    Igarashi T, et al. Divergent cyclin B1 expression and Rb/p16/cyclin D1 pathway aberrations among pulmonary neuroendocrine tumors. Mod. Pathol (2004) 17:1259–1267.[CrossRef][Web of Science]

    Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res (2002) 28:27–30.

    Kendziorski CM, et al. On paramertic empirical Bayes methods for comparing multiple groups using replicated gene expressionm profiles. Stat. Med (2003) 22:3899–3914.[CrossRef][Web of Science][Medline]

    Lahad JP, et al. Stem cell ness: a "magic marker" for cancer. J. Clin. Invest (2005) 115:1463–1467.[CrossRef][Web of Science][Medline]

    Letovsky S, Kasif S. Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics (2003) 19(Suppl. 1):i197–i204.[Abstract]

    Lönnstedt I, Speed T. Replicated microarray data. Stat. Sin (2002) 12:31–46.

    Miller LD, et al. An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc. Natl Acad. Sci (2005) 102:13550–13555.[Abstract/Free Full Text]

    Morin PJ. Claudin proteins in human cancer: promising new targets for diagnosis and therapy. Cancer Res (2005) 65:9603–9606.[Abstract/Free Full Text]

    Muller A, et al. Involvement of chemokine receptors in breast cancer metastasis. Nature (2001) 410:50–56.[CrossRef][Medline]

    Nacu S, et al. Gene expression network analysis, and applications to immunity. In: Technical report. (2006) Department of Statistics, Stanford Univerisity.

    Newton MA, et al. On differntial variability of expression ratios: improving statistical inference abou gene expression changes from micorarray data. J. Comput. Biol (2001) 8:37–52.[CrossRef][Web of Science][Medline]

    Rahnenführer J, et al. Calculating the statistical significance of changes in pathway activity from gene expression data. Stat. Appl. Genet. Mol. Biol (2004) 3. Article 16.

    Rapaport F, et al. Classification of microarray data using gene networks. BMC Bioinformatics (2007) Feb 1;8:35.[CrossRef][Medline]

    Rubie C, et al. Chemokine receptor CCR6 expression in colorectal liver metastasis. J. Clin. Oncol (2006) 24:5173–5174.[Free Full Text]

    Sagara Y, et al. Clinical significance of Caveolin-1, Caveolin-2 and HER2/neu mRNA expression in human breast cancer. Br. J. Cancer (2004) 91:959–965.[Web of Science][Medline]

    Segal E, et al. Discovering Molecular Pathways from Protein Interaction and Gene Expression Data. Bioinformatics (2003) 19(Suppl. 1):264–272.[CrossRef]

    Shulewitz M, et al. Repressor roles for TCF-4 and Sfrp1 in Wnt signaling in breast cancer. Oncogene (2006) 25:4361–9.[CrossRef][Web of Science][Medline]

    Sivachenko A, et al. Identifying Local Gene Expression Patterns in Biomolecular Networks. (2005) Proceedings of 2005 IEEE Computational Systems Bioinformatics Conference: Stanford, California.

    Souaze F, et al. Expression of neurotensin and NT1 receptor in human breast cancer: a potential role in tumor progression. Cancer Res (2006) 66:6243–9.[Abstract/Free Full Text]

    Subramanian A, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA (2005) 102:15545–50.[Abstract/Free Full Text]

    Tian L, et al. Discovering statistically significant pathways in expression profiling studies. Proc. Natl Acad. Sci (2005) 103:13544–13549.

    Tusher V, et al. Significance analyusis of miocrarrays applied to ionizing radiation response. Proc. Natl Acad. Sci (2001) 98:5116–5121.[Abstract/Free Full Text]

    Wang Y, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet (2005) 365:671–9.[Web of Science][Medline]

    Yu K, et al. A modular analysis of breast cancer reveals a novel low-grade molecular signature in estrogen receptor positive tumors. Clin. Cancer Res (2006) 12:3288–3296.[Abstract/Free Full Text]

    Zhao M, et al. Expression profiling of cyclin B1 and D1 in cervical carcinoma. Exp. Oncol (2006) 28:44–48.[Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
W. Pan
Network-based multiple locus linkage analysis of expression traits
Bioinformatics, June 1, 2009; 25(11): 1390 - 1396.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
B. Zhang, H. Li, R. B. Riggins, M. Zhan, J. Xuan, Z. Zhang, E. P. Hoffman, R. Clarke, and Y. Wang
Differential dependency network analysis to identify condition-specific topological changes in biological networks
Bioinformatics, February 15, 2009; 25(4): 526 - 532.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. Noirel, G. Sanguinetti, and P. C. Wright
Identifying differentially expressed subnetworks with MMG
Bioinformatics, December 1, 2008; 24(23): 2792 - 2793.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
C. Li and H. Li
Network-constrained regularization and variable selection for analysis of genomic data
Bioinformatics, May 1, 2008; 24(9): 1175 - 1182.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
G. Sanguinetti, J. Noirel, and P. C. Wright
MMG: a probabilistic tool to identify submodules of metabolic pathways
Bioinformatics, April 15, 2008; 24(8): 1078 - 1084.
[Abstract] [Full Text] [PDF]


Home page
Brief Funct Genomic ProteomicHome page
J. Noirel, S. Y. Ow, G. Sanguinetti, A. Jaramillo, and P. C. Wright
Automated extraction of meaningful pathways from quantitative proteomics data
Brief Funct Genomic Proteomic, March 7, 2008; (2008) eln011v1.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
P. Wei and W. Pan
Incorporating gene networks into statistical tests for genomic data via a spatially correlated mixture model
Bioinformatics, February 1, 2008; 24(3): 404 - 411.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/12/1537    most recent
btm129v2
btm129v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wei, Z.
Right arrow Articles by Li, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wei, Z.
Right arrow Articles by Li, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?