Skip Navigation

Bioinformatics 2007 23(13):i468-i478; doi:10.1093/bioinformatics/btm173
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Shiga, M.
Right arrow Articles by Mamitsuka, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shiga, M.
Right arrow Articles by Mamitsuka, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Annotating gene function by combining expression data with a modular gene network

Motoki Shiga , Ichigaku Takigawa and Hiroshi Mamitsuka *

Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji 611-0011, Japan

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD: PROBABILISTIC MODEL
 3 METHOD: LEARNING ALGORITHM...
 4 EXPERIMENTAL RESULTS
 5 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: A promising and reliable approach to annotate gene function is clustering genes not only by using gene expression data but also literature information, especially gene networks.

Results: We present a systematic method for gene clustering by combining these totally different two types of data, particularly focusing on network modularity, a global feature of gene networks. Our method is based on learning a probabilistic model, which we call a hidden modular random field in which the relation between hidden variables directly represents a given gene network. Our learning algorithm which minimizes an energy function considering the network modularity is practically time-efficient, regardless of using the global network property. We evaluated our method by using a metabolic network and microarray expression data, changing with microarray datasets, parameters of our model and gold standard clusters. Experimental results showed that our method outperformed other four competing methods, including k-means and existing graph partitioning methods, being statistically significant in all cases. Further detailed analysis showed that our method could group a set of genes into a cluster which corresponds to the folate metabolic pathway while other methods could not. From these results, we can say that our method is highly effective for gene clustering and annotating gene function.

Contact: shiga{at}kuicr.kyoto-u.ac.jp


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD: PROBABILISTIC MODEL
 3 METHOD: LEARNING ALGORITHM...
 4 EXPERIMENTAL RESULTS
 5 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Recent progress in genome sciences has led to the development of DNA microarray technology which allows to monitor the expression of thousands of genes simultaneously. Microarray technology by which we can see which genes are active in a given cell or tissue promises the ability to determine the function of each gene. A current popular approach to annotate gene function from gene expression data is clustering genes by expression values based on the assumption that genes with similar expression patterns can be clustered into a group with the same gene function. Existing approaches include typical clustering methods, such as hierarchical clustering (Eisen et al., 1998), k-means (Tavazoie et al., 1999) and self-organizing maps (Tamayo et al., 1999). However, microarray expression data is inevitably noisy, making the clustering result by the above methods unstable (Kerr and Churchill, 2001; Zhang and Zhao, 2000). A possible solution to overcome this problem is to generate many array replicates which are however demanding because of the high experimental cost of microarray experiments.

A more promising direction in current bioinformatics research is to combine microarray expression data with the existing knowledge of gene annotation derived from literature. Typical examples include model-based clustering incorporating GO (Gene Ontology) annotation as priors of model parameters (Pan, 2006), k-means clustering using GO annotation dependent distances (Huang and Pan, 2006) and hierarchical clustering based on the distance defined by both gene expressions and the shortest path on a metabolic (chemical reaction) pathway (Hanisch et al., 2002.), etc. This type of combination is of interest, since dynamic behavior of genes which would be observed from microarray data can be integrated with the literature-derived biological data which is obviously static information. Specifically, dynamic information of microarray expression would change the static gene clusters to fit the experimental condition of given microarray data. However, existing approaches are not methodologically sophisticated enough in combining the two data, i.e. real-valued expression data and literature-derived data, especially gene networks. In addition, the focus of current approaches is placed on the rather local information, such as neighboring genes, of gene networks, and incorporating global information of gene networks might find more appropriate gene clusters.

In light of the above, we present a new method for clustering genes using both microarray expression data, i.e. the dynamic information of a cell, and a given gene network, focusing on network modularity, a global nature of biological networks (Ravasz et al., 2002). Our method is based on learning a probabilistic model which we call a hidden module random field that is well suited to combine the two different types of data. Our model has observable variables for microarray expression data and hidden variables for gene annotation (cluster labels). Concretely, each observable variable is dependent upon a hidden variable which can be dependent upon one or more hidden variables each other, as defined by a given biological network. Learning parameters of our model is based on the EM (Expectation-Maximization) algorithm that minimizes the objective function which considers the modularity of a given gene network. Briefly speaking, our method uses not only the simple gene relations which are directly obtained from literature but also more complex information which can be derived from the entire network structure. We further stress that regardless of focusing on the global nature of a given network, our proposed algorithm for estimating probability parameters of our model is time-efficient.

Using a metabolic network and a wide variety of microarray datasets, we validated the performance of our method, comparing with other methods and analyzed the results obtained by our method. Clusters of genes are variable depending on the condition under which genes are placed, meaning that true cluster labels are unknown. However, to evaluate the performance of our method, we generated two different sets of gene clusters as standard data. Assuming that a good clustering result will be close to them if many microarray observations are used, we compared our method with four different clustering methods: k-means with microarray expression data and three graph partitioning methods using the structure of the metabolic network. We then found that our clustering method outperformed all other competing methods, being statistically significant in all cases. We further examined the clustering result, focusing on the genes in folate biosynthesis to confirm the significance of our method. These results proved that our method of combining microarray expression data with the modularity of a gene network is highly effective for clustering genes and annotating gene functions.


    2 METHOD: PROBABILISTIC MODEL
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD: PROBABILISTIC MODEL
 3 METHOD: LEARNING ALGORITHM...
 4 EXPERIMENTAL RESULTS
 5 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Clustering is to assign a gene function (or generally a cluster label) to each gene. Starting with a hidden random field, we will explain our model for clustering genes by combining microarray expression data with a global feature of gene networks. Each node in a gene network is labeled by a gene, and an undirected edge connecting two nodes represents some relation between the two genes labeling the two nodes.

2.1 Hidden random field
A hidden random field is a probabilistic model with two types of variables: observable and hidden variables. Figure 1a shows a model in which an observable variable depends on only one hidden variable and a hidden variable depends on no other variables. This model corresponds to that of a typical clustering method like k-means in which genes are clustered with microarray expression data only, assuming that the expression pattern of a gene is independent of another. However, this assumption is not necessarily true. For example, the expression patterns of two genes located neighborhood on a metabolic network are in most cases strongly correlated with each other (Kharchenko et al., 2005). Figure 1b shows a hidden random field in which hidden variables can be dependent upon each other which would be more natural for clustering genes using both gene expressions and biological networks. In particular, we use a given gene network to represent the relation of hidden variables.


Figure 1
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Hidden random fields with (a) independent hidden variables and with (b) relation between hidden variables.

 
Mathematically, these models can be formalized as follows: let Formula be a set of hidden variables and Formula be a potential function which defines the relation between genes. The probabilistic distribution over Z is given as the Gibbs distribution:


Formula 1

(1)
where Formula and {alpha} is a parameter (weight) for Formula .

We can further define the joint probability of a set of observable variables Formula and Z as follows:


Formula 2

(2)
Practically N is the number of genes. Given a microarray dataset, the size of Formula is the number of observations (experiments), and each value taken in Formula is an observable expression value. K is the number of clusters. A hidden variable zn takes a cluster label out of 1 to K. The conditional probability Formula is assumed to be a normal distribution or the von Mises–Fisher distribution (Mardia and Jupp, 2000).

The probability of X is further given as follows:


Formula

The model of Figure 1b has dependency relation between hidden variables as a network with nodes corresponding to genes and edges connecting nodes. The structure information of this network can be described in Formula of Equation (2).

2.2 Hidden Markov random field (HMaF)
A typical hidden random field is a hidden Markov random field, which we call HMaF in this article.1 We explain this model briefly to show the significance of the dependency structure of hidden variables. In HMaF, a hidden variable for a node depends on the hidden variables for its neighboring (closest) nodes alone, satisfying with the Markov property. The potential function in this model is defined as follows:


Formula 3

(3)
where Sn is a set of neighboring nodes of n and Formula is defined as follows:


Formula

As seen by Equation (3), the value of Formula increases as the neighboring hidden variables take the same value. Using this potential function, the probabilistic distribution over Z in HMaF can be defined as follows:


Formula 4

(4)
where Formula .

A major application of HMaF is image restoration (or image modeling), which is a problem of estimating true pixel labels from an observed image which is usually very noisy (Fjortoft et al., 2003; Zhang et al., 2001). We show the effectiveness of using the network information of hidden variables through an example of image restoration. Figure 2 shows a schematic picture of applying HMaF to image restoration, exactly according to (Zhang et al., 2001) in which observable and hidden variables are used for (b) noisy images and (c) clear true pixels, respectively. Each of (b) and (c) has 2500 (=50 x 50) pixels, and a pixel can take one of three labels as a true value. As shown in Figure 2a, a typical network structure for the relation of hidden variables in image restoration is a grid structure in which a node is connected to neighboring four (left, right, up and down) nodes only, keeping the Markov property.


Figure 2
View larger version (41K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. (a) HMaF for image restoration and (b) observable noisy data and (c) the true image.

 
Figure 3a shows the result obtained by applying HMaF to the data shown in Figure 2b, while Figure 3b shows that by applying k-means (which corresponds to Figure 1a) to the same data. As shown in these figures, HMaF clearly removes the noise in Figure 2b while obviously k-means cannot, indicating that the network structure assumption in HMaF works effectively for image restoration.


Figure 3
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Image restoration by (a) HMaF and by (b) k-means.

 
2.3 Our model: hidden ‘Modular’ random field (HMoF) for gene clustering
As shown in the previous section, the network structure of hidden variables is an important assumption to estimate the values of hidden variables exactly. The hidden structure in image restoration is a grid shape in which the number of edges from a node, i.e. the degree of a node, is basically constant. In contrast, the degree of a node in a gene network varies according to the scale-free nature (Jeong et al., 2000; Ravasz et al., 2002), suggesting that the grid shape is inappropriate for a gene network and the Markov property would be insufficient.

It is already reported that the genes having the same function (cluster) tend to be gathered together on a gene network, especially metabolic networks (Ravasz et al., 2002). This feature is called modularity, a global feature of gene networks. Modularity of a network is defined as a quantity which becomes larger as with increasing the number of edges in a cluster and with decreasing the number of edges between two different clusters (Guimera and Nunes Amaral, 2005; Guimera et al., 2004; Newman and Girvan, 2004). Modularity can then be mathematically defined as follows:


Formula 5

(5)
where L is the total number of edges in a given network, Formula is the number of edges in cluster k and Formula is the total sum of degrees in cluster k. The first term of the right-hand side of Equation (5) increases with the number of edges inside a cluster, playing the same role as that of Equation (3) which becomes larger as the same cluster is assigned to neighboring genes on a gene network. On the other hand, the second term checks the number of edges not only in a cluster but also crossing different clusters. Totally, the right-hand side becomes larger as with increasing the number of edges in a cluster without increasing the number of edges crossing different clusters. Thus Equation (5) can reflect the modularity, a global feature of gene networks.

Thus, we propose a new random field using Formula as a potential function as follows:


Formula 6

(6)
where Formula . We call this hidden random field HMoF which stands for hidden modular random field.


    3 METHOD: LEARNING ALGORITHM FOR HMoF
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD: PROBABILISTIC MODEL
 3 METHOD: LEARNING ALGORITHM...
 4 EXPERIMENTAL RESULTS
 5 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 EM algorithm
The EM algorithm is a local optimization method which is widely used for estimating parameters of a lot of probabilistic models with hidden variables (Dempster et al., 1977). According to the EM algorithm, we have to minimize the following so-called Q function:


Formula 7

(7)
where Formula is the set of current parameters, Formula is the set of parameters to be estimated and a {delta}-function is used for Formula . This Q function is true of a lot of other probabilistic models with hidden variables, such as k-means (Kearns et al., 2001).

We assume that the probabilistic distribution of an observable variable follows the von Mises–Fisher distribution (Mardia and Jupp, 2000), since this distribution is used for clustering high-dimensional data such as text documents (Zhong and Ghosh, 2005):


Formula 8

(8)
where, Formula is the center of cluster k and must satisfy with Formula , and Cv is a normalization constant. From Equations (2, 6–8GoGo), we can have the objective function J which should be minimized:


Formula 9

(9)
where {omega} satisfies with Formula and Formula , and Uk is a set of the indices of hidden variables taking cluster k. We note that {omega} is a weight parameter which is not trained from data, and so CD({alpha}) in Equation (6) becomes a constant when we derive the above Equation (9). The EM algorithm suggests a procedure to minimize J by optimizing cluster centers Formula and hidden variables Z separately. More concretely, this algorithm repeats the following two steps alternately: (1) we optimize cluster centers using fixed hidden variables and (2) optimize hidden variables using fixed cluster centers.

For the first step, when hidden variables Z are fixed, the objective function J is minimized by the following Formula :


Formula

where


Formula

The second step is to optimize J when we fix the above cluster centers. We note that potential function Formula is not the sum over all hidden variables Formula , since hidden variables can be dependent upon each other. We then have to find the optimal combination of hidden values, but the computational complexity of this problem is NP-hard. We then use the ICM (iterative conditional modes) method (Besag, 1986) to find an approximation solution of this problem.

3.2 ICM method
The basic procedure of the ICM method repeats the following two steps until the convergence of J: (1) we first randomly permute N hidden variables. (2) In the permuted order, we update the value of the chosen hidden variable so that J should be minimized. The ICM method is a local optimization method, and so we can repeat the above procedure many times using different initial values to find the most optimal solution.

We emphasize that the ICM method for our model is time-efficient, although our model deals with the global feature of a given network. The reason is as follows: we do not have to compute function J at every time of doing the above two steps. Instead, we can compute the difference between the current J and the new J. More concretely, after we choose a hidden variable, say zn and then examine a new hidden (cluster) value, say j, the difference between the current J specified by zn and the new J by j can be given by the function of these two:


Formula

where i is the current value of hidden variable zn and


Formula

where, Formula is the number of edges coming to the nodes in cluster i and Formula is the degree of node n. From the above equations, we note that Formula can be computed by using only the local (neighboring) nodes of node n, although HMoF reflects the global structure of a given biological network. Thus, the practical computation time of HMoF is totally comparable to that of HMaF which just uses the local information of a given network.

Finally, Figure 4 shows our entire algorithm of estimating probabilistic parameters by minimizing J. Line 1 is the input of initial values which are, in our experiments, obtained by repeating randomly assigning cluster labels to the nodes of a given gene network and selecting the set of cluster labels with the highest modularity which can be computed by Equation (5). Lines 3–6 are the ICM method to which, as shown in line 5, we add one constraint that the new cluster label is selected out of the labels of neighboring nodes only. This device incorporates the local information of a given network in updating labels and speeds up the efficiency of this algorithm. Lines 2–9 are the entire EM algorithm.


Figure 4
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Our entire algorithm for estimating parameters of HMoF. Let On be a set of neighbors of node n.

 

    4 EXPERIMENTAL RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD: PROBABILISTIC MODEL
 3 METHOD: LEARNING ALGORITHM...
 4 EXPERIMENTAL RESULTS
 5 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
4.1 Data
We used metabolic pathways for a gene network and focused on the data of Saccharomyces cerevisiae in both the metabolic gene network and microarray expression data. We will show the detail of our data subsequently.

4.1.1 Metabolic network
We generated a metabolic network from KEGG (Kyoto Encyclopedia of Genes and Genomes) (Kanehisa et al., 2006), by assigning an enzyme gene to a node and connecting the two nodes of two neighboring genes on the KEGG metabolic pathway by an edge. We removed a node with no edges and an isolated network with no more than 15 nodes. The generated network had 636 nodes (genes) and 3104 edges. We note again that this metabolic network is an undirected gene network, meaning that each node corresponds to a gene.

4.1.2 Microarray expression
From the GEO (Gene Expression Omnibus) database (Edgar et al., 2002), we chose five datasets of S. cerevisiae under the condition that each dataset has more than 50 experiments (observations), because a larger number of observations is more reliable when genes with similar expression patterns are clustered in the same group. We further added a dataset (Hughes et al., 2000) which has been often used in microarray analysis (Huang and Pan, 2006; Pan, 2006; Wu et al., 2002; Zhou et al., 2002) since it has 300 observations while a small number of missing values. Tables 1 shows the detail of the above six datasets. A missing value in these datasets was interpolated by using the 10-nearest least square method by (Troyanskaya et al., 2001).


View this table:
[in this window]
[in a new window]

 
Table 1. Microarray expression data summary

 
4.1.3 Standard gene cluster data
Gene clusters are variable depending on the condition under which genes are placed. However, to evaluate the performance of our method for clustering genes, we generated two different sets of standard gene clusters (or true cluster labels) from KEGG and GO (Ashburner et al., 2000), which we call KEGG clusters and GO clusters, respectively.

(1) KEGG clusters: The KEGG pathway maps are classified into six major categories, including metabolism. We used 10 sub-categories under the metabolism category as KEGG clusters. Tables 2 shows the list of KEGG clusters and the number of genes in the corresponding cluster. We note that a gene can be in more than one clusters.


View this table:
[in this window]
[in a new window]

 
Table 2. Ten standard gene clusters from KEGG

 
(2) GO clusters: GO is the ontology of gene functions which are classified into three categories: molecular function, biological process and cellular component. Out of these three major categories, we used the function labels in biological process only. In GO, the relation between gene functions forms a so-called DAG (directed acyclic graph) which is a directed graph with no cycles. Each node of the DAG corresponds to a gene function (cluster label) to which corresponding genes are assigned. Starting with the root of the DAG, we went from a parent to its child, checking the number of genes assigned to each node. Naturally, the number of genes reduces as with going to descendants. We fixed a cutoff value (minimum size of a cluster), and if the number of genes was less than the specified cutoff value, we went back to its parent and stopped there. If the number of genes at the parent is larger than 300, we did not use that node as a cluster because its size is too large to annotate gene function. Tables 3 shows the number of clusters obtained by the above procedure, changing the cutoff value. We selected the cutoff value of 75 to have 19 clusters as GO clusters, because of choosing a little different number from 10, which was the number of KEGG clusters. Figure 5 shows all nodes appeared during the above procedure to obtain 19 clusters which correspond to circles with thick lines in this figure. The number in each circle (node) in Figure 5 shows the number of corresponding genes. The name attached to each node shows the corresponding cluster name. As well as KEGG clusters, a gene can be in more than one clusters.


View this table:
[in this window]
[in a new window]

 
Table 3. Number of gene clusters from GO when changing the minimum size of a cluster

 

Figure 5
View larger version (35K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. 19 standard gene clusters (nodes with thick lines) from GO.

 
4.2 Performance evaluation: criterion
In information theory, mutual information is defined as a quantity to measure the amount of information shared between two random variables. This criterion can be applied to check the overlap between our clustering result and standard clusters. This is because the mutual information between two sets of cluster labels becomes larger as one set of clusters is more consistent with the other set of clusters. In general, mutual information is normalized since the range taken by mutual information depends on the size of given sets of clusters (Strehl and Ghosh, 2003). We note that normalized mutual information (NMI) has been widely used in a lot of applications to measure the performance of clustering methods (Zhong and Ghosh, 2003, 2005).

Let Formula be a cluster set of our result, where Gi be a set of genes in cluster i. Let Formula be a standard cluster set, where Formula be a set of genes in standard (true) cluster i. NMI is defined as follows:


Formula

where


Formula

4.3 Performance evaluation: results
4.3.1 Parameter setting and competing methods
We checked three different sizes of clusters: K = 10, 15 and 20, since two sets of standard clusters have 10 and 19 clusters.

As shown in Equation (9), the objective function that must be minimized has parameter {omega} which varies in the range between zero and one. If we set {omega} at zero, J has the first term only, consisting of microarray expression alone. This case, our model and its learning algorithm are almost the same as clustering genes by k-means with microarray expression data, and so hereafter we call {omega} = 0 as k-means. On the other hand, if we set {omega} at one, J has the second term alone, meaning that genes are clustered by a given biological network only. This is similar to graph partitioning in which nodes are clustered by using the connectivity of a given graph. Thus, we changed the value of parameter {omega} from zero to one at every 0.1. We then mainly examined the performance of our method at three values: zero, one and {omega} that gives the highest performance, which we denote {omega}0, {omega}1 and Formula , respectively.

Our method was further compared with two famous graph partitioning methods, called p-METIS (Karypis and Kumar, 1998a) and k-METIS (Karypis and Kumar, 1998b), both of which use only the structure of a given metabolic network, meaning that their NMIs are irrelevant to {omega} and microarray datasets. This is true of {omega}1.

4.3.2 NMI
We first show the result when we used KEGG clusters. Figure 6 shows the NMI of HMoF when we used Hughes as microarray expression data. This figure shows that the NMI of Formula was higher than those of {omega}0 (k-means) and {omega}1 in all three cases of K. This indicates that the clustering result obtained by using both the modularity of the metabolic network and microarray expression data is more consistent with KEGG clusters than that by either of the two types of data. We can further see that the NMI values in Formula were higher than those by p-METIS and k-METIS which are shown as dashed lines in Figure 6, indicating that our clustering result was overlapped more with KEGG clusters than those by the two popular graph partitioning methods. These findings were true of other microarray datasets. Tables 4 summarizes the NMI values for all six microarray datasets used in our experiments. Each Formula value is shown in the corresponding parenthesis. From this table, we can see that the NMI of Formula was the best among five compared methods in all 18 cases.


View this table:
[in this window]
[in a new window]

 
Table 4. NMI results summary when we used KEGG clusters

 

Figure 6
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. NMI changing with {omega} (weight) when we used Hughes, KEGG clusters and K= (a)10, (b)15 and (c) 20.

 
We then show the results obtained by using GO clusters as standard data. Figure 7 shows the NMI of HMoF when we used Hughes, changing K and {omega}. The NMI values of p-METIS and k-METIS are also shown in this figure as dashed lines. From this figure, the advantage of HMoF over other four methods was confirmed again. The NMI values in this figure ranged roughly from 0.15 to 0.2, which were smaller than those by KEGG clusters, probably because the number of GO clusters is 19 which is larger than the number of KEGG clusters, i.e. 10. Tables 5 summarizes the NMI values obtained by all six microarray datasets. From this table, we can see that the NMI of Formula was the highest among the five competing methods in all 18 cases, totally following the result by KEGG clusters. From Tables 4 and 5, we can see that Formula is 0.2 in 21 cases in all 36 cases, meaning that the clustering performance of {omega} at 0.2 is almost always close to that of Formula (or the true maximum). Thus, hereafter we fix {omega} at 0.2, which we denote Formula .


View this table:
[in this window]
[in a new window]

 
Table 5. NMI results summary when we used GO clusters

 

Figure 7
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7. NMI changing with {omega} (weight) when we used Hughes, GO clusters and K= (a)10, (b)15 and (c) 20.

 
The above results show the clear advantage of Formula (and Formula ) over k-means ({omega}0), p-METIS and k-METIS, but the difference of NMI between Formula and {omega}1 was rather slight. We then examined the statistical significance in the difference of NMI between Formula and {omega}1. Tables 6 shows the P-values of pairwise t-test (Kreyszig, 1970) between Formula and {omega}1 over all six microarray datasets. This table indicates that the NMI of Formula was higher than that of {omega}1 with a statistical significance of more than around 99.5% in all six cases. Overall we can say that our method of combining microarray expression data with the network modularity achieved a clearly higher performance in gene clustering than other four compared methods which use only either of the two data types.


View this table:
[in this window]
[in a new window]

 
Table 6. P-values of pairwise t-test between {omega}0.2 and {omega}1

 
4.4 Annotating gene function
Once a cluster label is assigned to each gene, we can compute the P-value between each gene function in some ontology database like GO and a cluster containing the gene in question. We can then rank the gene functions according to their P-values, and the top function can be assigned to each gene of the cluster. This manner of function annotation would be helpful for a functionally putative gene. To compute the P-values and rank them, we can use a software called GO Term Finder (Boyle et al., 2004) which uses a set of genes in a cluster as input and outputs a list of gene functions ranked according to P-values.

Tables 7 shows an example of function annotation, i.e. a list of gene functions with P-values for a cluster which is obtained under the condition that the microarray dataset was Hughes and K=20. We show this example, since this cluster has the highest P-value, 1.62E-37, among all 20 clusters obtained under the above condition. The top of this list is ‘protein modification’. In fact, this cluster contained 45 genes including, say ALG11/YNL048W (mannosyltransferase) and ALG6/YOR002W (glucosyltransferase) which add a sugar unit to a protein amino acid, meaning that their function is protein modification.


View this table:
[in this window]
[in a new window]

 
Table 7. An example of function annotation: top five functions for an obtained cluster having, say ALG11/YNL48W (mannosyltransferase)

 
4.5 Analysis on clustering result
Figure 8 shows the three metabolic gene networks obtained by {omega}0, Formula and {omega}1 using Hughes under K = 20. Each node corresponds to a gene, and the edge between two nodes indicates that the genes of these two nodes are neighboring enzyme genes on the KEGG metabolic pathway map. These networks have the same structure but nodes were differently colored. Each color indicates a cluster, meaning that genes labeled with the same color fall into the same cluster. From this figure, we can easily see that nodes with the same color were widely distributed over the entire network in (a), while the nodes with the same color were gathered together in (c) because nodes were clustered by the network structure only.


Figure 8
View larger version (46K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 8. Clustered genes on metabolic networks when we used Hughes, K = 20 and (a) {omega}0, (b) Figure 8 and (c) {omega}1.

 
We then focused on the square at the top-left part in each of the three networks in Figure 8. Figure 9 shows the enlargement of each of these squares. In Figure 9, two colors, red and green, were merged in (b) while more than two colors were merged in (a) and the two colors were clearly separated in (c). As shown in the previous section, (b), i.e. Formula , was closest to the standard clusters among {omega}0, Formula and {omega}1. We then checked orange colored genes in (b) by using the KEGG database and found that they correspond to those categorized in ‘metabolism of cofactors and vitamins’, more precisely those in the folate biosynthesis pathway. Figure 10 shows this pathway in which, e.g. dihydroneopterin (which is located at almost the center) is generated from GTP, catalyzed by FOL2/YGR267C (GTP-cyclohydrolase I) and PHO8/YDR481C (repressible alkaline phosphatase). These two genes were colored orange in Figure 9b, while not in both (a) and (c), meaning that they were wrongly clustered under {omega}0 and {omega}1. This result is true of other orange-colored genes in (b) such as RAD54/YGL163C (DNA-dependent ATPase) and HIR2/YDR038C (Histone transcription regulator) which are also in the folate biosynthesis pathway but not colored orange in (a) and (c). These four genes were all colored dark blue in (a). A possible reason for this is that in Hughes, they must not be co-expressed with other orange colored genes such as ENA1/YDR040C (ATPase) and RAD3/YER171W (DNA helicase), though the four genes themselves might be co-expressed with each other. On the other hand, a reason why the above four genes were clustered into the green group in (c) is that since only the network structure is considered in (c), MTD1/YKR080W (methylenetetrahydrafolate dehydrogenase), a major hub colored green (located at slightly right from the center) strongly affected other genes linked to this gene. In fact, as shown in Figure 9, YGR267C, YDR481C and YGL163C are directly connected to YKR080W. Interestingly, YDR038C is not linked to YKR080W but to the above all three genes, implying that YDR038C was colored green by the three genes (YGR267C, YDR481C and YGL163C) which were already colored green by YKR080W. On contrary, a possible reason why these four genes were well clustered in (b) might be that YDR038C was first colored orange by, say MET7/YOR241W (folylpolyglutamate synthetase) and/or DFR1/YOR236W (dihydrofolate reductase), and then other three were colored orange by the link to YDR038C or by their tight co-expression with YDR038C. Thus this result and analysis also confirmed the advantage of our method for balancing the two different types of data.


Figure 9
View larger version (32K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 9. Enlargement of the square located at the top-left part in Figure 8 (a) {omega}0, (b) Figure 9 and (c) {omega}1.

 

Figure 10
View larger version (31K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 10. Folate biosynthesis pathway.

 

    5 CONCLUDING REMARKS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD: PROBABILISTIC MODEL
 3 METHOD: LEARNING ALGORITHM...
 4 EXPERIMENTAL RESULTS
 5 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have presented a systematic method for gene clustering by combining expression data with the modularity of gene networks, based on learning of a probabilistic model. Experimental results showed that our method successfully integrated the totally different two types of data for clustering genes to accurately annotate gene function. Interesting future work is to use another or other multiple types of gene networks which are not necessarily curated, since KEGG is a curated database, and the KEGG network might be closely related to the gold-standard gene clusters. It would also be interesting to consider a probabilistic model in which more than one cluster labels can be probabilistically assigned to each gene.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD: PROBABILISTIC MODEL
 3 METHOD: LEARNING ALGORITHM...
 4 EXPERIMENTAL RESULTS
 5 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work is supported in part by Bioinformatics Education Program ‘Education and Research Organization for Genome Information Science’ and Kyoto University 21st Century COE Program ‘Knowledge Information Infrastructure for Genome Science’ with support from MEXT, Japan.

Conflict of Interest: none declared.


    FOOTNOTES
 
1 A general abbreviation of a hidden Markov random field is HMRF. If we follow this abbreviation, our model which we call a hidden modular random field is also HMRF. So in this article, we use HMRF for a hidden Markov random fiela hidden Markov random field while HMoF for our moded while HMoF for our model. Back


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD: PROBABILISTIC MODEL
 3 METHOD: LEARNING ALGORITHM...
 4 EXPERIMENTAL RESULTS
 5 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Ashburner M, et al. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat. Genet (2000) 25:25–29.[CrossRef][Web of Science][Medline]

    Besag J. On the statistical analysis of dirty pictures. J.R. Statist. Soc. B (1986) 48:259–302.

    Boyle EI, et al. GO::TermFinder–open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics (2004) 20:3710–3715.[Abstract/Free Full Text]

    Brem RB, et al. Genetic dissection of transcriptional regulation in budding yeast. Science (2002) 296:752–755.[Abstract/Free Full Text]

    Dempster AP, et al. Maximum likelihood from incomplete data via the EM algorithm. J.R. Statist. Soc. B (1977) 39:1–38.

    Edgar R, et al. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res (2002) 30:207–210.[Abstract/Free Full Text]

    Eisen MB, et al. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA (1998) 95:14863–14868.[Abstract/Free Full Text]

    Fjortoft R, et al. Unsupervised classification of radar images using hidden Markov chains and hidden Markov random fields. IEEE Trans. Geosci. Remote Sens (2003) 41:675–689.[CrossRef]

    Gasch AP, et al. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell (2000) 11:4241–4257.[Abstract/Free Full Text]

    Guimera R, Nunes Amaral LA. Functional cartography of complex metabolic networks. Nature (2005) 433:895–900.[CrossRef][Medline]

    Guimera R, et al. Modularity from fluctuations in random graphs and complex networks. Phys. Rev. E (2004) 70:025101.[CrossRef]

    Hanisch D, et al. Co-clustering of biological networks and gene expression data. Bioinformatics (2002) 18((Suppl.)):S145–S154.[Abstract]

    Huang D, Pan W. Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Bioinformatics (2006) 22:1259–1268.[Abstract/Free Full Text]

    Hughes TR, et al. Functional discovery via a compendium of expression profiles. Cell (2000) 102:109–126.[CrossRef][Web of Science][Medline]

    Jeong H, et al. The large-scale organization of metabolic networks. Nature (2000) 407:651–654.[CrossRef][Medline]

    Kanehisa M, et al. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res (2006) 34:D354–D357.[Abstract/Free Full Text]

    Karypis G, Kumar V. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput (1998a) 20:359–392.[CrossRef]

    Karypis G, Kumar V. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput (1998b) 48:96–129.[CrossRef]

    Kearns M, et al. An information-theoretic analysis of hard and soft assignment methods for clustering. In: 13th Annual Conference on Uncertainty in Artificial Intelligence (UAI2001) (2001) 495–520.

    Kerr MK, Churchill GA. Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc. Natl. Acad. Sci. USA (2001) 98:8961–8965.[Abstract/Free Full Text]

    Kharchenko P, et al. Expression dynamics of a cellular metabolic network. Mol. Syst. Biol (2005) msb4100023:E1–E6.

    Kreyszig E. Introductory Mathematical Statistics (1970) John Wiley & Sons.

    Mardia KV, Jupp PE. Directional Statistics (2000) 2nd. JohnWiley & Sons.

    Newman MEJ, Girvan M. Finding and evaluating community structure in networks. Phys. Rev. E (2004) 69:026113.[CrossRef]

    Pan W. Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics (2006) 22:795–801.[Abstract/Free Full Text]

    Ravasz E, et al. Hierarchical organization of modularity in metabolic networks. Science (2002) 297:1551–1555.[Abstract/Free Full Text]

    Spellman PT, et al. Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell (1998) 9:3273–3297.[Abstract/Free Full Text]

    Storey JD, et al. Multiple locus linkage analysis of genomewide expression in yeast. PLoS Biol (2005) 3:e267.[CrossRef][Medline]

    Strehl A, Ghosh J. Relationship-based clustering and visualization for high-dimensional data mining. INFORMS J. Comput (2003) 15:208–230.[Abstract/Free Full Text]

    Tamayo P, et al. Interpreting patterns of gene expression with selforganizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA (1999) 96:2907–2912.[Abstract/Free Full Text]

    Tavazoie S, et al. Systematic determination of genetic network architecture. Nat. Genet (1999) 22:281–285.[CrossRef][Web of Science][Medline]

    Troyanskaya O, et al. Missing value estimation methods for DNA microarrays. Bioinformatics (2001) 17:520–525.[Abstract/Free Full Text]

    Wu LF, et al. Large-scale prediction of saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat. Genet (2002) 31:255–265.[CrossRef][Web of Science][Medline]

    Yvert G, et al. Trans-acting regulatory variation in saccharomyces cerevisiae and the role of transcription factors. Nat. Genet (2003) 35:57–64.[Web of Science][Medline]

    Zhang K, Zhao H. Assessing reliability of gene clusters from gene expression data. Funct. Integr. Genomics (2000) 1:156–173.[CrossRef][Medline]

    Zhang Y, et al. Segmentation of brain MR images through a hidden Markov random field model and the Expectation-Maximization algorithm. IEEE Trans. Med. Imaging (2001) 20:45–57.[CrossRef][Web of Science][Medline]

    Zhong S, Ghosh J. A unified framework for model-based clustering. J. Mach. Learn. Res (2003) 4:1001–1037.[CrossRef]

    Zhong S, Ghosh J. Generative model-based document clustering: a comparative study. Knowl. Inf. Syst (2005) 8:374–384.[CrossRef]

    Zhou X, et al. Transitive functional annotation by shortest-path analysis of gene expression data. Proc. Natl Acad. Sci. USA (2002) 99:12783–12788.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Shiga, M.
Right arrow Articles by Mamitsuka, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shiga, M.
Right arrow Articles by Mamitsuka, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?