Annotating gene function by combining expression data with a modular gene network
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji 611-0011, Japan
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: A promising and reliable approach to annotate gene function is clustering genes not only by using gene expression data but also literature information, especially gene networks.
Results: We present a systematic method for gene clustering by combining these totally different two types of data, particularly focusing on network modularity, a global feature of gene networks. Our method is based on learning a probabilistic model, which we call a hidden modular random field in which the relation between hidden variables directly represents a given gene network. Our learning algorithm which minimizes an energy function considering the network modularity is practically time-efficient, regardless of using the global network property. We evaluated our method by using a metabolic network and microarray expression data, changing with microarray datasets, parameters of our model and gold standard clusters. Experimental results showed that our method outperformed other four competing methods, including k-means and existing graph partitioning methods, being statistically significant in all cases. Further detailed analysis showed that our method could group a set of genes into a cluster which corresponds to the folate metabolic pathway while other methods could not. From these results, we can say that our method is highly effective for gene clustering and annotating gene function.
Contact: shiga{at}kuicr.kyoto-u.ac.jp
| 1 INTRODUCTION |
|---|
|
|
|---|
Recent progress in genome sciences has led to the development of DNA microarray technology which allows to monitor the expression of thousands of genes simultaneously. Microarray technology by which we can see which genes are active in a given cell or tissue promises the ability to determine the function of each gene. A current popular approach to annotate gene function from gene expression data is clustering genes by expression values based on the assumption that genes with similar expression patterns can be clustered into a group with the same gene function. Existing approaches include typical clustering methods, such as hierarchical clustering (Eisen et al., 1998), k-means (Tavazoie et al., 1999) and self-organizing maps (Tamayo et al., 1999). However, microarray expression data is inevitably noisy, making the clustering result by the above methods unstable (Kerr and Churchill, 2001; Zhang and Zhao, 2000). A possible solution to overcome this problem is to generate many array replicates which are however demanding because of the high experimental cost of microarray experiments.
A more promising direction in current bioinformatics research is to combine microarray expression data with the existing knowledge of gene annotation derived from literature. Typical examples include model-based clustering incorporating GO (Gene Ontology) annotation as priors of model parameters (Pan, 2006), k-means clustering using GO annotation dependent distances (Huang and Pan, 2006) and hierarchical clustering based on the distance defined by both gene expressions and the shortest path on a metabolic (chemical reaction) pathway (Hanisch et al., 2002.), etc. This type of combination is of interest, since dynamic behavior of genes which would be observed from microarray data can be integrated with the literature-derived biological data which is obviously static information. Specifically, dynamic information of microarray expression would change the static gene clusters to fit the experimental condition of given microarray data. However, existing approaches are not methodologically sophisticated enough in combining the two data, i.e. real-valued expression data and literature-derived data, especially gene networks. In addition, the focus of current approaches is placed on the rather local information, such as neighboring genes, of gene networks, and incorporating global information of gene networks might find more appropriate gene clusters.
In light of the above, we present a new method for clustering genes using both microarray expression data, i.e. the dynamic information of a cell, and a given gene network, focusing on network modularity, a global nature of biological networks (Ravasz et al., 2002). Our method is based on learning a probabilistic model which we call a hidden module random field that is well suited to combine the two different types of data. Our model has observable variables for microarray expression data and hidden variables for gene annotation (cluster labels). Concretely, each observable variable is dependent upon a hidden variable which can be dependent upon one or more hidden variables each other, as defined by a given biological network. Learning parameters of our model is based on the EM (Expectation-Maximization) algorithm that minimizes the objective function which considers the modularity of a given gene network. Briefly speaking, our method uses not only the simple gene relations which are directly obtained from literature but also more complex information which can be derived from the entire network structure. We further stress that regardless of focusing on the global nature of a given network, our proposed algorithm for estimating probability parameters of our model is time-efficient.
Using a metabolic network and a wide variety of microarray datasets, we validated the performance of our method, comparing with other methods and analyzed the results obtained by our method. Clusters of genes are variable depending on the condition under which genes are placed, meaning that true cluster labels are unknown. However, to evaluate the performance of our method, we generated two different sets of gene clusters as standard data. Assuming that a good clustering result will be close to them if many microarray observations are used, we compared our method with four different clustering methods: k-means with microarray expression data and three graph partitioning methods using the structure of the metabolic network. We then found that our clustering method outperformed all other competing methods, being statistically significant in all cases. We further examined the clustering result, focusing on the genes in folate biosynthesis to confirm the significance of our method. These results proved that our method of combining microarray expression data with the modularity of a gene network is highly effective for clustering genes and annotating gene functions.
| 2 METHOD: PROBABILISTIC MODEL |
|---|
|
|
|---|
Clustering is to assign a gene function (or generally a cluster label) to each gene. Starting with a hidden random field, we will explain our model for clustering genes by combining microarray expression data with a global feature of gene networks. Each node in a gene network is labeled by a gene, and an undirected edge connecting two nodes represents some relation between the two genes labeling the two nodes.
2.1 Hidden random field
A hidden random field is a probabilistic model with two types of variables: observable and hidden variables. Figure 1a shows a model in which an observable variable depends on only one hidden variable and a hidden variable depends on no other variables. This model corresponds to that of a typical clustering method like k-means in which genes are clustered with microarray expression data only, assuming that the expression pattern of a gene is independent of another. However, this assumption is not necessarily true. For example, the expression patterns of two genes located neighborhood on a metabolic network are in most cases strongly correlated with each other (Kharchenko et al., 2005). Figure 1b shows a hidden random field in which hidden variables can be dependent upon each other which would be more natural for clustering genes using both gene expressions and biological networks. In particular, we use a given gene network to represent the relation of hidden variables.
|
Mathematically, these models can be formalized as follows: let
|
| (1) |
is a parameter (weight) for
We can further define the joint probability of a set of observable variables
and Z as follows:
|
| (2) |
The probability of X is further given as follows:
|
|
2.2 Hidden Markov random field (HMaF)
A typical hidden random field is a hidden Markov random field, which we call HMaF in this article.1 We explain this model briefly to show the significance of the dependency structure of hidden variables. In HMaF, a hidden variable for a node depends on the hidden variables for its neighboring (closest) nodes alone, satisfying with the Markov property. The potential function in this model is defined as follows:
|
| (3) |
|
|
|
| (4) |
A major application of HMaF is image restoration (or image modeling), which is a problem of estimating true pixel labels from an observed image which is usually very noisy (Fjortoft et al., 2003; Zhang et al., 2001). We show the effectiveness of using the network information of hidden variables through an example of image restoration. Figure 2 shows a schematic picture of applying HMaF to image restoration, exactly according to (Zhang et al., 2001) in which observable and hidden variables are used for (b) noisy images and (c) clear true pixels, respectively. Each of (b) and (c) has 2500 (=50 x 50) pixels, and a pixel can take one of three labels as a true value. As shown in Figure 2a, a typical network structure for the relation of hidden variables in image restoration is a grid structure in which a node is connected to neighboring four (left, right, up and down) nodes only, keeping the Markov property.
|
Figure 3a shows the result obtained by applying HMaF to the data shown in Figure 2b, while Figure 3b shows that by applying k-means (which corresponds to Figure 1a) to the same data. As shown in these figures, HMaF clearly removes the noise in Figure 2b while obviously k-means cannot, indicating that the network structure assumption in HMaF works effectively for image restoration.
|
2.3 Our model: hidden Modular random field (HMoF) for gene clustering
As shown in the previous section, the network structure of hidden variables is an important assumption to estimate the values of hidden variables exactly. The hidden structure in image restoration is a grid shape in which the number of edges from a node, i.e. the degree of a node, is basically constant. In contrast, the degree of a node in a gene network varies according to the scale-free nature (Jeong et al., 2000; Ravasz et al., 2002), suggesting that the grid shape is inappropriate for a gene network and the Markov property would be insufficient.
It is already reported that the genes having the same function (cluster) tend to be gathered together on a gene network, especially metabolic networks (Ravasz et al., 2002). This feature is called modularity, a global feature of gene networks. Modularity of a network is defined as a quantity which becomes larger as with increasing the number of edges in a cluster and with decreasing the number of edges between two different clusters (Guimera and Nunes Amaral, 2005; Guimera et al., 2004; Newman and Girvan, 2004). Modularity can then be mathematically defined as follows:
|
| (5) |
Thus, we propose a new random field using
as a potential function as follows:
|
| (6) |
| 3 METHOD: LEARNING ALGORITHM FOR HMoF |
|---|
|
|
|---|
3.1 EM algorithm
The EM algorithm is a local optimization method which is widely used for estimating parameters of a lot of probabilistic models with hidden variables (Dempster et al., 1977). According to the EM algorithm, we have to minimize the following so-called Q function:
|
| (7) |
-function is used for
We assume that the probabilistic distribution of an observable variable follows the von Mises–Fisher distribution (Mardia and Jupp, 2000), since this distribution is used for clustering high-dimensional data such as text documents (Zhong and Ghosh, 2005):
|
| (8) |
|
| (9) |
satisfies with
is a weight parameter which is not trained from data, and so CD(
) in Equation (6) becomes a constant when we derive the above Equation (9). The EM algorithm suggests a procedure to minimize J by optimizing cluster centers
For the first step, when hidden variables Z are fixed, the objective function J is minimized by the following
:
|
|
|
|
The second step is to optimize J when we fix the above cluster centers. We note that potential function
is not the sum over all hidden variables
, since hidden variables can be dependent upon each other. We then have to find the optimal combination of hidden values, but the computational complexity of this problem is NP-hard. We then use the ICM (iterative conditional modes) method (Besag, 1986) to find an approximation solution of this problem.
3.2 ICM method
The basic procedure of the ICM method repeats the following two steps until the convergence of J: (1) we first randomly permute N hidden variables. (2) In the permuted order, we update the value of the chosen hidden variable so that J should be minimized. The ICM method is a local optimization method, and so we can repeat the above procedure many times using different initial values to find the most optimal solution.
We emphasize that the ICM method for our model is time-efficient, although our model deals with the global feature of a given network. The reason is as follows: we do not have to compute function J at every time of doing the above two steps. Instead, we can compute the difference between the current J and the new J. More concretely, after we choose a hidden variable, say zn and then examine a new hidden (cluster) value, say j, the difference between the current J specified by zn and the new J by j can be given by the function of these two:
|
|
|
|
Finally, Figure 4 shows our entire algorithm of estimating probabilistic parameters by minimizing J. Line 1 is the input of initial values which are, in our experiments, obtained by repeating randomly assigning cluster labels to the nodes of a given gene network and selecting the set of cluster labels with the highest modularity which can be computed by Equation (5). Lines 3–6 are the ICM method to which, as shown in line 5, we add one constraint that the new cluster label is selected out of the labels of neighboring nodes only. This device incorporates the local information of a given network in updating labels and speeds up the efficiency of this algorithm. Lines 2–9 are the entire EM algorithm.
|
| 4 EXPERIMENTAL RESULTS |
|---|
|
|
|---|
4.1 Data
We used metabolic pathways for a gene network and focused on the data of Saccharomyces cerevisiae in both the metabolic gene network and microarray expression data. We will show the detail of our data subsequently.
4.1.1 Metabolic network
We generated a metabolic network from KEGG (Kyoto Encyclopedia of Genes and Genomes) (Kanehisa et al., 2006), by assigning an enzyme gene to a node and connecting the two nodes of two neighboring genes on the KEGG metabolic pathway by an edge. We removed a node with no edges and an isolated network with no more than 15 nodes. The generated network had 636 nodes (genes) and 3104 edges. We note again that this metabolic network is an undirected gene network, meaning that each node corresponds to a gene.
4.1.2 Microarray expression
From the GEO (Gene Expression Omnibus) database (Edgar et al., 2002), we chose five datasets of S. cerevisiae under the condition that each dataset has more than 50 experiments (observations), because a larger number of observations is more reliable when genes with similar expression patterns are clustered in the same group. We further added a dataset (Hughes et al., 2000) which has been often used in microarray analysis (Huang and Pan, 2006; Pan, 2006; Wu et al., 2002; Zhou et al., 2002) since it has 300 observations while a small number of missing values. Tables 1 shows the detail of the above six datasets. A missing value in these datasets was interpolated by using the 10-nearest least square method by (Troyanskaya et al., 2001).
|
4.1.3 Standard gene cluster data
Gene clusters are variable depending on the condition under which genes are placed. However, to evaluate the performance of our method for clustering genes, we generated two different sets of standard gene clusters (or true cluster labels) from KEGG and GO (Ashburner et al., 2000), which we call KEGG clusters and GO clusters, respectively.
(1) KEGG clusters: The KEGG pathway maps are classified into six major categories, including metabolism. We used 10 sub-categories under the metabolism category as KEGG clusters. Tables 2 shows the list of KEGG clusters and the number of genes in the corresponding cluster. We note that a gene can be in more than one clusters.
|
(2) GO clusters: GO is the ontology of gene functions which are classified into three categories: molecular function, biological process and cellular component. Out of these three major categories, we used the function labels in biological process only. In GO, the relation between gene functions forms a so-called DAG (directed acyclic graph) which is a directed graph with no cycles. Each node of the DAG corresponds to a gene function (cluster label) to which corresponding genes are assigned. Starting with the root of the DAG, we went from a parent to its child, checking the number of genes assigned to each node. Naturally, the number of genes reduces as with going to descendants. We fixed a cutoff value (minimum size of a cluster), and if the number of genes was less than the specified cutoff value, we went back to its parent and stopped there. If the number of genes at the parent is larger than 300, we did not use that node as a cluster because its size is too large to annotate gene function. Tables 3 shows the number of clusters obtained by the above procedure, changing the cutoff value. We selected the cutoff value of 75 to have 19 clusters as GO clusters, because of choosing a little different number from 10, which was the number of KEGG clusters. Figure 5 shows all nodes appeared during the above procedure to obtain 19 clusters which correspond to circles with thick lines in this figure. The number in each circle (node) in Figure 5 shows the number of corresponding genes. The name attached to each node shows the corresponding cluster name. As well as KEGG clusters, a gene can be in more than one clusters.
|
|
4.2 Performance evaluation: criterion
In information theory, mutual information is defined as a quantity to measure the amount of information shared between two random variables. This criterion can be applied to check the overlap between our clustering result and standard clusters. This is because the mutual information between two sets of cluster labels becomes larger as one set of clusters is more consistent with the other set of clusters. In general, mutual information is normalized since the range taken by mutual information depends on the size of given sets of clusters (Strehl and Ghosh, 2003). We note that normalized mutual information (NMI) has been widely used in a lot of applications to measure the performance of clustering methods (Zhong and Ghosh, 2003, 2005).
Let
be a cluster set of our result, where Gi be a set of genes in cluster i. Let
be a standard cluster set, where
be a set of genes in standard (true) cluster i. NMI is defined as follows:
|
|
|
|
4.3 Performance evaluation: results
4.3.1 Parameter setting and competing methods
We checked three different sizes of clusters: K = 10, 15 and 20, since two sets of standard clusters have 10 and 19 clusters.
As shown in Equation (9), the objective function that must be minimized has parameter
which varies in the range between zero and one. If we set
at zero, J has the first term only, consisting of microarray expression alone. This case, our model and its learning algorithm are almost the same as clustering genes by k-means with microarray expression data, and so hereafter we call
= 0 as k-means. On the other hand, if we set
at one, J has the second term alone, meaning that genes are clustered by a given biological network only. This is similar to graph partitioning in which nodes are clustered by using the connectivity of a given graph. Thus, we changed the value of parameter
from zero to one at every 0.1. We then mainly examined the performance of our method at three values: zero, one and
that gives the highest performance, which we denote
0,
1 and
, respectively.
Our method was further compared with two famous graph partitioning methods, called p-METIS (Karypis and Kumar, 1998a) and k-METIS (Karypis and Kumar, 1998b), both of which use only the structure of a given metabolic network, meaning that their NMIs are irrelevant to
and microarray datasets. This is true of
1.
4.3.2 NMI
We first show the result when we used KEGG clusters. Figure 6 shows the NMI of HMoF when we used Hughes as microarray expression data. This figure shows that the NMI of
was higher than those of
0 (k-means) and
1 in all three cases of K. This indicates that the clustering result obtained by using both the modularity of the metabolic network and microarray expression data is more consistent with KEGG clusters than that by either of the two types of data. We can further see that the NMI values in
were higher than those by p-METIS and k-METIS which are shown as dashed lines in Figure 6, indicating that our clustering result was overlapped more with KEGG clusters than those by the two popular graph partitioning methods. These findings were true of other microarray datasets. Tables 4 summarizes the NMI values for all six microarray datasets used in our experiments. Each
value is shown in the corresponding parenthesis. From this table, we can see that the NMI of
was the best among five compared methods in all 18 cases.
|
|
We then show the results obtained by using GO clusters as standard data. Figure 7 shows the NMI of HMoF when we used Hughes, changing K and
. The NMI values of p-METIS and k-METIS are also shown in this figure as dashed lines. From this figure, the advantage of HMoF over other four methods was confirmed again. The NMI values in this figure ranged roughly from 0.15 to 0.2, which were smaller than those by KEGG clusters, probably because the number of GO clusters is 19 which is larger than the number of KEGG clusters, i.e. 10. Tables 5 summarizes the NMI values obtained by all six microarray datasets. From this table, we can see that the NMI of
at 0.2 is almost always close to that of
at 0.2, which we denote
|
|
The above results show the clear advantage of
0), p-METIS and k-METIS, but the difference of NMI between
1 was rather slight. We then examined the statistical significance in the difference of NMI between
1. Tables 6 shows the P-values of pairwise t-test (Kreyszig, 1970) between
1 over all six microarray datasets. This table indicates that the NMI of
1 with a statistical significance of more than around 99.5% in all six cases. Overall we can say that our method of combining microarray expression data with the network modularity achieved a clearly higher performance in gene clustering than other four compared methods which use only either of the two data types.
|
4.4 Annotating gene function
Once a cluster label is assigned to each gene, we can compute the P-value between each gene function in some ontology database like GO and a cluster containing the gene in question. We can then rank the gene functions according to their P-values, and the top function can be assigned to each gene of the cluster. This manner of function annotation would be helpful for a functionally putative gene. To compute the P-values and rank them, we can use a software called GO Term Finder (Boyle et al., 2004) which uses a set of genes in a cluster as input and outputs a list of gene functions ranked according to P-values.
Tables 7 shows an example of function annotation, i.e. a list of gene functions with P-values for a cluster which is obtained under the condition that the microarray dataset was Hughes and K=20. We show this example, since this cluster has the highest P-value, 1.62E-37, among all 20 clusters obtained under the above condition. The top of this list is protein modification. In fact, this cluster contained 45 genes including, say ALG11/YNL048W (mannosyltransferase) and ALG6/YOR002W (glucosyltransferase) which add a sugar unit to a protein amino acid, meaning that their function is protein modification.
|
4.5 Analysis on clustering result
Figure 8 shows the three metabolic gene networks obtained by
0,
1 using Hughes under K = 20. Each node corresponds to a gene, and the edge between two nodes indicates that the genes of these two nodes are neighboring enzyme genes on the KEGG metabolic pathway map. These networks have the same structure but nodes were differently colored. Each color indicates a cluster, meaning that genes labeled with the same color fall into the same cluster. From this figure, we can easily see that nodes with the same color were widely distributed over the entire network in (a), while the nodes with the same color were gathered together in (c) because nodes were clustered by the network structure only.
|
We then focused on the square at the top-left part in each of the three networks in Figure 8. Figure 9 shows the enlargement of each of these squares. In Figure 9, two colors, red and green, were merged in (b) while more than two colors were merged in (a) and the two colors were clearly separated in (c). As shown in the previous section, (b), i.e.
0,
1. We then checked orange colored genes in (b) by using the KEGG database and found that they correspond to those categorized in metabolism of cofactors and vitamins, more precisely those in the folate biosynthesis pathway. Figure 10 shows this pathway in which, e.g. dihydroneopterin (which is located at almost the center) is generated from GTP, catalyzed by FOL2/YGR267C (GTP-cyclohydrolase I) and PHO8/YDR481C (repressible alkaline phosphatase). These two genes were colored orange in Figure 9b, while not in both (a) and (c), meaning that they were wrongly clustered under
0 and
1. This result is true of other orange-colored genes in (b) such as RAD54/YGL163C (DNA-dependent ATPase) and HIR2/YDR038C (Histone transcription regulator) which are also in the folate biosynthesis pathway but not colored orange in (a) and (c). These four genes were all colored dark blue in (a). A possible reason for this is that in Hughes, they must not be co-expressed with other orange colored genes such as ENA1/YDR040C (ATPase) and RAD3/YER171W (DNA helicase), though the four genes themselves might be co-expressed with each other. On the other hand, a reason why the above four genes were clustered into the green group in (c) is that since only the network structure is considered in (c), MTD1/YKR080W (methylenetetrahydrafolate dehydrogenase), a major hub colored green (located at slightly right from the center) strongly affected other genes linked to this gene. In fact, as shown in Figure 9, YGR267C, YDR481C and YGL163C are directly connected to YKR080W. Interestingly, YDR038C is not linked to YKR080W but to the above all three genes, implying that YDR038C was colored green by the three genes (YGR267C, YDR481C and YGL163C) which were already colored green by YKR080W. On contrary, a possible reason why these four genes were well clustered in (b) might be that YDR038C was first colored orange by, say MET7/YOR241W (folylpolyglutamate synthetase) and/or DFR1/YOR236W (dihydrofolate reductase), and then other three were colored orange by the link to YDR038C or by their tight co-expression with YDR038C. Thus this result and analysis also confirmed the advantage of our method for balancing the two different types of data.
|
|
| 5 CONCLUDING REMARKS |
|---|
|
|
|---|
We have presented a systematic method for gene clustering by combining expression data with the modularity of gene networks, based on learning of a probabilistic model. Experimental results showed that our method successfully integrated the totally different two types of data for clustering genes to accurately annotate gene function. Interesting future work is to use another or other multiple types of gene networks which are not necessarily curated, since KEGG is a curated database, and the KEGG network might be closely related to the gold-standard gene clusters. It would also be interesting to consider a probabilistic model in which more than one cluster labels can be probabilistically assigned to each gene.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This work is supported in part by Bioinformatics Education Program Education and Research Organization for Genome Information Science and Kyoto University 21st Century COE Program Knowledge Information Infrastructure for Genome Science with support from MEXT, Japan.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
1 A general abbreviation of a hidden Markov random field is HMRF. If we follow this abbreviation, our model which we call a hidden modular random field is also HMRF. So in this article, we use HMRF for a hidden Markov random fiela hidden Markov random field while HMoF for our moded while HMoF for our model.
| REFERENCES |
|---|
|
|
|---|
Ashburner M, et al. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat. Genet, ( (2000) ) 25, : 25–29.[CrossRef][ISI][Medline].
Besag J. On the statistical analysis of dirty pictures. J.R. Statist. Soc. B, ( (1986) ) 48, : 259–302..
Boyle EI, et al. GO::TermFinder–open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics, ( (2004) ) 20, : 3710–3715.
Brem RB, et al. Genetic dissection of transcriptional regulation in budding yeast. Science, ( (2002) ) 296, : 752–755.
Dempster AP, et al. Maximum likelihood from incomplete data via the EM algorithm. J.R. Statist. Soc. B, ( (1977) ) 39, : 1–38..
Edgar R, et al. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res, ( (2002) ) 30, : 207–210.
Eisen MB, et al. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, ( (1998) ) 95, : 14863–14868.
Fjortoft R, et al. Unsupervised classification of radar images using hidden Markov chains and hidden Markov random fields. IEEE Trans. Geosci. Remote Sens, ( (2003) ) 41, : 675–689.[CrossRef].
Gasch AP, et al. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell, ( (2000) ) 11, : 4241–4257.
Guimera R, Nunes Amaral LA. Functional cartography of complex metabolic networks. Nature, ( (2005) ) 433, : 895–900.[CrossRef][Medline].
Guimera R, et al. Modularity from fluctuations in random graphs and complex networks. Phys. Rev. E, ( (2004) ) 70, : 025101.[CrossRef].
Hanisch D, et al. Co-clustering of biological networks and gene expression data. Bioinformatics, ( (2002) ) 18, ((Suppl.)): S145–S154.[Abstract].
Huang D, Pan W. Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Bioinformatics, ( (2006) ) 22, : 1259–1268.
Hughes TR, et al. Functional discovery via a compendium of expression profiles. Cell, ( (2000) ) 102, : 109–126.[CrossRef][ISI][Medline].
Jeong H, et al. The large-scale organization of metabolic networks. Nature, ( (2000) ) 407, : 651–654.[CrossRef][Medline].
Kanehisa M, et al. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res, ( (2006) ) 34, : D354–D357.
Karypis G, Kumar V. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput, ( (1998a) ) 20, : 359–392.[CrossRef].
Karypis G, Kumar V. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput, ( (1998b) ) 48, : 96–129.[CrossRef].
Kearns M, et al. An information-theoretic analysis of hard and soft assignment methods for clustering. In: 13th Annual Conference on Uncertainty in Artificial Intelligence (UAI2001), ( (2001) ) 495–520..
Kerr MK, Churchill GA. Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc. Natl. Acad. Sci. USA, ( (2001) ) 98, : 8961–8965.
Kharchenko P, et al. Expression dynamics of a cellular metabolic network. Mol. Syst. Biol, ( (2005) ) msb4100023, : E1–E6..
Kreyszig E. Introductory Mathematical Statistics, ( (1970) ) John Wiley & Sons..
Mardia KV, Jupp PE. Directional Statistics, ( (2000) ) 2nd. JohnWiley & Sons..
Newman MEJ, Girvan M. Finding and evaluating community structure in networks. Phys. Rev. E, ( (2004) ) 69, : 026113.[CrossRef].
Pan W. Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics, ( (2006) ) 22, : 795–801.
Ravasz E, et al. Hierarchical organization of modularity in metabolic networks. Science, ( (2002) ) 297, : 1551–1555.
Spellman PT, et al. Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, ( (1998) ) 9, : 3273–3297.
Storey JD, et al. Multiple locus linkage analysis of genomewide expression in yeast. PLoS Biol, ( (2005) ) 3, : e267.[CrossRef][Medline].
Strehl A, Ghosh J. Relationship-based clustering and visualization for high-dimensional data mining. INFORMS J. Comput, ( (2003) ) 15, : 208–230.
Tamayo P, et al. Interpreting patterns of gene expression with selforganizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA, ( (1999) ) 96, : 2907–2912.
Tavazoie S, et al. Systematic determination of genetic network architecture. Nat. Genet, ( (1999) ) 22, : 281–285.[CrossRef][ISI][Medline].
Troyanskaya O, et al. Missing value estimation methods for DNA microarrays. Bioinformatics, ( (2001) ) 17, : 520–525.
Wu LF, et al. Large-scale prediction of saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat. Genet, ( (2002) ) 31, : 255–265.[CrossRef][ISI][Medline].
Yvert G, et al. Trans-acting regulatory variation in saccharomyces cerevisiae and the role of transcription factors. Nat. Genet, ( (2003) ) 35, : 57–64.[ISI][Medline].
Zhang K, Zhao H. Assessing reliability of gene clusters from gene expression data. Funct. Integr. Genomics, ( (2000) ) 1, : 156–173.[CrossRef][Medline].
Zhang Y, et al. Segmentation of brain MR images through a hidden Markov random field model and the Expectation-Maximization algorithm. IEEE Trans. Med. Imaging, ( (2001) ) 20, : 45–57.[CrossRef][ISI][Medline].
Zhong S, Ghosh J. A unified framework for model-based clustering. J. Mach. Learn. Res, ( (2003) ) 4, : 1001–1037.[CrossRef].
Zhong S, Ghosh J. Generative model-based document clustering: a comparative study. Knowl. Inf. Syst, ( (2005) ) 8, : 374–384.[CrossRef].
Zhou X, et al. Transitive functional annotation by shortest-path analysis of gene expression data. Proc. Natl Acad. Sci. USA, ( (2002) ) 99, : 12783–12788.
| |||||||||||||||||||||||













