Bioinformatics Advance Access originally published online on November 7, 2007
Bioinformatics 2007 23(24):3335-3342; doi:10.1093/bioinformatics/btm526
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Hierarchical tree snipping: clustering guided by prior knowledge
1Department of Computer Science, Ben Gurion University, Beer Sheva 84105, Israel, 2Department of Biomedical Engineering, 3Center for Advanced Genomic Technology, 4Bioinformatics Program and 5Children's Hospital Boston, Harvard/MIT Program in Health Sciences and Technology, 300 Longwood Avenue, Boston, MA 02115, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Hierarchical clustering is widely used to cluster genes into groups based on their expression similarity. This method first constructs a tree. Next this tree is partitioned into subtrees by cutting all edges at some level, thereby inducing a clustering. Unfortunately, the resulting clusters often do not exhibit significant functional coherence.
Results: To improve the biological significance of the clustering, we develop a new framework of partitioning by snipping—cutting selected edges at variable levels. The snipped edges are selected to induce clusters that are maximally consistent with partially available background knowledge such as functional classifications. Algorithms for two key applications are presented: functional prediction of genes, and discovery of functionally enriched clusters of co-expressed genes. Simulation results and cross-validation tests indicate that the algorithms perform well even when the actual number of clusters differs considerably from the requested number. Performance is improved compared with a previously proposed algorithm.
Availability: A java package is available at http://www.cs.bgu.ac.il/~dotna/ TreeSnipping
Contact: dotna{at}cs.bgu.ac.il
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Hierarchical agglomerative clustering is a widely used tool in computational biology with applications to phylogeny, evolution, functional genomics and analysis of microarray data, among others. In functional genomics, hierarchical clustering is routinely used to cluster genes into groups based on the similarity of their mRNA expression profiles in a set of experiments (Eisen et al., 1998), and is a standard tool in such software packages as GenePattern (Reich et al., 2006). Typically the resulting clusters serve to provide biological insight into the experimental data, or serve as a basis for further analysis, such as predicting the function of a gene from the known functions of other genes in the cluster (Wu et al., 2002). The usefulness of such a clustering derives from the assumption that genes with similar expression profiles have some characteristic in common, such as being functionally related (e.g. involved in a particular biological process or pathway), or being regulated by the same set of transcription factors. However, sometimes genes with similar expression profiles do not have common characteristics (Clare and King, 2002; Gibbons and Roth, 2002; Rodriguez-Trelles et al., 2005). Some possible reasons for this phenomenon are the following. Since expression data are noisy, the observed expression profile of a gene may appear to be more similar to the profile of a functional category that differs from its own. Furthermore, expression profiles of genes in one functional category may be diverse and may to a certain extent overlap expression profiles of another category. In addition, annotations may be erroneous, and a gene may lack a certain annotation simply because it has not been the subject of an appropriate experiment.
One way to address this difficulty is to subject to further analysis only those clusters defined by subtrees of the hierarchical tree that are statistically enriched (Adryan and Schuh, 2004; Buehler et al., 2004; Doherty et al., 2006; Okada et al., 2005; Raychaudhuri et al., 2003; Tavazoie et al., 1999; Toronen, 2004), similarly to methods that seek enriched terms in clusters obtained from more general clustering methods (Curtis et al., 2005; Draghici et al., 2003; Liu et al., 2004).
An alternative approach is to integrate prior biological knowledge into the clustering procedure itself (Cheng et al., 2004; Fang et al., 2006; Hanisch et al., 2002; Huang and Pan, 2006; Kustra and Zagdanski, 2006; Pan, 2006). Some previous studies used model-based integration methodologies in order to explicitly model the joint probability of gene expression and functional labels. Other studies proposed to modify the similarity measure between two genes to be a linear combination of the similarity of their expression profiles and their functional similarity, appropriately defined. These modifications usually lead to different clustering results from the ones produced solely on the basis of gene expression clustering, thereby affecting interpretability. Another shortcoming of some of the previous proposals was that they caused the separation of genes of known function from genes of unknown function, if the latter are considered at all. In contrast, our approach does not require the definition of a new functional similarity measure, nor does it bias against genes of unknown function.
At the heart of the methodology proposed here is a novel partitioning of the tree produced by the hierarchical clustering. Instead of the standard partitioning, which results from making a single horizontal cut through the tree, the clusters are obtained by snipping the tree—cutting selected edges at possibly different levels. The selection of the snips is guided by an objective function, which aims to construct a partition that is maximally consistent with partially available background knowledge. The precise form of the objective function depends on the particular application that is to be served by the partition.
To illustrate our methodology two applications are detailed. The first one concerns the problem of functional annotation of currently unannotated proteins. In this case it is desirable that all annotated genes in a cluster have at least one annotation in common, in order for the prediction to be as reliable as possible. The second application aims to discover functionally enriched clusters of genes with similar expression profiles. In this case, it is acceptable for a cluster to contain a few seemingly misclassified genes, whose annotation differs from the majority annotation, and the objective of the algorithm is to find a partition with the minimum number of misclassified genes. Thus, the drive to enrich an individual cluster is tempered by the need to keep the overall number of misclassified genes low. We present algorithms for both cases, as well as results of their application to biological and simulated data. Our methods can be used as a simple post-processing step to enhance the results of any hierarchical clustering software. For definiteness, we describe our approach in terms of a similarity measure based on gene expression profiles with the prior knowledge about the genes being the biological processes in which the annotated genes are known to participate. However, our methodology is applicable whenever a hierarchical clustering of objects may also take into account a partial labeling of the objects, e.g. clustering of samples with partially known tumor classification (Alizadeh et al., 2000).
| 2 ALGORITHMS |
|---|
|
|
|---|
2.1 Standard hierarchical clustering
A hierarchical (agglomerative) clustering algorithm initially places each gene in a cluster by itself, and then it recursively and greedily merges the clusters that are closest to each other, according to some distance or similarity criterion, until a single tree (dendrogram) is created. Henceforth, we will refer to the tree produced in the course of a hierarchical clustering algorithm as the h-tree. An issue that has to be faced by all hierarchical clustering algorithms is how to induce a clustering of the genes once the h-tree has been constructed. The standard method for doing this is to make a horizontal cut through the h-tree, thereby partitioning it into subtrees, and assigning all genes in a subtree to the same cluster.
Although many other clustering methods exist, hierarchical clustering is one of the most commonly used methods in analysis of microarray data.
2.2 Overview of the proposed framework
We propose here to incorporate into the partitioning of the h-tree partial information about the genes so as to improve the biological significance of the resulting clusters. The partial information is in the form of labels for some of the genes. For example, in the application of hierarchical clustering to functional prediction discussed in the Results section, the labels of the genes are the biological process GO annotations. In other applications the labels could be the transcription factors known to regulate genes, or the developmental phase of the organism, as measured in a temporal gene expression experiment.
Our aim is to obtain clusters that are substantially enriched in genes participating in the same biological process, or more generally having the same label (as pure as possible). We partition the h-tree by snipping—cutting selected edges at possibly different levels so as to maximize cluster purity criteria.
The input to the algorithms contains therefore two types of information: (1) the h-tree, and (2) for each gene a list of all its known labels. The list of an unlabeled gene contains all possible labels, thereby allowing the algorithms complete freedom in selecting the best label that can also serve as the predicted label for further investigation. The output of the algorithms is a partition of the tree into connected components, rather than complete subtrees. Each of the components induces a cluster, which is labeled with any one of the labels that is enriched in this subset, the cluster label. Figure 1B illustrates three labeled clusters obtained from the tree in Figure 1A by snipping two edges.
|
We describe algorithms for two natural variants of partitioning a labeled h-tree into clusters. In the first variant, the objective is to produce a partition of the tree such that all the annotated genes in the same cluster have at least one label in common (additional genes without previous annotation are also allowed). We will call a cluster with this property a pure cluster, and a partition into pure clusters a mosaic partition. Such a partition can, of course, be achieved by putting each leaf into a cluster by itself. Instead, we seek a partition that requires the minimum number of edge snips. For example, the mosaic partition in Figure 1B is minimal. This variant is used in the functional prediction application.
The aim of the second variant is to functionally characterize the different expression profiles appearing in the dataset. For this variant, it is not as important that all genes in a cluster have a label in common. Indeed, genes may lack a particular annotation simply because the relevant experiment has not yet been performed. Assuming that a cluster has been assigned a label, define a leaf (gene) to be misclassified, with respect to its labeled cluster, if none of the gene's labels is the same as the cluster label. The desired partition is one minimizing the total number of misclassified genes. Figure 1C shows the standard partition of the tree into two subtrees with a total of two misclassified leaves. The different partition shown in Figure 1D has only one misclassified leaf, the minimal number for a partition of the given tree into two subtrees.
A similar methodology is routinely practiced in decision tree induction when the initial tree is pruned into the smallest tree that minimizes the overall misclassification rate (Murthy et al., 1994) but the optimization criteria and algorithms are different.
2.3 Snipping for a minimal mosaic partition
The first problem we address is the partitioning of an h-tree into the smallest number of pure components (clusters), a minimal mosaic partition.
The Minimal Mosaic algorithm associates with each node v two variables:
- An integer minSnips(v), which equals the minimum number of edges that have to be snipped in order to construct a mosaic partition of the subtree rooted at v. For a leaf v minSnips(v) = 0.
- A list List(v) of labels containing a label l if and only if there is a minimal mosaic partition of the subtree rooted at v with the property that all genes in the component of the partition that contains v have the label l. For a leaf v, List(v) equals the given list of annotations of v (all annotations if v had no annotation).
The algorithm traverses the tree in bottom-up fashion, computing the values of these variables in greedy fashion, according to the recursion formula given in Figure 2.
|
Once minSnips(root) is computed, it is possible to determine which edges are to be snipped by retracing the optimal values down the tree.
Observe that the problem of finding a minimal mosaic partition can be viewed equivalently as one of finding the assignment of a single annotation to each internal node of the h-tree, as well as a single annotation to each leaf with multiple annotations, such that the total number of internal nodes and leaves whose annotation differs from that of their parent is minimized. Here a leaf without any annotation is considered to have all possible annotations, allowing it to conform to any annotation of its parent node.
Consider for a moment the case that none of the leaves has multiple annotations. Then, the problem just described requires the assignment of a single annotation to each internal node such that the total number of changes of annotation between a node and its parent is minimized.
This is precisely the small parsimony problem of evolutionary phylogenetics, in which the annotations of the leaves are morphological or molecular characters. Our problem is therefore a generalization of the small parsimony problem, in that it allows leaves to have more than one annotation, as well as no annotation whatsoever. The Minimal Mosaic algorithm is indeed but a variation of the small parsimony algorithm proposed by Fitch (1971). Consequently, the proof that it finds a minimal mosaic partition is but a variation of the proof of correctness of the small parsimony algorithm given by Hartigan (1973).
The algorithm runs in O(nL) time, using O(nL) space, where n is the number of genes (tree leaves) and L is the total number of possible labels.
It is of interest to observe that the algorithm constructs a partition having the desirable additional property that its highest snip is closest to the root among all minimal mosaic partitions of the h-tree (see the Supplementary Material for a formal proof).
2.4 Snipping to minimize misclassification
Recall that for the second application, the functional characterization of the different expression profiles appearing in the dataset, the main objective is to assign each cluster a label. It is permissible, however, for a cluster to contain a few seemingly misclassified genes that do not have a label in common with the label assigned to the cluster.
The Minimal Misclassified algorithm for finding a K-partition of the h-tree with the minimal number of misclassified genes, and for each cluster in the partition a cluster label, is based on dynamic programming and traverses the tree in bottom-up fashion.
The algorithm computes for each node v, functional label l and the number of snips k (0
k < K) the value minMis(v,l,k), which equals the minimal number of misclassified leaves when node v is labeled with label l and it is permitted to snip k edges of the subtree rooted at node v [creating a (k + 1)-partition]. Define minNum(v,k) to be the minimum of minMis(v,l,k) over all possible labels l. The initial value of minMis(v,l,k) for each leaf v is 0 when v is labeled with l and 1 otherwise.
The algorithm traverses the tree in bottom-up fashion, computing the values of these variables by the recurrence formulae given in Figure 3. The formula for MinMis(v,l,k) consists of four cases:
- Case 1: neither of the edges from the node v to its children left and right are snipped;
- Case 2: the edge from v to right is snipped;
- Case 3: the edge from v to left is snipped and
- Case 4: the edges from v to both left and right are snipped.
|
Since the tree is traversed in bottom-up fashion, when computing minMis(v,l,k), both minMis(v',l,k) and minNum(v',k) have already been computed for both children v' of v. The minimum number of misclassified leaves for an optimal K-partitioning of the tree is minNum(root, K – 1). Once this number is computed, the appropriate snips can be found by the usual traceback, from the root of the tree down to the leaves. The algorithm runs in O(nLK2) time, and uses O(nLK) space where n is the number of genes (tree leaves), L is the total number of possible labels and K is the requested number of snips. The formal proof, a time complexity analysis and description of the traceback phase are given in the Supplementary Material.
In theory it is possible that the Minimum Misclassification algorithm groups into one cluster genes whose expression profiles differ significantly from each other, since it takes into account only the structure of the tree and the labels of the leaves. Although no such undesirable clusters were observed in our analyses, we incorporated into our software an additional (user-settable) feature that enables the user to limit the diameter of each obtained cluster. Its effect is to first generate a decomposition of the h-tree into subtrees with limited diameter, and then to apply an extended version of the Minimum Misclassification algorithm to the resulting forest.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Functional labeling by snipping
To illustrate the power of the snipping framework we describe two applications. The first concerns functional prediction using microarray data. When a gene without any known function is associated with a cluster of genes that have a function in common, then it is reasonable to hypothesize that the same function can be attributed to that gene. This principle is often referred to as guilt by association. When the aim is to identify new genes that participate in a specific process t, only those clusters that carry the label t are of interest. Therefore, we adopt the following procedure:
- Affix a label t to all genes annotated with t, or any more specific annotation, even if they participate in additional processes. Label all other annotated genes according to the processes they participate in (alternatively, they can simply be labeled not t since only the t clusters are of interest; this will reduce the running time of the algorithm, which is proportional to the number of labels).
- Construct an h-tree for the gene expression dataset, using complete linkage with uncentered correlation as a metric.
- Snip the h-tree into pure clusters using the Minimal Mosaic algorithm.
- Predict the annotation t for each unannotated gene that resides in a cluster of size at least s that contains at least m genes annotated with t.
By assigning a gene annotated with t only the one label, the procedure ensures that all such genes will be assigned to a cluster carrying the label t, even if they participate in additional processes.
3.1.1 Proof of concept: the oogenesis process
We employed this procedure to annotate genes with the GO biological process oogenesis (GO:0048477) (this term was chosen because of an existing collaboration focusing on the process). The h-tree was constructed using the data published by Arbeitman et al., 2002) on the transcriptional profiles of 5005 genes, about one-third of all predicted Drosophila melanogaster genes, for 75 individuals at different developmental stages. Of these microarray data, we used all eights sets for adult females of different ages. The data for a one-hour-old embryo were used as well, since the maternal transcripts dominate at this first developmental stage. The microarray data for a 30-day-old male fly were taken to serve as a reference. The genes were annotated by mapping biological process annotations for fruit-fly (Revision: 1.60, 23 July 2005) from the GO consortium (Harris et al., 2004). All annotated genes were labeled with either the oogenesis label or the not oogenesis label. The unannotated genes were labeled with both labels.
The Minimal Mosaic algorithm identified a total of four clusters containing five genes or more, of which at least two were labeled oogenesis.
One of the clusters identified by the algorithm contained five genes. Two of these genes, Bicaudal D and E2f, are annotated to oogenesis, while three, diskette, CG4050 and CG14005, currently lack annotation to a biological process and are therefore predicted to be annotated to oogenesis. We were able to find corroborating evidence for this prediction for each of the genes.
The gene diskette (also known as dAda3 or Ada3) is a homologue of the human ada3. This gene encodes for one of the Gcn5-containing N-acetyltransferase (GNAT) complex subunits (Kusch et al., 2003). In the same article, the authors suggested that the GNAT components play an important role during oogenesis and in early embryonic gene regulation. Moreover, mosaic analysis suggests that this gene is required for oogenesis, as it is not possible to recover germ-line mosaics from three mutant alleles of ada3 subunits (Shafi et al., 2000).
The gene CG4050 is homologous to the OGT genes in human, Rattus norvegicus and Caenorhabditis elegans. It is therefore likely that CG4050 encodes a UDP-N-acetylglucosamine-peptide N-acetylglucosaminyltransferase. Other fly genes with acetylglucosaminyltransferase activity such as brainiac and egghead are known to take part in oogenesis. Deletion of the OGT gene results in loss of embryonic stem cell viability. Moreover, segregation of OGT alleles in the mouse germ line with ZP3-Cre recombination in oocytes reveals that intact OGT alleles are required for completion of embryogenesis (Aguilera et al., 1999).
The third gene, CG14005, has no significant sequence similarity to any other gene. However, this gene encodes for a protein that has been found to interact with zfh1 (Giot et al., 2003), a protein participating in germ-cell migration, a part of the gametogenesis process. It is noteworthy that among the five genes that were the most highly correlated with diskette, only one was annotated with oogenesis while the four others were annotated with different biological processes. The same was true for CG4050, while among the five genes that had the highest correlation with CG14005, only two genes were annotated with oogenesis while the others were annotated with different biological processes. Thus, none of the three predicted genes would have been connected to the oogenesis process on the basis of these correlations.
For the other three clusters (of sizes 5, 6 and 8) that were identified, the corroborating evidence for annotation to oogenesis of the unannotated genes was not as strong.
3.2 Discovery of functionally enriched clusters by snipping
By design, the overall enrichment of the clusters obtained from our algorithm will be larger than one of clusters obtained from the standard partitioning of the h-tree. The ability of the algorithm to discover functionally enriched clusters was therefore evaluated by cross-validation tests in a simulation study, as well as in a gene function prediction test using actual gene expression profiles.
3.2.1 A simulation study
The purpose of the simulation study was to evaluate the success of the algorithm, as measured by its predictive ability, to cluster labeled data, in settings of variable difficulty determined by various noise parameters. To this end we constructed datasets of points in a 10-dimensional space. Each dataset consisted of 5000 points, generated from a uniform mixture of 10 independent Gaussians. The points generated from the first Gaussian will be referred to as the first cluster, and the points generated from all other Gaussians together will be referred to as the second cluster. Next, we constructed a labeling of the points as follows: 80% of the points from the first cluster was given label 1, and 80% of the points from the second cluster was assigned the label 2. The labels of 10% of the labeled points were reversed, to simulate errors in the labeling of the clusters. The classification of the points that remained unlabeled forms the basis for the performance evaluation of the different clustering methods. We discuss the details of this simulation study in the Supplementary Material.
We then constructed the h-tree using Euclidean distance with complete-linkage, and compared the clustering performances of the standard partitioning, the Minimum Misclassification algorithm, the Minimal Mosaic algorithm and the Bayes classifier. The Bayes classifier labels a point 1 if the probability that the point was generated by the first Gaussian is higher than the probability that it was not generated by that Gaussian. This classifier is based on a complete knowledge of the mixture model employed to generate the points, (it cannot be used on real data, of course). The first three procedures used the h-tree and labeled each resulting cluster by its majority label. The tree-snipping algorithms also used the labeling of the points, as constructed above.
To measure the ability of the different methods to assign the unlabeled points to their true cluster, we used the F-measure. This measure is defined as the harmonic mean of precision and recall, and is commonly used in the performance assessment of functional prediction programs. It is thought to be more informative than the standard accuracy measure, especially when the number of true positives is relatively small compared to the number of true negatives. We define true positives to be points from the first cluster that are correctly predicted to have originated from the first cluster. False negatives are points from the first cluster that are predicted to have originated from the second cluster, and similarly, false positives are points from the second cluster that are predicted to have originated from the first cluster.
Figure 4A compares the average F-measures of the four methods, with the standard method and the Minimum Misclassification algorithm partitioning the h-tree into 10 clusters.
|
The Minimum Misclassification algorithm performed better than the standard partitioning. It also performed better than the Minimal Mosaic algorithm when the Gaussians were relatively well separated, presumably since it is given the advantage of knowing the actual number of clusters in the dataset. However, as the data become more complex, which is simulated here by larger SDs, the clustering performance of the standard partitioning drops rapidly, while the performance of the Minimal Mosaic algorithm decreases the least. These results support the hypothesis that our algorithms are able to adapt to reasonably complex data.
Figure 4B compares the clustering performances of the standard method and the Minimum Misclassification partitioning, as a function of the number of clusters the algorithms are required to produce. The performance of the standard method is poor as long as the number of clusters it partitions into is smaller than the actual number of clusters. In contrast, the Minimum Misclassification partition is much more robust with respect to the number of clusters, suggesting that the algorithm may be useful even when the number of clusters is known only approximately.
Very similar results (data not shown) were obtained in cases when several of the 10 Gaussians were labeled with label 1. By design, the Minimum Misclassification algorithm is able to pick out clusters of points largely carrying the same label from a background of points labeled with a different label, once the pre-specified number of requested snips is large enough. In contrast to the standard h-tree partition, this partition is not influenced by the differences in size and height between subtrees.
In summary, by taking advantage of the known labeling the Minimum Misclassification algorithm was able to exhibit improved classification abilities, greater fault tolerance to labeling errors and decreased sensitivity to the estimate of the actual number of clusters present in the data.
3.2.2 Cross-validation test using gene expression profiles
We further tested the performance of our algorithm in a cross-validation experiment using actual gene expression data. Its performance in this experiment was compared with the performance of a shrinkage algorithm that was recently proposed by Huang and Pan (2006). The input to the latter algorithm is the same as the input to our algorithm, namely a matrix of distances between all pairs of genes, derived from the gene co-expression data, as well as annotation data for some of the genes. The algorithm first computes a new distance matrix for the genes with known annotations, as follows: given a shrinkage parameter r
1, the distance between genes i and j, dij, is reduced to rdij, if i and j share an annotation. The new distance matrix is then used to cluster the annotated genes into K0 clusters. Finally, the unannotated genes are clustered using the original distance matrix while given the option to either join one of the K0 clusters that were created in the first stage, or join one of up to K1 additional clusters. The shrinkage algorithm was chosen for comparison because it is applicable to hierarchical clustering (as well as to other clustering procedures), it has the ability to deal with unannotated genes, and it does not use a model-based integration methodology. In our experiments K1 was set to 0, since in the evaluation procedure of Huang and Pan (2006) the association of a gene with one of the new K1 clusters is treated as an incorrect prediction. Two values of the shrinkage parameter r, 0.8 and 0.9, were tested, since these performed best in Huang and Pan (2006). We used the newly created distance matrix and complete linkage to construct a dendrogram from which K0 clusters were derived by standard partitioning. The unlabeled genes were then assigned to clusters, similarly to the procedure described in the original paper. In order to prevent biasing the results because of different linkage procedures, we used complete linkage when constructing the h-tree that was used in our algorithm as well.
The expression data were taken from a Saccharomyces cerevisiae cell-cycle experiment (Spellman et al., 1998). After log-transforming, the data missing values were replaced with zeros. Biological-process annotations were downloaded from GO (Revision: 1.1310). Similarly to the procedure described by Huang and Pan (2006), we first considered the set of genes annotated with at least one of the two biological processes: cell cycle (GO:0007049) and protein transport (GO:0015031). There are 373 and 232 genes in these categories, respectively. In a subsequent test, a third process annotation, reproduction (GO:0000003), was added to the previous two. There are 242 genes in this third category. In both cases, as a further test of our algorithm, we compared the results with those obtained when all the 1511 yeast genes that currently lack functional annotation participated in the construction of the h-tree. These 2- and 3-category data were then used for a V-fold cross-validation test. In such a test, the dataset is randomly partitioned into V equally sized subsets. The prediction accuracy is calculated as the average accuracy over V tests. In each test, the annotations of each of the genes from one subset are hidden, and then predicted according to the dominant annotation of the cluster to which it is assigned. The two (respectively three) known (chosen) GO categories were used to guide both algorithms. Genes without functional annotation did not participate in the cross-validation evaluation.
Figure 5 summarizes the prediction accuracy of the different algorithms in a 5-fold cross-validation test. Here we took an average over five independent cross-validation tests, and measured the accuracy of classifications following Huang and Pan (2006). The Minimum Misclassification algorithm outperforms the competitor algorithms, in both the 2-class and the 3-class tests. The addition of the unannotated genes tends to improve performance in most cases; possibly this is due to the presence among them of genes that do participate in the tested processes (although it is not known yet), which enhances the corresponding clusters. The performance of the Minimum Misclassification algorithm appears to be relatively insensitive to the number of requested clusters, a desirable property in practice, as discussed in the previous sections.
|
| 4 DISCUSSION |
|---|
|
|
|---|
It is often difficult to detect a weak functional signal in the typically noisy measurements of biological processes. Expression data, for example, produce a relatively faint signal, in the sense that many of the clusters obtained by the standard clustering methods are not enriched with any specific biological process. It is increasingly recognized that computational integration of additional data sources can alleviate this problem. We described a novel data integration methodology that fuses information about genes in the form of labels with similarity information, such as correlation of transcriptional expression profiles.
The utility of the methodology was demonstrated with algorithms for two key applications: functional prediction of genes, and discovering functionally enriched clusters. The functional prediction algorithm was used to generate oogenesis annotation predictions for D.melanogaster genes, which were corroborated by considerable biological evidence. Simulation results show that the algorithm of choice for discovering functionally enriched clusters is the Minimum Misclassification algorithm, and that it performs well on data that are noisy and complex, with large numbers of background profiles that have considerable overlap with the target profiles.
Furthermore, it is able to identify qualitatively different profiles existing within the set of genes annotated to the same term, and it performs well even when the actual number of clusters differs considerably from the requested number. Determining this actual number of clusters, k, is a difficult problem (Bickel, 2003; Bolshakova and Azuaje, 2006; and others). In hierarchical clustering, one often selects a k such that the height of the resulting highest subtree is significantly lower than the highest subtree results from selecting k – 1. We similarly suggest selecting a value of k that provides significantly better performance than the preceding value. The simulation results indicate that our algorithm may be of value in situations in which genes that are annotated with the same biological process have two (or more) distinct expression profiles (Yona et al., 2007). For example, the expression profiles of cell cycle-related genes, measured at successive time points in the cell cycle, naturally fall into qualitatively distinct categories, corresponding to the stage of the cell cycle in which they take part.
The tree-snipping algorithms can also be used for generating a linear ordering of the leaves of the h-tree. By swapping the two subtrees at internal nodes, different orderings that are consistent with the tree can be obtained. Additional criteria can therefore be imposed on the ordering, for example to better display features of the clustering. It was previously proposed to minimize the sum of successive pair-wise distances (Bar-Joseph et al., 2003), in order to generate an ordering with smooth transitions from one gene profile to another. For our purpose of generating biologically meaningful clusters, it is natural to first order the leaves according to the labeled clusters, obtained by a snipping algorithm, while minimizing successive pair-wise distances only within the labeled clusters.
The snipping framework can be applied also to data other than expression data. For example, a hierarchical tree can be constructed for proteins with the similarity of two proteins based on the statistical significance of their having shared partners in a protein–protein interaction network (Samanta and Liang, 2003). Another interesting possibility is to apply the methodology to functional annotations of genes derived from phylogenetic trees on proteins constructed using sequence similarity (Durbin et al., 1998; Kaplan et al., 2004).
The framework described here requires two sources of information: similarity information about all pairs of data items (genes), and labels for some of the items. A number of efforts have shown the value of integration including (Segal et al., 2003; Tamada et al., 2003) among numerous others. Our setting resembles the one for the popular and often extremely effective method of semi-supervised learning, which also attempts to extract information from multiple data sources. Previous approaches to semi-supervised clustering, such as (Bansal et al., 2004; Bilenko et al., 2004; Klein et al., 2002), assumed that each data item has at most one label, and that items which are similar to a labeled item should probably be in the same cluster. In many applications of biological interest these assumptions do not hold. For example, when analyzing data from a microarray experiment, and labeling genes with their GO biological process, it is often the case that a gene has more than one annotation, and that two genes that are annotated with the same process are not similarly expressed.
Our methodology can thus be viewed as a novel instance of semi-supervised learning that integrates partially labeled and unlabeled data with the structural constraints imposed by the hierarchy produced by an agglomerative clustering algorithm. Namely, the partition of the data into partially labeled clusters is guided by both the available labels and the similarity-based clusters. There are a number of useful extensions that will be incorporated in future versions of this methodology, which have the weights on the branches of the tree as well as other factors influence the optimal partitioning decisions. If the weights correspond to probabilities, as in graphical models in the form of trees (Pearl, 1988; Zheng et al., 2005), then the functional prediction problem can be framed as the problem of inferring the missing functional labels in probabilistic trees and can be solved using standard belief propagation. However, such probabilities are more often than not difficult to estimate, and the procedures described in this article provide a viable alternative.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Nir Tzachar and Itamar Cohen for their help. We thank the reviewers for their thoughtful comments. This work is supported by the Lynn and William Frankel Center for Computer Science, the Paul Ivanier center for robotics research and production, NSF grant ITR-048715, NHGRI grants #1R33HG002850-01A1 and R01 HG003367-01A1 and NIH grant U54 LM008748.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: David Rocke
Received on April 22, 2007; revised on October 9, 2007; accepted on October 15, 2007
| REFERENCES |
|---|
|
|
|---|
Adryan B, Schuh R. Gene-Ontology-based clustering of gene expression data. Bioinformatics (2004) 20:2851–2852.
Aguilera M, et al. DADA3: cloning and characterization of a Drosophila melanogaster homolog of a histone-acetylase complex component. A. Dros. Res. Conf (1999) 40:473A.
Alizadeh AA, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature (2000) 403:503–511.[CrossRef][Medline]
Arbeitman MN. Gene expression during the life cycle of Drosophila melanogaster. Science (2002) 297:2270–2275.
Bansal N, et al. Correlation clustering. Mach. Learn (2004) 56:89–113.[CrossRef]
Bar-Joseph Z, et al. K-ary clustering with optimal leaf ordering for gene expression data. Bioinformatics (2003) 19:1070–1078.
Bickel DR. Robust cluster analysis of microarray gene expression data with the number of clusters determined biologically. Bioinformatics (2003) 19:818–824.
Bilenko M, et al. Integrating constraints and metric learning in semi-supervised clustering. (2004) Proceedings of the International Conference on Machine Learning. 81–88.
Bolshakova N, Azuaje F. Estimating the number of clusters in DNA microarray data. Methods Inf. Med (2006) 45:153–157.[Web of Science][Medline]
Buehler EC, et al. The CRASSS plug-in for integrating annotation data with hierarchical clustering results. Bioinformatics (2004) 20:3266–3269.
Cheng J, et al. A knowledge-based clustering algorithm driven by Gene Ontology. J. Biopharm. Stat (2004) 14:687–700.[CrossRef][Medline]
Clare A, King RD. How well do we understand the clusters found in microarray data? In Silico Biol (2002) 2:511–522.[Medline]
Curtis RK, et al. Pathways to the analysis of microarray data. Trends Biotechnol (2005) 23:429–435.[CrossRef][Web of Science][Medline]
Doherty JM, et al. GOurmet: a tool for quantitative comparison and visualization of gene expression profiles based on gene ontology (GO) distributions. BMC Bioinformatics (2006) 7:151.[CrossRef][Medline]
Draghici S, et al. Global functional profiling of gene expression. Genomics (2003) 81:98–104.[CrossRef][Web of Science][Medline]
Durbin R, et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (1998) Cambridge, UK: Cambridge University Press.
Eisen M, et al. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA (1998) 95:14863–14868.
Fang Z, et al. Knowledge guided analysis of microarray data. J. Biomed. Inform (2006) 39:401–411.[CrossRef][Web of Science][Medline]
Fitch WM. Toward defining the course of evolution: minimum change for a specific tree topology. Syst. Zool (1971) 20:406–416.
Gibbons FD, Roth FP. Judging the quality of gene expression-based clustering methods using gene annotation. Genome Res (2002) 12:1574–1581.
Giot L, et al. A protein interaction map of Drosophila melanogaster. Science (2003) 302:1727–1736.
Hanisch D, et al. Co-clustering of biological networks and gene expression data. Bioinformatics (2002) 18:145–154.
Harris MA, et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res (2004) 32:D258–D261.
Hartigan JA, et al. Minimum mutation fits to a given tree. Biometrics (1973) 29:53–65.[CrossRef][Web of Science]
Huang D, Pan W. Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Bioinformatics (2006) 22:1259–1268.
Kaplan N, et al. A functional hierarchical organization of the protein sequence space. BMC Bioinformatics (2004) 5:196.[CrossRef][Medline]
Klein D, et al. From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. (2002) Proceedings of the International Conference on Machine Learning. 307–314.
Kusch T, et al. Two Drosophila Ada2 homologues function in different multiprotein complexes. Mol. Cell. Biol (2003) 23:3305–3319.
Kustra R, Zagdanski A. Incorporating Gene Ontology in Clustering Gene Expression Data. IEEE Symposium on Computer-Based Medical Systems (2006) 555–563.
Liu J, et al. Gene Ontology friendly biclustering of expression profiles. (2004) Proceedings of the IEEE Computational Systems Bioinformatics Conference. 436–447.
Murthy SK, et al. A system for induction of oblique decision trees. J. Artif. Intell. Res (1994) 2:1–33.
Okada Y, et al. Knowledge-assisted recognition of cluster boundaries in gene expression data. Artif. Intell. Med (2005) 35:171–183.[CrossRef][Web of Science][Medline]
Pan W. Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics (2006) 22:795–801.
Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (1988) San Mateo, CA: Morgan Kaufmann.
Raychaudhuri S, et al. The computational analysis of scientific literature to define and recognize gene expression clusters. Nucleic Acids Res (2003) 31:4553–4560.
Reich M, et al. GenePattern 2.0. Nat. Genet (2006) 38:500–501.[CrossRef][Web of Science][Medline]
Rodriguez-Trelles F, et al. Is ectopic expression caused by deregulatory mutations or due to gene-regulation leaks with evolutionary potential? Bioessays (2005) 27:592–601.[CrossRef][Web of Science][Medline]
Samanta MP, Liang S. Predicting protein functions from redundancies in large-scale protein interaction networks. Proc. Natl Acad. Sci. USA (2003)) 100:12579–12583.
Segal E, et al. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics (2003) 19(Suppl. 1):i264–i271.[Abstract]
Shafi R, et al. The O-GlcNAc transferase gene resides on the X chromosome and is essential for embryonic stem cell viability and mouse ontogeny. Proc. Natl Acad. Sci. USA (2000) 97:5735–5739.
Spellman PT, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell (1998) 9:3273–3297.
Tamada Y, et al. Estimating gene networks from gene expression data by combining Bayesian network model with promoter element detection. Bioinformatics (2003) 2):II227–II236.
Tavazoie S, et al. Systematic determination of genetic network architecture. Nat. Genet (1999) 22:281–285.[CrossRef][Web of Science][Medline]
Toronen P. Selection of informative clusters from hierarchical cluster tree with gene classes. BMC Bioinformatics (2004) 5:32.[CrossRef][Medline]
Wu LF, et al. Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat. Genet (2002) 31:255–265.[CrossRef][Web of Science][Medline]
Yona G, et al. Comparing algorithms for clustering of expression data – how to assess gene clusters. In: Computational Systems Biology (2007) Totowa, NJ: Humana press.
Zheng Y, et al. Phylogenetic detection of conserved gene clusters in microbial genomes. BMC Bioinformatics (2005) 6:243.[CrossRef][Medline]
This article has been cited by other articles:
![]() |
D. Dotan-Cohen, S. Kasif, and A. A. Melkman Seeing the forest for the trees: using the Gene Ontology to restructure hierarchical clustering Bioinformatics, July 15, 2009; 25(14): 1789 - 1795. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Krushevskaya, H. Peterson, J. Reimand, M. Kull, and J. Vilo VisHiC--hierarchical functional enrichment analysis of microarray data Nucleic Acids Res., July 1, 2009; 37(suppl_2): W587 - W592. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






