Bioinformatics Advance Access originally published online on February 15, 2005
Bioinformatics 2005 21(10):2546-2547; doi:10.1093/bioinformatics/bti317
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
© Published by Oxford University Press 2005.
A knowledge-driven approach to cluster validity assessment
1Department of Computer Science, Trinity College Dublin Dublin 2, Ireland
2School of Computing and Mathematics, University of Ulster Jordanstown BT37 0QB, UK
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: This paper presents an approach to assessing cluster validity based on similarity knowledge extracted from the Gene Ontology.
Availability: The program is freely available for non-profit use on request from the authors.
Contact: nadia.bolshakova{at}cs.tcd.ie
Supplementary information: http://www.cs.tcd.ie/Nadia.Bolshakova/GOtool.html
The automated integration of background knowledge is fundamental to support the generation and validation of hypotheses about the function of gene products. One such source of prior knowledge is the Gene Ontology (GO), which is a structured, shared vocabulary that allows the annotation of gene products across different model organisms. The GO comprises three independent hierarchies: molecular function (MF), biological process (BP) and cellular component (CC). Researchers can represent relationships between gene products and annotation terms in these hierarchies. Previous research has applied GO information to detect overrepresented functional annotations in clusters of genes obtained from expression analyses. It has also been suggested to assess gene sequence similarity and expression correlation. For additional information on the GO and its applications, the reader is referred to its website (http://www.geneontology.org) and Wang et al. (2004).
Topological and statistical information extracted from the GO in relation to a set of annotated gene products may be used to measure the similarity between them. Different GO-driven similarity assessment methods may be then implemented to perform clustering or to quantify the quality of the resulting clusters. Cluster validity assessment may consist of data-driven and knowledge-driven methods, which aim to estimate the optimal cluster partition from a collection of candidate partitions. Data-driven methods mainly include statistical tests or validity indices applied to the data clustered. Knowledge-driven methods are proposed for enhancing the predictive reliability and biological relevance of the results. A data-driven, cluster validity assessment platform was previously reported by Bolshakova and Azuaje (2003).
Traditional GO-based cluster description methods have consisted of statistical analyses of the enrichment of GO terms in a cluster. The application of GO-based similarity to perform clustering and validate clustering outcomes has not been widely investigated. A recent contribution by Speer et al. (2004) presented an algorithm that incorporates GO annotations to cluster genes. They applied the DaviesBouldin index (Bolshakova and Azuaje, 2003) to estimate the quality of the clusters.
We implemented a knowledge-driven cluster validity assessment system for microarray data clustering. It consists of validity indices that incorporate similarity knowledge originating from the GO (we used only non-IEA annotations and the May 2004 release version). A well-known gene expression dataset from the yeast cell cycle (Cho et al., 1998) was analysed to illustrate its application. Several cluster partitions, obtained with the k-means algorithm, were analysed to estimate the optimum number of clusters for this dataset. An information content technique proposed by Resnik (1995) was implemented to measure similarity between gene products based on the GO. Detailed descriptions on this and other GO-based similarity assessment techniques are presented in Wang et al. (2004) and the Supplementary information.
This research applies two approaches to calculating cluster validity indices. The first approach processes overall similarity values, which are calculated by taking into account the combined annotations originating from the three GO hierarchies. The second approach is based on the calculation of independent similarity values, which originate from each of these hierarchies. The second approach allows one to estimate the effect of each of the hierarchies on the validation process.
We applied the C-index (Hubert and Schultz, 1976), which is an effective cluster validity estimator for different types of clustering applications. Clustering was performed with the Machaon CVE tool (Bolshakova and Azuaje, 2003). The data comprised 64 genes described by their expression values during the yeast cell cycle (Cho et al., 1998). Previous research has shown that disjoint clusters of genes are significantly expressed in each of the five cell cycle stages: early G1, late G1, S, G2 and M.
Figure 1(a) shows the predictions made by the validity indices at each number of clusters, c, for c=2 to 6. The bold entries correspond to the optimal values of the indices. The validity indices based on similarity information from the MF, BP and the combined hierarchies indicated that the optimal number of clusters is c = 5, which is consistent with the cluster structure expected (Cho et al., 1998). Only the method based on the CC hierarchy suggested the partition with two clusters as the optimal partition, which confirms that cellular localization information does not adequately reflect relevant functional relationships in this dataset.
|
The Machaon CVE (Bolshakova and Azuaje, 2003) has been updated to support this technique. It aims to partition samples or genes into groups characterized by similar expression patterns, and to evaluate the quality of the clusters obtained. Figure 1b depicts screenshots from the Machaon CVE. Future research will include the comparison and combination of different data- and knowledge-driven cluster validity indices. This study contributes to the development of techniques for facilitating the statistical and biological validity assessment of data mining results in functional genomics.
| Acknowledgments |
|---|
This research is partly based upon works supported by the Science Foundation Ireland under Grant No. S.F.I.-02IN.1I111.
Received on November 1, 2004; revised on January 24, 2005; accepted on February 8, 2005
| REFERENCES |
|---|
|
|
|---|
Bolshakova, N. and Azuaje, F. (2003) Machaon CVE: cluster validation for gene expression data. Bioinformatics, 19, 24942495
Cho, R.J., et al. (1998) A genomewide transcriptional analysis of the mitotic cell cycle. Mol. Cell, 2, 6573[CrossRef][Web of Science][Medline].
Hubert, L. and Schultz, J. (1976) Quadratic assignment as a general data-analysis strategy. Brit. J. Math. Statist. Psychol., 190241.
Resnik, P. (1995) Using information content to evaluate semantic similarity in a taxonomy. Proceedings of the 14th International Joint Conference on Artificial Intelligence , pp. 448453.
Speer, N., et al. (2004) A memetic clustering algorithm for the functional partition of genes based on the gene ontology. Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2004) , San Diego, USA IEEE Press, pp. 252259.
Wang, H., et al. (2004) Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships. Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational BiologyOctober 78 , La Jolla, CA IEEE Press, pp. 2531.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
