Bioinformatics Advance Access originally published online on December 17, 2004
Bioinformatics 2005 21(4):423-429; doi:10.1093/bioinformatics/bti186
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 21 issue 4 © Oxford University Press 2005; all rights reserved.
Clustering of diverse genomic data using information fusion
Department of Computer Science and Engineering, Pennsylvania State University University Park, PA 16802, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Genome sequencing projects and high-through-put technologies like DNA and Protein arrays have resulted in a very large amount of information-rich data. Microarray experimental data are a valuable, but limited source for inferring gene regulation mechanisms on a genomic scale. Additional information such as promoter sequences of genes/DNA binding motifs, gene ontologies, and location data, when combined with gene expression analysis can increase the statistical significance of the finding. This paper introduces a machine learning approach to information fusion for combining heterogeneous genomic data. The algorithm uses an unsupervised joint learning mechanism that identifies clusters of genes using the combined data.
Results: The correlation between gene expression time-series patterns obtained from different experimental conditions and the presence of several distinct and repeated motifs in their upstream sequences is examined here using publicly available yeast cell-cycle data. The results show that the combined learning approach taken here identifies correlated genes effectively. The algorithm provides an automated clustering method, but allows the user to specify apriori the influence of each data type on the final clustering using probabilities.
Availability: Software code is available by request from the first author.
Contact: jkasturi{at}cse.psu.edu
| INTRODUCTION |
|---|
|
|
|---|
Studies on gene transcription and regulation have been conducted for several years now, but our understanding of the underlying mechanisms for most genes still remains largely incomplete. The availability of a large amount of genomic sequence and expression data has made it possible to conduct large-scale studies to understand the mechanisms underlying gene transcription and regulation. In this respect, the identification of transcription factor binding sites (TFBS) is a crucial step (Babenko et al., 1999; Brazma et al., 1998; Fickett and Wasserman, 2000; Kellis et al., 2003). TFBS are short, conserved sequence elements, usually 525 base pairs in length, typically located upstream of the transcriptional start site of a gene and are often binding sites for certain proteins known as transcription factors (TFs) that regulate a group of genes involved in similar cellular function. Therefore, in order to characterize a promoter region, an essential step is to identify the binding sites or cis-regulatory elements. Several algorithms have been developed to detect statistically significant patterns among sequences (Hughes et al., 2000; Roth et al., 1998).
DNA microarrays are high-throughput experiments whereby it is possible to measure the changes in expression of several thousands of genes simultaneously. Studies combining sequence data with gene expression data are motivated by the idea that genes that have similar shaped expression profiles are likely to be co-regulated and therefore might share similar sequence elements in their promoter regions (Bussemaker et al., 2001; Holmes and Bruno, 2000). These methods can be broadly classified into two basic categories. The first method looks for common sequence elements in the promoter regions of genes that are clustered based on their expression profiles (Chiang et al., 2001; Jakt et al., 2001) while the second method searches for groups of genes that share sequence similarities and then looks for similar expression profiles (Park et al., 2002). In contrast to these methods of making inferences based on a single data source, there have been some recent methods developed that aim to combine both gene expression and upstream sequences (Holmes and Bruno, 2000; Segal et al., 2002; Segal and Koller, 2002). Many gene expression analysis methods (Brazma and Vilo, 2000; Friedman et al., 2000; Murali and Kasif, 2003; Sherlock, 2000) as well as combination data methods are model-based and require assumptions to be made on the data. Such methods are generally computationally intensive since parameters are estimated in an iterative fashion.
The algorithm presented here is a model-free approach to collaborative learning with the use of an unsupervised clustering method such as Self-organizing maps (SOM). In contrast to previous methods, the aim of this algorithm is to allow exploration of the data by making use of as much available data related to a set of genes as possible. Each dataset used is considered a category, like expression data, sequence information in the form of DNA motifs, location information, and even gene ontology information. In this way, extremely diverse, but complementary types of data can be used together to simultaneously discover interesting patterns and possibly new insights in the data, that might not be readily available from analyzing one type of data in isolation. In this paper we use real gene expression data [yeast cell-cycle (Spellman et al., 1998)] to explore and identify associations between patterns of gene expression of genes, and the presence of transcription factor binding sites (possibly multiple occurrences) in the upstream regions of their DNA sequences.
| METHODS |
|---|
|
|
|---|
Let the number of data sets (categories) available for analysis be k. Let N g be the total number of genes being analyzed, corresponding to the rows of the input gene expression matrix. Each of the k categories have dimension N 1, N 2,..., N k respectively. Associated with each category, i is a distance function D i (i=1, 2,..., k).
From symbol to signal
Gene expression data is generally available as intensity ratios, while sequence data requires some abstraction (such as motif extraction) depending on the context and the goal of the algorithm. Here, we aim at finding associations between expression profiles and combinatorial effects of motifs, and hence the input to this algorithm is a motif frequency vector created for each gene in the data set. Given the set of promoter sequences for the genes, motifs are extracted (using any method), and the number of occurrences of each motif in a gene is counted. If a motif is not present in a gene, the corresponding value is zero.
Given a set of promoter sequences of the set of genes, let the number of distinct motifs extracted be M, each labeled as m 1, m 2,..., m M . It is assumed that each gene contains at least one motif. The corresponding motif input vector for a gene g is of the form
where
is number of times motif j occurs in the gene g sequence; and 0 otherwise. (j = 1, 2, ..., M). The matrix of motif counts and memberships is therefore sparse with non-zero values lying only in positions corresponding to the present motifs. For a category r containing expression data from N r experiments, the vector of intensity values is (i 1, i 2, ..., i Nr ).
Weighting scheme
The algorithm combines different data types by allowing a weight to be assigned to each type of data depending on how much influence that data should have on the final clustering results. These weights give prior probabilities for assigning confidence levels to the data types, given by P = p 1, p 2, ..., p k (satisfying properties of a probability distribution) corresponding to the weights for each of the k categories. Explanation of the usage of these weights is in the results section.
Algorithm overview
Here we describe the information fusion algorithm in detail. The method extends the self-organizing map (SOM) learning algorithm to allow for combined learning of different data types. The self-organizing neural network uses an iterative procedure by which the probability distribution of the data is reproduced as closely as possible. At each iteration step, a category is randomly selected based on the weighting scheme P. The chosen category r and its associated distance function d r are used to train the network of neurons and the weights for the entire input tuple (of dimension N 1 + N 2 + ··· + N m ) are updated using the Kohonen learning rule (Kohonen, 1995) although distances are calculated on each segment of the input vector independently using the appropriate distance. Information-theoretic similarity measures such as KullbackLiebler are preferred for clustering of intensity values (Kasturi et al., 2003). To measure the similarity between genes based on their frequency of motif occurrence, we use a measure based on the Extended Jaccard Similarity coefficient. A new probability based method is specified for cluster assignment. The basic fusion algorithm is shown in Figure 1 [refer to Kasturi et al., (2003) for more details on the SOM clustering algorithm].
|
As with any iterative optimization and vector clustering algorithm there are several issues that need to be addressed. The first is the question of the convergence properties of an algorithm when a new distance measure is used for training. This is especially important in the case of combined learning where training is performed using a random choice of category (and hence distance measure) for each epoch. Experimental results show that the algorithm always converges much before the maximum limit of 10 000 epochs (used here), although as expected, there is some perturbation in the SOM weights.
Another aspect to address in such problems is that of normalization between the different data types and the distance measures used. We avoid these problems here since different data types are never actually combined to form a single feature vector, which is then used for clustering. Here, each data type uses a different distance measure appropriate to it.
Distance functions
The Relative entropy or KullbackLeibler divergence measures the inefficiency of assuming that the distribution of a random variable X is q when the true distribution is p, given by
![]() |
To provide a similarity measure between genes based on the frequency of motif occurrence, we define the distance function SDist based on the Extended Jaccard Similarity Coefficient. SDist is given by
![]() |
- 0
SDist(x,y)
1;
- SDist(x,x) = 0;
- SDist(x,y) = SDist(y,x).
Further, if both x and y are identically zero in all coordinates, then SDist(x,y) = 0.
The distance uses the values contained in the nonzero positions in the sparse vectors rather than just noting the number of non-zero positions common to both vectors whose distance is being calculated. This measure however does not depend on the total number of motifs in the data set. This is an important aspect when considering a distance measure for motif frequency since in general a single gene is not expected to contain more than 3 motifs (especially in yeast).
Cluster assignment algorithm
Once training is complete, each data point has to be assigned to one of the N different clusters based on some criterion. We introduce a new method of assigning cluster memberships to data points based on the prior information for each category using the probability distribution P, by assigning a data point to the maximal probability cluster. For gene g and N cluster centroids, first find the closest cluster centroid for each of the k categories (using the corresponding coordinates and distance function). Let the winning cluster using category r j be l j , (j = 1,2,...,k). Let
denote the indicator function for category r, with a value 1 for cluster l and zero for all others. Then, the probability that gene g lies in cluster l (l = 1, 2,..., N) is given by
![]() |
| EXPERIMENTAL DATA |
|---|
|
|
|---|
The yeast genome has now been sequenced and there is a vast amount of experimental data available for analyses. The data used to illustrate our algorithm was obtained from the following sources. (1) The gene expression data set used was an aggregation of data from several experiments on budding yeast taken as a reduced dataset from (Eisen et al., 1998; Spellman et al., 1998). We used the data related to time courses during mitotic cell division cycle after synchronization by alpha factor arrest (18 time points), as well as the Elutriation (14 time points). (2) Known transcription factors (motifs) were searched in TRANSFAC for the genes in dataset. 15 motifs were used to make up the motif frequency input data. The frequency of occurrence of a motif in each of the genes was extracted and a final dataset was created to contain 99 genes (discarding replicated genes and genes with missing values) and 47 columns.
| EXPERIMENTAL SETUP AND RESULTS |
|---|
|
|
|---|
To illustrate the success of the combined learning algorithm, clusters were obtained for different choices of input data. By the use of the weighting scheme introduced earlier, clusters could be obtained either using only a single data set or for all data to be equally involved during learning of the clusters. Results are presented for each of these learning combinations. The basic experimental setup was as follows: (1) Cluster the data using only motif frequencies (weight for motif data = 1); (2) Cluster the data using only expression data (weight for motif data = 0, expression data = 1); and (3) Cluster using both motif and expression data (weight for both = 0.5).
Based on the data used here, it was observed that most (89 genes) of the genes contained only one of the 15 motifs, while 9 genes contained 2 motifs and only 1 gene contained 3 motifs. Figure 2C) shows the histogram of occurrence of each of the motifs in the data set. It was observed that several genes were similar in that they contained the same motifs (see Figure 2A) and hence clustering based on motifs might reveal some interesting insights into the data. It was also of interest to see whether genes with higher number of distinct motifs present in their upstream regions have any correlation with being physically close to one another or the same chromosome (shown in Figure 2B).
|
The first set of results was obtained by looking for correlated patterns in the alpha factor gene expression dataset along with the motif frequency data. The algorithm was first run to find motif clusters alone and hence using only the motif data to perform clustering. The results are shown in Figure 3. There were approximately 17 clusters for motifs and the results are presented. The algorithm was run several times and either 16 or 17 clusters seemed to be significant every time. This shows that the algorithm is robust and results between different runs do not vary a great deal. The clusters are shown by line plots as well as red/green color plots (for log-transformed data). Viewing the results in this manner helps correlate the clusters obtained and also that the algorithm gives consistent clusters in the log-space as well. Whatever choice of data was used for clustering, genes in Cluster 12 were always together as can be seen in the other figures as well. The patterns for Cluster 12 were consistently correlated with all expression and motifs for all of the results presented here. Figure 3C shows that the error does not oscillate much when only one data type is used for clustering as would be expected while the use of more than one data type causes the weights to oscillate a lot during initial training and then steadily lower after that.
|
The next step was to evaluate the results obtained when both data types were used equally for clustering. Here again the data used was the alpha factor gene expression data. Figure 4 shows 19 clusters obtained when clustering using alpha factor data. It can be seen that the gene expression clusters are well categorized, while the corresponding motif data shows much less correlation, although for example, it can be seen that genes in cluster 1 do share a common motif.
|
Figure 5 shows clusters that are now correlated between the two data sets. Genes in the clusters were examined more closely. For example, genes that fell into cluster 8 were analyzed. The MIPS database was used to obtain gene functions for these genes shown in the table. These results were also compared with the original paper that examined the data from (Spellman et al., 1998) and it was found that 4 of the 5 genes here. Namely, CTF4, POL30, HYS2 and POL32 were mentioned in the original paper to be DNA Syn related genes. Therefore combined clustering now shows that genes with similar function might possibly share a common expression pattern under a certain experimental condition and might share a common motif.
|
To continue, the same experiments were carried out by now adding Elutriation (Elu) data also to the clustering. Figure 6A shows the 16 clusters obtained when using only the motif data, Figure 6B shows 20 clusters obtained when using only the Elu data. It can be seen that the similarity among patterns in the Elu data is obviously more than when compared to the Alpha factor dataset. Although that is the case, 5 or 6 clusters seem to be significantly similar for genes between the data sets. It is also interesting to note that correlation between motifs and Elu data is more significant when compared to Alpha data. Figure 6C shows the results when combining both Alpha and Elu data along with motif frequency data, all given equal weights (uniform distribution).
|
Hence through the choice of the weight (or apriori probability on the data) parameter, different aspects of information in the data may be captured. Data integration is then truly possible and combined analysis proves to be extremely useful in further understanding and discovering new insights into the data.
| SUMMARY AND CONCLUSIONS |
|---|
|
|
|---|
The algorithm described here provides an exploratory method for correlating diverse data sources, and is particularly useful for genomics applications. This method is also able to combine data sources by using the weighting scheme. Specifically here, we looked for similarities in gene expression patterns between yeast genes by combining multiple gene expression data sets along with presence of known transcription factors in these genes and their frequencies of occurrence.
The combined clustering approach is able to find better clusters of genes than those obtained when using just a single data source. This algorithm is thus useful for understanding gene regulatory modules by examining several gene expression experiments for a related set of genes.
| Acknowledgments |
|---|
The research was partially supported by grants to Professor Acharya from NSF and PCABC (The Pennsylvania Cancer Alliance Bioinformatics Consortium).
| Footnotes |
|---|
Kasturi and Acharya (2004) Clustering of diverse genomic data using information fusion. Symposium on Applied Computing, Proceedings of the 2004 ACM symposium on Applied computing, 116120; http://doi.acm.org/10.1145/967900.967926
Copyright 2004 Association for Computing Machinery, Inc. Reprinted by permission. Direct permission requests to permissions{at}acm.org
Received on May 21, 2004; accepted on November 11, 2004
| REFERENCES |
|---|
|
|
|---|
Babenko, V.N., Kosarev, P.S., Vishnevsky, O.V., Levitsky, V.G., Basin, V.V., Frolov, A.S. (1999) Investigating extended regulatory regions of genomic DNA sequences. Bioinformatics, 15, 644653
Brazma, A. and Vilo, J. (2000) Gene expression data analysis. FEBS Lett., 480, 1724[CrossRef][ISI][Medline].
Brazma, A., Jonassen, I., Vilo, J., Ukkonen, E. (1998) Predicting gene regulatory elements in silico on a genomic scale. Genome Res., 8, 12021215
Bussemaker, H., Li, H., Siggia, E.D. (2001) Regulatory element using correlation with expression. Nat. Genet., 27, 167174[CrossRef][ISI][Medline].
Chiang, D.Y., Brown, P.O., Eisen, M.B. (2001) Visualizing associations between genome sequences and gene expression data using genome-mean expression profiles. Bioinformatics, 17, (Suppl. 1), S49S55[Abstract].
Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 1486314868
Fickett, J.W. and Wasserman, W.W. (2000) Discovery and modeling of transcriptional regulatory regions. Curr. Opin. Biotechnol., 11, 1924[CrossRef][ISI][Medline].
Friedman, N., Linial, M., Nachman, I. (2000) Using Bayesian networks to analyze gene expression data. J. Comput. Biol., 7, 601620[CrossRef][ISI][Medline].
Holmes, I. and Bruno, W.J. (2000) Finding regulatory elements using joint likelihoods for sequence and expression profile data. Proc. Int. Conf. Intell. Syst. Mol. Biol., 8, 202210[Medline].
Hughes, J.D., Estep, P.W., Tavazoie, S., Church, G.M. (2000) Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae . J. Mol. Biol., 296, 12051214[CrossRef][ISI][Medline].
Jakt, L.M., Cao, L., Cheah, K.S., Smith, D.K. (2001) Assessing clusters and motifs from gene expression data. Genome Res., 11, 112123
Kasturi, J., Acharya, R., Ramanathan, M. (2003) An information theoretic approach for analyzing temporal patterns of gene expression. Bioinformatics, 19, 449458
Kellis, M., Patterson, N., Endirizzi, M., Birren, B., Lander, E.S. (2003) Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature, 423, 241254[CrossRef][Medline].
Kohonen, T. Self-Organizing Maps, (1995) , Berlin Springer Series in Information Sciences, Springer.
Murali, T.M. and Kasif, S. (2003) Extracting conserved gene expression motifs from gene expression data. Pac. Symp. Biocomput., 8, , pp. 7788.
Park, P.J., Butte, A.J., Kohane, I.S. (2002) Comparing expression profiles of genes with similar promoter regions. Bioinformatics, 18, 15761584
Roth, F.R., Hughes, J.D., Estep, P.E., Church, G.M. (1998) Finding DNA regulatory motifs within unaligned non-coding sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol., 16, 939945[CrossRef][ISI][Medline].
Segal, E., Barash, Y., Simon, I., Friedman, N., Koller, D. (2002) From promoter sequence to expression: a probabilistic framework. Proceedings of the Sixth Annual International Conference on Computational Biology (RECOMB 2002), , New York, NY ACM Press, pp. 263272.
Segal, E. and Koller, D. (2002) Probabilistic Hierarchical clustering for biological data. Proceedings of the Sixth Annual International Conference on Computational Biology (RECOMB 2002), , New York, NY ACM Press, pp. 273280.
Sherlock, G. (2000) Analysis of large-scale gene expression data. Curr. Opin. Immunol., 12, 201205[CrossRef][ISI][Medline].
Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Eisen, M.B., Brown, P.O., Botstein, D., Futcher, B. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 32733297
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








