Skip Navigation


Bioinformatics Advance Access originally published online on January 24, 2006
Bioinformatics 2006 22(7):795-801; doi:10.1093/bioinformatics/btl011
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/7/795    most recent
btl011v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (22)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pan, W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pan, W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Incorporating gene functions as priors in model-based clustering of microarray gene expression data

Wei Pan

Division of Biostatistics, MMC 303, School of Public Health, University of Minnesota Minneapolis, MN 55455-0392, USA


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

Motivation: Cluster analysis of gene expression profiles has been widely applied to clustering genes for gene function discovery. Many approaches have been proposed. The rationale is that the genes with the same biological function or involved in the same biological process are more likely to co-express, hence they are more likely to form a cluster with similar gene expression patterns. However, most existing methods, including model-based clustering, ignore known gene functions in clustering.

Results: To take advantage of accumulating gene functional annotations, we propose incorporating known gene functions as prior probabilities in model-based clustering. In contrast to a global mixture model applicable to all the genes in the standard model-based clustering, we use a stratified mixture model: one stratum corresponds to the genes of unknown function while each of the other ones corresponding to the genes sharing the same biological function or pathway; the genes from the same stratum are assumed to have the same prior probability of coming from a cluster while those from different strata are allowed to have different prior probabilities of coming from the same cluster. We derive a simple EM algorithm that can be used to fit the stratified model. A simulation study and an application to gene function prediction demonstrate the advantage of our proposal over the standard method.

Contact: weip{at}biostat.umn.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
This article concerns with clustering genes for gene function discovery using microarray gene expression data. It has been widely observed that genes with a similar function or involved in the same biological process are likely to co-express, hence clustering genes’ expression profiles provides a means for gene function discovery; see, e.g. Eisen et al. (1998), Brown et al. (2000), Wu et al. (2002), Xiao and Pan (2005) and references therein. However, most existing approaches all ignore known functions of some genes in the process of clustering; few exceptions in the context of non-model-based clustering include Hanisch et al. (2002), Cheng et al. (2004), Fang et al. (2006) and Huang and Pan (2006). For example, in model-based clustering, all the genes are treated equally a priori; in particular, all the genes are assumed to have an equal prior probability of being in a given cluster (e.g. Li and Hong, 2001; Ghosh and Chinnaiyan, 2002; Pan et al., 2002). As mentioned, if some genes are known to share the same function, it is more likely that they belong to the same cluster. Hence, it seems more plausible to model the genes sharing the same biological function to have an equal prior probability while allowing the genes with different functions to have varying prior probabilities. This provides a more efficient way to account for the association between gene function and co-expression. In this paper, we propose such an approach that uses gene functional annotations as priors for model-based clustering. Specifically, first, the genome is partitioned into several groups with one group containing the genes of unknown function and each of the other groups containing the genes sharing the same function. Gene functional annotations are readily available from many existing databases, such as the Gene Ontology (GO) (Ashburner et al., 2000) and MIPS (Mewes et al., 2004). Second, each group is treated as a stratum and a stratified mixture model is used: the genes from the same groups are assumed to have the same prior probability of coming from the same cluster while the prior probabilities for different groups are allowed to be unequal. Because of possible heterogeneity in each gene functional group, we do not assume that the genes from the same functional group come from the same cluster. In fact, for the genes in the group of unknown function, they may come from any cluster. With relatively high noise levels of genomic data, it is recognized that incorporating biological knowledge into statistical analysis is a reliable way to maximize statistical efficiency and enhance the interpretability of the analysis results.

This article is organized as follows. In Section 2, we first briefly review the standard method of model-based clustering with a global mixture model, then propose our stratified mixture model and associated stratified clustering. We derive a simple EM algorithm to fit our stratified model. In Section 3, we demonstrate the advantage of our proposal using simulated data, and then using real data for gene function prediction. We end the paper with a short discussion.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
2.1 Standard model-based clustering
In model-based clustering, it is assumed that each observation x, a p-dimensional vector, is drawn from a finite mixture distribution

Formula 1(1)
with the prior probability {pi}i, component-specific distribution fi and its parameters {theta}i. We use {Theta} = {({pi}i, {theta}i) : i = 1, ... , g} to denote all unknown parameters, with the restriction that 0 ≤ {pi}i ≤ 1 for any i and that Formula 1. Each component of the mixture distribution corresponds to a cluster. The number of clusters, g, has to be determined in practice; see Section 2.3.

The finite normal mixture model is most widely used: each component fi is a normal distribution; we use this model throughout this article. Because we are particularly interested in applications with ‘large p, small n’ (i.e. high-dimensional data with small sample sizes) encountered in genomic studies, we adopt a working independence model for the components of x as in the naive Bayesian, though other more sophisticated methods may be preferred (e.g. McLachlan et al., 2003). Specifically, we have

Formula 1
where Formula 1 and Formula 1.

Given a dataset xj for j = 1, ... , n, the EM algorithm (Dempster et al., 1977) is used to estimate the parameters {Theta} iteratively in the standard model-based clustering (McLachlan and Peel, 2002; Fraley and Raftery, 2002). We use generic notation {Theta}(m) to represent the parameter estimates at iteration m, then the EM works by iterating the following:

Formula 2(2)

Formula 3(3)
where

Formula 4(4)
is the estimated posterior probability of xjs coming from component i.

The above iteration is repeated until convergence, resulting in maximum likelihood estimate (MLE) Formula 4. Then we use (4) to calculate the posterior probability of any observation xjs belonging to any cluster i, and assign the observation to the cluster with the largest such probability. Because of possible existence of multiple local maxima, we need to start the algorithm multiple times with various starting values; in this paper, we use the result from K-means as starting values for the EM. We fit a series of models with various values of g, then use a model selection criterion to choose its appropriate value, as discussed in Section 2.3.

2.2 Stratified model-based clustering
In Section 2.1, all the genes are treated equally a priori. To take advantage of the known gene functions, we propose first partitioning the genome into several groups, say, G1, ... , GK. GK contains the genes of unknown function while each of the other groups contains the genes sharing the same biological function. Second, rather than a global model (1), we propose using a stratified model: for any gene j in functional group Gk,

Formula 5(5)
for k = 1, ... , K. Note that the K stratified models differ in using stratum-specific prior probabilities while sharing the same set of component distributions. Hence, we assume that genes from the same group Gk share the same prior probability {pi}(k),i of coming from the same cluster i while allowing them to come from different clusters. It is easy to see that the above stratified model reduces to the standard model with K = 1.

In practice, Gk can be determined based on the GO (Ashburner et al., 2000), MIPS (Mewes et al., 2004) or other sources of biological knowledge; the choice of the database depends on the application at hand. For example, if the goal is to predict gene function while both data quality and terminology of GO are preferred, GO can be used. Because some genes may have multiple functions, we allow a gene to be in multiple groups. A simple method considered here is that, for example, if a gene has two functions corresponding to groups G1 and G2, we duplicate the gene expression profile for the gene and treat the two observations as two genes, one in G1 and one in G2, in the process of clustering; see the final section for more discussions on this issue.

Next we derive the EM algorithm for the above model-based clustering. Given data X = {xj : j = 1, ... , n}, the log-likelihood is

Formula 5
where for simplicity we suppress the dependence of group index k = k(j) on gene j and we use this convention throughout. Maximization of the above log-likelihood is difficult, and it is common to use the EM algorithm (Dempster et al., 1977) by casting the problem in the framework of missing data. Define zij as the indicator of whether xj is from component i; i.e. zij = 1 if xj is indeed from component i and zij = 0 otherwise. If we could observe the missing data zijs, then the complete data log-likelihood is

Formula 5
It is easy to verify that the E-step of the EM yields

Formula 5
where for gene j isin Gk,

Formula 6(6)
is the estimated posterior probability of xjs coming from component i.

The M-step of the EM maximizes the above Q to update the parameter estimates. For a stratified model, updates for the mean and variance parameters are the same as in (2), but the update for the prior probability is different:

Formula 7(7)
where nk is the number of the genes in Gk.

As before, the above updates are iterated until convergence with resulting MLE Formula 7. We fit a series of models with various values of g, then use a model selection criterion to choose an appropriate value.

We use (6) to calculate the posterior probability of any observation xjs belonging to each cluster, and assign the observation to the cluster with the largest such probability.

2.3 Model selection
In practice, we have to determine the number of components, g. This is realized by first fitting a series of models with various numbers of components, and then using a model selection criterion to choose the best one. In model-based clustering, it is common to use Bayesian information criterion (BIC) (Schwartz, 1978) defined as

Formula 7
where d is the total number of (effective) parameters (Fraley and Raftery, 1998). In the standard model, because we have three sets of parameters, {pi}is, {sigma}qs and µiqs, and we have the constraint Formula 7, we have d = p + gp + g – 1; in the stratified model with K strata, instead of a set of {pi}is, we have K sets of {pi}(k),is, thus d = p + gp + K(g – 1).


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
3.1 Simulated data
We did a simulation study to demonstrate the improved performance of our proposal. There were two univariate clusters with distributions N1, {sigma}1) and N2, {sigma}2) respectively. We had two gene functional groups with n1 and n2 genes respectively. For a gene in functional group k for k = 1 or 2, its prior probability of being in cluster 1 was {pi}(k),1. Four simulation set-ups were used with various parameter values: three sets of {pi}(k),1s were used; n1 = n2 = 50 in the first three set-ups while n1 = 25 and n2 = 75 in set-up 4; set-ups 2 and 4 were the same except for different n1 and n2.

The results were based on 1000 simulations for each set-up. We considered both the mean and variance of a parameter estimate over 1000 simulations; we also gave the average numbers of observations in a simulated dataset that were incorrectly assigned to clusters different from their true ones.

In the first simulation set-up, we had {pi}(1),1 = {pi}(2),1, hence the standard mixture model was correct, and the standard method was supposed to perform better. Although the stratified mixture model was still correct, it unnecessarily required to estimate two separate parameters {pi}(1),1 and {pi}(2),1. It was confirmed that the standard method worked better, however, more interestingly, its performance was quite close to that of our proposed method: the estimates of the mean and variance parameters from the stratified method were almost the same as that from the standard method, but as expected, the estimate of the prior probability from the former had slightly larger variability.

In set-ups 2–4, the stratified mixture model was correct with {pi}(1),1 != {pi}(2),1. Hence, with the knowledge of the gene functional groups, it was better to use the stratified model. On the other hand, ignoring the gene functional groups, the global mixture model still held with a corresponding {pi}1 = 0.5 or {pi}1 = 0.35. Unsurprisingly, the standard method gave estimates with larger estimation errors e.g. in terms of a mean squared error, which is the sum of the squared bias and the variance; this was especially evident for the mean and variance parameters of the second cluster, both with much larger variances than those from the stratified model. In consequence, the standard method resulted in much larger numbers of misclassified genes. Comparing set-ups 2 and 3, it was noted that the performance difference between the two methods increased with Formula 7. With different numbers of genes in the two functional groups (set-up 4), the stratified method still maintained better performance (Table 1).


View this table:
[in this window]
[in a new window]
 
Table 1 Simulation results for the standard method and the new method

 
3.2 Gene function prediction using gene expression profiles
With the completion of the human genome and other sequencing projects, it has become compelling to learn functions of many newly discovered genes. An important approach is to cluster gene expression profiles drawn from microarray experiments under various conditions; see, e.g. Eisen et al. (1998), Brown et al. (2000), Wu et al. (2001), Zhou et al. (2002), Xiao and Pan (2005) and references therein. The rationale is that genes sharing the same biological function are likely to co-express.

We considered clustering gene expression data to predict gene functions for yeast Saccharomyces cerevisiae. We used a large dataset containing 300 microarray experiments with gene deletions and drug treatments (Hughes et al., 2000). The data were centered and scaled so that the mean and variance of the expression profile for each gene were 0 and 1, respectively. Gene functions were downloaded from the MIPS database (Mewes et al., 2004). For illustration, we only considered two gene functions, mitotic cell cycle and cell cycle control (with MIPS code 030301) and mitochondrion (with MIPS 4016), shortened as classes 1 and 2, respectively. There were three strata: the first two strata each contained 200 genes randomly selected from one of the two classes, while the third stratum was a mixture of 119 and 112 genes from the two classes respectively. Hence, the genes in the first two strata were treated as those with known functions whereas the third stratum consisted of the genes whose functions were to be predicted.

We did not use the class labels of the genes in the process of clustering. Based on the clustering results, we predicted the functions of the genes in the third stratum. There were two ways to accomplish prediction, called hard classification and soft classification. In hard classification, each gene was assigned to a cluster, and a cluster was classified to a class to which the majority of the genes from the first two strata that were assigned to the cluster belonged; if there was an equal number of the genes from each of the first two strata, we randomly assigned a class label to the cluster. Any gene from stratum 3 that was assigned to a cluster was predicted to be in the same class as that of the cluster. When there were an equal or almost equal number of the genes from the first two strata, there seemed to be a certain degree of randomness in the hard classification. Furthermore, it did not take advantage of the soft-clustering feature of model-based clustering: even two genes were both assigned to the same cluster, they might have quite different posterior probabilities of being in the cluster. Hence, as an alternative, we used soft classification: first, for each cluster i, we calculated the proportion of the genes in class c, Formula 7, for c = 1 and 2; second, for any gene j in stratum 3, the expectation of its being in class c is Formula 7. Summing over all the genes from class c (or the other class), we obtained an expected number of the genes predicted to be in class c. To illustrate the effect of various prior probabilities, we also included results using the prior probabilities of the first two strata to calculate {tau}ij for each j isin G3; in a table (i.e. Tables 2 and 3), they were respectively denoted as ‘New: Formula 7’ for k = 1, 2, 3.


View this table:
[in this window]
[in a new window]
 
Table 2 BIC with various numbers (g) of clusters in the standard and stratified clustering for Hughes’ gene expression data with 300 microarray experiments

 

View this table:
[in this window]
[in a new window]
 
Table 3 Predictions for stratum 3 based on hard classification using standard clustering and new clustering with g = 8 for Hughes’ gene expression data with 300 microarray experiments

 
As a comparison, we also treated the problem as supervised learning: the 200 genes in the first two strata and their class labels were training data, while the genes in the third stratum were test data. We applied a classic linear discriminant analysis (LDA) and three new classifiers regarded as among the best: the nearest shrunken centroids (NSC) (Tibshirani et al., 2003), random forests (RF) (Breiman, 2001) and support vector machines (SVM) (Vapnik, 1998). Our purpose here was not to directly compare clustering analyses with these classifiers; rather, these classifiers provided some estimates of the upper bound of the predictive performance of clustering for this particular application. We used the default settings of R functions lda(), pamr(), randomForest() and svm() implementing the methods except that 5-fold cross-validation was used to select parameters for NSC and SVM.

3.2.1 Using 300 microarray experiments
First, we considered the use of the full dataset with all 300 microarray experiments included. Table 4 gives the BIC values for the two methods with various numbers of clusters, based on which g = 8 that minimized BIC was selected for both methods. The predictive results based on hard classification for the genes in stratum 3 are given in Table 5, showing that the two methods gave almost the same result. As mentioned earlier, the result of ‘New: Formula 7’ was based on using the prior probability estimate Formula 7 to calculate the posterior probabilities and thus made predictions for the genes in stratum 3; in practice, we would just simply use ‘New: Formula 7’ for stratum 3. Furthermore, based on soft classification, the results from the two methods were still almost the same. Tables 6 and 7 give the estimated prior probabilities and predictive results of the five classifiers, respectively. For NSC, cross-validation selected the number of the microarray experiments in the constructed classifier at 85; as a comparison, we also included the result using all the 300 microarray experiments. All the supervised methods except LDA worked well; the inferior performance of LDA was likely because of a problematic estimation of a large (i.e. 300 x 300) covariance matrix with only a few hundred observations.


View this table:
[in this window]
[in a new window]
 
Table 4 Predictions for stratum 3 based on soft classification using standard clustering and new clustering with g = 8 for Hughes’ gene expression data with 300 microarray experiments

 

View this table:
[in this window]
[in a new window]
 
Table 5 Estimated prior probabilities with g = 8 for Hughes’ gene expression data with 300 microarray experiments

 

View this table:
[in this window]
[in a new window]
 
Table 6 Predictions for a separate test dataset (i.e. stratum 3) using the NSC, LDA, RF and SVM for Hughes’ gene expression data with 300 microarray experiments

 

View this table:
[in this window]
[in a new window]
 
Table 7 BIC with various numbers (g) of clusters in the standard and stratified clustering for Hughes’ gene expression data with only 10 microarray experiments

 
In summary, using the gene expression data with all the 300 microarray experiments, the standard clustering and our proposed method had similar performance. The close performance between the two methods was presumably because there was enough information in the data for predicting the two functional classes; with fewer microarray experiments or more functional classes, difference might appear. On the other hand, as in the null case (i.e. set-up 1) of simulation, it showed that, although our proposed method did not improve over the standard method, it did not result in any significant loss of efficiency either.

3.2.2 Using 10 microarray experiments
We suspected that because of information redundancy in the gene expression data for predicting the two gene functional classes, the influence of the prior probability was negligible; e.g. the cluster centers were all far away from each other, dominating the resulting posterior probability calculations and thus final clustering results. Hence, to increase the difficulty or the noise level of the problem, next we only used the first 10 microarray experiments.

Based on BIC, we chose g = 5 clusters for both methods (Table 8). Using hard classification (Table 3), the overall accuracy rates from the two approaches were close, though the new method had a slight edge over the standard method. It seemed that using the estimated prior probabilities for strata 1 and 2 improved the predictive accuracy rates for the two classes respectively at the expense of a lower accuracy rate for the other class. This trend was more evident with soft classification (Table 9); more importantly, when using the estimated prior probability from stratum 3 as prescribed, our proposed method gave a higher overall accuracy rate than that of the standard method.


View this table:
[in this window]
[in a new window]
 
Table 8 Predictions for stratum 3 based on hard classification using standard clustering and new clustering with g = 5 for Hughes’ gene expression data with 10 microarray experiments

 

View this table:
[in this window]
[in a new window]
 
Table 9 Predictions for stratum 3 based on soft classification using standard clustering and new clustering with g = 5 for Hughes’ gene expression data with 10 microarray experiments

 
Table 10 gives prior probability estimates for the two methods. Most of these estimates were close to each other; in particular, Formula 7 from the standard method was in general close to Formula 7 from the new method, perhaps because stratum 3 had similar proportions of the genes from the two classes as did the total dataset. However, some estimates were different; e.g. for cluster 2, the four prior estimates were 0.461, 0.535, 0.134 and 0.344 respectively. The overall closeness as well as some individual differences among the estimated prior probabilities offered an explanation on why the clustering results were close, but nevertheless different.


View this table:
[in this window]
[in a new window]
 
Table 10 Estimated prior probabilities with g = 5 for Hughes’ gene expression data with 10 microarray experiments

 
The predictive performances of the four classifiers are summarized in Table 11. Comparing Tables 11 and 3, we found that the two clustering methods, especially our proposed new one, worked well with results quite close to that of supervised learning methods, highlighting the promise of using clustering analysis for gene function prediction. Note that there are several other advantages of clustering analysis over supervised learning. First, clustering methods allow the possibility of discovering new classes while predicting for existing classes (Golub et al., 1999). Second, rather than using a hierarchical ontology to describe gene functions, gene associations or linkages have been argued to be an alternative (Fraser and Marcotte, 2004), for which clustering methods are more naturally applicable than supervised learning.


View this table:
[in this window]
[in a new window]
 
Table 11 Predictions for a separate test dataset based on the NSC, LDA, RF and SVM for Hughes’ gene expression data with 10 microarray experiments

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
With relatively high noise levels in genomic data, the importance of incorporating biological knowledge into statistical analysis has been increasingly recognized. However, the current practice in this direction is mainly restricted to using biological knowledge as an evaluating criteria to validate the analysis results after the analysis is done. For example, many systems have been built to assess statistical enrichments of a list of user-supplied genes in any GO categories, as evidenced by two recent reviews by Handl et al. (2005) and Khatri and Draghici (2005); see references therein. There are a few exceptions: e.g. Mootha et al. (2003), Al-Shahrour et al. (2005) and Pan (2005) proposed different stratified/subgroup analysis approaches to detect differential gene expression, all based on the idea of using biological information to group genes to form strata; Lottaz and Spang (2005) considered incorporating GO categories into sample or tumor classifications, though their main motivation was to enhance interpretability of results, which is also an important factor applicable to genomic analyses. Here we extend the idea to clustering genes and demonstrate its improved performance using both simulated and real data.

In practice, one has to determine which gene functional groups to use. As in any serious statistical modeling, some careful thoughts and trade-offs are needed. There are two factors that need to be balanced. It is desirable to have any group Gk as homogeneous as possible, which, however, may contain only few genes. Furthermore, using smaller groups leads to a larger number of the groups, hence more parameters (i.e. {pi}(k)s) are to be estimated based on smaller sample sizes. Here we suggest three approaches. The first is both simple and practical: to avoid using functional groups either too big or too small, we specify an acceptable range of the number of the genes a functional group can contain, plus possibly some other constraints, and thus determine eligible functional groups. A specific example is the use of ‘informative’ GO categories (Zhou et al., 2002); an informative GO category is the one satisfying the two conditions: (1) it contains at least {gamma} genes and (2) it does not have any child category containing at least {gamma} genes, where {gamma} is a threshold in the range of 20–40. For example, with {gamma} = 30, 73 informative MIPS categories were found (Xiao and Pan, 2005), which can be used as priors in our method. Second, among several candidate sets of functional groups, model-selection criteria, such as BIC, can be used to choose the one estimated to give the best predictive performance; this is an advantage of model-based clustering. Third, following the line of Huang et al. (2006), a weighted method can be applied to combine the results of using two sets of functional groups, or more generally, to take account of the hierarchical structure of a gene annotation system. More studies on genomic scales are needed.

In our example, to deal with genes with multiple functions, we simply replicated their expression profiles for each of their functional groups. This paralleled the usual treatment when the expression profile of a gene with multiple functions was used as a test observation. Nevertheless, a downside was that, owing to the introduced correlations among the expression profiles of the same genes, the resulting calculation of BIC was not exact. On the other hand, with only a small number of such genes as in our example, we suspect that the influence was limited. Furthermore, to avoid any potential problem, a simple alternative is to assign a gene randomly to only one of its known functional groups for training data; this will only influence the prior specification by ignoring multiple functions of some genes, but not the validity of final results.

In this article, we have only considered clustering genes, but the idea can be equally applied to model-based clustering of samples for tumor classifications (e.g. Yeung et al., 2001; Ghosh and Chinnaiyan, 2002; McLachlan et al., 2002) if the samples are known a priori to belong to different groups; these groups can be formed based on some diagnostic or prognostic factors. Our proposed method also bears some similarity to clustering partially classified data (McLachlan and Peel, 2000, Section 2.19); see Qu and Xu (2004) and Alexandridis et al. (2004) for the latter's two applications to gene expression data. However, there is a key difference: in partially classified data, all the genes known to have the same class label are assumed to be from the same cluster, whereas, we allow genes sharing the same class label to come from different clusters; we only assume that they share the same prior probability of coming from the same cluster. In other words, in partially classified data, a class label is assumed to be the same as a cluster membership, whereas we do not impose such a restriction. Considering the heterogeneity of various gene functional categories, our modeling assumption is not only weaker, but also more reasonable in the current context.

Many existing clustering methods are not model-based (e.g. Tamayo et al., 1999; Tseng and Wong, 2005). An advantage of model-based clustering is its explicit statistical modeling, which for instance facilitates incorporating subject-matter knowledge as priors into analysis, as we have shown here. In contrast, in non-model-based clustering, biological knowledge was directly incorporated into a distance metric (Hanisch et al., 2002; Cheng et al., 2004; Huang and Pan, 2006). The advantage of the former is its robustness: when the prior specification is incorrect, the final results may still be valid as long as there is enough information in the data (e.g. Carlin and Louis, 2000), while the latter may give completely wrong results. This also partially explains the results in our example: first, with the data containing 300 microarray experiments, the new method gave results similar to that of the standard method, presumably because there was enough information in the data and incorporating biological knowledge as prior would not make any difference; on the contrary, with the data consisting of only 10 microarray experiments, by incorporating biological knowledge, the new method was more efficient than the standard method and thus giving better results. We are currently further evaluating the difference between the two approaches of incorporating biological knowledge in clustering. In addition, our proposed method is an empirical Bayes approach; conceptually it seems possible to generalize our idea to fully Bayesian model-based clustering (Richardson and Green, 1997; Broet et al., 2002; Medvedovic and Sivaganesan, 2002; Fraley and Raftery, 2005, http://www.stat.washington.edu/www/research/reports/), though computational challenges remain. The methodology may be also extended to other more elaborate modeling contexts, such as analysis of time-course microarray data (e.g. Ramoni et al., 2002; Luan and Li, 2003). These are interesting topics to be studied in the future.


    Acknowledgments
 
The author thanks Guanghua Xiao and Peng Wei for helpful discussions and assistance with the gene expression data. The author is grateful to the reviewers for their constructive comments. This work was supported by NIH grant HL65462 and a UM AHC Development grant.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

Received on October 25, 2005; revised on January 16, 2006; accepted on January 16, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

    Alexandridis, R., et al. (2004) Class discovery and classification of tumor samples using mixture modeling of gene expression data. Bioinformatics, 20, 2545–2552[Abstract/Free Full Text].

    Al-Shahrour, F., et al. (2005) Discovering molecular functions significantly related to phenotypes by combining gene expression data and biological information. Bioinformatics, 21, 2988–2993[Abstract/Free Full Text].

    Ashburner, M., et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet, . 25, 25–29[CrossRef][Web of Science][Medline].

    Breiman, L. (2001) Random forests. Mach. Learn, . 45, 5–32[CrossRef].

    Broet, P., et al. (2002) Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. J. Comput. Biol, . 9, 671–683[CrossRef][Web of Science][Medline].

    Brown, M.P., et al. (2000) Knowledge-based analysis of microarray gene expression data using support vector machines. Proc. Natl Acad. Sci. USA, 97, 262–267[Abstract/Free Full Text].

    Carlin, B.P. and Louis, T.A. Bayes and Empirical Bayes Methods for Data Analysis, (2000) Chapman and Hall/CRC Press.

    Cheng, J., et al. (2004) A knowledge-based clustering algorithm driven by Gene Ontology. J. Biopharm. Stat, . 14, 687–700[CrossRef][Medline].

    Dempster, A.P., et al. (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Statist. Soc. B, 39, 1–38.

    Eisen, M., et al. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 14863–14868[Abstract/Free Full Text].

    Fang, Z., et al. (2006) Knowledge guided analysis of microarray data. J. Biomed. Inform, . (in press).

    Fraley, C. and Raftery, A.E. (1998) How many clusters? Which clustering methods?—Answers via model-based cluster analysis. Comput. J, . 41, 578–588.

    Fraley, C. and Raftery, A.E. (2002) Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc, . 97, 611–631[CrossRef][Web of Science].

    Fraley, C. and Raftery, A.E. (2005) Bayesian regularization for normal mixture estimation and model-based clustering. Technical report 486, Department of Statistics, University of Washington.

    Fraser, A.G. and Marcotte, E.M. (2004) A probabilistic view of gene function. Nat. Genet, . 36, 559–564[CrossRef][Web of Science][Medline].

    Ghosh, D. and Chinnaiyan, A.M. (2002) Mixture modeling of gene expression data from microarray experiments. Bioinformatics, 18, 275–286[Abstract/Free Full Text].

    Golub, T.R., et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 285, 531–537[CrossRef][Web of Science][Medline].

    Handl, J., et al. (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics, 21, 3201–3212[Abstract/Free Full Text].

    Hanisch, D., et al. (2002) Co-clustering of biological networks and gene expression data. Bioinformatics, 18, 145–154.

    Huang, D. and Pan, W. (2006) Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Research report 2006-007. Division of Biostatistics, University of Minnesota. Available at http://www.biostat.umn.edu/rrs.php/.

    Huang, D., Wei, P., Pan, W. (2006) Combining gene annotations and gene expression data in model-based clustering: a weighted method. OMICS, (in press).

    Hughes, T.R., et al. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109–126[CrossRef][Web of Science][Medline].

    Khatri, P. and Draghici, S. (2005) Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics, 21, 3587–3595[Abstract/Free Full Text].

    Li, H. and Hong, F. (2001) Cluster-rasch models for microarray gene expression data. Genome Biol, . 2, RESEARCH0031.

    Lottaz, C. and Spang, R. (2005) Molecular decomposition of complex clinical phenotypes using biologically structured analysis of microarray data. Bioinformatics, 21, 1971–1978[Abstract/Free Full Text].

    Luan, Y. and Li, H. (2003) Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics, 19, 474–482[Abstract/Free Full Text].

    McLachlan, G.J. and Peel, D. Finite Mixture Model, (2002) , New York John Wiley & Sons, Inc.

    McLachlan, G.J., et al. (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics, 18, 413–422[Abstract/Free Full Text].

    McLachlan, G.J., et al. (2003) Modeling high-dimensional data by mixtures of factor analyzers. Comput. Stat. Data Anal, . 41, 379–388[CrossRef].

    Medvedovic, M. and Sivaganesan, S. (2002) Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics, 18, 1194–1206[Abstract/Free Full Text].

    Mewes, H.W., et al. (2004) MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res, . 32, D41–D44[Abstract/Free Full Text].

    Mootha, V.K., et al. (2003) PGC-1 alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet, . 34, 267–273[CrossRef][Web of Science][Medline].

    Pan, W. (2005) Incorporating biological information as a prior in an empirical Bayes approach to analyzing microarray data. Stat. Appl. Genet. Mol. Biol, . 4, Article 12.

    Pan, W., et al. (2002) Model-based cluster analysis of microarray gene-expression data. Genome Biol, . 3, RESEARCH0009.

    Qu, Y. and Xu, S. (2004) Supervised cluster analysis for microarray data based on multivariate Gaussian mixture. Bioinformatics, 20, 1905–1913[Abstract/Free Full Text].

    Ramoni, M., et al. (2002) Cluster analysis of gene expression dynamics. Proc. Natl Acad. Sci. USA, 99, 9121–9126[Abstract/Free Full Text].

    Richardson, S. and Green, P.J. (1997) On Bayesian analysis of mixtures with an unknown number of components. J. B. Statist. Soc, . 59, 731–758[CrossRef].

    Schwarz, G. (1978) Estimating the dimensions of a model. Annal. Stat, . 6, 461–464.

    Tamayo, P., et al. (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA, 96, 2907–2912[Abstract/Free Full Text].

    Tibshirani, R., et al. (2003) Class prediction by nearest shrunken centroids, with application to DNA microarrays. Stat. Sci, . 18, 104–117[CrossRef][Web of Science].

    Tseng, G.C. and Wong, W.H. (2005) Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics, 61, 10–16[CrossRef][Web of Science][Medline].

    Vapnik, V. Statistical Learning Theory, (1998) , NY Wiley.

    Wu, L.F., et al. (2002) Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat. Genet, . 31, 255–265[CrossRef][Web of Science][Medline].

    Xiao, G. and Pan, W. (2005) Gene function prediction by a combined analysis of gene expression data and protein–protein interaction data. J. Bioinform. Comput. Biol, . 3, 1371–1389[Medline].

    Yeung, K.Y., et al. (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics, 17, 977–987[Abstract/Free Full Text].

    Zhou, X., et al. (2002) Transitive functional annotation by shortest-path analysis of gene expression data. Proc. Natl Acad. Sci. USA, 99, 12783–12788[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
X. Dai, O. Yli-Harja, and A. S. Ribeiro
Determining noisy attractors of delayed stochastic gene regulatory networks from multiple data sources
Bioinformatics, September 15, 2009; 25(18): 2362 - 2368.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. Rogers, M. Girolami, W. Kolch, K. M. Waters, T. Liu, B. Thrall, and H. S. Wiley
Investigating the correspondence between transcriptomic and proteomic expression profiles using coupled cluster models
Bioinformatics, December 15, 2008; 24(24): 2894 - 2900.
[Abstract] [Full Text] [PDF]


Home page
BiostatisticsHome page
G. Nowak and R. Tibshirani
Complementary hierarchical clustering
Biostat., July 1, 2008; 9(3): 467 - 483.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Lai
Genome-wide co-expression based prediction of differential expressions
Bioinformatics, March 1, 2008; 24(5): 666 - 673.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
P. Wei and W. Pan
Incorporating gene networks into statistical tests for genomic data via a spatially correlated mixture model
Bioinformatics, February 1, 2008; 24(3): 404 - 411.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
I. Takigawa and H. Mamitsuka
Probabilistic path ranking based on adjacent pairwise coexpression for metabolic transcripts analysis
Bioinformatics, January 15, 2008; 24(2): 250 - 257.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. Dotan-Cohen, A. A. Melkman, and S. Kasif
Hierarchical tree snipping: clustering guided by prior knowledge
Bioinformatics, December 15, 2007; 23(24): 3335 - 3342.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
F. Tai and W. Pan
Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data
Bioinformatics, December 1, 2007; 23(23): 3170 - 3177.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
G. C. Tseng
Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data
Bioinformatics, September 1, 2007; 23(17): 2247 - 2255.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
F. Tai and W. Pan
Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms
Bioinformatics, July 15, 2007; 23(14): 1775 - 1782.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. Shiga, I. Takigawa, and H. Mamitsuka
Annotating gene function by combining expression data with a modular gene network
Bioinformatics, July 1, 2007; 23(13): i468 - i478.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. K. Ng, G. J. McLachlan, K. Wang, L. Ben-Tovim Jones, and S.-W. Ng
A Mixture model with random-effects components for clustering correlated gene-expression profiles
Bioinformatics, July 15, 2006; 22(14): 1745 - 1752.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. Huang and W. Pan
Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data
Bioinformatics, May 15, 2006; 22(10): 1259 - 1268.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/7/795    most recent
btl011v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (22)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pan, W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pan, W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?