Bioinformatics Advance Access originally published online on February 15, 2005
Bioinformatics 2005 21(9):2118-2122; doi:10.1093/bioinformatics/bti318
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Applications of beta-mixture models in bioinformatics
1Department of Biostatistics, The University of Texas M.D. Anderson Cancer Center Houston, TX 77030, USA
23M Pharmaceuticals St Paul, Minnesota, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: We propose a beta-mixture model approach to solve a variety of problems related to correlations of gene-expression levels. For example, in meta-analyses of microarray gene-expression datasets, a threshold value of correlation coefficients for gene-expression levels is used to decide whether gene-expression levels are strongly correlated across studies. Ad hoc threshold values such as 0.5 are often used. In this paper, we use a beta-mixture model approach to divide the correlation coefficients into several populations so that the large correlation coefficients can be identified. Another important application of the proposed method is in finding co-expressed genes. Two examples are provided to illustrate both applications. Through our analysis, we also discover that the popular model selection criteria BIC and AIC are not suitable for the beta-mixture model. To determine the number of components in the mixture model, we suggest an alternative criterion, ICLBIC, which is shown to perform better in selecting the correct mixture model.
Contact: yuanji{at}mdanderson.org
Supplementary information: http://odin.mdacc.tmc.edu/~yuanj/highcorgeneanno.html
| 1 INTRODUCTION |
|---|
|
|
|---|
We propose a beta-mixture model approach to solve a variety of problems in bioinformatics related to a large number of correlation coefficients. The correlation coefficients could be computed for the expression levels of the same gene measured under microarrays from different studies, which is often seen in meta-analyses of multiple gene-expression experiments (Ghosh et al., 2003; Kuo et al., 2002; Parmigiani et al., 2004; Pusztai et al., 2003). Alternatively, the correlation coefficients could come from a pathway analysis related to a critical gene, (Sabatti et al., 2002) in which the expression levels of the gene are examined to see whether they correlate with those of other genes in the same array data. Investigators are often asked to draw conclusions based on the magnitudes of these correlations. Specifically, a threshold value is used to decide whether the gene-expression levels are strongly correlated or not. For example, if a correlation coefficient is >0.5, then the gene-expression levels are positively correlated. However, the choice of 0.5 for the threshhold value is hard to justify.
Regardless of the origin, a large number of correlation coefficients are obtained, which usually reflect certain biological behaviors of the genes under investigation, e.g. if the genes are consistently expressed across the studies or if the genes are co-expressed. Naturally, not all genes behave the same way, and as a result, the values of their correlation coefficients differ. Finite mixture models (McLachlan and Peel, 2000) are typically used to analyze data of this type. Specifically, the correlation coefficients can be considered as coming from several underlying probability distributions. Each distribution is a component of the mixture model representing a gene population with similar behavior, and all the components are combined into a comprehensive model by a mixture form. To model the correlation coefficients, we use the beta distribution, which has been known for its flexible shapes and is therefore widely used to describe data from various experiments. In our analyses, we find that a two-component beta-mixture model is usually adequate to fit the correlation coefficients, with one component representing the population of uncorrelated gene expression levels and the other component representing the population of correlated ones.
One key issue in the finite mixture model approach is to decide the number of components in the model. This has been extensively studied for the normal-mixture model. The commonly used Akaike information criterion (AIC) Akaike (1973) and Bayesian information criterion (BIC) (Schwarz, 1978) have been shown to be adequate for deciding the number of components in the normal-mixture model (Leroux, 1992; Roeder and Wasserman, 1997). However, little is known about the performance of these criteria in the case of the beta-mixture model. Our finding through both simulations and real examples is that neither criterion is suitable for the beta-mixture model. Both AIC and BIC seem to overestimate the number of components, leading to mixture models with excessive numbers of components. On the other hand, we find that the integrated classification likelihoodBIC (ICLBIC) (Biernacki et al., 1998) is adequate, always selects the right model in our simulation studies and yields reasonable results when applied to real data.
| 2 THE STATISTICAL MODEL |
|---|
|
|
|---|
The beta-mixture model deals with a vector of correlation coefficients of gene-expression levels. Usually, the dimension of the vector is large, in the order of thousands. The correlation coefficients are assumed to come from multiple underlying probability distributions, in our case, beta distributions. To fit the beta distribution, for each correlation coefficient xi, we apply a linear transformation yi = (xi+1)/2, so that the range of the transformed values is between 0 and 1. The index i represents the gene with respect to which the correlation coefficient y is calculated.
Let {yi},i = 1,...,n, denote the transformed correlation coefficients, where n is the total number of observations. Under a mixture of beta distributions,
![]() |
![]() |
is the beta function. The quantity L is the number of components in the mixture.
We augment the data by introducing the latent indicator variable zil for each gene i, where
![]() |
= (
1,...,
L)'. This is the prior distribution for zi. Let
= (
1, ß1,...,
L, ßL) be the vector containing all the other unknown parameters
and ß. The likelihood function for the complete data is given by
![]() |
![]() | (1) |
| 3 THE EXPECTATIONMAXIMIZATION ALGORITHM |
|---|
|
|
|---|
We use the expectationmaximization (EM) algorithm (Dempster et al., 1977) to iteratively maximize the log-likelihood and update the conditional probability that yi comes from the l-th component, which is defined as
![]() | (2) |
is a maximizer of the log-likelihood (1), for given z*s. We assign yi to the component
. The EM algorithm iterates between an E-step where values
are computed from the data with the current parameter estimates, and an M-step in which the log-likelihood (1), with each zil replaced by its current conditional expectation
, is maximized with respect to the parameters
and
. The detailed algorithm is given as follows:
- Initialize
: This can be done by, for example, assigning the smallest 100/L percentage of yi to the first component, and the next smallest 100/L percentage of yi to the second component, etc.
- M-step: Given
, maximize (1) with respect to the parameters
and
. Specifically, 
We obtain a maximizer
of the log-likelihood in (1) numerically.
- E-step: Given the parameter estimates from the M-step, compute

- Repeat M-step and E-step until the change in the value of the log-likelihood in Equation (1) is negligible.
Maximization with respect to
l and ßl can only be carried out numerically. We use the nlm function in R, which is freely available at http://www.r-project.org/. The EM algorithm yields the final estimated posterior probability
, the value of which represents the posterior probability that correlation coefficient yi comes from component l. We assign yi to component l0 as previously defined, which follows a beta distribution with parameter estimates
and
, also yielded by the EM algorithm. The characteristics of the beta distribution contain information that can be used for inference about the behavior of the genes belonging to the corresponding component. For example, if the beta distribution is closely centered around a positive value, say 0.7, then the correlation coefficients in this component represent a population of strongly correlated genes.
The determination of the number of components L is challenging and is often decided by some model selection criterion, e.g. AIC, BIC or ICLBIC. While AIC and BIC are well known, the definition of ICLBIC is given by
![]() |
is the estimated entropy of the fuzzy classification matrix C = ((zil)). Typically, mixture models are fitted through the EM algorithm with different values of L, and the model with the smallest AIC, BIC or ICLBIC is chosen. We evaluate the performance of these three criteria through a simulation study in the next section. | 4 SIMULATION |
|---|
|
|
|---|
To be more realistic, instead of directly simulating correlation coefficients, we simulated sets of gene-expression levels with different correlation coefficients across datasets from different experiments. Specifically, we considered two experiments in which the same set of 10 000 genes were measured. We simulated 10 expression levels for each gene in each experiment. Let Rik and Sik denote the k-th expression level of gene i in experiments 1 and 2, respectively, i = 1,...,10 000 and k = 1,...,10. Assume that (Rik, Sik)' are independent for different values of i and k and
![]() | (3) |
i is the correlation of the expression levels across two experiments. We let the variance expression level be proportional to the square mean expression level, which is often observed in practice. In the first simulation, we let
i = 0.7 for the first 3000 genes and
i = 0.0 for the last 7000 genes. We sampled each µi independently from a normal distribution with mean 0 and variance 4. In the second simulation, the µi were sampled in the same way. However, we let
= 0.7 for the first 3000 genes,
= 0.0 for the next 4000 genes and
= 0.7 for the last 3000 genes. We sampled (Rik, Sik)' from the distribution in Equation (3) and computed the sample correlation of the vectors (Ri1,...,Ri10) and (Si1,...,Si10) for each gene i. We thus obtained 10 000 sample correlation coefficients, which came from two underlying distributions for the first simulation and three for the second simulation. We computed the values of AIC, BIC and ICLBIC for each simulation after implementing the proposed beta-mixture model with different values of L. The results are presented in Table 1. Both AIC and BIC prefer models with an excessive number of components. The criterion ICLBIC, however, correctly chooses the right model for both simulations.
|
| 5 TWO EXAMPLES |
|---|
|
|
|---|
The proposed method is illustrated with two real bioinformatics studies. The first study requires combining gene expression from different types of microarrays. The second deals with the identification of co-expressed genes.
5.1 Study I: combining gene expression levels
This study was a meta-analysis involving a total of 33 patients with diagnosed stages IIII breast cancer. Two microarray technologies were used in this study: Affymetrix Human Genome U133 chip sets (HG-U133A and HG-U133B), and radioactively labeled cDNA nylon membrane microarrays printed by Millennium Pharmaceuticals Inc. (Cambridge, MA). There were 9285 pairs of genes that were common to both array platforms. One of the study objectives was to determine which pairs were strongly correlated in their expression levels. For each gene on each platform, we had 33 measurements corresponding to the 33 patients. Using these 33 measurements on each platform, we computed correlation coefficients for each gene, which resulted in 9285 correlation coefficients. We applied the beta-mixture model to these correlation coefficients, with the number of components ranging from 1 to 5. ICLBIC chose L = 2, while AIC and BIC both preferred the largest model with L = 5 (Table 2). Figure 1 displays the fitted densities of the beta-mixture distributions with L ranging from 2 to 5. It seems that when L = 2, the fitting is already adequate; the improvement in the fit for the case when L > 2 is negligible.
|
|
We proceeded by fitting a two-component beta-mixture model to the correlation coefficients. The means of the fitted beta distributions respectively equaled 0.54 and 0.76, which corresponded to the values 0.09 and 0.53 on the correlation scale. The standard deviations of both beta distributions equaled 0.1. Therefore, one component (with mean correlation 0.09) corresponded to the genes with weak or no correlations and the other (with mean correlation 0.53) to the genes with strong correlations. The cut-off value based on the posterior probability was 0.31, i.e. if a correlation coefficient was >0.31, it would be considered as coming from the component with strong correlations.
After the two populations of genes were discovered, efforts were taken to explain why some pairs were poorly correlated.
Those corresponding genes with strong correlations were used for further analysis.
5.2 Study II: gene co-expression
The second example used the dataset from the study conducted by Beer et al. (2002). In this study, the authors identified a list of important genes that are related to the survival of patients with adenocarcinoma. We arbitrarily selected one gene, tyrosine hydroxylase (TH), from the original list and computed the correlation coefficients of its expression levels to the remaining genes in the data. We obtained 7128 correlation coefficients. We repeated the same analysis as in the last section, and found that again ICLBIC seemed to select the right mixture model with two components while AIC and BIC seemed to overestimate the number of components. We fitted a two-component beta-mixture model suggested by ICLBIC, the density of which is given in Figure 2. The means of the two beta distributions equaled 0.46 and 0.71, corresponding to the values 0.07 and 0.43, respectively, on the correlation scale. The standard deviations of both distributions equaled 0.1. Therefore, we have two gene populations, one with weak correlations and the other strong. The ones in the strongly correlated population are potentially co-expressed genes, a possibility that needs to be validated by further analysis.
|
Among the top 20 genes highly correlated with the target gene TH, which is related to tyrosine metabolism, we found at least 9 genes that are functionally related to TH. Many of the highly correlated genes are related to muscle development and contraction, including genes KCNQ1, MMP16, TNNI1, SMPD1, GPR68, HTR2C, CUL5, EMD and OXT. Tyrosine hydroxylase is essential in tyrosine metabolism, and tyrosine kinase, which requires tyrosine for its activity, is related to the function of muscle tissue. Du et al. (1994) reported that the expression level of TH is related to the function of muscle. Patients with adenocarcinoma might have a higher level of TH activity than healthy people because the smooth muscle of their lungs needs to do more work to compensate for the lung tissue that is compromised by disease. An annotation table of the top 20 highly correlated genes is provided in the Supplemental information section.
| 6 DISCUSSION |
|---|
|
|
|---|
We make two contributions in this paper. First, we provide a statistical tool, the beta-mixture model, for analyzing a large number of correlation coefficients, which is often required in bioinformatics. To our knowledge, there is no objective method for performing the analysis so far. Second, we show that the model selection criteria AIC and BIC do not work for the beta-mixture model, although they have been proved to be adequate for normal-mixture models. Specifically, both criteria seem to overestimate the number of components in the beta-mixture model, a sign of lack of penalties for the model size. We further find that the criterion ICLBIC seems to work well and selects the right model in our simulations. ICLBIC adds more penalties for a larger model and, therefore, performs better.
As theoretical justification is needed for further work, we hope that the findings provided in this paper will generate more research in this area.
| Acknowledgments |
|---|
This research was partly supported by the National Institutes of Health through the University of Texas M.D. Anderson Cancer Center SPORE in Lung Cancer grant CA070907 and the Prostate Cancer grant CA90270.
Received on January 5, 2005; revised on January 27, 2005; accepted on February 8, 2005
| REFERENCES |
|---|
|
|
|---|
Akaike, H. (1973) Information theory and the extension of the maximum likelihood principle. In Petrov, V. and Csaki, F. (Eds.). Proceedings of the Second International Symposium on Information Theory, , Budapest Akailseoniai-kiudo, pp. 267281.
Beer, D., et al. (2002) Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat. Med., 8, 816824[ISI][Medline].
Biernacki, C., et al. (1998) Assessing a mixture model for clustering with the integrated classification likelihood. , Rhone-Alpes Technical Report No. 3521 INRIA.
Dempster, A., et al. (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. B, 39, 138.
Ghosh, D., et al. (2003) Statistical issues and methods for meta-analysis of microarray data: a case study in prostate cancer. Funct. Integr. Genomics, 4, 180188.
Kuo, W., et al. (2002) Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics, 18, 405412
Leroux, B. (1992) Consistent estimation of a mixing distribution. Ann. Stat., 20, 13501360.
McLachlan, G. and Peel, D. Finite Mixture Models, (2000) , NY John Wiley and Sons.
Parmigiani, G., et al. (2004) A cross-study comparison of gene expression studies for the molecular classification of lung cancer. Clin. Cancer Res., 10, 29222927
Pusztai, L., et al. (2003) Gene expression profiles obtained from single passage fine needle aspirations (FNA) of breast cancer reliably identify prognostic/predictive markers such as estrogen (er) and HER-2 receptor status and reveal large scale molecular differences between ER-negative and ER-positive tumors. Clin. Cancer Res., 9, 24062415
Roeder, K. and Wasserman, L. (1997) Practical density estimation using mixtures of normals. J. Am. Stat. Assoc., 92, 894902[CrossRef].
Sabatti, C., et al. (2002) Co-expression pattern from DNA microarray experiments as a tool for operon prediction. Nucleic Acids Res., 13, 28862893.
Schwarz, G. (1978) Estimating the dimension of a model. Ann Stat., 6, 461464.
This article has been cited by other articles:
![]() |
J. Zhang, Y. Ji, and L. Zhang Extracting three-way gene interactions from microarray data Bioinformatics, November 1, 2007; 23(21): 2903 - 2909. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||










