Bioinformatics Advance Access originally published online on July 26, 2006
Bioinformatics 2006 22(19):2388-2395; doi:10.1093/bioinformatics/btl393
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Semi-supervised learning via penalized mixture model with application to microarray sample classification
1 Division of Biostatistics, School of Public Health, University of Minnesota MN, USA
2 School of Statistics, University of Minnesota MN, USA
3 Department of Biostatistics, Vanderbilt University MN, USA
4 Vascular Biology Center and Division of Hematology-Oncology-Transplantation, University of Minnesota Medical School MN, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: It is biologically interesting to address whether human blood outgrowth endothelial cells (BOECs) belong to or are closer to large vessel endothelial cells (LVECs) or microvascular endothelial cells (MVECs) based on global expression profiling. An earlier analysis using a hierarchical clustering and a small set of genes suggested that BOECs seemed to be closer to MVECs. By taking advantage of the two known classes, LVEC and MVEC, while allowing BOEC samples to belong to either of the two classes or to form their own new class, we take a semi-supervised learning approach; for high-dimensional data as encountered here, we propose a penalized mixture model with a weighted L1 penalty to realize automatic feature selection while fitting the model.
Results: We applied our penalized mixture model to a combined dataset containing 27 BOEC, 28 LVEC and 25 MVEC samples. Analysis results indicated that the BOEC samples appeared to form their own new class. A simulation study confirmed that, compared with the standard mixture model with or without initial variable selection, the penalized mixture model performed much better in identifying relevant genes and forming corresponding clusters. The penalized mixture model seems to be promising for high-dimensional data with the capability of novel class discovery and automatic feature selection.
Contact: weip{at}biostat.umn.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
A biologically interesting question is whether human blood outgrowth endothelial cells (BOECs) belong to or are closer to large vessel endothelial cells (LVECs) or microvascular endothelial cells (MVECs). BOECs are being explored for efficacy in endothelial-based gene therapy (Lin et al., 2002), and as being useful for vascular diagnostic purposes (Hebbel et al., 2005); in each case, it is important to know whether BOEC have characteristics of MVECs or of LVECs. Based on the expression of gene CD36, it seems reasonable to characterize BOECs as MVECs (Swerlick et al., 1992). However, CD36 is expressed in endothelial cells, monocytes, some epidermal cells and a variety of cell lines; characterization of BOECs or any other cells using a single gene marker seems unreliable. Jiang (2005) conducted a genome-wide comparison: microarray gene expression profiles for BOEC, LVEC and MVEC samples were clustered; it was found that BOEC samples tended to cluster together with MVEC samples, suggesting that BOECs were closer to MVECs.
There were two potential shortcomings with the above approach. First, the method used was hierarchical clustering, an unsupervised learning technique ignoring the known classes of LVEC and MVEC samples; it seems more natural to use semi-supervised learning, treating the class labels of LVEC and MVEC samples as known while that of BOEC samples unknown (see McLachlan and Basford, 1988; McLachlan, 1992; Zhu, 2006, http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf for reviews on semi-supervised learning). It is worth pointing out that many semi-supervised learning approaches are not applicable to the current problem, which requires that a learning algorithm to have the ability for novel class discovery: the BOEC samples may belong to one of the two known LVEC and MVEC classes, or they may form their own new class. Second, the clustering results were based on the expression levels of 37 genes, which were selected to best discriminate between LVEC and MVEC samples. Importantly, any clustering result may critically depend on the features or genes being used; the small number of the genes used might not reflect the whole picture. It would be desirable to start with a larger set of the genes to cluster; however, there is a dilemma: using too many genes, most of which may not be informative in discriminating known and unknown classes, would lead to covering some true clustering structures underlying data, as to be shown later. For such high-dimensional data, it is thus necessary to have a feature selection mechanism, preferably embedded within the learning framework, as opposed to the usual practice of first selecting features and then fitting/learning a model; as to be shown, the practice of pre-selecting features can perform terribly because such selected features may not be relevant at all to uncovering interesting clustering structure of data, largely due to the separation between the two steps of feature selection and model fitting. In this work, we propose a new method that overcomes the above two problems: we take a semi-supervised learning approach in the framework of a penalized mixture model, allowing automatic variable selection simultaneously with model fitting. We demonstrate that, with a larger set of genes included in a starting model and with appropriate automatic gene selection, BOEC samples tend to form a separate cluster from those of LVEC and MVEC samples.
Although semi-supervised learning via a finite mixture model has been studied in the statistics and machine learning literature (McLachlan and Peel, 2002, Section 2.19; Nigam et al., 2006), and in particular applied to microarray data analysis (Alexandridis et al., 2004), our proposal of using a penalized likelihood to realize automatic variable selection in this context is novel; in fact, variable selection in this context is largely a neglected topic, though the results may critically depend on what variables are to be used, as to be shown later. This work extends the penalized unsupervised learning/clustering analysis method of Pan and Shen (2006, http://www.biostat.umn.edu./rrs.php) to semi-supervised learning. We emphasize that variable selection is not trivial in semi-supervised learning. Most existing methods, such as Alexandridis et al., (2004), employ a two-step strategy: first, selecting variables based on labeled samples or some ad hoc heuristics; second, using selected variables to conduct semi-supervised learning. A serious issue in these types of approaches is the separation between the two steps: the variable selection step is completely independent of model-learning, and hence there is no guarantee that selected variables are relevant to the subsequent learning task; e.g. if we use only labeled data to select variables, the chosen variables may not be informative to distinguishing novel classes of unlabeled data from known classes of labeled data, as to be illustrated in our simulation study. In addition, final results depend on the number of variables selected, while determining how many variables to keep is probably even more challenging. Finally, as demonstrated in regression and classification, variable selection may not be stable, while penalized regression may be more effective, especially when there is a sparse solution (Tibshirani, 1996), as encountered with high-dimensional data, which is our focus here.
In the sequel, we first review the standard mixture model for semi-supervised learning, then propose our new penalized mixture model, along with an EM algorithm to fit the model. Next we compare the performance of the standard and penalized mixture models using simulated data; we illustrate the performance of the methods when applied to the real microarray data containing BOEC, LVEC and MVEC samples to determine whether the BOEC samples belong to one of the two other classes. We end with a short discussion.
| 2 METHODS |
|---|
|
|
|---|
2.1 Semi-supervised learning via standard mixture model
With partially labeled or classified data, we have K-dimensional feature vectors: x1,...xn, of which the first n0 do not have class labels while the last n1 have. Suppose that there are g = g0 + g1 classes, of which the first g0 are unknown classes to be discovered while the last g1 are known classes. Suppose zij is the indicator of whether observation xj is in class i : zij = 1 if and only if xj is known to be in class i; zij = 0 if and only if xj is known not to be in class i. Note that, zijs are missing for 1
j
n0, whereas zijs are observed for n0 < j
n.
A mixture model is commonly used as a generative model for partially labeled data (McLachlan and Peel, 2002): it is assumed that each observation x comes from a finite mixture distribution
, with the mixing proportion
i, class-specific distribution fi and its parameters
i; i.e. each observation comes from fi, one of g components/classes/clusters, with prior probability
i. For high-dimensional data with small sample sizes, following Pan and Shen (2006), we propose each class-specific distribution fi as a Normal distribution with a common diagonal covariance matrix:
![]() |
and
.
To determine which component an unlabeled observation xj comes from, we calculate the posterior probability of xjs coming from component i:
![]() |
ij. It is clear that, if µ1k = µ2k =
= µgk for some k, then the terms involving xjk will cancel out from the numerator and the denominator of
ij; i.e. feature/attribute k does not contribute to classification. Therefore, a common mean component µik across all the clusters i effectively realizes variable selection. It is noted that this is only possible under the assumption of a common diagonal covariance matrix V across all clusters; e.g. if some cluster-specific covariance matrices Vi are used, even if µ1k = µ2k = ... µgk, the terms involving xjk will not cancel out from
ij; i.e. attribute k still contributes to classification.
As usual, the mixture model involves unknown parameters to be estimated. Denote
= {(
i,
i):i = 1, ... , g} for all unknown parameters, with restriction that 0
i
1 for any i and
. Given the data, the log-likelihood is
![]() |
The maximum likelihood estimator (MLE) of
can be obtained by maximizing the above log-likelihood using the EM algorithm (Dempster et al., 1977); see McLachlan and Peel (2002, section 2.19) for details.
2.2 Penalized mixture model
Rather than using the standard mixture model, following Pan and Shen (2006), we propose using a penalized mixture model for model regularization, realizing automatic variable selection; i.e. we propose using maximum penalized likelihood estimator (MPLE) of
, as opposed to MLE, to fit the model. Specifically, we maximize the below penalized log-likelihood with a weighted L1 penalty:
![]() |
Note that, throughout this article, it is assumed that, prior to analysis, we have standardized the data so that each feature has sample mean 0 and sample variance 1. Hence, for any feature k, if the µiks are all zero for all 1
i
g, then feature k will not be used; the L1 penalty serves to obtain a sparse solution with many small estimates of µiks automatically set to 0, thus realizing variable selection.
We derive an EM algorithm to maximize the penalized log-likelihood. To save space, we only summarize the major steps below; in particular, it is confirmed that the L1-penalty results in a thresholding rule with the desired sparsity property. We use generic notation
(m) to represent the parameter estimates at iteration m. The EM iterates the following steps:
![]() | (1) |
![]() | (2) |
![]() | (3) |
![]() | (4) |
![]() | (5) |
The above steps are iterated until convergence, resulting in the MPLE
. Then we use (4) to calculate the posterior probability of any unlabeled observation's belonging to each cluster, and assign it to the cluster with the largest probability. Because of possible existence of multiple local maxima, we started the algorithm multiple times with various starting values for the EM: we used the results from each of multiple K-means runs as starting values for the EM. We fitted a series of models with various values of g0, g1 and
, then used a model selection criterion to choose their appropriate values, as to be discussed in the next section.
If
, then
; otherwise,
is obtained by shrinking
by an amount
. It can be seen that, if
for any i, then the k-th feature does not contribute to classification: it will be cancelled out from the numerator and the denominator of (4).
Note that in (3), if we use
, as opposed to
, we obtain the MLE at the convergence, which is equivalent to using
= 0. Zou (2005, http://www.stat.umn.edu/~hzou/Papers/AdaLasso.pdf 2006) proposed using the weighted L1 penalty in the context of supervised learning. Here we extend the idea to the current context: we propose using
with w
0; the standard L1 penalty corresponds to w = 0. The weighted penalty automatically realizes a data-adaptive penalization: it penalizes more on smaller µik while penalizing less on, and thus reducing the bias for, larger µik, leading to better feature selection and classification performance. As in Zou (2006), we tried
and found only minor differences in results for w > 0; for simplicity we will present results only for w = 0 and w = 1.
2.3 Model selection
In practice, we need to determine for a mixture model the number of components g0. For the standard mixture model with maximum likelihood estimation, it is most popular to use Bayesian information criterion (BIC) (Schwartz, 1978) defined as
![]() |
![]() |
The idea was borrowed from Efron et al. (2004) and Zou et al. (2004, http://stat.stanford.edu/~hastie/pub.htm) who estimated the effective number of parameters for L1 penalized regression, such as LASSO, in a similar way; Pan and Shen found that this modified BIC worked reasonably well for penalized clustering; as to be shown in our simulation, it also worked well here, though more rigorous studies are needed. Although resampling based model selection methods, including cross-validation or data perturbation (Shen and Ye, 2002; Efron, 2004), can be also used, BIC has an advantage of being computationally less demanding.
In summary, we propose using this modified BIC to select the number of components g0 and the penalization parameter
simultaneously. A grid search to determine an optimal
is straightforward; as a general strategy, trials and errors can be used: first we try a wider range of
values, and then have a finer grid-search near the
with minimum BIC. Finally, we select a combination of (g,
) yielding a minimum BIC.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Simulated data
3.1.1 Simulation set-ups
We considered four non-null (i.e. g0 > 0) simulation set-ups mimicking the real data: for each set-up, there were respectively 20 observations in the g0 = 1 unknown class and g1 = 2 known classes; there were K = 200 independent attributes, among which 2K1 were informative while the remaining ones were noise variables; the distributions of each of the first K1 informative attributes were N(0,1), N(0,1) and N(1.5,1) for the three classes respectively, while those of each of the next K1 informative attributes were N(1.5,1), N(0,1) and N(0,1) respectively; each of the K 2K1 noise variable was distributed as N(0,1). Therefore, the first K1 informative attributes distinguished the third class with the other two, while the next K1 informative attributes discriminated the first one from the other two. The four set-ups corresponded to K1 = 10, 15, 20 and 30, respectively.
We also considered a null case with g0 = 0 and the set-up was similar to the above ones: the first K1 attributes were discriminatory to the two known classes with the same distributions as before; however, the remaining K K1 were all noise variables with N(0,1) distributions for each class. We only considered the case with K1 = 30.
For each simulation set-up, 100 independent datasets were generated. Both the standard method without variable selection (i.e.
= 0) and our proposed penalized method with BIC-selected
were applied; only w = 0 was considered for the penalized method. For each dataset, the algorithm for each method was run 10 times: at each run, a random partition of the unlabeled data was input to the K-means, whose results were in turn input to the EM or modified EM; the final result was the one with the maximum likelihood or penalized likelihood (for the given
). For the penalized method, each
= {0,2,4,6,8,10,12,15,20,25} was tried; the final result was selected to be the one minimizing BIC. The set
was determined by trials and errors: we searched finer grids on a few datasets and found that the optimal
based on BIC was well within this range.
3.1.2 Comparison between the standard and penalized methods
Table 1 summarizes the results from 100 independent simulations for each of the five set-ups. For the non-null cases, as K1 increased, the problem became easier. Because of the presence of noise attributes, the standard mixture model incorrectly selected g0 = 0; i.e. it tended to assign observations coming from a new class into one of the two known classes. In contrast, the penalized method, equipped with automatic variable selection, performed much better in identifying correct g0 = 1, and thus correctly determining that unlabeled observations came from a novel class. It was also verified that the penalized method could correctly detect most of the noise attributes. On the other hand, in the few cases where the penalized method incorrectly selected g0 = 0, it was largely owing to over-penalization, leading to discarding even informative attributes.
|
For the null case, both the standard and penalized methods selected g0 = 0 correctly. Again an advantage of the penalized method is its ability for variable selection: it always correctly detected and discarded 170 noise attributes, though it also incorrectly threw away some informative attributes.
3.1.3 Comparison with variable selection
These simulated data clearly demonstrated the importance of variable selection. We emphasize here that, an initial variable selection based on labeled data did not help here: for any non-null case, if we selected the variables that discriminated between the two known classes, ideally only the first group of K1 informative attributes would be selected, which however were not informative to discriminating the novel class (for unlabeled data) from the other two, leading to incorrectly selecting g0 = 0. On the other hand, treating unlabeled data as a separate class and selecting variables to discriminate the three classes would be able to identify both groups of informative genes in these cases, however, there was no guarantee that it would work for other data because the unlabeled data might simply come from various known classes; more importantly, because of the separation between the two steps of gene selection and fitting the mixture model, it might only identify a model of no interest, as opposed to the one of interest, as shown next.
We did a simulation study using the data from simulation set-up 1. An F-statistic based on the ratio of between-class sum of squares and within-class sum of squares (Broet et al., 2004; Huang et al., 2005) could be used to detect genes with differential expression among the classes, and thus to rank the genes. We considered two ways to rank the genes based on the F-statistic: first, by ignoring the unlabeled data, we ranked the genes based on the two known classes for the labeled data (F2); second, by treating the unlabeled data as a new class, we ranked the genes based on the three classes (i.e. two for the labeled data and one for unlabeled data) (F3). Top K0 genes with the largest F2 or F3 statistics were selected and used respectively in a standard mixture model; in each case, two models corresponding to g0 = 0 and g0 = 1 were fitted; we selected the one with the minimum BIC. Table 2 gives the frequencies of the models selected for various values of K0. It is obvious that, for a small K0, it was more likely to choose g0 = 0, corresponding to incorrectly treating unlabeled data as coming from one of the two known classes; this happened because the selected attributes tended to be from the first group of informative attributes, which could not distinguish the unlabeled data from one of the two known classes. Hence, a difficulty with this two-step approach was the correct choice of K0. Most strikingly, if K0 was treated as a parameter and we used BIC to select K0, it turned that K0 = 5 would always be selected; see the last row of Table 2. This happened because, the top K0 = 5 attributes were indeed discriminatory for the two known classes and led to a correct and most parsimonious model, which was of no interest but still selected by BIC.
|
In summary, the two-step approach of first selecting variables and then applying a standard semi-supervised learning is problematic: because of the separation of the two steps, the selected variables may identify a correct model, which however may not be of interest! This is a unique point with semi-supervised learning, as in unsupervised learning (Pan and Shen 2006), differing from supervised learning. In contrast, penalized methods, coupled with automatic variable selection, are much viable in this aspect.
3.2 Real data
Chi et al. (2003) collected 53 cDNA two-channel microarray samples, including 28 large vessel endothelial cell (LVEC) and 25 microvascular endothelial cell (MVEC) samples. Jiang (2005) presented 27 human blood outgrowth endothelial cell (BOEC) samples using Affymetrix U133A microarrays; probe-set expression levels were summarized by the RMA method (Irizarry et al., 2003). There were 9289 unique common genes in the two datasets.
Because of different microarray platforms used in the two studies, it is necessary to normalize the data to possibly eliminate any systematic and inherent difference owing to the two platforms prior to a combined analysis. With the presence of six human umbilical vein endothelial cell (HUVEC) samples from each of the two datasets, Jiang (2005) compared 64 possible combinations of a three-step normalization procedure and identified the best one that maximized the mixing of the 12 HUVEC samples in a hierarchical clustering analysis. Here we used the combined 80 samples normalized by the same method.
In semi-supervised learning, we treated the class labels of the 28 LVEC samples and that of the 25 MVEC samples as known with g1 = 2. We either did not allow the existence of the third class with g0 = 0, or allowed it with g0 = 1; some other combinations of (g0,g1) values were also tried with similar results (data not shown). We considered three scenarios with three sets of genes. For each scenario, we presented six models: the standard model without penalization (i.e.
= 0), two penalized models with w = 0 and w = 1, respectively, each with a selected
minimizing BIC, for (1) (g0 = 0, g1 = 2) and (2) (g0 = 1, g1 = 2), respectively. The EM was randomly started 20 times with the starting values from the K-means output. At the convergence of the EM, we used formula (4) to calculate the posterior probabilities and thus classified the LVEC and MVEC samples, as for BOEC samples. Although the class labels of the LVEC and MVEC samples were known, it was interesting to see how these samples would be classified based on a fitted model, as a partial validation of the fitted model; this was done throughout the tables presented next.
3.2.1 Using 37 genes discriminating LVECs and MVECs
Jiang (2005) used both SAM (Tusher et al., 2002) and PAM (Tibshirani et al., 2003) to identify a list of 37 genes discriminating the LVEC and MVEC samples. Based on a hierarchical clustering analysis of the expression levels of these 37 genes, Jiang found that samples of each type seemed to stay together by themselves, and that the cluster of BOEC samples was first merged with that of the MVEC samples, hence concluding that BOEC samples were closer to MVEC samples. Here we approached the same problem by semi-supervised learning using the same set of the 37 genes.
With the same 37 genes, we used a semi-supervised mixture model with g0 = 0 and g1 = 2 to classify BOEC samples to one of the two known classes. The results confirmed that most (for
= 0 or w = 0) or even all (for w = 1) BOEC samples were classified into the MVEC class (Table 3). However, if we allowed the existence of a possible new class, a half (for the standard method with
= 0) or more than a half (for the penalized methods) of BOEC samples seemed to form their own class (Table 3). Overall, based on BIC values, the penalized mixture model with (
= 3, w = 1) for (g0 = 1, g1 = 2) seemed to be the best.
|
An advantage of the penalized methods was their ability to automatically select genes. In particular, the penalized method with w = 1 was the winner with fewest genes: only 2/3 of the 37 genes were used in the final model (Table 4).
|
3.2.2 Using top 1000 genes discriminating LVECs and MVECs
It would be interesting to investigate whether the conclusion would remain the same as a larger number of the genes were used. For this purpose, we considered using the top 1000 genes with the largest absolute SAM T-statistics discriminating between the LVEC and MVEC samples (Tusher et al., 2002). If we intended to classify BOEC samples to one of the two known classes, the standard method classified 16 and 11 BOEC samples to one of the two classes respectively (Table 5); however, the two penalized methods estimated the two cluster centers at 0, implying that there was only a single cluster. In fact, the BIC values were all large, e.g. compared with that of using (g0 = 1, g1 = 2), suggesting the existence of more classes. Indeed, if a new class was allowed, it was formed mostly by BOEC samples, largely separate from the LVEC and MVEC classes (Table 5).
|
Penalized methods gave an impressive performance for gene selection:
85% of the 1000 genes were not used (Table 6).
|
3.2.3 Using top 1000 genes with largest sample variances
The above analyses were based on the genes that discriminated between LVEC and MVEC samples; it might be argued that some other genes, e.g. those discriminating between BOEC and LVEC/MVEC samples but not between LVEC and MVEC samples, were omitted and should be included. A more objective way was to use some genes with large variations across all the samples; we used top 1000 genes with the largest sample variances across the 80 samples, and similar conclusions were drawn when 2000 genes were used (data not shown). First, if BOEC samples were forced to go into one of the two known classes with (g0 = 0, g1 = 2), the result was consistent with the previous conclusion: BOEC samples formed their own class while the LVEC and MVEC samples were largely mixed together; in fact, one penalized method (w = 0) regarded that there was only a single cluster (Table 7). Better models (with smaller BIC values) were obtained by allowing the existence of a new class with (g0 = 1, g1 = 2); now the three types of the samples formed their own classes, respectively (Table 7).
|
Again the penalized methods did not use all the 1000 features after automatic variable selection; in particular, a large number of features were regarded as non-informative for the two-class problem (Table 8).
|
We also conducted a model-based clustering analysis; i.e., we did not use any class labels and tried to group the 80 samples based on their expression profiles only. With three clusters, it was verified that the BOEC samples formed their own cluster while the other two types of samples mixed with each other (data not shown); similar results were obtained with larger numbers of clusters (data not shown). This example clearly demonstrated an apparent advantage of semi-supervised learning over unsupervised learning.
| 4 DISCUSSION |
|---|
|
|
|---|
As expected, clustering or classification results depend on which features are being used. For our motivating example, with various larger sets of genes, it was found that the BOEC samples seemed to be different from both LVEC and MVEC samples, and formed a new class. Because of different platforms of microarray chips used, this result might only reflect artificial differences between the chips; although a normalization method was developed and applied to eliminate or minimize such platform differences, some differences might still remain. This is a limitation of the current data, and further studies are needed.
The major contribution of this work is the use of penalized mixture model for semi-supervised learning. As in clustering (Pan and Shen, 2006), variable selection in semi-supervised learning is both critical and challenging. The usual practice of first selecting variables and then clustering does not work: because of the separation between variable selection and model fitting, selected variables may not be relevant at all; in contrast, our proposed penalized mixture model accomplishes the two goals simultaneously, leading to much better performance, as shown in our numerical examples.
Our proposed semi-supervised learning method is a natural extension of penalized model-based clustering (Pan and Shen, 2006): both are based on finite normal mixture models, and use penalized likelihood for estimation; the difference is that here we have partially labeled data while only unlabeled data are given in the latter. There are also some similarities between the nearest shrunken centroids (NSC) (Tibshirani et al., 2002, 2003) and these two penalized mixture model methods. First, all three aim to handle high-dimensional (and low-sample-sized) data as encountered in genomics, which partially determines the below similarities. Second, all assume a Normal distribution for each cluster or class. Third, all adopt a common diagonal covariance matrix for all the clusters/classes, which simplifies estimation for such data and facilitates variable selection. Fourth, all use soft-thresholding to realize variable selection. Nevertheless, the three methods are quite different. First of all, obviously, they handle supervised, semi-supervised, and unsupervised learning tasks respectively. Second, the penalization in NSC is largely based on heuristics, while the other two are cast in the general and unified framework of penalized likelihood.
Mixture models have been used in semi-supervised learning with novelty discovery, e.g. for text classifications (Nigam et al., 2006). However, to our knowledge, there is no use of penalization. With a large number of features, as for text classification data and genomic data, we have demonstrated the importance of simultaneous feature selection and model fitting. Here we have only considered a single Normal distribution for each class; a mixture of Normals can be also used (Nigam et al., 2006). Extensions to fully Bayesian implementations of mixture models (Broet et al., 2002) and to incorporating the idea of tight clustering (Tseng and Wong, 2005) into the current context may be also useful. These are interesting topics to be studied in the future.
| Acknowledgments |
|---|
WP was supported by NIH grant HL65462 and a UM AHC Development grant, and AJ and RH by NIH grant P01-HL076540. The authors thank the reviewers for helpful and constructive comments.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
Received on April 21, 2006; revised on July 10, 2006; accepted on July 11, 2006
| REFERENCES |
|---|
|
|
|---|
Alexandridis, R., et al. (2004) Class discovery and classification of tumor samples using mixture modeling of gene expression data. Bioinformatics, 20, 25462552.
Broet, P., et al. (2002) Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. J. Comput. Biol, . 9, 671683[CrossRef][Web of Science][Medline].
Broet, P., et al. (2004) A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments. Bioinformatics, 20, 25622571
Chi, J.-T., et al. (2003) Endothelial cell diversity revealed by global expression profiling. Proc. Natl Acad. Sci. USA, 100, 1062310628
Dempster, A.P., et al. (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Statist. Soc. B, 39, 138.
Efron, B. (2004) The estimation of prediction error: covariance penalties and cross-validation. JASA, 99, 619632.
Efron, B., et al. (2004) Least angle regression. Ann. Stat, . 32, 407499[CrossRef].
Fraley, C. and Raftery, A.E. (1998) How many clusters? Which clustering methods?Answers via model-based cluster analysis. Comp. J, . 41, 578588.
Hastie, T., et al. (2001) The Elements of Statistical Learning. Data mining, Inference, and Prediction. Springer.
Hebbel, R., et al. (2005) Genetic influence on the systems biology of sickle stroke risk detected by endothelial gene expression. Blood, 106, Suppl, 26a.
Huang, X., et al. (2005) A comparative study of discriminating human heart failure etiology using gene expression profiles. BMC Bioinformatics, 6, 205[CrossRef][Medline].
Irizarry, R.A., et al. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4, 249264[Abstract].
Jiang, A. (2005) Are BOEC cells more like large vessel or microvascular endothelial cells? , MN MS Thesis, Division of Biostatistics, University of Minnesota.
Lin, Y., et al. (2002) Use of blood outgrowth endothelial cells for gene therapy of hemophilia, A. Blood, 99, 457462
Lin, Y., et al. (2000) Origins of circulating endothelial cells and endothelial outgrowth from blood. J. Clin. Investigation, 105, 7177[Web of Science][Medline].
McLachlan, G.J. Discriminant Analysis and Statistical Pattern Recognition, (1992) , New York Wiley.
McLachlan, G.J. and Basford, K.E. Mixture Models: Inference and Applications to Clustering, . (1988) , New York Marcel Dekker.
McLachlan, G.J., et al. (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics, 18, 413422
McLachlan, G.J. and Peel, D. Finite Mixture Model, . (2002) , New York John Wiley & Sons, Inc.
Nigam, K., et al. (2006) Semi-supervised text classification using EM. In Chapelle, O, Scholkopf, B., Zien, A. (Eds.). Semi-Supervised Learning, , Cambridge, MA, USA MIT Press.
Pan, W. and Shen, X. (2006) Penalized model-based clustering with application to variable selection. Research Report 20062004, Division of Biostatistics, University of Minnesota.
Schwarz, G. (1978) Estimating the dimensions of a model. Ann. Stat, . 6, 461464[CrossRef].
Shen, X. and Ye, J. (2002) Adaptive model selection. J. Am. Stat. Assoc, . 97, 210221[CrossRef][Web of Science].
Swerlick, R.A., et al. (1992) Human dermal microvascular endothelial but not human umbilical vein endothelial cells express CD36 in vivo and in vitro. J. Immunol, . 148, 7883[Abstract].
Tibshirani, R. (1996) Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. B, 58, 267288.
Tibshirani, R., et al. (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl Acad. Sci., USA, 99, 65676572
Tibshirani, R., et al. (2003) Class prediction by nearest shrunken centroids, with application to DNA microarrays. Stat. Sci, . 18, 104117[CrossRef][Web of Science].
Tseng, G.C. and Wong, W.H. (2005) Tight Clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics, 61, 1016[CrossRef][Web of Science][Medline].
Tusher, V.G., et al. (2002) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci., USA, 98, 51165121.
Zhu, X. (2006) Semi-supervised learning literature survey. Technical report 1530. Department of Computer Sciences, University of Wisconsin-Madison.
Zou, H., Hastie, T., Tibshirani, R. (2004) On the Degrees of Freedom of the Lasso, Technical Report. , CA Statistics Department, Stanford University.
Zou, H. (2005) The adaptive Lasso and its oracle properties, Technical report. , MN School of Statistics, University of Minnesota.
Zou, H. (2006) Feature selection and classification via a hybrid support vector machine, Technical report. , MN School of Statistics, University of Minnesota.
This article has been cited by other articles:
![]() |
S. Ma and J. Huang Penalized feature selection and classification in bioinformatics Brief Bioinform, September 1, 2008; 9(5): 392 - 403. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. C. Tseng Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data Bioinformatics, September 1, 2007; 23(17): 2247 - 2255. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||












