Bioinformatics Advance Access originally published online on February 18, 2007
Bioinformatics 2007 23(8):998-1005; doi:10.1093/bioinformatics/btm053
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A multi-stage approach to clustering and imputation of gene expression profiles
Department of Statistics, Macquarie University, NSW 2109, Australia
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Microarray experiments have revolutionized the study of gene expression with their ability to generate large amounts of data. This article describes an alternative to existing approaches to clustering of gene expression profiles; the key idea is to cluster in stages using a hierarchy of distance measures. This method is motivated by the way in which the human mind sorts and so groups many items. The distance measures arise from the orthogonal breakup of Euclidean distance, giving us a set of independent measures of different attributes of the gene expression profile. Interpretation of these distances is closely related to the statistical design of the microarray experiment. This clustering method not only accommodates missing data but also leads to an associated imputation method.
Results: The performance of the clustering and imputation methods was tested on a simulated dataset, a yeast cell cycle dataset and a central nervous system development dataset. Based on the Rand and adjusted Rand indices, the clustering method is more consistent with the biological classification of the data than commonly used clustering methods. The imputation method, at varying levels of missingness, outperforms most imputation methods, based on root mean squared error (RMSE).
Availability: Code in R is available on request from the authors.
Contact: dwong{at}efs.mq.edu.au
| 1 INTRODUCTION |
|---|
|
|
|---|
Microarray experiments allow us to measure the expression of tens of thousands of genes simultaneously, thus having the potential to dramatically increase the efficiency of genome-wide studies. Following the conduct of a microarray experiment, a primary concern of the researcher is the appropriate grouping of similarly expressed genes. The biological motivation for performing clustering lies in the fact that many co-expressed genes are also co-regulated; clustering aids in functional annotation of novel genes, identification of transcription factor binding sites and discovery of complete biological pathways (Boutros and Okey, 2005). A secondary, but related, concern is the need for imputation of missing data. Gene expression profiles, especially those obtained from microarray chips, often include a substantial number of missing values.
Techniques such as hierarchical clustering (Eisen et al., 1998), k-means (Soukas et al., 2000), Cluster affinity search technique (CAST) (Ben-Dor et al., 1999), gene shaving (Hastie et al., 2000), the use of self-organizing maps (SOM) (Tamayo et al., 1999), self-organizing tree algorithms (SOTA) (Herrero et al., 2001) and mixture models (McLachlan et al., 2002; Yeung et al., 2001) to name a few, have been used in the clustering of gene expression profiles. In practice, the most common clustering methods used by biologists for gene expression data (Knudsen, 2002) are hierarchical clustering, k-means and SOM. Hierarchical clustering links the genes, based on closest distance, to form a family tree. The k-means method starts by randomly assigning each gene to one of k clusters. The distance between each gene and each cluster centre (or centroid) is calculated and used to assign genes to the closest centroid. The genes assigned to a centroid become a new cluster. The centroids are then recalculated and genes reassigned until the centroids converge. The SOM method is similar to k-means, the difference being that it is constrained to work on a 2D grid that provides information about the relationship between neighbouring clusters. SOTA is a hierarchical SOM, clustering using the hierarchical structure with the accuracy and robustness of a neural network. Most clustering techniques, however, are unable to deal with missing data. Samples containing missing data must be omitted or the values imputed.
de Brevern et al. (2004) have shown that the imputation method used affects the final clustering, even at a low rate of missingness. Therefore, choosing an appropriate imputation method is a crucial step in the analysis of gene expression data. Generally, we can categorize imputation methods into two classes: the first uses local information and the second uses global information. The two methods proposed initially by Troyanskaya et al. (2001), namely the k-nearest neighbour (KNN) and singular value decomposition (SVD) imputation methods are the respective pioneers in these two categories.
The KNN imputation method uses information from the k-nearest neighbours to estimate the missing value. Subsequent articles belonging to this category further developed this idea by either altering the gene selection process or the design of the estimation rule. The KNN method uses Euclidean distance for gene selection and a weighted average (with weights determined by gene similarity) for the estimation rule. Improvements in gene selection include the use of Bayesian variable selection (Zhou et al., 2003), Gaussian mixture clustering (Ouyang et al., 2004) or correlation (Bo et al., 2004). Advancements in the estimation rule include the use of linear models (Scheel et al., 2005; Zhou et al., 2003), non-linear models (Zhou et al., 2003), the (Expectation-maximization) EM-algorithm (Bo et al., 2004; Ouyang et al., 2004) or least squares methods (Bo et al., 2004; Kim et al., 2005; Nguyen et al., 2004).
The SVD imputation method uses singular value decomposition to obtain mutually orthogonal expression patterns that can be linearly combined to approximate the expression of all genes in the dataset. Estimates of the missing values can be obtained by regressing against this set of genes. Here, a set of genes which can represent the entire dataset is selected and used to estimate the missing value. Further development in this category includes the introduction of Bayesian estimation into principal component analysis (Oba et al., 2003), partial least squares (Nguyen et al., 2004), a covariance-based method to rank genes (Sehgal et al., 2005) and support vector regression (Wang et al., 2006).
Other methods are either a variation on, or a combination of, the above categories. These include a sequential KNN method (Kim et al., 2004) which uses previously imputed values to impute subsequent missing values, use of a convex combination of existing methods (Jornsten et al., 2005), use of information about the quality of the spots (Tom et al., 2005) and use of information from gene ontology (Tuikkala et al., 2006). To date, the KNN approach is the most widely used imputation method due to its simplicity, efficiency and availability.
This article describes an alternative approach to the clustering of microarray data; the method accommodates missing data and also leads to an associated imputation method. This method is adapted from Godfrey et al. (2002), where it has been successfully used in a horticultural context for clustering in genotype-by-environment analyses with missing data. The method outperforms commonly used clustering methods while retaining their simplicity. The associated imputation method also produces promising results.
Section 2 describes the method by first detailing the derivation of the distance measures. This is followed by a modification of the distance measures to accommodate missing data. As an aside, the relationship between the distance measures and the experimental design used is presented. We then describe the clustering and imputation algorithms and introduce the jump factor as a stopping criterion. In Section 3, the results of clustering and imputation using both two-stage and three-stage methods are presented. A short discussion is provided in Section 4 and a brief conclusion given in Section 5.
| 2 METHODS |
|---|
|
|
|---|
The clustering method introduced is based on the simple idea of grouping in stages using a hierarchy of distance measures. The method captures the way in which the human mind sorts and thus groups items using a hierarchy of attributes, so increasing the probability of success. Consider, for example, how we tackle a jigsaw puzzle. It is common to sort the pieces into groups at the outset; sorting may require a number of stages, depending on the complexity of the puzzle. For example, we might first group the pieces based on shape into edge and non-edge pieces. Within these groups, we then sort based on colour. Similarly here, the more complex the design of the experiment, the greater the number of stages required for clustering. The situation can be modelled probabilistically, demonstrating that the probability of accurately grouping the items is always higher when done in stages than when done all at once.
2.1 Distance measures
The distance measures giving rise to the stages are the result of breaking Euclidean distance into a number of orthogonal components. For observations ys, s = 1,2, ... ,S, we commonly have a decomposition of the total sum of squares into, say, n orthogonal components, as
|
|
|
|
|
|
2.1.1 Two-stage decomposition
Here we describe the simplest situation where there are only two components. Based on the very simple model
|
|
|
|
Paralleling this, the partitioned squared Euclidean distance between the ith and jth gene is given as
|
|
Thus, we have partitioned the squared Euclidean distance between the ith and jth expression profiles into two squared distance measures,
|
|
|
|
|
|
|
|
2.1.2 Three-stage decomposition
Using a more elaborate model, we can extend the two-stage method to a three-stage method. Suppose we have T treatments within a gene and St repetitions (samples) within treatment t. For a given gene, we can express this in the model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2.2 Distance measures accommodating missing values
The distance measures can be modified to accommodate missing values. This involves calculating the squared Euclidean distance over samples common to both genes. We let sij denote the indices of samples where values for genes i and j are present.
2.2.1 Two-stage decomposition accommodating missing values
Let pij be the number of samples common to genes i and j and
be the mean of the yis across these pij common samples. The orthogonal partition of the squared Euclidean distance using only common samples will be
|
|
|
|
|
|
2.2.2 Three-stage decomposition accommodating missing values
Let sijt be the indices of samples in treatment t where values for genes i and j are present, pijt be the number of samples common to genes i and j in treatment t,
be the mean of the yits, using only the pijt common samples and
be the overall mean of the yits, using only common samples between gene i and gene j. Letting
, we can express
and
as
|
|
|
|
|
|
|
|
|
|
2.3 Relationship between the distance measures and the gene expression model
In this section, we show how the distance measures obtained from the two-stage decomposition are related to a model for gene expression data. This can be extended to the distance measures obtained from higher order decompositions and the associated models.
We consider a two-factor factorial design with no replicates. Let Yis denote the gene expression for the ith gene and sth sample. An appropriate model for this design is
|
|
is is the error term assumed to be independently and normally drawn with mean zero and variance
2. Since there are no replicates, GSis and
is are confounded. The squared Euclidean distance between the expression profiles for genes i and j is |
|
It can be shown that
follows a non-central
2 distribution with one degree of freedom and non-centrality parameter
. Therefore, the expected value of
is
. This is a translation of the squared difference in gene expression level. Moreover, the translation value,
, is usually small relative to (Gi – Gj)2. Consequently,
serves as a satisfactory measure of difference in gene level.
Also,
follows a non-central
2 distribution with S – 1 degrees of freedom and non-centrality parameter
. Thus, the expected value of
is
. Let GSis–GSjs, the difference between the G x S interaction of the ith and jth gene, be considered as an observation. Note that GSis–GSjs across the S samples has mean zero, whence
is the sample variance of the GSis–GSjs. The value of this variance gives us a good indication of the extent of the difference in G x S interaction between the ith and jth genes. The expected value of
is a translation of this variance by 2
2. Consequently,
serves as a satisfactory measure of difference in GxS interaction.
2.4 Clustering
The idea underlying the clustering method proposed is to group in stages using a hierarchy of distance measures. The hierarchy begins with the most dominant attribute and progresses through to finer attributes. This mimics the way mails are sorted, for example firstly by country then by state, down to postcode and so on. We can summarize the clustering algorithm as follows:
- Stage 1: Cluster using D1
- Stage 2: Cluster within each first-stage cluster using D2
- Stage 3: Cluster within each second-stage cluster using D3.
- Stage 2: Cluster within each first-stage cluster using D2
2.4.1 Two-stage clustering
We begin by describing the two-stage clustering method, with distance measures of main effect distance and interaction distance. This is summarized as follows:
First stage
- Calculate all main effect distances Mij
- Cluster genes using these main effect distances (this produces level-similar clusters).
Second stage
- Calculate interaction distances Iij for all gene pairs i and j within first-stage clusters
- Cluster genes within each first-stage cluster using the interaction distances (this produces level-similar and shape-similar gene clusters).
2.4.2 Three-stage clustering
When we have three distance measures, namely the main effect, treatment and interaction distances, a three-stage clustering algorithm can be summarized as follows:
First stage
- Calculate all main effect distances Mij
- Cluster genes using these main effect distances (this produces level-similar clusters).
Second stage
- Calculate treatment distances Tij for all gene pairs i and j within first-stage clusters
- Cluster genes within each first-stage cluster using the treatment distances (this produces level-similar and treatment shape-similar gene clusters).
Third stage
- Calculate interaction distances Iij for all gene pairs i and j within second-stage clusters
- Cluster genes within each second-stage cluster using the interaction distances (this produces level-similar, treatment shape-similar as well as interaction shape-similar gene clusters).
2.5 Imputation
For imputation, we cluster using interaction distance modified to accommodate missing values. We use information from the genes in the cluster of the gene with missing data to find an imputed value.
- Perform clustering using only interaction distance
- For each missing value, identify the gene and the sample to which it corresponds (we call these the target gene and the target sample)
- Identify the genes that belong to the same interaction cluster as the target gene (we call these the parent genes)
- For each parent gene with an expression value in the target sample, calculate the corresponding overall mean
- Find the difference between the expression value in the target sample and the calculated overall mean
- Calculate the mean of all values obtained in Step 5
- The imputed value is the overall mean of the target gene plus the value calculated in Step 6.
- Perform clustering using only interaction distance
- For each missing value, identify the gene, the sample and the treatment to which it corresponds (we call these the target gene, the target sample and the target treatment)
- Identify the genes that belong to the same interaction cluster as the target gene (we again call these the parent genes)
- For each parent gene with an expression value in the target sample, calculate the corresponding target treatment mean
- Find the difference between the expression value in the target sample and the calculated target treatment mean
- Calculate the mean of all values obtained in Step 5
- The imputed value is the target treatment mean of the target gene plus the value calculated in Step 6.
2.6 Stopping criterion
A critical challenge is to determine an appropriate number of clusters when no prior knowledge is available. To identify the appropriate number of clusters, we plot the height of the new cluster to be formed against the current number of clusters. Height here corresponds to the criterion used to determine which two clusters are to be merged to form a new cluster. For example, we used Ward's linkage method where height is the total ESS after merging two clusters. Since a hierarchical agglomerative technique starts with each data point a cluster, then at each iterative step joins the two closest clusters, the height of the new cluster will be the largest height calculated so far. We propose that clustering should stop when the height increases markedly. As a quantitative measure of this, we use a jump factor defined as
|
|
| 3 RESULTS |
|---|
|
|
|---|
In this section, we first illustrate the performance of the two-stage and the three-stage clustering methods. We then demonstrate the performance of the two-stage and the three-stage imputation methods.
3.1 Clustering
We compared the two-stage and three-stage clustering methods to commonly used methods, namely, the hierarchical, k-means, SOM, SOTA and model-based clustering (Yeung et al., 2001) methods. All codes were obtained from R packages (cluster, stats, som, mclust) downloadable from the comprehensive R archive network (CRAN) except for SOTA, which we ran on GEPAS, a web-based server for SOTA. The Rand and adjusted Rand indices were used to measure performance. The Rand index (Rand, 1971) is the number of agreements (pairs that are either in the same cluster or in different clusters in both clusterings) divided by the total number of pairs. The adjusted Rand index proposed by Hubert and Arabie (1985), adjusts the score so that its expected value for random clustering is zero. The maximum value for the Rand and adjusted Rand indices is one; a high index indicates a high level of agreement between the clusterings. The jump factor criterion was used to detect the number of clusters for hierarchical, two-stage and three-stage methods. The model-based method and SOTA have a built-in criteria while the number of clusters for k-means and SOM are user specified.
3.1.1 Two-stage clustering
To test the two-stage clustering method, we used a simulated dataset placed by Michaud et al. (2003) at http://www.che.udel.edu/eXPatGen/paper/example2.out and the yeast cell cycle data with MIPS criterion (extracted from Cho et al. (2001)) made available by Yeung et al. (2001) at http://faculty.washington.edu/kayee/cluster/. The simulated dataset contains 100 genes and 36 samples and is generated based on known biological features of expression complexity, diversity and interconnectivity. There are 10 clusters in this dataset, with each cluster containing 10 genes. Close examination of this dataset, however, shows that the first two clusters contain genes that are neither repressed nor induced at any point of the experiment. We treat these clusters as identical and so the dataset contains only nine true clusters. The yeast cell cycle dataset contains 237 genes and 17 samples. These genes corresponding to four categories in the MIPS database (DNA synthesis and replication, organization of centrosome, nitrogen and sulphur metabolism, and ribosomal proteins); we assume these to be the true clusters. Table 2 shows the Rand and adjusted Rand indices for the two-stage method and the other commonly used methods, against the true clusters for the simulated data and the yeast cell cycle data.
|
For the simulated data, both the two-stage and model-based method did equally well, with nine clusters detected and only one gene misclassified. Hierarchical clustering detected only eight clusters, SOTA detected seven and the k-means method had the number of clusters pre-specified as nine. All these methods had slightly lower Rand and Adjusted Rand indices compared to the two-stage method. We were unable to force SOM to produce nine clusters; six clusters was the optimal choice. This method produced the lowest Rand and adjusted Rand indices.
For the yeast cell cycle data, the two-stage method detected four clusters (two in the first stage and two in each first-stage cluster in the second stage). Two-stage clustering has the highest Rand and adjusted Rand indices. Using hierarchical clustering, three clusters were detected. If we pre-specify four as the number of clusters in hierarchical clustering, it performed slightly better but not as well as the two-stage method. The k-means and SOM methods, despite having the advantage of four being pre-specified as the number of clusters, have a lower adjusted Rand index than the two-stage clustering method. The model-based method detected only two clusters; this could be the reason behind its having the lowest Rand and adjusted Rand indices. We did not include SOTA in the results because it produced too many clusters (up to 50).
3.1.2 Three-stage clustering
The central nervous system (CNS) development gene expression data (Wen et al., 1998) made available by Yeung et al. (2001) at http://faculty.washington.edu/kayee/cluster/ was used to test the three-stage method. There are 112 genes known to belong to major gene families deemed important for spinal cord development. There are nine samples in this experiment, measured using embryonic days 11, 13, 15, 18 and 21, postnatal days 0, 7 and 14 and adult (postnatal day 90).
We divided the data into three groups, early embryonic days (E11, E13, E15), late embryonic days (E18, E21) and all data after birth (P0, P7, P14, P90). Table 3 shows the Rand and adjusted Rand indices for the two-stage and three-stage methods. Wen et al. (1998) classified the genes into 14 general functional classes. The two-stage clustering resulted in six clusters, while three-stage clustering yielded 21 clusters. In this dataset, three-stage clustering performed better based on both the Rand and adjusted Rand indices.
|
3.2 Imputation
Missing values were generated by randomly removing from 1 to 20% of the data. The root mean squared error (RMSE) was calculated to measure the performance of the imputation. To assess the consistency of results, 1000 runs (each with different values removed randomly) were performed and the mean, minimum and maximum RMSE across all runs were obtained.
3.2.1 Two-stage imputation
To illustrate the performance of the two-stage imputation method, the simulated dataset and the yeast cell cycle data with MIPS criterion was used. We compared the two-stage imputation method to the commonly used imputation methods of zero, row mean and KNN imputation. Current methods that are more sophisticated such as the local least squares imputation method (LLSimpute) (Kim et al., 2005), Bayesian principal component method (BPCA) (Oba et al., 2003) and collateral missing value imputation (CMVE) (Sehgal et al., 2005) were also compared to give more credibility to the comparison between methods. The code for KNN imputation was obtained from the R package, called impute, downloadable from CRAN. LLSimpute, BPCA and CMVE were obtained via downloadable Matlab code available at the associated author website. An example of the results, with 10% of data missing, is shown in Table 4.
|
For the simulated dataset, we used k = 10 for the KNN, LLSimpute and CMVE method since each cluster contains 10 genes, and this value is reported to work well for all three methods (Sehgal et al., 2005). For the yeast cell cycle data, we used k = 8 for the KNN method and for two-stage imputation we used 30 clusters; this corresponds to approximately 8 genes in a cluster, assuming that they are uniformly distributed. This choice produces a markedly low RMSE for both methods. The same k value is used for LLSimpute and CMVE to provide an equitable comparison.
Based on RMSE, the two-stage method outperformed all methods except BPCA. BPCA is a more sophisticated method than the two-stage method and far more computationally intensive. A single run using the simulated dataset with 10% missing values takes BPCA approximately 40 s while the two-stage imputation method takes
1 s. Furthermore, Bayesian approaches are highly dependent on the chosen prior distribution; a wrong choice of prior distribution would result in poor performance. In our investigation, BPCA outperforms LLSimpute and CMVE, while Kim et al. (2005) and Sehgal et al. (2005) have reported that their respective methods outperform BPCA. LLSimpute is reported to perform well when k is large; based on the 1000 runs, we found that the optimal k varies extensively. CMVE performed extremely badly on a number of runs, giving an infinitely large RMSE. Moreover, even its minimum RMSE is high in comparison with other methods. This could be due to bugs in the Matlab code or sensitivity to the distribution of the missing data.
3.2.2 Three-stage imputation
To test the performance of the three-stage imputation method, the CNS data was used. The main focus was to compare the three-stage method with the two-stage method. An example of the results, with 10% of data missing, is shown in Table 5.
|
For both two-stage and three-stage imputation, we used 11 clusters. This number of clusters was chosen because it produces a low RMSE. For 10% of values missing, on average, two-stage imputation performed slightly better than three-stage imputation. The three-stage method, however, has a lower minimum compared to two-stage imputation. Table 6 shows the difference between the performance of the two-stage and the three-stage imputation methods as the percentage of missing values increases from 1 to 20%.
|
On average, three-stage imputation performed better than two-stage imputation when the percentage of missing values was low (i.e. at 1 and 5%). As the percentage of missing values increases, the performance of three-stage imputation drops. This is because the replication within treatment is very small (i.e. three replications within each treatment). When the percentage of missing values is high, there is a high probability that all replications within a treatment are missing and therefore, the imputed value is the overall mean of the gene expression profile. This reduces the three-stage imputation method to row mean imputation and thus its performance falls.
| 4 DISCUSSION |
|---|
|
|
|---|
Our results suggest that decomposing the profiles into orthogonal components and clustering in stages is a useful approach. Geometrically, the multi-stage approach breaks the S-dimensional space in which the data lies into orthogonal subspaces. The advantage of this approach is most apparent when the scale of the dominant attribute is considerably larger than that of the others. In this case, clustering is largely based on the dominant attribute when using Euclidean distance. The multi-stage approach, however, allows the subtlety of all components to be acknowledged. Therefore, extra information is available when performing the clustering.
In this article, for two-stage clustering, we have used main effect distance first then interaction distance second; we refer to this as the top-down order. Altering the order can give very different clustering results, especially when the clusters are indistinct. It is possible that a bottom-up approach could produce better results. The question here is not to determine which order is better, rather it is to determine when a certain order is better. Since the process is not reversible, we always begin with the attribute which produces the most distinct clusters. If the separation is unclear in the first stage, the misclustering that occurs is carried on to subsequent stages. Information on the separation of the clusters in each stage is usually unattainable; we resolve this difficulty by assuming that attributes with larger values tend to have more distinct clusters. Thus, a top-down approach should be the default option.
Work in progress involves the implementation of the multi-stage idea into a model-based method. Classical and Bayesian model-based approaches to clustering of gene expression profiles will be studied and compared at a future date.
| 5 CONCLUSION |
|---|
|
|
|---|
We have introduced an alternative approach to the clustering of gene expression profiles, an approach that involves clustering in a number of stages using a hierarchy of distance measures. This enables the clustering method to deal with large datasets in a systematic way.
We have shown that the distance measures are related to the design of experiment employed and reflect different attributes of the data. The multi-stage approach enhances the distinguishing power of the distance measures, because it allows subtle differences to not be masked by a more dominant attribute of the gene expression profiles. Thus, the precision of clustering is improved, as seen in the results displayed.
This clustering method is modified to accommodate missing values and leads to an associated imputation method. The multi-stage imputation method is simple and robust. It also outperforms imputation methods within its league.
The multi-stage approach is not only theoretically grounded but also biologically supported. It achieves this by putting emphasis on shape similarity, so taking into account the fact that co-expressed genes are co-regulated. Furthermore, it can be used on incomplete datasets and brings with it the ability to estimate the missing values.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on July 30, 2006; revised on January 23, 2007; accepted on February 10, 2007
| REFERENCES |
|---|
|
|
|---|
Ben-Dor A, et al. Clustering gene expression patterns. J. of Comput. Biol., ( (1999) ) 6, : 281–297.[CrossRef].
Bo TH, et al. LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res., ( (2004) ) 32, : e34.
Boutros PC, Okey AB. Unsupervised pattern recognition: an introduction to the whys and wherefores of clustering microarray data. Brief. Bioinform., ( (2005) ) 6, : 331–343.
Cho RJ, et al. Transcriptional regulation and function during the human cell cycle. Nat. Genet., ( (2001) ) 27, : 48–54.[ISI][Medline].
de Brevern AG, et al. Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering. BMC Bioinformatics, ( (2004) ) 5, : 114.[CrossRef][Medline].
Eisen MB, et al. Cluster analysis and display of genome-wide expression patterns. Proc. Nat. Acad. Sci. USA, ( (1998) ) 95, : 14863–14868.
Godfrey AJR, et al. Two-stage clustering in genotype-by-environment analyses with missing data. J. Agric. Sci., ( (2002) ) 139, : 67–77.[CrossRef].
Hastie T, et al. Gene shaving as a method of identifying distinct sets of genes with similar expression patterns. Genome Biol., ( (2000) ) 1, . research0003.1-0003.21..
Herrero J, et al. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics, ( (2001) ) 17, : 126–136.
Hubert L, Arabie P. Comparing partitions. J. Classification, ( (1985) ) 4, : 193–218..
Jornsten R, et al. DNA microarray data imputation and significance analysis of differential expression. Bioinformatics, ( (2005) ) 21, : 4155–4161.
Kim H, et al. Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics, ( (2005) ) 21, : 187–198.
Kim KY, et al. Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinformatics, ( (2004) ) 5, : 160.[CrossRef][Medline].
Knudsen S. A Biologist's Guide to Analysis of DNA Microarray Data, ( (2002) ) New York: John Wiley and Sons, Inc..
McLachlan GJ, et al. A mixture model-based approach to the clustering of microarray expression data. Bioinformatics, ( (2002) ) 18, : 413–422.
Michaud DJ, et al. eXPatGen: generating dynamic expression patterns for the systematic evaluation of analytical methods. Bioinformatics, ( (2003) ) 19, : 1140–1146.
Nguyen DV, et al. Evaluation of missing value estimation for microarray data. J. Data Sci., ( (2004) ) 2, : 347–370..
Oba S, et al. A Bayesian missing value estimation method for gene expression profile data. Bioinformatics, ( (2003) ) 19, : 2088–2096.
Ouyang M, et al. Gaussian mixture clustering and imputation of microarray data. Bioinformatics, ( (2004) ) 20, : 917–923.
Rand WM. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc., ( (1971) ) 66, : 846–850.[CrossRef][ISI].
Scheel I, et al. The influence of missing value imputation on detection of differentially expressed genes from microarray data. Bioinformatics, ( (2005) ) 21, : 4272–4279.
Sehgal MS, et al. Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data. Bioinformatics, ( (2005) ) 21, : 2417–2423.
Soukas A, et al. Leptin-specific patterns of gene expression in white adipose tissue. Genes Dev., ( (2000) ) 14, : 963–980.
Tamayo P, et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Nat. Acad. Sci. USA, ( (1999) ) 96, : 2907–2912.
Tom BD, et al. Quality determination and the repair of poor quality spots in array experiments. BMC Bioinformatics, ( (2005) ) 6, : 234.[CrossRef][Medline].
Troyanskaya O, et al. Missing value estimation methods for DNA microarrays. Bioinformatics, ( (2001) ) 17, : 520–525.
Tuikkala J, et al. Improving missing value estimation in microarray data with gene ontology. Bioinformatics, ( (2006) ) 22, : 566–572.
Wang X, et al. Missing value estimation for DNA microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinformatics, ( (2006) ) 7, : 32.[CrossRef][Medline].
Wen X, et al. Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl. Acad. Sci. USA, ( (1998) ) 95, : 334–339.
Yeung KY, et al. Model-based clustering and data transformations for gene expression data. Bioinformatics, ( (2001) ) 17, : 977–987.
Zhou X, et al. Missing-value estimation using linear and non-linear regression with Bayesian gene selection. Bioinformatics, ( (2003) ) 19, : 2302–2307.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||










