Bioinformatics Advance Access originally published online on January 5, 2006
Bioinformatics 2006 22(5):589-596; doi:10.1093/bioinformatics/btk026
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A multi-step approach to time series analysis and gene expression clustering
1Dipartimento di Scienze Fisiche, University of Naples Federico II Naples, ITALY
2Dipartimento di Matematica e Informatica, University of Salerno Fisciano, Salerno, ITALY
3Institute of Information Transmission Problems, Russian Academy of Sciences Moscow, Russia
4Telethon Institute of Genetics and Medicine Naples, ITALY
5Department of Astronomy, California Institute of Technology Pasadena CA, USA
6INFNIstituto Nazionale Fisica Nucleare Sezione di Napoli, Naples, ITALY
7INAFIstituto Nazionale di Astrofisica Sezione di Napoli, Naples, ITALY
8Department of Physics, Syracuse University Syracuse NY, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: The huge growth in gene expression data calls for the implementation of automatic tools for data processing and interpretation.
Results: We present a new and comprehensive machine learning data mining framework consisting in a non-linear PCA neural network for feature extraction, and probabilistic principal surfaces combined with an agglomerative approach based on Negentropy aimed at clustering gene microarray data. The method, which provides a user-friendly visualization interface, can work on noisy data with missing points and represents an automatic procedure to get, with no a priori assumptions, the number of clusters present in the data. Cell-cycle dataset and a detailed analysis confirm the biological nature of the most significant clusters.
Availability: The software described here is a subpackage part of the ASTRONEURAL package and is available upon request from the corresponding author.
Contact: robtag{at}unisa.it
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Scientists have been successful in cataloguing genes through genome sequencing projects, and they can now generate vast quantities of gene expression data using microarrays. However, owing to the sheer size of the datasets involved and the complexity of the problems to be tackled, novel approaches to data mining and understanding, relying on artificial intelligence tools, are necessary.
These tools can be divided in two main families: tools for supervised learning, which make use of prior knowledge to group samples into different classes, and unsupervised tools, which rely only on the statistical properties of the data themselves (Hastie et al., 2001). Both approaches have been used for a variety of applications and both have advantages and disadvantages, the choice of a specific tool depends on the purpose of the investigation and the structure of the data. Among various applications, we recall the following:
- Diagnostic: It finds gene expression patterns specific to given classes mainly dealt with supervised methods (Ando et al., 2002; Mukherjee et al., 1999).
- Clustering: It is aimed at grouping genes that are functionally related without attempting to model the underlying biology (Eisen et al., 1998; Törönen et al., 1999; Tamayo et al., 1999).
- Model-based approach: The generation of a model that justifies the grouping of specific genes and trains the parameters of the model on the dataset (Bussermaker et al., 2001; di Bernardo et al., 2005; Yeung et al., 2001).
- Projection methods: They decompose the dataset into components that have the desired properties (Jolliffe, 2002; Alter et al., 2000; Misra et al., 2002, Chang et al., 2003). To this class belong some methods like principal components analysis (PCA), independent components analysis (ICA) and probabilistic principal surfaces (PPS) which are discussed in this paper.
Regarding the clustering problems, the most commonly used clustering algorithms, such as the hierarchical clustering (Eisen et al., 1998) and k-means (Duda et al., 2001) suffer from some limitations, namely:
- they need an arbitrary and an a priori choice of the correct number of clusters, and this greatly affects the success of the clustering procedure;
- some of them, like hierarchical clustering, cannot properly handle large experimental datasets which are typically noisy and incomplete (i.e. they contain a large fraction of missing data points);
- they do not feature a user-friendly data visualization, which is quite crucial for further analysis and understanding in case of large amount of data.
In this work we propose a new approach, based on a solid mathematical formalism, able to visualize and cluster noisy gene expression data with missing data points. The method can be divided in two parts: the preprocessing of the data, which as widely occurs, is specifically tailored to the characteristics of the dataset under examination, and the clustering and visualization part which is of more general applicability. The method was tested against the microarray dataset of cell-cycle in yeast Saccharomyces cerevisiae made publicly available by Spellman et al. (1998) (see also Cho et al., 1998). The Spellman dataset consists of four synchronization experiments (alpha factor arrest, elutriation and arrest of CDC15 and CDC28 temperature-sensitive mutants) which were performed for a total of 73 microarrays during cell-cycle. A detailed description of the dataset can be found in Spellman et al. (1998).
| 2 PREPROCESSING |
|---|
|
|
|---|
Microarray data are very noisy, and thus preprocessing plays an important role. Preprocessing is needed to filter out noise and to deal with missing data points. We used a two-step preprocessing phase: a preliminary noisy gene rejection, followed by a non-linear PCA features extraction. The structure of noise affecting gene expression data has been studied in a series of papers (Townsend, 2004; Wolfinger et al., 2001) where it has been proved to be quite involved. For the case of Spellman dataset, since inherent replications of the data are not available, it is impossible to perform a careful noise estimation. Furthermore, in this dataset we had only one time sequence for each gene. The number of points in each sequence ranges from 14 to 24. This prevented the application of widely used statistical methods of the noise estimation like ANOVA model to microarrays and bootstrap methods (Kerr and Churchill, 2001; Wolfinger et al., 2001), calculation of the statistical significance of an observed change in expression level using a ratio distribution (Ermolaeva et al., 1998; Chen et al., 1997), Bayesian method for transcript level estimation (Tseng et al., 2001; Townsend and Hartl, 2002), method based on multiple repetition of the signal (Hughes et al., 2000) or finally advanced normalization method to correct for systematic bias (de Lichtenberg et al., 2005).
We illustrate below a simple method for noise estimation relying on the periodicity behavior of the gene expression, followed by a non-linear PCA features extraction.
2.1 Noisy gene rejection
Our input data are the time-courses of the relative abundance of each gene's mRNA in the test sample compared with its abundance in a reference sample (to be precise the log2 of this ratio, and hereafter denoted for simplicity as gene expression). These signals are periodic (the period being determined by the life time from one division of yeast cell to the next division) with each period depending on the temperature and density of the cells culture, the kind of nutrient source, yeast strain and other parameters. According to Spellman et al. (1998) the mean value of this period is 90 min in normal conditions even though each experiment could have its specific period falling in the range of 79101 min. The difference between gene expression in two periods can be considered as an estimate of the noise.
The first goal is thus to estimate the period for each experiment. To this aim, we considered an arbitrary gene a and denoted by fa(t) its expression as a function of time. We split fa(t) into two parts around an arbitrary time point T that can be any value. For T different from the sampled time points we used interpolation, a safe procedure since restricted to period estimation only. For each gene a we compute the Pearson correlation coefficient
a(T) between the restrictions of fa in the time range [0, T/2] and [T, 3/2T], respectively. The value of T for which the cumulative correlation r(T) =
a
a(T) reaches its absolute maximum is our estimate of the period T. As a result of this analysis one gets different values of the period for the different experiments: T
= 86 min, TCDC15 = 109 min, TCDC28 = 75 min (see Supplementary figures ALPHA-corr.gif, CDC15-corr.gif and CDC28-corr.gif). Finally, note that these values have 90 min as average value, and that this procedure cannot be applied to Elutriation since synchronization is lost after one cell-cycle (Spellman et al., 1998). In order to have the same analysis for each experiment, and since
experiment only covers one and half a period we had to choose half period correlation.
Once we obtained the values of T, Fa(t)
(fa(t) + fa(t + T))/2 gives an estimate of the signal and
fa(t)
(fa(t) fa(t + T))/2 of the noise. In supplemnentary figure CDC15-delta.gif, the distribution of
, defined as the average of
fa(t) on the period T, for CDC15 is plotted. The distribution has average equal to 0.01 and variance equal to 0.26. Moreover its skewness and kurtosis are 0.3 and 5.7, respectively. We have checked using a t-test, that the distribution is extracted by a zero average theoretical distribution at 99% confidence level. It is worth observing that a
2 goodness-of-fit test for both Gaussian and asymmetric Laplace distributions (Purdom and Holmes, 2005) does not give a satisfactory significance level, even if, the Laplace distribution provides a much better fit, namely
2
400 for
70 d.o.f. Despite the fact that we cannot separate the additive and multiplicative components, the distribution variance provides an order of magnitude of the noise, which is sufficient for our filtering procedure.
By denoting with
a the root-mean-square for the noise of gene a, we can define the noise-over-signal ratio (n/s) for the gene a as
, where the denominator represents the effective value of the signal (Figure CDC15-n2s.gif, in the Supplementary material, reports the distribution of Ra for CDC15 experiment).
Genes for which all the experiments
, CDC15 and CDC28 yield Ra > 0.5 were excluded from the sample being too noisy, thus reducing the number of genes to be further preprocessed from 6125 to 2679. The 0.5 threshold was found to be a compromise between a reasonable value for the noise-to-signal ratio and also capable to retain most of the genes analyzed by Spellman (Figure CDC15-n2s.gif). Small variations of this threshold would change the number of filtered genes by few hundreds and does not sensibly modify our results.
2.2 Non-linear PCA-based feature extraction
The second step of preprocessing is the extraction of the principal components (eigenvectors) of the autocorrelation matrixes of the genes which passed the filtering procedure. This step is used to deal with missing data points.
Features extraction process is based on a non-linear PCA method which can estimate the eigenvectors from unevenly sampled data. This approach is based on the STIMA algorithm described in Ciaramella et al. (2004) and Tagliaferri et al. (1999, 2001).
2.2.1 Normalization of input vectors
As a first step, the mean value of the raw microarray signal was calculated and subtracted, in order to obtain a vanishing mean process, and divided by its standard deviation. In this way, for each gene we obtained four normalized signals (one for each experiment) with vanishing mean value and SD = 1, which were grouped into a single vector. For genes with no missing points, the resulting vector was 73 dimensional (18, 24, 17 and 14 data points for the alpha, CDC15, CDC28 and elutriation experiments, respectively).
One of the main problems of the microarray's data is the presence of missing points. We used only genes whose microarray signal had more than 10 time points for each experiment, and this reduces the 2679 noise filtered genes to 2445.
2.2.2 PCA neural nets for feature extraction
PCA is a widely used technique in data analysis (Hyvärinen et al., 2001, Ciaramella et al., 2004; Tagliaferri et al., 1999, 2001; Oja et al., 1991, 1996).
PCAs can be neurally realized in various ways (Ciaramella et al., 2004; Tagliaferri et al., 1999, 2001; Karhunen and Joutsensalo, 1994, 1995; Oja et al., 1991). In our case we used a non-linear PCA neural network (NN) consisting of a one-layer feedforward NN able to extract the principal components of the autocorrelation matrix of the input sources.
The parameters of the NN and those we need in the learning process may be summarized as follows:
- the NN is composed by N x p connections, where N and p are the number of input and output neurons, respectively. The parameter N is the embedding dimension calculated considering the features of the signals for each gene and for each experiment (we denote with D the number of elements for each experiment) and p is the number of principal eigenvectors that we need. We denote with xk = [x(k), ..., x(k + N 1)]T the input vector corresponding to the N samples of the input data. In our case we have that each experiment for each gene is an unevenly sampled sequence of D elements and N = 8 is the dimension for each evenly sampled estimated principal component;
- the initial W weight matrix of N x p dimension;
- a valid non-linear cost function f(·) [i.e. log cosh(
·) or |
·|] and
function parameter;
- the learning rate µ;
- the early stopping parameter
.
Briefly, we have to determine, using a learning algorithm (i.e. descent gradient), the weights matrix W = [w(1), ..., w(p)]. These vectors correspond to our p principal components that are in other words the p feature vectors used as input to the PPS approach.
In details, we then have the following general algorithm:
Step 1: the column weight vectors w0(i)
i = 1, ..., p are initialized using the first eight orthonormalized signal patterns since in our experiments we are considering N = 8 and p = 1. The learning threshold
, the learning rate µ and the objective function f(·) [and the corresponding derivative g(·)] are initialized as well. The pattern counter k and the epoch counter t are set to zero. The value N = 8 is fixed by looking for a compromise between different requirements: (1) to get the same number of data points for each gene in each experiment in order to avoid different statistical weights for the experimental datasets; (2) to keep the largest possible time window for each gene expression in order to save the largest amount of information and properly train the NN and (3) to avoid discarding from the analysis a large number of genes owing to the presence of too many missing data points. Moreover, the small dimension of the analyzed time series in addition to the use of N = 8 does not allow to consider p > 1.
Step 2: We input the k-th input pattern
![]() |
Step 3: the output for each neuron at the k time is
![]() |
Step 4:
i = 1, ..., p the weights
![]() |
where wt+1(i) = [wt+1(i, 1), ..., wt+1(i, N)], g(
t) is the derivative of the objective function f(·) and ek(i) is
![]() |
Step 5: k = k + 1. If k is
D N (where D N is the number of patterns) then GO TO STEP 2.
Step 6: convergence test. If
![]() |
Step 7: t = t + 1. GO TO STEP 2.
Step 8. End.
Summarizing, after applying our non-linear PCA-based feature extractor method, we obtained for each of the 2445 genes a 32-dimensional feature vector (one 8-dimensional principal component for each of the four experiments) which can be fed into the PPS model.
| 3 PPS: THEORETICAL DESCRIPTION |
|---|
|
|
|---|
PPS (Chang, 2000; Chang and Ghosh, 2001) are a non-linear extension of principal components, in that each node on the PPS is the average of all data points that project near/onto it. From a theoretical standpoint, the PPS are a generalization of the Generative Topographic Mapping (GTM) (Bishop et al., 1998), which can be seen as a parametric alternative to Self Organizing Maps or SOM (Kohonen, 1995). Some advantages of PPS include their parametric and flexible formulation for any geometry/topology in any dimension, guaranteed convergence (indeed the PPS training is accomplished through the Expectation-Maximization algorithm). PPS are governed by their latent topology and owing to their flexibility a variety of PPS topologies can be created, among which also is the 3-dimensional (3D) sphere. A sphere is finite, unbounded and symmetric, with all nodes distributed at the edge, and it is therefore suitable for emulating the sparseness and peripheral property of high-dimensional data. Furthermore, the sphere topology can be easily comprehended by humans and thereby used for visualizing high-dimensional data.
PPS define a non-linear, parametric mapping y(x; W) from a Q-dimensional latent space (x
Q) to a D-dimensional data space (t
D), where usually Q < D. The (continuous and differentiable) function y(x; W) maps each of the M points in the latent space (where M is the number of latent variables which is fixed a priori) to the correspondent point into the data space. Since the latent space is Q-dimensional, these points will be confined to a Q-dimensional manifold non-linearly embedded into the D-dimensional data space (Fig. 1). PPS build a constrained mixture of Gaussians (where the priors are all fixed to 1/M)
![]() | (1) |
![]() | (2) |
denotes the noise variance. The covariance is defined as
![]() | (3) |
, a clamping factor which determines the orientation of the covariance, is bound in the interval 0 <
< (D/Q). Moreover,
is the set of orthonormal vectors tangent to the manifold at y(x; w) and
is the set of orthonormal vectors orthogonal to the manifold in y(x; w).
|
The complete set of orthonormal vectors
spans
D. The EM algorithm (Dempster et al., 1977) can be used to estimate the PPS parameters W and ß, while the clamping factor is fixed by the user and is assumed to be constant during the EM iterations. The form of the mapping y(x; w) is defined as a generalized linear regression model y(x; w) = W
(x), where the elements of
(x) consist of L fixed basis functions
, and W is a D x L matrix. If Q = 3 is chosen, a spherical manifold (Chang, 2000) can be constructed using a PPS with nodes
arranged regularly on the surface of a sphere in
3 latent space, with the latent basis functions evenly distributed on the sphere at a lower density (Staiano, 2003).
After a PPS model is fitted to the data, several visualization possibilities are available like the projection of the data are projected into the latent space as points onto a sphere. The latent manifold coordinates
n of each data point tn are computed as
![]() | (4) |
![]() | (5) |
m rmn = 1, for n = 1, ..., N, these coordinates lie within a unit sphere, i.e. ||
n||
1. Having projected the data into the latent sphere, a typical task performed by most data analyzers is the localization of the most interesting data points, for instance the ones lying far away from more dense areas (outliers), or those lying in the overlapping regions between clusters, and to investigate their characteristics by linking the data points on the sphere with their position in the original dataset. For instance, in the genetic application described here if the raw data were available, the user might want to visualize the object corresponding to the data point selected on the sphere. The user is also allowed to select a latent variable and color all the points for which that specific latent variable is responsible (Fig. 2). Furthermore, the latent variables responsibilities can be plotted on the sphere in order to obtain the data probability density function visualization (Fig. 3). Note that dense areas (in red) correspond to genes that have similar properties, moreover, they might contain more than one cluster, and this calls for further investigations. Another advantage of the 3D sphere representation is that, unlike 2D plot (i.e. hierarchical clustering, etc.), it is possible to observe the distribution of the different clusters on the sphere and observe which clusters are similar to each other as the dense regions that are in close proximity. All these advanced visualization options and successful applications to other research fields (i.e. in astrophysics) have been discussed in Staiano et al. (2005).
|
|
| 4 NEGENTROPY CLUSTERING (NEC) |
|---|
|
|
|---|
The main purpose of any clustering algorithm is to separate classes arbitrarily distributed in data space, in an unsupervised way. The simplest methods make use of vector quantization techniques (Duda et al., 2001), i.e. k-means, fuzzy C-Means, SOM, etc.
PPS may be combined with an agglomerative approach based on a Negentropy (hereafter NEC) information, to merge similar regions into clusters. The only a priori information that we need is a dissimilarity threshold. This quantity is used as a method to guide the search for substructures in the data. We clusterize for different threshold values ranging in a fixed interval, looking for a region of stability in the number of clusters. The use of a non-Euclidian metric for clustering was also explored in Martins et al. (2004) who used an approach based on Neural Networks and KullbackLeibler divergency, but it is computationally expensive. Futhermore, Negentropy is a very important measure of non-Gaussianity. It is strongly connected to ICA. The goal of ICA is to find a linear representation of non-Gaussian data so that the components are statistically independent, or as independent as possible (Hyvärinen et al., 2001). ICA has been applied to microarray cell-cycle data (Lee and Batzoglou, 2003), and first proposed by Liebermeister (2002). To introduce the concept of Negentropy one needs to define the differential entropy H of a set of random vectors y = (y1, ..., yn)T with density f(y) as follows:
![]() | (6) |
![]() | (7) |
![]() | (8) |
![]() | (9) |
The approximation we used is J(y)
[E{G(y)} E{G(
)}]2, where
is a Gaussian variable with vanishing mean value, and unit variance (i.e. standardized). The variable y is assumed to have vanishing mean value and unit variance, as well. We note that choosing a G that does not grow too fast, one obtains more robust estimators. The following choices of G have proved very useful:
![]() | (10) |
Negentropy can be used to agglomerate with an unsupervised method the clusters (regions) found by the PPS approach. The only a priori information is a dissimilarity threshold. In practice, this approach measures whether two clusters can be modeled by one single Gaussian or not, in other words, if the two regions can be considered to be aligned or as part of a greater dataset. In particular, the plot on the top of Figure 4 represents two separated Gaussian distributions and their contours as well as a Gaussian distribution obtained by merging the two distributions. In this case we obtain a Negentropy measure of 2.6261 (using the G1 function with a1 = 0.1). On the other hand, at the bottom of Figure 4 we show two overlapped Gaussian distributions with the contour of the Gaussian distribution obtained on the overall dataset. In this case the Negentropy information is of 0.005 using the same function and the same settings.
|
| 5 RESULTS |
|---|
|
|
|---|
Non-linear PCA was used to extract the first eigenvector of the autocorrelation matrix of the 2445 genes that passed the filtering procedure, followed by a PPS with 98 latent variables and 38 latent basis functions and a clamping factor
set to 0.7. All these parameters were set empirically. After the completion of the training phase we projected the data into the latent space and computed the responsibility for each latent variable. The NEC algorithm started with 98 clusters (the number of latent variables used with PPS, i.e. the number of components of the mixture).
As output we obtained 56 clusters. Figure 5 shows the projection of genes which belong to some relevant clusters. For each of them we derived an archetype, defined as the mean of the data points in each cluster, and the SD with respect to the archetypes. The resulting error bar plots for clusters 23 and 49 are shown in Figure 6, as example (see Supplementary material for the other clusters).
|
|
We also computed the biological P-values of our clusters using the GOTerm Finder tool from the S.cerevisiae database (Christie et al., 2004). GOTerm Finder searches for significant shared Gene Ontology (GO) terms, or parents of the GO terms, used to describe the genes in a given list. The biological reliability of our clustering algorithm is confirmed by the fact that many clusters (22 out of 56) show very low Bonferroni-corrected P-values according to the GOTerm Finder tool (refer to Supplementary files for a complete list of significant biological pathways found in each cluster). Thus confirming that according to database annotations the genes in each cluster share a common biological pathway.
In-depth biological validation of our approach and the discovery of novel functions of known genes are outside the scope of this work that aimed at presenting a new robust methodology for gene clustering. However, as an example of genes that get clustered together by the algorithm, we describe in detail two clusters, namely 23 and 49. The remaining clusters are shown in the Supplementary material.
These clusters are very different: cluster 23, consists of cell-cycle regulated genes, the other, cluster 49 has no intersection with Spellman data. Notice that since the exact composition of all the Spellman's clusters is not available, the following comparison has been carried out by using the only available data.
Cluster 23 consists of 52 genes. In this cluster 24 genes intersect with Spellman's CLN2 cluster. These genes are: YAR007C, YBR070C, YCL024W, YDL003W, YDR528W, YGR109C, YGR152C, YGR221C, YHR153C, YIL066C, YJL181W, YKL045W, YLL022C, YLR103C, YLR313C, YML027W, YML102W, YMR179W, YOL007C, YOL017W, YPL153C, YPL256C, YPR174C and YPR175W. All of these genes are strongly cell-cycle regulated. Their peak expression occurs in mid-G1 phase, they are strongly induced by GAL-CLN3 but are strongly repressed by GAL-CLB2. Out of the remaining 28, 25 genes, are also cell-cycle regulated and their peak of expression occurs in the G1 phase (except of YFL060C with the peak in S-phase). Most of these genes are involved in the processes of DNA repair and replication. The genes YMR048W, YCL022C, YCL023C, YGR151C and YIL141W are cell-cycle regulated, have peak of expression in G1 phase, but have unknown functions. The remaining three genes of the cluster (YLR376C, YLR095C and YMR177W) are not cell-cycle regulated.
Cluster 49 comprises 56 genes with no common intersection with Spellman clusters. A more detailed analysis of the biological significance of this cluster showed that 26 genes out of 56 (i.e. 46%) (ECM1, DBP10, RRP1, SSF2, UTP4, LCP5, CGR1, LSG1, ROK1, NOP7, UTP8, NSR1, RPF1, IMP3, RIX1, RPF2, SOF1, EMG1, RSA3, UTP21, ERB1, NOP15, NOP8, RRP6, RRP12 and NOP53) are involved in the same two cellular processes: (1) cytoplasm organization and biogenesis and (2) ribosome biogenesis and assembly with both p = 2.79 e23. A set of 22 genes (DBP10, RPC53, UTP4, LCP5, CGR1, ROK1, NOP7, UTP8, NSR1, RPF1, IMP3, RIX1, RPF2, SOF1, EMG1, UTP21, ERB1, NOP8, RRP6, RPA43, RRP12 and NOP53) among the 56 genes in this cluster is involved in biopolymer metabolism. Finally, eight genes (YIL096C, YIL110W, YLR287C, YMR310C, YNL313C, YNR046W, YPR136C and YOL022C) have still unknown function.
We also evaluated the noise filtering step in our approach to verify that we did not filter out cell-cycle regulated genes. This was achieved using the benchmark B1, a list of known cell-cycle genes, described in the article of de Lichtenberg et al. (2005). A total number of 84 genes out of 113 of B1 passed our noise filtering step, whereas only 29 genes from B1 were excluded from the analysis as noisy.
| 6 DISCUSSION AND CONCLUSIONS |
|---|
|
|
|---|
We present a new and complete machine learning data mining framework consisting in a noise filtering procedure, a non-linear PCA NN for feature extraction, and PPS combined with an agglomerative approach based on a Negentropy aimed at preprocessing and clustering gene microarray data. The approach has been applied to microarray dataset of cell-cycle in yeast S.cerevisiae, and could be used for more general time-dependent signals by properly modifying the noise estimation method.
The noisy gene rejection method, which represents the first step of our algorithm, makes use of a noise estimation based on the periodical behavior of cell-cycle regulated gene expression in microarray data. Genes which pass the filtering procedure undergo a features extraction process, based on a non-linear PCA method, which allows us to obtain a matrix of eigenvectors from unevenly sampled data. The results of the above preprocessing are then analyzed by a PPS algorithm which has proven to be a very powerful and efficient model in several data mining activities, and in particular for high-dimensional data visualization and clustering. Spherical PPS, which consists of a spherical latent manifold lying in a 3D latent space, better deal with high-dimensional data. The sphere, in fact, is able to capture the sparsity and periphery of data in large input spaces which are owing to the curse of dimensionality (Bishop, 1995). The regions found by PPS are then agglomerated by NEC in a completely unsupervised way. This 3D sphere is an innovative way of looking at gene expression data as compared with hierarchical clustering that displays a 2D plot. Our visualization captures the information on clusters of genes that do share similar behavior. These would be represented on the sphere as dense regions that are in close proximity.
Other neural-based methods have already been investigated with success on microarray data, e.g. in Törönen et al. (1999) and Tamayo et al. (1999), SOM have been used for both visualization and clustering purposes. However, SOMs, while nice and understandable and appealing in a number of application fields, are less flexible in the sense that gives no way to directly interact with data and exhibits strong border effects on 2D neurons grid since its manifold is bounded.
As shown in the paper by de Lichtenberg et al. (2005) the jungle of available clustering methods often leads to contradictory results, and often, it is difficult to choose a specific benchmark. To understand the result obtained and to test our method we compared our results with those obtained in Spellman et al. (1998). The authors clustered the genes using the hierarchical bottom-up algorithm introduced by Eisen et al. (1998) which groups the genes that have similar relative concentration profiles during cell life cycle. It has to be stressed that the preprocessing adopted by Spellman is not optimal since it is based on Fourier analysis, which cannot be effectively applied to a time series which covers only two or one and half cellular period. The clustering sequence is represented by a hierarchical tree where it is not obvious when to stop the agglomerative process. Stopping it at an early stage could result in a large number of small clusters, thus missing the recognition of underlying similarities among nearby clusters. On the other hand allowing agglomeration to continue for too many iterations, would result in fewer, larger clusters, thus potentially obscuring important substructures. While there is no objective criteria to decide where to stop the hierarchical tree, in our method the number of clusters is automatically determined. Nevertheless, Spellman's classification scheme, in spite of the weaknesses of the adopted clustering algorithm, is strongly supported by a deep analysis of the clusters meaning performed using a biological approach.
| Acknowledgments |
|---|
The authors would like to thank M. Caselle for the useful discussion and comments. The implemented system has been developed within the framework of the Astroneural (http://people.na.infn.it/~astroneural) collaboration: a joint project between the Dipartimento di Matematica e Informatica of the University of Salerno and the Dipartimento di Scienze Fisiche of the University of Naples Federico II.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin John Bishop
Received on October 6, 2005; revised on December 10, 2005; accepted on December 23, 2005
| REFERENCES |
|---|
|
|
|---|
Alter, O., et al. (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA, 97, 1010110106
Ando, T., et al. (2002) Fuzzy neural network applied to gene expression profilling for predicting the prognosis of diffuse large B-cell lymphoma. Jpn. J. Cancer Res, . 93, 12071212[CrossRef][Web of Science].
Bishop, C.M. Neural Networks for Pattern Recognition, (1995) , Oxford, UK Oxford University Press.
Bishop, C.M., et al. (1998) GTM: The generative topographic mapping. Neural Comput, . 10, 215234[CrossRef][Web of Science].
Bussermaker, H.J., et al. (2001) Regulatory element detection using correlation with expression. Nat. Genet, . 27, 167171[CrossRef][Web of Science][Medline].
Chang, J.H., et al. (2003) Gene expression pattern analysis via latent variable models coupled with topographic clustering. Genom. Inform, . 1, 3239.
Chang, K. (2000) Nonlinear Dimensionality Reduction Using Probabilistic Principal Surfaces, PhD Thesis, Department of Electrical and Computer Engineering, University of Texas at Austin, USA.
Chang, K. and Ghosh, J. (2001) A unified model for probabilistic principal surfaces. IEEE Trans. Pattern Anal. Mach. Intell, . 23, n1.
Chen, Y., et al. (1997) Ratio-based decision and the quantitative analysis of cDNA microarray images. J. Biomed. Opt, . 364374.
Cho, R.J., et al. (1998) A genom-wide transcriptional analysis of the mitotic cells. Mol. Cell, 2, 6573[CrossRef][Web of Science][Medline].
Christie, K.R., et al. (2004) Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res, . 32, (Database issue) D311D314
Ciaramella, A., et al. (2004) A Multifrequency Analysis of Radio Variability of Blazars. J. Astron. Astrophys, . 419, 485500[CrossRef].
de Lichtenberg, U., et al. (2005) Comparison of computational methods for the identification of cell cycle-regulated genes. [Erratum (2005) Bioinformatics, 21, 3063.]. Bioinformatics, 21, 11641171
Dempster, A.P., et al. (1977) Maximum-Likelihood from Incomplete Data Via the EM Algorithm. J. R. Sta. Soc, . 39, n1.
di Bernardo, D., et al. (2005) Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks. Nat. Biotechnol, . 23, 377383[CrossRef][Web of Science][Medline].
Duda, R.O., Hart, P.E., Stork, D.G. Pattern Classification, (2001) 2nd edn. , New York John Wiley & Sons Inc.
Eisen, M.B., et al. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 1486314868
Ermolaeva, O., et al. (1998) Data management and analysis for gene expression arrays. Nat. Genet, . 20, 1923[CrossRef][Web of Science][Medline].
Hastie, T., Tibshirani, R., Friedman, J. The Elements of Statistical Learning, (2001) , New York Springer-Verlag.
Hughes, T.R., et al. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109126[CrossRef][Web of Science][Medline].
Hyvärinen, A., Karhunen, J., Oja, E. Independent Component Analysis, (2001) , New York John Wiley & Sons.
Lee, S.I. and Batzoglou, S. (2003) Application of independent component analysis to microarrays. Genom. Biol, . 4, R76.
Liebermeister, W. (2002) Linear modes of gene expression determined by independent component analysis. Bioinformatics, 18, 5160
Jolliffe, I.T. Principal Component Analysis, (2002) 2nd edn. , New York Springer-Verlag.
Karhunen, J. and Joutsensalo, J. (1994) Representation and separation of signals using non-linear PCA type learing. Neural Netw, . 7, 113127[CrossRef].
Karhunen, J. and Joutsensalo, J. (1995) Generalizations of principal component analysys, optimization problems and neural networks. Neural Netw, . 8, 549563[CrossRef].
Kerr, M.K. and Churchill, G.A. (2001) Statistical design and the analysis of gene expression microarray data. Genet. Res, . 77, 123128[CrossRef][Web of Science][Medline].
Kohonen, T. Self-Organizing Maps, (1995) , Berlin Springer-Verlag.
Martins, A., de, M., Neto, A.D.D., de Melo, J.D., Costa, J.A.F. (2004) Clustering Using Neural Networks and Kullback-Leibler Divergency. Proceedings of the International Joint Conference on Neural NetworksIJCNNBudapest , PiscatawayNJUSA IEEE, pp. , pp. 16.
Misra, J., et al. (2002) Interactive exploration of microarray gene expression patterns in a reduced dimensional space. Genome Res, . 12, 11121120
Mukherjee, M., Tamago, P., Mesirov, J.P., Slorim, D., Verni, A., Poggio, T. (1999) Support vector machine classification of microarray data. Technical Report, Cambridge: MIT, N.182.
Oja, E., Ogawa, H., Wangviwattana, J. (1991) Learning in nonlinear constrained Hebbian network. In Kohonen, T. (Ed.), et al. Artificial Neural Networks, , North-Holland, Amsterdam , pp. 385.
Oja, E., Karhunen, J., Wang, L., Vigario, R. (1996) Principal and independent components in neural networksrecent developments. Proceedings of WIRN'95Vietri, Italy , Singapore World Scientific Publisher, pp. , pp. 1635.
Purdom, E. and Holmes, S.P. (2005) Error distribution for gene expression data. Stat. Appl. Genet. Mol. Biol, . 4, 16.
Spellman, P.T., et al. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 32733297
Staiano, A. (2003) Unsupervised neural networks for the extraction of scientific information from astronomical data, PhD Thesis, University of Salerno Italy.
Staiano, A., De Vinco, L., Tagliaferri, R., Longo, G. (2005) High-D data visualization methods via probabilistic principal surfaces for data mining applications. Proceedings of Multimedia Database and Image Communication, Series on Software Engineering and Knowledge Engineering, World Scientific Series 17, , pp. 6374.
Tagliaferri, R., et al. (1999) Spectral analysis of stellar light curves by means of neural networks. Astron. Astrophys. Suppl. Ser, . 137, 391405[CrossRef].
Tagliaferri, R., et al. (2001) Soft computing methodologies for spectral analysis in cyclostratigraphy. Comput. Geosci, . 27, 535548[CrossRef].
Tamayo, P., et al. (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci., USA, 96, 29072912
Törönen, P., et al. (1999) Analysis of gene expression data using self-organizing maps. FEBS Lett, . 451, 142146[CrossRef][Web of Science][Medline].
Townsend, J. P. and Hartl, D.L. (2002) Bayesian analysis of gene expression levels: statistical quantification of relative mRNA level across multiple treatments or samples. Genome Biol, . 3, research0071.10071.16.
Townsend, J.P. (2004) Resolution of large and small differences in gene expression using models for the Bayesian analysis of gene expression levels and spotted DNA microarrays. BMC Bioinformatics, 5, 54[CrossRef][Medline].
Tseng, G.C., et al. (2001) Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res, . 29, 25492557
Wolfinger, R.D., et al. (2001) Assessing gene significance from cDNA microarray expression data via mixed models. J. Comput. Biol, . 8, 625637[CrossRef][Web of Science][Medline].
Yeung, K.Y., et al. (2001) Model based clustering and data transformations for gene expression data. Bioinformatics, 17, 977987
This article has been cited by other articles:
![]() |
D. Sahoo, D. L. Dill, R. Tibshirani, and S. K. Plevritis Extracting binary signals from microarray time-course data Nucleic Acids Res., June 28, 2007; 35(11): 3705 - 3712. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





















