Bioinformatics Advance Access originally published online on October 27, 2004
Bioinformatics 2005 21(6):730-740; doi:10.1093/bioinformatics/bti067
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Relational patterns of gene expression via non-metric multidimensional scaling analysis
1Department of Physics, Faculty of Science and Technology, Chuo University 1-13-27 Kasuga, Bunkyo-ku, Tokyo 112-8551, Japan
2Institute for Science and Technology, Chuo University 1-13-27 Kasuga, Bunkyo-ku, Tokyo 112-8551, Japan
3Department of Physics 1110 W. Green Street, Urbana, IL 61801, USA
4Institute for Genomic Biology, University of Illinois at Urbana-Champaign Urbana, IL 61801, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Microarray experiments result in large-scale data sets that require extensive mining and refining to extract useful information. We demonstrate the usefulness of (non-metric) multidimensional scaling (MDS) method in analyzing a large number of genes. Applying MDS to the microarray data is certainly not new, but the existing works are all on small numbers (<100) of points to be analyzed. We have been developing an efficient novel algorithm for non-metric MDS (nMDS) analysis for very large data sets as a maximally unsupervised data mining device. We wish to demonstrate its usefulness in the context of bioinformatics (unraveling relational patterns among genes from time series data in this paper).
Results: The Pearson correlation coefficient with its sign flipped is used to measure the dissimilarity of the gene activities in transcriptional response of cell-cycle-synchronized human fibroblasts to serum. These dissimilarity data have been analyzed with our nMDS algorithm to produce an almost circular relational pattern of the genes. The obtained pattern expresses a temporal order in the data in this example; the temporal expression pattern of the genes rotates along this circular arrangement and is related to the cell cycle. For the data we analyze in this paper we observe the following. If an appropriate preparation procedure is applied to the original data set, linear methods such as the principal component analysis (PCA) could achieve reasonable results, but without data preprocessing linear methods such as PCA cannot achieve a useful picture. Furthermore, even with an appropriate data preprocessing, the outcomes of linear procedures are not as clear-cut as those by nMDS without preprocessing.
Availability: The FORTRAN source code of the method used in this analysis (pure nMDS) is available at http://www.granular.com/MDS/
Contact: tag{at}granular.com
Supplementary information: http://www.granular.com/MDS/B1_2005.
| INTRODUCTION |
|---|
|
|
|---|
Each DNA microarray experiment can give us information about the relative populations of mRNAs for thousands of genes. This implies that without extensive data mining it is often hard to recognize any useful information from the experimental results. In this paper, we demonstrate that a non-metric multidimensional scaling (nMDS) method can be a powerful unsupervised means to extract relational patterns in gene expression. A data mining procedure may be useful, if it is flexible enough to incorporate any level of supervision, but we believe that the most basic feature required for any good data mining method is to be able to extract recognizable significant patterns reproducibly without supervision. In this sense our nMDS method is clearly demonstrated to be a useful means of data mining.
Since MDS is a very natural tool to visualize the salient features in the data in general, initially we expected many existing published works but to our knowledge actually there are not so many. Besides, these existing examples [only a few published papers in the field of bioinformatics, e.g. Dyrskjot et al., (2002)], are mostly metric MDS examples and the number of points to be imbedded is small (
30). Shmulevich and Zhang, (2002) applies a non-metric MDS method, but again they classify tumor types. That is, they study <100 points to embed (visualize).
We have been developing an efficient nMDS technique for large data sets (Y.-h.Taguchi and Y.Oono, unpublished data, http://www.granular.com/MDS/src/paper.pdf; Taguchi et al., 2001; Taguchi and Oono, 2004). The input is the rank order of (dis)similarities among the objects (genes in the present case). Our algorithm is maximally non-metric in the sense that any introduction of intermediate metric coordinates obtained by monotone regression common to the conventional nMDS methods is avoided.
Data compression is essentially a problem of linear functional analysis as Donoho et al., (1998) stresses. In contrast, we believe that non-linear methods should be important to data mining. There are linear algebraic methods such as the principal component analysis (PCA) for data mining, but it is expected that non-linear methods are, in principle, more powerful than linear methods. The present paper illustrates this point. Indeed, in our case PCA cannot find any comprehensible pattern in low-dimensional spaces without an appropriate data polishing.
| SYSTEMS AND METHODS |
|---|
|
|
|---|
Systems to analyze
The gene activities in transcriptional response of cell-cycle-synchronized human fibroblasts to serum reported by Iyer et al., (1999) are analyzed. The microarray data used in this analysis is available at http://genome-www.stanford.edu/serum/fig2data.txt
Other possible analysis methods
To extract interpretable patterns from microarray data, cluster and linear multivariate analyses seem to be the two major strategies. However, these methods may not always work as shown here in the example of the analysis of time series data.
The cluster analysis seems to be the most common approach (Slonim, 2002). For example, the hierarchical clustering method seems to be popular (Eisen et al., 1998). Classifying the expression patterns as functions of time is often attempted by clustering methods (Spellman et al., 1998). However, it is difficult to see continuous relationship among genes by cluster analysis, because it must classify genes into a limited number of categories. For example, it is difficult to see how genes are expressed temporally. In contrast to clustering analyses, nMDS can place genes in a continuous space, so we can easily recognize the relationship among genes that change in a continuous manner. Thus, the nMDS method can capture features which cluster analysis is hard to find.
The PCA is a method that visualizes the relationship among genes in a continuous vector space. The main idea is to choose a data-adapted basis set, and to make a subspace that can capture salient features of the original data set. In principle, the method could capture the continuous relationship in the gene expression pattern, but the dimension of the subspace may not be low even if the data are on a very low-dimensional manifold. In short, the information compression capability of linear methods is generally weak. This can be illustrated well by the data we wish to analyze in this paper: PCA does not capture the temporal order as clearly as nMDS. Figure 1 exhibits the two-dimensional space spanned by the first two principal components. The temporal expression pattern is hardly recognized in the result. This should be compared with the nMDS results in Figure 4.
|
|
Apparently, however, some linear analysis methods work well. Holter et al., (2000) demonstrated that the singular value decomposition (SVD; a linear method) is remarkably successful in extracting the characteristic modes. The reader must wonder why there is a difference between this result and the one due to PCA that is not successful. The secret is in the highly non-linear polishing of the original data (proposed by Eisen et al., 1998). However, the role of this non-linear polishing must be considered carefully, because it can generate a spurious temporal behavior. Therefore, we relegate the comparison of these linear methods with data preprocessing and our nMDS in the Appendix 2 section. The salient conclusions are as follows:
- Linear methods such as PCA and SVD could perhaps achieve reasonable results, if a data preparation scheme is chosen appropriately. However, even the best linear results are generally fairly inferior to non-linear results.
- The data preparation such as the polishing used by Holter et al., 2000 could actually corrupt the original data (as illustrated in the Appendix 2 section), and should be avoided.
An interesting proposal to extract a temporal order is to use the partial least squares (PLS) regression (Johansson et al., 2003). In this case one may assume a temporal order one wishes to extract (say, a sinusoidal change in time), and the original data are organized around the expected pattern. This is, so to speak, to analyze a set of data according to a certain prejudice. Although during the process of organization no supervision is needed, the pattern to be extracted (i.e. the prejudice) must be presupposed. Furthermore, if a clear objective pattern could be extractable by this method, certainly nMDS can achieve the same goal without any presupposed pattern required by PLS.
Metric MDS methods may also be used, but they depend on the definition of the dissimilarity. Therefore, unless the measure of the dissimilarity is (almost) dictated by the data or by the context of the data analysis, arbitrary elements are introduced. Even if a dissimilarity measure is already given, if the measure is not positive definite (as in the case of the microarray data), the metric MDS requires its conversion to a positive definite metric. Thus, an arbitrary factor may further be introduced, and the metric supposedly capturing detailed information could carry spurious information (disinformation, so to speak) as well. Indeed, this extra factor can completely change the possible geometrical features in the data as can be seen from the following simple example. Take three samples A, B and C with the dissimilarities given as AB = 1, BC = 0 and AC = 1. To convert them into positive numbers we shift the origin. If we add 2, A, B and C can be arranged consistently in 1D. If we add 1.5, the triangle inequality is violated. If we add 3, it can be geometrically realized but not in 1D. nMDS is completely free from this problem, because adding a common number cannot change the ordering of the dissimilarities. Another disadvantage of metric MDS (and of linear multivariate methods) is that these methods are vulnerable to missing or grossly inaccurate data.
The application of non-metric MDS is a very natural idea, but there are only a small number of published examples. Furthermore, to our knowledge there is no paper analyzing large number of objects by nMDS. Dyrskjot et al., (2002) is a rare example of applying nMDS, but, in contrast to ours, they never used MDS to analyze genes; they analyzed 40 tumors with a package in SPSS. Our main purpose is to extract information on genes themselves.
The cluster analysis with the aid of self-organizing maps (SOMs) is definitely a non-linear data analysis method, but as we have seen in Kasturi et al., (2003) it is not generally suitable for extracting temporal order. Kasturi et al. comment that SOM is not particularly better than the ordinary cluster analysis. Besides, as can be seen from the fact that the use of a particular initial condition can be a methodological paper (Kanaya et al., 2001) we must worry about the ad hoc initial-condition dependence of the results.
| ALGORITHM |
|---|
|
|
|---|
Basic idea of algorithm for nMDS
The philosophy of nMDS Shepard, 1962a, b; Kruskal, 1964a, b) is to find a constellation in a certain metric space
of points representing the objects under study (genes in the present case) such that the pairwise distances d of the points in
have the rank order in closest agreement with the rank order of the pairwise dissimilarities
of the corresponding objects that are given as the raw (or the original) input data.
The conventional nMDS methods assume a certain intermediate pair distance
that is chosen as close as possible to d for a given object pair under the condition that it is monotone with respect to the actually given ordering of the dissimilarities
. The choice of
is not unique. The discrepancy between d and
is called the stress, and all the algorithms attempt to minimize it. Depending on the choice of
and on the interpretation of as close as, different methods have been proposed e.g. see (Green et al., 1970; Cox and Cox, 1994; Borg and Groenen, 1997). The choice of
affects the outcome.
is required only by technical reasons for implementation of the basic non-metric idea, so to be faithful to the original idea due to Shepard (1962a,b) we must compare
with d directly. Our motivation is to make an algorithm that is maximally non-metric in the sense that we get rid of
.
The basic idea of this purely non-metric algorithm is as follows (Y.-h.Taguchi and Y.Oono, unpublished data; Taguchi et al., 2004): in a metric space
(in this paper, D-dimensional Euclidean space RD is used) N points representing the N objects are placed as an initial configuration. For this initial trial configuration we compute the pair distances d(i,j), and then rank them according to their magnitudes. Comparing this ranking and that according to the dissimilarity data
(i,j), we compute an appropriate force that moves the points in
to reduce the discrepancy between these two rankings. After moving the points according to the forces, the new forces are computed again, and the whole adjusting process of the object positions in
is iterated until they converge sufficiently. The details are in the Appendix 1 section.
Non-metric MDS can usually recover geometrical objects correctly (up to scaling, orientation and direction) when there are sufficiently many (say,
30) objects. Furthermore, even with metric data converting them into the corresponding rank order data may be a general strategy to extract robust features in the metric data. Therefore, nMDS is a versatile multivariate analysis method.
It is desirable to have a criterion for convergence (analogous to the level of the stress in the conventional nMDS), or a measure of goodness of embedding. To this end let us recall the Kendall statistics K (Hollander and Wolfe, 1999, p. 364),
![]() |

, ß
[i.e.
(also ß)] denotes a pair of objects. Usually, this is used for a statistical test to reject the null hypothesis that {d(i,j)} does not correlate with {
(i,j)}. (The contribution of ties is usually negligible for large data set, so we do not pay any particular attention to tie data.)
Here, we use this value to estimate the number of objects embedded correctly. If N' objects are correctly embedded, and if we may assume that the rest are uncorrelated, then K is expected to be
![]() |


,ß
± 1 that is of order
, if we assume that the central limit theorem applies. This gives the latitude in the above inequalities. If the embedding is successful for the majority of the objects, then n' = O[N2], so we may ignore the contribution of the bad points. Thus, we may estimate
![]() |
as an indicator of the goodness of embedding. Although one might criticize that this is a crude measure of goodness, notice that MDS does not have any clear criterion of the goodness of embedding except for the so-called stress. In contrast to the stress, N' gives an effective number of correctly embedded points.
The plausibility of the above estimate of the number of correctly embedded points may be illustrated as follows: we prepare a data set of 200 objects whose subset A consisting of M objects is embeddable in a 3D space, but whose remaining 200 M objects are not. The subset A consists of M randomly positioned points in 3D; their x,y,z coordinates are uniform random numbers in [1,1]. The dissimilarity between objects i,j
A is defined by
![]() |
As can be seen from the results in Table 1, N' monotonically increases with M. The difference between them decreases as N M decreases. Actually,
. Thus, we can expect that N'/N is asymptotically an effective fraction of correctly embedded points in the N
limit with (N M)/N kept constant.
|
We must also discuss the initial configuration dependence of the result. Our algorithm is not free from the problem of local minima as all of the previously proposed algorithms for nMDS and as high-dimensional non-linear optimization problems in general. However, generally speaking, this dependence has only a very minor effect. This has been checked for the fibroblast data (see below).
| RESULTS |
|---|
|
|
|---|
We have found that the fibroblast data may be embedded in a two-dimensional space approximately as a ring (Fig. 2). The estimated number of correctly embedded genes is about 480 among all the 517 genes (i.e. the goodness of embedding is >90%). In addition, 516 out of 517 genes have P < 0.005 confidence level (see Appendix 1). One might wonder that this is too lax a criterion, because we should apply a multiple comparison criterion when we have so many number of genes. However, the P-value is still very small: we still have 515 genes with the P < 0.005 confidence level of correct embedding under multiple comparison criterion. That is, we need not change the gross upper bound estimate of P.
|
The obtained configuration is insensitive to the initial configurations. To see this we constructed two 2D embedding results starting from two different random initial configurations. With the aid of the Procrustean similarity transformation (Borg and Groenen, 1997) one result is fit to the other (notice that our procedure is non-metric, so to compare two independent results, appropriate scales, orientations, etc., must be optimally chosen). Figure 3 demonstrates the close agreements of x- and y-coordinates of the two results. As illustrated, the dependence on the initial conditions is very weak, and we may regard the embedded structure as a faithful representation of the information in the original data. Thus, we conclude that the obtained configuration is sufficiently reliable.
|
This ring-like arrangement of the genes faithfully represents the temporal expression patterns of the genes as can be seen clearly from the rotation of the expression peaks around the ring (Fig. 4a). It is noteworthy that the angle coordinate assigned to the genes according to the result shown in Figure 2 automatically gives the figure usually obtained only through detailed Fourier analysis (Fig. 4b). These figures attest to the usefulness of nMDS, a non-linear data mining method, for extracting non-trivial patterns in large-scale data sets.
One might criticize that the temporal pattern just mentioned is rather trivial and can be expected intuitively, because the temporal pattern may be recognized, if we pay attention to the peaks only. However, the obtained ordering of the genes along the ring (i.e. the angle coordinate) cannot simply be obtained by ordering genes according to their peak times alone. There are genes that exhibit peaks only at one time, so there is no way to order these genes sharing the peak time (just as there is no way to order genes in one cluster in cluster analyses without inspection). The detailed ordering obtained by nMDS reflects biologically meaningful events as discussed below.
Most genes directly related to cell cycle have peaks at 24 h. All of them are presented in Table 2; the information of the genes were obtained with the aid of AceView (http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/). Thus, we seem to observe the transition from G2 to G1 through M phase along the ring between angle variables 1.8 and 3 (needless to say, the angle coordinate difference of order 0.1 is not significant). That is, the ordering found by nMDS is consistent with some known functions of genes. Since the starved fibroblasts are supposedly in G1, perhaps one full cell cycle may have been captured in the original data. This may explain why the peak positions make one complete rotation. One might question that this full rotation is simply due to the use of all the data. Even if we use the data only up to 16 h from the start, we can actually almost reproduce the gene configuration exhibited in Figure 2. Obviously, the peaks up to 16 h cannot complete any full circle around it. Thus, the observed full rotation in Figure 4a is not an artifact of our data analysis. Thus, we may conclude that nMDS captures, as cluster analyses do, major expression patterns of the genes but with a more natural time ordering in the current example.
|
Although there is a recent paper claiming that the data we are analyzing can provide information about the Gene Ontology (Lagreid et al., 2003) we do not think the data can be used for this purpose (Appendix 3).
As has been clearly demonstrated, the 2D embedding is statistically natural and biologically informative. Still the 2D embedding is not perfect, so it is interesting to see what we might obtain by unfolding the 2D data, adding one more axis. The unfolded result is shown in Figure 5. Here, the angular coordinates
and
of the spherical coordinate system is determined by the xy-plane whose x-(resp., y-)axis is the first (resp., the second) principal component of the 3D embedded result. The total contribution of these two components is 86%. We do not recognize any clear pattern other than that captured in the 2D space. Therefore, we may conclude that the 2D embedding result is sufficiently reliable and informative.
|
However, one might wish to be more quantitative. Traditionally within the MDS methods, there is no very clear way to determine the dimensionality of the embedding space. Here, we propose a method utilizing the correlation among principal coordinates. Basically, the strategy is as follows. Suppose we embed identical data into n-dimensional and (n + 1)-dimensional spaces. Using the embedded results, we can construct principal axes with the aid of PCA. Then, we study the correlation coefficients of the coordinates of the points. It is usually the case that the first n principal axes of the (n + 1)-dimensional embedding result agree well (have high correlation coefficients) with the n principal axes of the n-dimensional embedding result. Now, we see the correlation for the (n + 1)-th axis of the (n + 1)-dimensional embedding result and n principal axes obtained from the n-dimensional embedding. If the correlations are low enough, we may say n-dimensional space captures the main features in the data. We apply this to n = 2. In Table 3 nDX denotes the X-th principal axis for the n-dimensional embedding. As can be seen from the table, we may conclude that 2D is enough to capture the main features in the data. We could even devise a statistical test based on the correlation coefficients, but we do not go into this further in this paper.
|
In this paper, we have demonstrated that nMDS has a capability of extracting a significant pattern without any supervision or preprocessing. However, the data we have analyzed is a very high-quality well-structured data. A natural question is what nMDS can offer for less well-structured data.
As an example, we here show the nMDS analysis result of Cho et al. (2001) on the cell cycle of human foreskin cells. Their data were analyzed in detail by Shedden and Cooper (2002). According to their analysis periodicity of the genes in Cho et al. is statistically insignificant; the data randomly permuted along the time axis exhibit periodic patterns of similar or greater strength than the original experimental data. As was done by Shedden and Cooper, we have selected 300 periodic genes from both original data and randomly time-reordered data. One should keep in mind that it was impossible to distinguish these two data by inspection. The two upper figures in Figure 6 are the results when we applied our nMDS to these two data sets.
|
First, one can notice that both of them exhibit ring-like structures. It is natural because periodic genes are selected. However, in contrast to the analysis by Shedden and Cooper, we can detect some differences between these two figures. Along the ring-like structure in the original data, the distribution of the genes is not uniform and has two dense regions placed diametrically opposite to each other. On the contrary, such a structure cannot be seen in the randomized data. Needless to say, the forced periodicity by an artificial selection of the data is detected in both cases as a ring structure, so one might conclude, as Shedden and Cooper did, the observed periodicity is not statistically significant. However, the difference between the original data and the randomized data clearly tells us the conclusion is premature.
To highlight the capability of nMDS, the same selected data were analyzed by PCA with the preprocessing discussed in the Appendix 2 section (the data standardization by Method 2) (the two middle figures in Fig. 6). We see much more diffuse rings than that obtained by nMDS. It is however interesting to note that PCA captures the non-uniformity of the distribution of genes noted in the above.
Thus, we may conclude that nMDS has a capability of detecting subtle differences. In the example here, this could be detected by PCA as well (with normalization), but it is clear that nMDS is much more crisp than PCA. However, the data analyzed here are not really the original data, but a subset of the original data selected by the good correlation to sinusoidal time dependence. Therefore, as duly criticized by Shedden and Cooper, the periodicity in the analyzed data can largely be an artifact.
To avoid this possible artifact we made a 300 gene subset consisting of genes with largest expression level variations to avoid possible noise effects. The 2D embedding result of these genes with the aid of nMDS is shown in the bottom left of Figure 6. Although diffuse, a ring-like structure is still visible. More interestingly, two distribution peaks diametrically placed on the ring are clearly visible. If we randomly permuted the same data along the time axis, the embedded result (bottom right) loses both the ring-like structure and the peaks. Therefore, we may conclude that the original data contains some information about the cell-cycle-related gene expression contrary to the conservative conclusion by Shedden and Cooper. We have no intention to criticize their study that was admirably critical and careful; we simply wish to demonstrate that nMDS can detect subtle structures.
Iyer et al. (1999) selected the genes to analyze according to the significance of the changes in their expression levels, so the temporal structure detected by our nMDS analysis is not an artifact of the methodology in contrast to that exhibited in Cho et al., (2001).
| CONCLUSION |
|---|
|
|
|---|
We have demonstrated that the nMDS can be a useful tool for data mining. It is unsupervised, and perhaps maximally non-linear. Our algorithm is probably the simplest among the non-metric MDS algorithms and is efficient enough to enable the analysis of a few thousand objects with a small laptop machine.
The nMDS algorithm works on the binary relations among the objects, so if there are N objects, computational complexity is of order N2 at least. Therefore, it is far slower than linear methods such as PCA, although our non-linear algorithm is practically fast enough; we have used for this work a small laptop machine (Mobile Celeron 650 MHz CPU with 256 MB RAM). Typically, the necessary number of iteration is about 100, and it takes only a few minutes for a few hundreds of genes. As has been pointed out and illustrated, with an appropriate data preprocessing a certain linear method could give us a reasonable result with less computational efforts. Although in this paper we have not made any particular effort to reduce computational requirements, a practical way to use nMDS might be to prepare an initial configuration by a linear method with an appropriate data preprocessing method that is verified to be consistent with the full nMDS results.
|
|
|---|
APPENDIX A
Purely non-metric MDS algorithm
Suppose d(i,j) is the distance between objects i and j in
. Let the ranking of
(i,j) among all the input dissimilarity data be n and that of d(i,j) among all the distances between embedded pairs be Tn. If n < Tn (resp., n < Tn), we wish to push the pair i and j farther apart (resp., closer) in
. Intuitively speaking, to this end we introduce an overdamped dynamics of the points in
driven by the following potential function:
![]() |
to make the force simpler; we have selected the one that makes the force formula very simple, if not the simplest.
The
defined by the formula above may be regarded as a counterpart of the stress in the conventional nMDS. As we will see later we can use quantities related to
to evaluate the confidence level of the resultant configuration. However, we do not employ
itself, because this quantity is not the difference between two ranks of independent objects; there are NC2
ij or dij but they are relations only among n potentially independent quantities. Therefore, the statistics of
is not known.
Thus, although certain modifications as discussed above are needed, still in our nMDS algorithm, the optimization process is closely connected to a process that improves the confidence level of the resultant configuration.
The pure nMDS algorithm for n objects may be described as follows:
- Dissimilarities
ij (i, j = 1, ..., N) for n objects are given. Order them as follows: 
- Put n points randomly in
as an initial configuration.
- Scale the position vectors in
such as
, where ri is the current position of object i in
.
- Compute dij for all object pairs (i,j) in
, and then order them as 
- Suppose
ij is the m-th largest in the ordering in (1) and dij is the Tm-th largest in the ordering in (4). Assign Cij = Tm m. Calculate the following displacement vector for i:
where s = 0.1 x N3 typically, and update ri
ri +
ri
- Return to (3), and continue until the potential energy
becomes sufficiently small.
The reader may worry about the handling of tie data. Generally speaking, for a large data set the fraction of tie relations is not significant; furthermore, if the result depends on the handling schemes of tie data, the result is unreliable anyway. Therefore, we do not pay any particular attention to the tie data problem.
In the above algorithm, s is a constant value. In practice, we could choose an appropriate schedule to vary s as is often done in optimization processes. In this paper, for simplicity, we do not attempt such a fine tuning.
In the above algorithm, we can deal with asymmetric data as well, i.e.
ij
ji if we compare
ij with dij while
ji with dji (= dij). Needless to say, if the mismatch between
ij and
ji is large, then representing the pair by a pair of points in a metric space is questionable. Therefore, we will not discuss this problem any further in this paper. The microarray data analyzed in this paper do not have such a problem in any case.
Goodness of embedding
In the text we have already discussed the effective number of correctly embedded objects as a measure of global goodness of embedding. This measure, however, cannot tell us the embedding quality of each object. It is often the case that the majority of objects are embedded well even without sensitive dependence on the initial conditions, but there are a few objects that consistently refuse to be embedded stably. To judge the quality of embedding for each object j we define
![]() |
(i,j) among N 1 pairs (i,j) for a given j, and Tn(j)(j) is the rank order of d(i,j) among N 1 pairs (i,j) for the same j.
(j) can be regarded as a statistical variable for the relative position of the j-th object with respect to the remaining objects (Lehmann, 1975). We can estimate the probability P(
) of
(j) <
with the null hypothesis that the rank ordering of dij (i
{1,2, ..., N} \ {j}) is totally random with respect to the rank ordering of
ij (i
{1,2, ..., N}\{j}). If n is sufficiently large, then
(j) obeys the normal distribution with mean (M3 M)/6 and variance M2 (M + 1)2 (M 1)/36, where M
N 1. For smaller N there is a table for P(
) (Lehmann, 1975). Thus, we can test the null hypothesis with a given confidence level for j-th object.
APPENDIX 2
Limitations and capabilities of linear methods
The limitations and capabilities of PCA with and without data preprocessing are illustrated in this appendix. There is no fundamental difference between PCA and SVD. We consider the following artificial data {sgt}, where g (=1,...,517) denote genes and t (=1,...,11) the observation times:
- [Data set 1]

- [Data set 2]

- [Data set 3]

- [Data set 2]
g, Cig and
ig (i = 1,2,3) are uniform random numbers in [0,1]. That is, Data set 1 is a set of sinusoidal waves with random amplitudes and phases, Data set 2 is the nonlinearly distorted Data set 1, and Data set 3 is a set of periodic functions that are very different from simple oscillatory behaviors. These data sets are analyzed by the following methods:
- Method 1: PCA with the preprocessing used by Holter et al., 2000). The preprocessing procedure is as follows:
- Step 1: Subtract the average,
where
g,t is the average over all genes and experiments, 
- Step 2 (Column normalization): Normalize the data as

- Step 3 (Row normalization): Normalize the data as
Repeat these steps until the following condition is satisfied,
From the resultant sgt correlation matrix Matr.(Cort t') is constructed, and then PCA is performed.
- Step 2 (Column normalization): Normalize the data as
- Method 2: PCA with the preprocessing so that
t sgt = 0 and
for all g. Of course, no iteration is needed for this preprocessing. From the resultant sgt correlation matrix Matr.(Cortt') is constructed, and then PCA is performed.
- Method 3: nMDS as done in the text. That is, the negative of the correlation coefficient Corgg' is used as the dissimilarity and nMDS is applied straightforwardly. Needless to say, no preprocessing of data is needed.
- Step 1: Subtract the average,
The results are shown in Figure 7. The conclusions may be:
- For Data set 1, any method will do.
- For Data set 2, the procedure recommended by (Holter et al., 2000) fails, although ironically simpler Method 2 still works very well. If the amplitude C is distributed in [0,5] instead of [0,1] (i.e. the extent of the non-linear distortion is increased), Method 2 becomes inferior to Method 3, but still Method 2 is adequate.
- For Data set 3, even Method 2 fails. nMDS (Method 3) still exhibits a ring-like structure. The method recommended by (Holter et al., 2000) is obviously out of question.
|
Thus, we may conclude that nMDS is a versatile and all-around data mining method for analyzing periodic temporal data. Furthermore, we can point out that the preprocessing method in Method 1 should not be used, because it could severely distort the original data (as may have been expected from the figures). Suppose there are n genes and four time points. Consider the following example (for the counterexample sake). The first gene has (a,b,b,a) (a > b > 0), and the remaining genes are all give by (1,0,0,1). The N x 4 matrix made from these vectors is polished by an iterative row and column vector normalization procedure. If n is sufficiently large, the first row converges to (0,1,1,0) and the rest to (1,0,0,1), independent of a and b. If b is small, then all the vectors should behave almost the same way, but after polishing the out-of-phase component in the discrepancy between the first row and the rest is enhanced dramatically, resulting in a spurious out of phase temporal behavior. Although the preceding exercise is trivial, the result warns us the danger of using the so-called polishing.
APPENDIX 3
Gene ontology
Since we have seen Lagreid et al. (2003) claiming that gene ontology can be discovered in the same data we are analyzing, we tried to extract more biological information from the nMDS result. We have realized that the distribution of ontologically identified genes (or that of the distribution of 23 ontology tags adopted by Lagreid et al.) along the ring is statistically indistinguishable from the uniform distribution. In this sense, nMDS fails to capture ontological information of the genes. This might suggest that our nMDS method is definitely inferior to the analysis based on the rough set adopted by the above paper. However, we have found that the claimed success is largely illusory. The claimed success is: of the 24 genes for which homology information allows inference of their ontology, 11 genes had one or more classifications that matched this assumption (Table 10). However, actually 8 ± 2.4 is within one
error (note that out of 210 genes used to train the rules based on the rough set, each gene on the average share an ontology with about 70 other genes). That is, 11 out of 24 is statistically hardly significant (the cross-validation does not mean very much because it is only a consistency check of the method). Therefore, nMDS does not fail to capture the significant information that can be obtained by the existing statistical methods.
Thus, we may conclude that the data we have analyzed may be sufficiently informative to follow the cell-cycle-related events, but not to annotate genes ontologically.
Received on July 19, 2003; revised on September 17, 2004; accepted on September 18, 2004
| REFERENCES |
|---|
|
|
|---|
Borg, I. and Groenen, P. Modern Multidimensional Scaling, (1997) , NY Springer.
Cho, R.J., Huang, M., Campbell, M.J., Dong, H., Steinmetz, L., Sapinoso, L., Hampton, G., Elledge, S.J., Davis, R.W., Lockhart, D.J. (2001) Transcriptional regulation and function during the human cell cycle. Nat. Genet., 27, , pp. 4854[ISI][Medline].
Cox, T.F. and Cox, M.A.A. Multidimensional Scaling, (1994) , London Chapman and Hall.
Donoho, D.L., Vetterli, M., DeVore, R.A., Daubechies, I. (1998) Data compression and harmonic analysis. IEEE Trans. Inform. Theory, 44, , pp. 24352476[CrossRef].
Dyrskjot, L., Thykjaer, T., Kruhoffer, M., Jensen, J.L., Marcussen, N., Hamilton-Dutoit, S., Wolf, H., Orntoft, T.F. (2002) Identifying distinct classes of bladder carcinoma using microarrays. Nat. Genet., 33, 9096.
Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci., USA, 95, 1486314868
Green, P.E., Carmone, F.J., Jr, Smith, S.M. Multidimensional Scaling: Concepts and Applications, (1970) , Boston, MA Allyn and Bacon.
Hollander, M. and Wolfe, D.A. Nonparametric Statistical Methods, (1999) , NY John Wiley & Sons.
Holter, N.S., Mitra, M., Maritan, A., Cieplak, M., Banavar, J.R., Fedoroff, N.V. (2000) Fundamental patterns underlying gene expression profiles: simplicity from complexity. Proc. Natl Acad. Sci., USA, 97, , pp. 84098414
Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G., Moore, T.L., Jeffrey, C.F., Trent, J.M., Staudt, L.M., Hudson, J., Jr, Boguski, M.S., et al. (1999) The transcriptional program in the response of human fibroblasts to serum. Science, 283, 8387
Johansson, D., Lindgren, P., Beglund, A. (2003) A multivariate approach applied to microarray data for identification of genes with cell cycle-coupled transcription. Bioinformatics, 19, 467473
Kanaya, S., Kinouchi, M., Abe, T., Kudo, Y., Yamada, Y., Nishi, T., Mori, H., Ikemura, T. (2001) Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM): characterization of horizontally transferred genes with emphasis on the E. coli O157 genome. Gene, 276, 8999[CrossRef][ISI][Medline].
Kasturi, J., Acharya, R., Ramanathan, M. (2003) An information theoretic approach for analyzing temporal patterns of gene expression. Bioinformatics, 19, 449458
Kruskal, J.B. (1964a) Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29, 127[CrossRef][ISI].
Kruskal, J.B. (1964b) Nonmetric multidimensional scaling: a numerical method. Psychometrika, 29, 115129[CrossRef][ISI].
Lagreid, A., Hvidsten, T.R., Midelfart, H., Komorowski, J., Sandvik, A.K. (2003) Predicting gene ontology biological process from temporal gene expression patterns. Genome Res., 13, 965979
Lehmann, E.L. Nonparametrics, (1975) , San Francisco, CA Holden-Day.
Shedden, K. and Cooper, S. (2002) Analysis of cell-cycle-specific gene expression in human cells as determined by microarray and double-thymidine block synchronization. Proc. Natl Acad. Sci., USA, 99, , pp. 43794384
Shepard, R.N. (1962a) The analysis proximities: multidimensional scaling with an unknown distance function, I. Psychometrika, 27, 125140[CrossRef].
Shepard, R.N. (1962b) The analysis proximities: multidimensional scaling with an unknown distance function, II. Psychometrika, 27, 219246[CrossRef].
Shmulevich, I. and Zhang, W. (2002) Binary analysis and optimization-based normalization of gene expression data. Bioinformatics, 18, 555565
Slonim, D.K. (2002) From patterns to pathways: gene expression data analysis of age. Nat. Genet. Suppl., 32, 502508[CrossRef].
Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., Futcher, B. (1998) Comprehensive identification of cell cycle regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 32733297
Taguchi, Y-h. and Oono, Y. (2004) Nonmetric multidimensional scaling as a data-mining Tool: new algorithm and new targets. In Toda, M., Komatsuzaki, T., Konishi, T., Rice, R.S., Berry, S.A. (Eds.). Geometrical Structures of Phase Space Multidimensional Chaos, 130, , pp. 315351 Special Volume of Adv. Chem. Phys..
Taguchi, Y-h., Oono, Y., Yokoyama, K. (2001) New possibilities of non-metric multidimensional scaling. Proc. Inst. Stat. Math., 49, 133153 (in Japanese).
This article has been cited by other articles:
![]() |
P. F. M. Krabbe, J. A. Salomon, and C. J. L. Murray Quantification of Health States with Rank-Based Nonmetric Multidimensional Scaling Med Decis Making, August 1, 2007; 27(4): 395 - 405. [Abstract] [PDF] |
||||
![]() |
M. A. Zapala and N. J. Schork Multivariate regression analysis of distance matrices for testing associations between gene expression patterns and related variables PNAS, December 19, 2006; 103(51): 19430 - 19435. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||














