Skip Navigation


Bioinformatics Advance Access originally published online on October 27, 2004
Bioinformatics 2005 21(6):730-740; doi:10.1093/bioinformatics/bti067
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/6/730    most recent
bti067v2
bti067v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (10)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Taguchi, Y.-h.
Right arrow Articles by Oono, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Taguchi, Y.-h.
Right arrow Articles by Oono, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Relational patterns of gene expression via non-metric multidimensional scaling analysis

Y.-h. Taguchi 1,2,* and Y. Oono 3,4

1Department of Physics, Faculty of Science and Technology, Chuo University 1-13-27 Kasuga, Bunkyo-ku, Tokyo 112-8551, Japan
2Institute for Science and Technology, Chuo University 1-13-27 Kasuga, Bunkyo-ku, Tokyo 112-8551, Japan
3Department of Physics 1110 W. Green Street, Urbana, IL 61801, USA
4Institute for Genomic Biology, University of Illinois at Urbana-Champaign Urbana, IL 61801, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 SYSTEMS AND METHODS
 ALGORITHM
 RESULTS
 CONCLUSION

 REFERENCES
 

Motivation: Microarray experiments result in large-scale data sets that require extensive mining and refining to extract useful information. We demonstrate the usefulness of (non-metric) multidimensional scaling (MDS) method in analyzing a large number of genes. Applying MDS to the microarray data is certainly not new, but the existing works are all on small numbers (<100) of points to be analyzed. We have been developing an efficient novel algorithm for non-metric MDS (nMDS) analysis for very large data sets as a maximally unsupervised data mining device. We wish to demonstrate its usefulness in the context of bioinformatics (unraveling relational patterns among genes from time series data in this paper).

Results: The Pearson correlation coefficient with its sign flipped is used to measure the dissimilarity of the gene activities in transcriptional response of cell-cycle-synchronized human fibroblasts to serum. These dissimilarity data have been analyzed with our nMDS algorithm to produce an almost circular relational pattern of the genes. The obtained pattern expresses a temporal order in the data in this example; the temporal expression pattern of the genes rotates along this circular arrangement and is related to the cell cycle. For the data we analyze in this paper we observe the following. If an appropriate preparation procedure is applied to the original data set, linear methods such as the principal component analysis (PCA) could achieve reasonable results, but without data preprocessing linear methods such as PCA cannot achieve a useful picture. Furthermore, even with an appropriate data preprocessing, the outcomes of linear procedures are not as clear-cut as those by nMDS without preprocessing.

Availability: The FORTRAN source code of the method used in this analysis (pure nMDS) is available at http://www.granular.com/MDS/

Contact: tag{at}granular.com

Supplementary information: http://www.granular.com/MDS/B1_2005.


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 SYSTEMS AND METHODS
 ALGORITHM
 RESULTS
 CONCLUSION

 REFERENCES
 
Each DNA microarray experiment can give us information about the relative populations of mRNAs for thousands of genes. This implies that without extensive data mining it is often hard to recognize any useful information from the experimental results. In this paper, we demonstrate that a non-metric multidimensional scaling (nMDS) method can be a powerful unsupervised means to extract relational patterns in gene expression. A data mining procedure may be useful, if it is flexible enough to incorporate any level of supervision, but we believe that the most basic feature required for any good data mining method is to be able to extract recognizable significant patterns reproducibly without supervision. In this sense our nMDS method is clearly demonstrated to be a useful means of data mining.

Since MDS is a very natural tool to visualize the salient features in the data in general, initially we expected many existing published works but to our knowledge actually there are not so many. Besides, these existing examples [only a few published papers in the field of bioinformatics, e.g. Dyrskjot et al., (2002)], are mostly metric MDS examples and the number of points to be imbedded is small (~30). Shmulevich and Zhang, (2002) applies a non-metric MDS method, but again they classify tumor types. That is, they study <100 points to embed (visualize).

We have been developing an efficient nMDS technique for large data sets (Y.-h.Taguchi and Y.Oono, unpublished data, http://www.granular.com/MDS/src/paper.pdf; Taguchi et al., 2001; Taguchi and Oono, 2004). The input is the rank order of (dis)similarities among the objects (genes in the present case). Our algorithm is maximally non-metric in the sense that any introduction of intermediate metric coordinates obtained by monotone regression common to the conventional nMDS methods is avoided.

Data compression is essentially a problem of linear functional analysis as Donoho et al., (1998) stresses. In contrast, we believe that non-linear methods should be important to data mining. There are linear algebraic methods such as the principal component analysis (PCA) for data mining, but it is expected that non-linear methods are, in principle, more powerful than linear methods. The present paper illustrates this point. Indeed, in our case PCA cannot find any comprehensible pattern in low-dimensional spaces without an appropriate data polishing.


    SYSTEMS AND METHODS
 TOP
 Abstract
 INTRODUCTION
 SYSTEMS AND METHODS
 ALGORITHM
 RESULTS
 CONCLUSION

 REFERENCES
 
Systems to analyze
The gene activities in transcriptional response of cell-cycle-synchronized human fibroblasts to serum reported by Iyer et al., (1999) are analyzed. The microarray data used in this analysis is available at http://genome-www.stanford.edu/serum/fig2data.txt

Other possible analysis methods
To extract interpretable patterns from microarray data, cluster and linear multivariate analyses seem to be the two major strategies. However, these methods may not always work as shown here in the example of the analysis of time series data.

The cluster analysis seems to be the most common approach (Slonim, 2002). For example, the hierarchical clustering method seems to be popular (Eisen et al., 1998). Classifying the expression patterns as functions of time is often attempted by clustering methods (Spellman et al., 1998). However, it is difficult to see continuous relationship among genes by cluster analysis, because it must classify genes into a limited number of categories. For example, it is difficult to see how genes are expressed temporally. In contrast to clustering analyses, nMDS can place genes in a continuous space, so we can easily recognize the relationship among genes that change in a continuous manner. Thus, the nMDS method can capture features which cluster analysis is hard to find.

The PCA is a method that visualizes the relationship among genes in a continuous vector space. The main idea is to choose a data-adapted basis set, and to make a subspace that can capture salient features of the original data set. In principle, the method could capture the continuous relationship in the gene expression pattern, but the dimension of the subspace may not be low even if the data are on a very low-dimensional manifold. In short, the information compression capability of linear methods is generally weak. This can be illustrated well by the data we wish to analyze in this paper: PCA does not capture the temporal order as clearly as nMDS. Figure 1 exhibits the two-dimensional space spanned by the first two principal components. The temporal expression pattern is hardly recognized in the result. This should be compared with the nMDS results in Figure 4.



View larger version (7K):
[in this window]
[in a new window]
 
Fig. 1 PCA results using correlation coefficient matrix. The first two principal components are used as the horizontal and vertical axis, respectively (the cumulative proportion is 70%). Genes whose experimental values are larger than 3.2 are drawn using filled boxes, otherwise, using small dots (the corresponding color figure is available online). From the top the time is, respectively, 15 min, 30 min, 1 h, 2 h, 4 h, 6 h, 8 h, 12 h, 16 h, 20 h and 24 h. The figures are arranged in two columns, but this is solely for the layout purpose; there is no distinction between two columns.

 


View larger version (23K):
[in this window]
[in a new window]
 
Fig. 4 (a) Temporal patterns of gene expression levels visualized with the aid of nMDS. Genes whose experimental values are larger than 1.2 are drawn using filled boxes, otherwise, using small dots. Time sequences are the same as explained in Figure 1. (b) Gene expression data as a function of the angle measured from the vertical axis in (a). The horizontal axis corresponds to t. Genes whose experimental values are larger than 1.6 are drawn using filled boxes, otherwise, using small dots. The figures are arranged in two columns, but this is solely for the layout purpose; there is no distinction between two columns.

 
Apparently, however, some linear analysis methods work well. Holter et al., (2000) demonstrated that the singular value decomposition (SVD; a linear method) is remarkably successful in extracting the characteristic modes. The reader must wonder why there is a difference between this result and the one due to PCA that is not successful. The secret is in the highly non-linear ‘polishing’ of the original data (proposed by Eisen et al., 1998). However, the role of this non-linear polishing must be considered carefully, because it can generate a spurious temporal behavior. Therefore, we relegate the comparison of these linear methods with data preprocessing and our nMDS in the Appendix 2 section. The salient conclusions are as follows:
  1. Linear methods such as PCA and SVD could perhaps achieve reasonable results, if a data preparation scheme is chosen appropriately. However, even the best linear results are generally fairly inferior to non-linear results.
  2. The data preparation such as the ‘polishing’ used by Holter et al., 2000 could actually corrupt the original data (as illustrated in the Appendix 2 section), and should be avoided.

An interesting proposal to extract a temporal order is to use the partial least squares (PLS) regression (Johansson et al., 2003). In this case one may assume a temporal order one wishes to extract (say, a sinusoidal change in time), and the original data are organized around the expected pattern. This is, so to speak, to analyze a set of data according to a certain prejudice. Although during the process of organization no supervision is needed, the pattern to be extracted (i.e. the ‘prejudice’) must be presupposed. Furthermore, if a clear objective pattern could be extractable by this method, certainly nMDS can achieve the same goal without any presupposed pattern required by PLS.

Metric MDS methods may also be used, but they depend on the definition of the dissimilarity. Therefore, unless the measure of the dissimilarity is (almost) dictated by the data or by the context of the data analysis, arbitrary elements are introduced. Even if a dissimilarity measure is already given, if the measure is not positive definite (as in the case of the microarray data), the metric MDS requires its conversion to a positive definite metric. Thus, an arbitrary factor may further be introduced, and the metric supposedly capturing detailed information could carry spurious information (disinformation, so to speak) as well. Indeed, this extra factor can completely change the possible geometrical features in the data as can be seen from the following simple example. Take three samples A, B and C with the dissimilarities given as AB = –1, BC = 0 and AC = 1. To convert them into positive numbers we shift the origin. If we add 2, A, B and C can be arranged consistently in 1D. If we add 1.5, the triangle inequality is violated. If we add 3, it can be geometrically realized but not in 1D. nMDS is completely free from this problem, because adding a common number cannot change the ordering of the dissimilarities. Another disadvantage of metric MDS (and of linear multivariate methods) is that these methods are vulnerable to missing or grossly inaccurate data.

The application of non-metric MDS is a very natural idea, but there are only a small number of published examples. Furthermore, to our knowledge there is no paper analyzing large number of objects by nMDS. Dyrskjot et al., (2002) is a rare example of applying nMDS, but, in contrast to ours, they never used MDS to analyze genes; they analyzed 40 tumors with a package in SPSS. Our main purpose is to extract information on genes themselves.

The cluster analysis with the aid of self-organizing maps (SOMs) is definitely a non-linear data analysis method, but as we have seen in Kasturi et al., (2003) it is not generally suitable for extracting temporal order. Kasturi et al. comment that SOM is not particularly better than the ordinary cluster analysis. Besides, as can be seen from the fact that the use of a particular initial condition can be a methodological paper (Kanaya et al., 2001) we must worry about the ad hoc initial-condition dependence of the results.


    ALGORITHM
 TOP
 Abstract
 INTRODUCTION
 SYSTEMS AND METHODS
 ALGORITHM
 RESULTS
 CONCLUSION

 REFERENCES
 
Basic idea of algorithm for nMDS
The philosophy of nMDS Shepard, 1962a, b; Kruskal, 1964a, b) is to find a constellation in a certain metric space R of points representing the objects under study (genes in the present case) such that the pairwise distances d of the points in R have the rank order in closest agreement with the rank order of the pairwise dissimilarities {delta} of the corresponding objects that are given as the raw (or the original) input data.

The conventional nMDS methods assume a certain intermediate pair distance that is chosen as close as possible to d for a given object pair under the condition that it is monotone with respect to the actually given ordering of the dissimilarities {delta}. The choice of is not unique. The discrepancy between d and is called the stress, and all the algorithms attempt to minimize it. Depending on the choice of and on the interpretation of ‘as close as’, different methods have been proposed e.g. see (Green et al., 1970; Cox and Cox, 1994; Borg and Groenen, 1997). The choice of affects the outcome. is required only by technical reasons for implementation of the basic non-metric idea, so to be faithful to the original idea due to Shepard (1962a,b) we must compare {delta} with d directly. Our motivation is to make an algorithm that is maximally non-metric in the sense that we get rid of .

The basic idea of this ‘purely non-metric’ algorithm is as follows (Y.-h.Taguchi and Y.Oono, unpublished data; Taguchi et al., 2004): in a metric space R (in this paper, D-dimensional Euclidean space RD is used) N points representing the N objects are placed as an initial configuration. For this initial trial configuration we compute the pair distances d(i,j), and then rank them according to their magnitudes. Comparing this ranking and that according to the dissimilarity data {delta}(i,j), we compute an appropriate ‘force’ that moves the points in R to reduce the discrepancy between these two rankings. After moving the points according to the ‘forces’, the new ‘forces’ are computed again, and the whole adjusting process of the object positions in R is iterated until they converge sufficiently. The details are in the Appendix 1 section.

Non-metric MDS can usually recover geometrical objects correctly (up to scaling, orientation and direction) when there are sufficiently many (say, ≥30) objects. Furthermore, even with metric data converting them into the corresponding rank order data may be a general strategy to extract robust features in the metric data. Therefore, nMDS is a versatile multivariate analysis method.

It is desirable to have a criterion for convergence (analogous to the level of the stress in the conventional nMDS), or a measure of goodness of embedding. To this end let us recall the Kendall statistics K (Hollander and Wolfe, 1999, p. 364),

where the summation is over all the pairs of dissimilarities (distances between objects) <{alpha}, ß> [i.e. {alpha} (also ß)] denotes a pair of objects. Usually, this is used for a statistical test to reject the null hypothesis that {d(i,j)} does not correlate with {{delta}(i,j)}. (The contribution of ties is usually negligible for large data set, so we do not pay any particular attention to tie data.)

Here, we use this value to estimate the number of objects embedded correctly. If N' objects are correctly embedded, and if we may assume that the rest are uncorrelated, then K is expected to be

where n' = N'(N' – 1)/2. This can be shown as follows. Since N' is correctly embedded, n' pairs must be correctly ordered, so the contribution from these correct points to K is n'. We assume the rest is random. Their contribution is the random sum bounded by {sum}<{alpha}> ± 1 that is of order , if we assume that the central limit theorem applies. This gives the latitude in the above inequalities. If the embedding is successful for the majority of the objects, then n' = O[N2], so we may ignore the contribution of the bad points. Thus, we may estimate

Therefore, we adopt as an indicator of the goodness of embedding. Although one might criticize that this is a crude measure of goodness, notice that MDS does not have any clear criterion of the goodness of embedding except for the so-called stress. In contrast to the stress, N' gives an effective number of correctly embedded points.

The plausibility of the above estimate of the number of correctly embedded points may be illustrated as follows: we prepare a data set of 200 objects whose subset A consisting of M objects is embeddable in a 3D space, but whose remaining 200 – M objects are not. The subset A consists of M randomly positioned points in 3D; their x,y,z coordinates are uniform random numbers in [–1,1]. The dissimilarity between objects i,j A is defined by

The dissimilarities between the points both of which are not in A are chosen to be the square root of random numbers uniformly distributed in [0,2]. With these choices, the variance of the dissimilarities is unity for both cases. We embedded these 200 objects into a 3D space using these dissimilarities for several values of M.

As can be seen from the results in Table 1, N' monotonically increases with M. The difference between them decreases as N M decreases. Actually, . Thus, we can expect that N'/N is asymptotically an effective fraction of correctly embedded points in the N -> {infty} limit with (N M)/N kept constant.


View this table:
[in this window]
[in a new window]
 
Table 1 The dependence of the effective number N' of correctly embedded points upon the number M of geometrically embeddable points

 
We must also discuss the initial configuration dependence of the result. Our algorithm is not free from the problem of local minima as all of the previously proposed algorithms for nMDS and as high-dimensional non-linear optimization problems in general. However, generally speaking, this dependence has only a very minor effect. This has been checked for the fibroblast data (see below).


    RESULTS
 TOP
 Abstract
 INTRODUCTION
 SYSTEMS AND METHODS
 ALGORITHM
 RESULTS
 CONCLUSION

 REFERENCES
 
We have found that the fibroblast data may be embedded in a two-dimensional space approximately as a ring (Fig. 2). The estimated number of correctly embedded genes is about 480 among all the 517 genes (i.e. the goodness of embedding is >90%). In addition, 516 out of 517 genes have P < 0.005 confidence level (see Appendix 1). One might wonder that this is too lax a criterion, because we should apply a multiple comparison criterion when we have so many number of genes. However, the P-value is still very small: we still have 515 genes with the P < 0.005 confidence level of correct embedding under multiple comparison criterion. That is, we need not change the gross upper bound estimate of P.



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 2 Two-dimensional embedding result obtained using nMDS.

 
The obtained configuration is insensitive to the initial configurations. To see this we constructed two 2D embedding results starting from two different random initial configurations. With the aid of the Procrustean similarity transformation (Borg and Groenen, 1997) one result is fit to the other (notice that our procedure is non-metric, so to compare two independent results, appropriate scales, orientations, etc., must be optimally chosen). Figure 3 demonstrates the close agreements of x- and y-coordinates of the two results. As illustrated, the dependence on the initial conditions is very weak, and we may regard the embedded structure as a faithful representation of the information in the original data. Thus, we conclude that the obtained configuration is sufficiently reliable.



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 3 Comparison between the nMDS embedding results with two different initial configurations after Procrustean similarity transformation. The horizontal (resp., vertical) coordinates are compared in the left (resp., right) figure. In each figure, x-axis corresponds to the result from one initial condition and the y-axis the other.

 
This ring-like arrangement of the genes faithfully represents the temporal expression patterns of the genes as can be seen clearly from the rotation of the expression peaks around the ring (Fig. 4a). It is noteworthy that the angle coordinate assigned to the genes according to the result shown in Figure 2 automatically gives the figure usually obtained only through detailed Fourier analysis (Fig. 4b). These figures attest to the usefulness of nMDS, a non-linear data mining method, for extracting non-trivial patterns in large-scale data sets.

One might criticize that the temporal pattern just mentioned is rather trivial and can be expected intuitively, because the temporal pattern may be recognized, if we pay attention to the peaks only. However, the obtained ordering of the genes along the ring (i.e. the angle coordinate) cannot simply be obtained by ordering genes according to their peak times alone. There are genes that exhibit peaks only at one time, so there is no way to order these genes sharing the peak time (just as there is no way to order genes in one cluster in cluster analyses without inspection). The detailed ordering obtained by nMDS reflects biologically meaningful events as discussed below.

Most genes directly related to cell cycle have peaks at 24 h. All of them are presented in Table 2; the information of the genes were obtained with the aid of AceView (http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/). Thus, we seem to observe the transition from G2 to G1 through M phase along the ring between angle variables 1.8 and 3 (needless to say, the angle coordinate difference of order 0.1 is not significant). That is, the ordering found by nMDS is consistent with some known functions of genes. Since the starved fibroblasts are supposedly in G1, perhaps one full cell cycle may have been captured in the original data. This may explain why the peak positions make one complete rotation. One might question that this full rotation is simply due to the use of all the data. Even if we use the data only up to 16 h from the start, we can actually almost reproduce the gene configuration exhibited in Figure 2. Obviously, the peaks up to 16 h cannot complete any full circle around it. Thus, the observed full rotation in Figure 4a is not an artifact of our data analysis. Thus, we may conclude that nMDS captures, as cluster analyses do, major expression patterns of the genes but with a more natural time ordering in the current example.


View this table:
[in this window]
[in a new window]
 
Table 2 Relationship between gene functions and the two-dimensional arrangement

 
Although there is a recent paper claiming that the data we are analyzing can provide information about the Gene Ontology (Lagreid et al., 2003) we do not think the data can be used for this purpose (Appendix 3).

As has been clearly demonstrated, the 2D embedding is statistically natural and biologically informative. Still the 2D embedding is not perfect, so it is interesting to see what we might obtain by ‘unfolding’ the 2D data, adding one more axis. The unfolded result is shown in Figure 5. Here, the angular coordinates {varphi} and {theta} of the spherical coordinate system is determined by the xy-plane whose x-(resp., y-)axis is the first (resp., the second) principal component of the 3D embedded result. The total contribution of these two components is 86%. We do not recognize any clear pattern other than that captured in the 2D space. Therefore, we may conclude that the 2D embedding result is sufficiently reliable and informative.



View larger version (19K):
[in this window]
[in a new window]
 
Fig. 5 3D unfolding of the temporal pattern of gene expression level with the aid of nMDS (3D). Experimental values are normalized as explained in Figure 3. Genes whose experimental values are larger than 1.6 are drawn using filled boxes, otherwise drawn using small dots (the corresponding color figure is available online). The horizontal (resp., vertical) axis represents {varphi} (resp., {theta}). See the text for details.

 
However, one might wish to be more quantitative. Traditionally within the MDS methods, there is no very clear way to determine the dimensionality of the embedding space. Here, we propose a method utilizing the correlation among principal coordinates. Basically, the strategy is as follows. Suppose we embed identical data into n-dimensional and (n + 1)-dimensional spaces. Using the embedded results, we can construct principal axes with the aid of PCA. Then, we study the correlation coefficients of the coordinates of the points. It is usually the case that the first n principal axes of the (n + 1)-dimensional embedding result agree well (have high correlation coefficients) with the n principal axes of the n-dimensional embedding result. Now, we see the correlation for the (n + 1)-th axis of the (n + 1)-dimensional embedding result and n principal axes obtained from the n-dimensional embedding. If the correlations are low enough, we may say n-dimensional space captures the main features in the data. We apply this to n = 2. In Table 3 nDX denotes the X-th principal axis for the n-dimensional embedding. As can be seen from the table, we may conclude that 2D is enough to capture the main features in the data. We could even devise a statistical test based on the correlation coefficients, but we do not go into this further in this paper.


View this table:
[in this window]
[in a new window]
 
Table 3 The correlation coefficients between principal axes of embedding in D-dimensional space

 
In this paper, we have demonstrated that nMDS has a capability of extracting a significant pattern without any supervision or preprocessing. However, the data we have analyzed is a very high-quality well-structured data. A natural question is what nMDS can offer for less well-structured data.

As an example, we here show the nMDS analysis result of Cho et al. (2001) on the cell cycle of human foreskin cells. Their data were analyzed in detail by Shedden and Cooper (2002). According to their analysis periodicity of the genes in Cho et al. is statistically insignificant; the data randomly permuted along the time axis exhibit periodic patterns of similar or greater strength than the original experimental data. As was done by Shedden and Cooper, we have selected 300 periodic genes from both original data and randomly time-reordered data. One should keep in mind that it was impossible to distinguish these two data by inspection. The two upper figures in Figure 6 are the results when we applied our nMDS to these two data sets.



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 6 Example of less well-structured data. Upper row, nMDS; middle row, PCA with normalization (Method 2 in the Appendix 2 section); lower row: most expressive/depressive genes (with larger variance): left column: original data; right column: randomized data. See the text for details.

 
First, one can notice that both of them exhibit ring-like structures. It is natural because periodic genes are selected. However, in contrast to the analysis by Shedden and Cooper, we can detect some differences between these two figures. Along the ring-like structure in the original data, the distribution of the genes is not uniform and has two dense regions placed diametrically opposite to each other. On the contrary, such a structure cannot be seen in the randomized data. Needless to say, the forced periodicity by an artificial selection of the data is detected in both cases as a ring structure, so one might conclude, as Shedden and Cooper did, the observed periodicity is not statistically significant. However, the difference between the original data and the randomized data clearly tells us the conclusion is premature.

To highlight the capability of nMDS, the same selected data were analyzed by PCA with the preprocessing discussed in the Appendix 2 section (the data standardization by Method 2) (the two middle figures in Fig. 6). We see much more diffuse rings than that obtained by nMDS. It is however interesting to note that PCA captures the non-uniformity of the distribution of genes noted in the above.

Thus, we may conclude that nMDS has a capability of detecting subtle differences. In the example here, this could be detected by PCA as well (with normalization), but it is clear that nMDS is much more crisp than PCA. However, the data analyzed here are not really the original data, but a subset of the original data selected by the good correlation to sinusoidal time dependence. Therefore, as duly criticized by Shedden and Cooper, the periodicity in the analyzed data can largely be an artifact.

To avoid this possible artifact we made a 300 gene subset consisting of genes with largest expression level variations to avoid possible noise effects. The 2D embedding result of these genes with the aid of nMDS is shown in the bottom left of Figure 6. Although diffuse, a ring-like structure is still visible. More interestingly, two distribution peaks diametrically placed on the ring are clearly visible. If we randomly permuted the same data along the time axis, the embedded result (bottom right) loses both the ring-like structure and the peaks. Therefore, we may conclude that the original data contains some information about the cell-cycle-related gene expression contrary to the conservative conclusion by Shedden and Cooper. We have no intention to criticize their study that was admirably critical and careful; we simply wish to demonstrate that nMDS can detect subtle structures.

Iyer et al. (1999) selected the genes to analyze according to the significance of the changes in their expression levels, so the temporal structure detected by our nMDS analysis is not an artifact of the methodology in contrast to that exhibited in Cho et al., (2001).


    CONCLUSION
 TOP
 Abstract
 INTRODUCTION
 SYSTEMS AND METHODS
 ALGORITHM
 RESULTS
 CONCLUSION

 REFERENCES
 
We have demonstrated that the nMDS can be a useful tool for data mining. It is unsupervised, and perhaps maximally non-linear. Our algorithm is probably the simplest among the non-metric MDS algorithms and is efficient enough to enable the analysis of a few thousand objects with a small laptop machine.

The nMDS algorithm works on the binary relations among the objects, so if there are N objects, computational complexity is of order N2 at least. Therefore, it is far slower than linear methods such as PCA, although our non-linear algorithm is practically fast enough; we have used for this work a small laptop machine (Mobile Celeron 650 MHz CPU with 256 MB RAM). Typically, the necessary number of iteration is about 100, and it takes only a few minutes for a few hundreds of genes. As has been pointed out and illustrated, with an appropriate data preprocessing a certain linear method could give us a reasonable result with less computational efforts. Although in this paper we have not made any particular effort to reduce computational requirements, a practical way to use nMDS might be to prepare an initial configuration by a linear method with an appropriate data preprocessing method that is verified to be consistent with the full nMDS results.


   
 TOP
 Abstract
 INTRODUCTION
 SYSTEMS AND METHODS
 ALGORITHM
 RESULTS
 CONCLUSION

 REFERENCES
 
APPENDIX A
‘Purely’ non-metric MDS algorithm
Suppose d(i,j) is the distance between objects i and j in R. Let the ranking of {delta}(i,j) among all the input dissimilarity data be n and that of d(i,j) among all the distances between embedded pairs be Tn. If n < Tn (resp., n < Tn), we wish to ‘push’ the pair i and j farther apart (resp., closer) in R. Intuitively speaking, to this end we introduce an ‘overdamped dynamics’ of the points in R driven by the following potential function:

Here, the summation is over all the pairs. In the actual implementation of the algorithm, simpler forces are adopted than the one obtained from this potential as seen below, because the latter is complicated for numerical simulation to the extent of being not practical. There are many ways to modify {Delta} to make the force simpler; we have selected the one that makes the force formula very simple, if not the simplest.

The {Delta} defined by the formula above may be regarded as a counterpart of the stress in the conventional nMDS. As we will see later we can use quantities related to {Delta} to evaluate the confidence level of the resultant configuration. However, we do not employ {Delta} itself, because this quantity is not the difference between two ranks of independent objects; there are NC2{delta}ij or dij but they are relations only among n potentially independent quantities. Therefore, the statistics of {Delta} is not known.

Thus, although certain modifications as discussed above are needed, still in our nMDS algorithm, the optimization process is closely connected to a process that improves the confidence level of the resultant configuration.

The ‘pure nMDS’ algorithm for n objects may be described as follows:

  1. Dissimilarities {delta}ij (i, j = 1, ..., N) for n objects are given. Order them as follows:


  2. Put n points randomly in R as an initial configuration.
  3. Scale the position vectors in R such as , where ri is the current position of object i in R.
  4. Compute dij for all object pairs (i,j) in R, and then order them as


  5. Suppose {delta}ij is the m-th largest in the ordering in (1) and dij is the Tm-th largest in the ordering in (4). Assign Cij = Tm m. Calculate the following displacement vector for i:

    where s = 0.1 x N–3 typically, and update ri -> ri + {delta}ri

  6. Return to (3), and continue until the ‘potential energy {Delta}’ becomes sufficiently small.

The reader may worry about the handling of tie data. Generally speaking, for a large data set the fraction of tie relations is not significant; furthermore, if the result depends on the handling schemes of tie data, the result is unreliable anyway. Therefore, we do not pay any particular attention to the tie data problem.

In the above algorithm, s is a constant value. In practice, we could choose an appropriate schedule to vary s as is often done in optimization processes. In this paper, for simplicity, we do not attempt such a fine tuning.

In the above algorithm, we can deal with asymmetric data as well, i.e. {delta}ij != {delta}ji if we compare {delta}ij with dij while {delta}ji with dji (= dij). Needless to say, if the mismatch between {delta}ij and {delta}ji is large, then representing the pair by a pair of points in a metric space is questionable. Therefore, we will not discuss this problem any further in this paper. The microarray data analyzed in this paper do not have such a problem in any case.

Goodness of embedding
In the text we have already discussed the effective number of correctly embedded objects as a measure of ‘global goodness of embedding’. This measure, however, cannot tell us the embedding quality of each object. It is often the case that the majority of objects are embedded well even without sensitive dependence on the initial conditions, but there are a few objects that consistently refuse to be embedded stably. To judge the quality of embedding for each object j we define

Here, n(j) is the rank order of {delta}(i,j) among N – 1 pairs (i,j) for a given j, and Tn(j)(j) is the rank order of d(i,j) among N – 1 pairs (i,j) for the same j.

{Delta}(j) can be regarded as a statistical variable for the relative position of the j-th object with respect to the remaining objects (Lehmann, 1975). We can estimate the probability P({varepsilon}) of {Delta}(j) < {varepsilon} with the null hypothesis that the rank ordering of dij (i {1,2, ..., N} \ {j}) is totally random with respect to the rank ordering of {delta}ij (i {1,2, ..., N}\{j}). If n is sufficiently large, then {Delta}(j) obeys the normal distribution with mean (M3 M)/6 and variance M2 (M + 1)2 (M – 1)/36, where M {equiv} N – 1. For smaller N there is a table for P({varepsilon}) (Lehmann, 1975). Thus, we can test the null hypothesis with a given confidence level for j-th object.

APPENDIX 2
Limitations and capabilities of linear methods
The limitations and capabilities of PCA with and without data preprocessing are illustrated in this appendix. There is no fundamental difference between PCA and SVD. We consider the following artificial data {sgt}, where g (=1,...,517) denote genes and t (=1,...,11) the observation times:

[Data set 1]


[Data set 2]


[Data set 3]


In the above, Cg, {delta}g, Cig and {delta}ig (i = 1,2,3) are uniform random numbers in [0,1]. That is, Data set 1 is a set of sinusoidal waves with random amplitudes and phases, Data set 2 is the nonlinearly distorted Data set 1, and Data set 3 is a set of periodic functions that are very different from simple oscillatory behaviors.

These data sets are analyzed by the following methods:

Method 1: PCA with the preprocessing used by Holter et al., 2000). The preprocessing procedure is as follows:
Step 1: Subtract the average,

where <>g,t is the average over all genes and experiments,


Step 2 (Column normalization): Normalize the data as


Step 3 (Row normalization): Normalize the data as

Repeat these steps until the following condition is satisfied,

From the resultant sgt correlation matrix Matr.(Cort t') is constructed, and then PCA is performed.


Method 2: PCA with the preprocessing so that {sum}t sgt = 0 and for all g. Of course, no iteration is needed for this preprocessing. From the resultant sgt correlation matrix Matr.(Cortt') is constructed, and then PCA is performed.
Method 3: nMDS as done in the text. That is, the negative of the correlation coefficient Corgg' is used as the dissimilarity and nMDS is applied straightforwardly. Needless to say, no preprocessing of data is needed.

The results are shown in Figure 7. The conclusions may be:

  1. For Data set 1, any method will do.
  2. For Data set 2, the procedure recommended by (Holter et al., 2000) fails, although ironically simpler Method 2 still works very well. If the amplitude C is distributed in [0,5] instead of [0,1] (i.e. the extent of the non-linear distortion is increased), Method 2 becomes inferior to Method 3, but still Method 2 is adequate.
  3. For Data set 3, even Method 2 fails. nMDS (Method 3) still exhibits a ring-like structure. The method recommended by (Holter et al., 2000) is obviously out of question.



View larger version (42K):
[in this window]
[in a new window]
 
Fig. 7 Comparison of linear and non-linear methods. Method 1, PCA with polishing (Holter et al. 2000); Method 2, PCA with normalization; Method 3, 2D space embedding with the aid of nMDS. See the text for Data sets and Methods. For Methods 1 and 2, horizontal and vertical axes are the first and second principal components, respectively, and the percentages describe cumulative proportions. For Method 3, the percentages are the indicators of goodness defined in the text.

 
Thus, we may conclude that nMDS is a versatile and all-around data mining method for analyzing periodic temporal data. Furthermore, we can point out that the preprocessing method in Method 1 should not be used, because it could severely distort the original data (as may have been expected from the figures). Suppose there are n genes and four time points. Consider the following example (for the counterexample sake). The first gene has (a,b,–b,–a) (a > b > 0), and the remaining genes are all give by (1,0,0,–1). The N x 4 matrix made from these vectors is polished by an iterative row and column vector normalization procedure. If n is sufficiently large, the first row converges to (0,1,–1,0) and the rest to (1,0,0,–1), independent of a and b. If b is small, then all the vectors should behave almost the same way, but after polishing the out-of-phase component in the discrepancy between the first row and the rest is enhanced dramatically, resulting in a spurious out of phase temporal behavior. Although the preceding exercise is trivial, the result warns us the danger of using the so-called polishing.

APPENDIX 3
Gene ontology
Since we have seen Lagreid et al. (2003) claiming that gene ontology can be discovered in the same data we are analyzing, we tried to extract more biological information from the nMDS result. We have realized that the distribution of ontologically identified genes (or that of the distribution of 23 ontology tags adopted by Lagreid et al.) along the ring is statistically indistinguishable from the uniform distribution. In this sense, nMDS fails to capture ontological information of the genes. This might suggest that our nMDS method is definitely inferior to the analysis based on the rough set adopted by the above paper. However, we have found that the claimed success is largely illusory. The claimed success is: of the 24 genes for which homology information allows inference of their ontology, ‘11 genes had one or more classifications that matched this assumption (Table 10)’. However, actually 8 ± 2.4 is within one {sigma} error (note that out of 210 genes used to train the rules based on the rough set, each gene on the average share an ontology with about 70 other genes). That is, 11 out of 24 is statistically hardly significant (the cross-validation does not mean very much because it is only a consistency check of the method). Therefore, nMDS does not fail to capture the significant information that can be obtained by the existing statistical methods.

Thus, we may conclude that the data we have analyzed may be sufficiently informative to follow the cell-cycle-related events, but not to annotate genes ontologically.

Received on July 19, 2003; revised on September 17, 2004; accepted on September 18, 2004

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 SYSTEMS AND METHODS
 ALGORITHM
 RESULTS
 CONCLUSION

 REFERENCES
 

    Borg, I. and Groenen, P. Modern Multidimensional Scaling, (1997) , NY Springer.

    Cho, R.J., Huang, M., Campbell, M.J., Dong, H., Steinmetz, L., Sapinoso, L., Hampton, G., Elledge, S.J., Davis, R.W., Lockhart, D.J. (2001) Transcriptional regulation and function during the human cell cycle. Nat. Genet., 27, , pp. 48–54[Web of Science][Medline].

    Cox, T.F. and Cox, M.A.A. Multidimensional Scaling, (1994) , London Chapman and Hall.

    Donoho, D.L., Vetterli, M., DeVore, R.A., Daubechies, I. (1998) Data compression and harmonic analysis. IEEE Trans. Inform. Theory, 44, , pp. 2435–2476[CrossRef].

    Dyrskjot, L., Thykjaer, T., Kruhoffer, M., Jensen, J.L., Marcussen, N., Hamilton-Dutoit, S., Wolf, H., Orntoft, T.F. (2002) Identifying distinct classes of bladder carcinoma using microarrays. Nat. Genet., 33, 90–96.

    Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci., USA, 95, 14863–14868[Abstract/Free Full Text].

    Green, P.E., Carmone, F.J., Jr, Smith, S.M. Multidimensional Scaling: Concepts and Applications, (1970) , Boston, MA Allyn and Bacon.

    Hollander, M. and Wolfe, D.A. Nonparametric Statistical Methods, (1999) , NY John Wiley & Sons.

    Holter, N.S., Mitra, M., Maritan, A., Cieplak, M., Banavar, J.R., Fedoroff, N.V. (2000) Fundamental patterns underlying gene expression profiles: simplicity from complexity. Proc. Natl Acad. Sci., USA, 97, , pp. 8409–8414[Abstract/Free Full Text].

    Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G., Moore, T.L., Jeffrey, C.F., Trent, J.M., Staudt, L.M., Hudson, J., Jr, Boguski, M.S., et al. (1999) The transcriptional program in the response of human fibroblasts to serum. Science, 283, 83–87[Abstract/Free Full Text].

    Johansson, D., Lindgren, P., Beglund, A. (2003) A multivariate approach applied to microarray data for identification of genes with cell cycle-coupled transcription. Bioinformatics, 19, 467–473[Abstract/Free Full Text].

    Kanaya, S., Kinouchi, M., Abe, T., Kudo, Y., Yamada, Y., Nishi, T., Mori, H., Ikemura, T. (2001) Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM): characterization of horizontally transferred genes with emphasis on the E. coli O157 genome. Gene, 276, 89–99[CrossRef][Web of Science][Medline].

    Kasturi, J., Acharya, R., Ramanathan, M. (2003) An information theoretic approach for analyzing temporal patterns of gene expression. Bioinformatics, 19, 449–458[Abstract/Free Full Text].

    Kruskal, J.B. (1964a) Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29, 1–27[CrossRef][Web of Science].

    Kruskal, J.B. (1964b) Nonmetric multidimensional scaling: a numerical method. Psychometrika, 29, 115–129[CrossRef][Web of Science].

    Lagreid, A., Hvidsten, T.R., Midelfart, H., Komorowski, J., Sandvik, A.K. (2003) Predicting gene ontology biological process from temporal gene expression patterns. Genome Res., 13, 965–979[Abstract/Free Full Text].

    Lehmann, E.L. Nonparametrics, (1975) , San Francisco, CA Holden-Day.

    Shedden, K. and Cooper, S. (2002) Analysis of cell-cycle-specific gene expression in human cells as determined by microarray and double-thymidine block synchronization. Proc. Natl Acad. Sci., USA, 99, , pp. 4379–4384[Abstract/Free Full Text].

    Shepard, R.N. (1962a) The analysis proximities: multidimensional scaling with an unknown distance function, I. Psychometrika, 27, 125–140[CrossRef].

    Shepard, R.N. (1962b) The analysis proximities: multidimensional scaling with an unknown distance function, II. Psychometrika, 27, 219–246[CrossRef].

    Shmulevich, I. and Zhang, W. (2002) Binary analysis and optimization-based normalization of gene expression data. Bioinformatics, 18, 555–565[Abstract/Free Full Text].

    Slonim, D.K. (2002) From patterns to pathways: gene expression data analysis of age. Nat. Genet. Suppl., 32, 502–508[CrossRef].

    Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., Futcher, B. (1998) Comprehensive identification of cell cycle regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 3273–3297[Abstract/Free Full Text].

    Taguchi, Y-h. and Oono, Y. (2004) Nonmetric multidimensional scaling as a data-mining Tool: new algorithm and new targets. In Toda, M., Komatsuzaki, T., Konishi, T., Rice, R.S., Berry, S.A. (Eds.). Geometrical Structures of Phase Space Multidimensional Chaos, 130, , pp. 315–351 Special Volume of Adv. Chem. Phys..

    Taguchi, Y-h., Oono, Y., Yokoyama, K. (2001) New possibilities of non-metric multidimensional scaling. Proc. Inst. Stat. Math., 49, 133–153 (in Japanese).


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
GeneticsHome page
C. Zhu and J. Yu
Nonmetric Multidimensional Scaling Corrects for Population Structure in Association Mapping With Different Sample Types
Genetics, July 1, 2009; 182(3): 875 - 888.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. Rajaram
A novel meta-analysis method exploiting consistency of high-throughput experiments
Bioinformatics, March 1, 2009; 25(5): 636 - 642.
[Abstract] [Full Text] [PDF]


Home page
Med Decis MakingHome page
P. F. M. Krabbe, J. A. Salomon, and C. J. L. Murray
Quantification of Health States with Rank-Based Nonmetric Multidimensional Scaling
Med Decis Making, August 1, 2007; 27(4): 395 - 405.
[Abstract] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
M. A. Zapala and N. J. Schork
Multivariate regression analysis of distance matrices for testing associations between gene expression patterns and related variables
PNAS, December 19, 2006; 103(51): 19430 - 19435.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/6/730    most recent
bti067v2
bti067v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (10)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Taguchi, Y.-h.
Right arrow Articles by Oono, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Taguchi, Y.-h.
Right arrow Articles by Oono, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?