Skip Navigation


Bioinformatics Advance Access originally published online on April 4, 2006
Bioinformatics 2006 22(12):1540-1542; doi:10.1093/bioinformatics/btl117
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/12/1540    most recent
btl117v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (44)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Suzuki, R.
Right arrow Articles by Shimodaira, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Suzuki, R.
Right arrow Articles by Shimodaira, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Pvclust: an R package for assessing the uncertainty in hierarchical clustering

Ryota Suzuki 1,2,* and Hidetoshi Shimodaira 1

1 Department of Mathematical and Computing Sciences, Tokyo Institute of Technology 2-12-1 Ookayama, Meguro-ku, Tokyo 152-8552, Japan
2 Ef-prime, Inc. 2-17-5 Nihonbashi-Kayabacho, Chuo-ku, Tokyo 103-0025, Japan

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 REFERENCES
 

Summary: Pvclust is an add-on package for a statistical software R to assess the uncertainty in hierarchical cluster analysis. Pvclust can be used easily for general statistical problems, such as DNA microarray analysis, to perform the bootstrap analysis of clustering, which has been popular in phylogenetic analysis. Pvclust calculates probability values (p-values) for each cluster using bootstrap resampling techniques. Two types of p-values are available: approximately unbiased (AU) p-value and bootstrap probability (BP) value. Multiscale bootstrap resampling is used for the calculation of AU p-value, which has superiority in bias over BP value calculated by the ordinary bootstrap resampling. In addition the computation time can be enormously decreased with parallel computing option.

Availability: The program is freely distributed under GNU General Public License (GPL) and can directly be installed from CRAN (http://cran.r-project.org/), the official R package archive. The instruction and program source code are available at http://www.is.titech.ac.jp/~shimo/prog/pvclust

Contact: ryota.suzuki{at}is.titech.ac.jp

Cluster analysis is a statistical method which aims to classify several objects into some groups (clusters) according to similarities between them. While it has been widely used in many applications such as DNA microarray analysis, the uncertainty of results caused by sampling error of data has not generally been evaluated in practice. Pvclust is an implementation of bootstrap analysis on a statistical software R to assess the uncertainty in hierarchical cluster analysis.

The importance of uncertainty assessment has been well-recognized in phylogenetic analysis. It is a special form of hierarchical clustering for inferring the history of evolution as a dendrogram. Thousands of bootstrap samples are generated by randomly sampling elements of the data, and bootstrap replicates of the dendrogram are obtained by repeatedly applying the cluster analysis to them (Efron, 1979; Felsenstein, 1985). The bootstrap probability (BP) value of a cluster is the frequency that it appears in the bootstrap replicates. The multiscale bootstrap resampling was developed recently (Efron et al., 1996; Shimodaira, 2002, 2004) for calculating approximately unbiased (AU) probability values (p-values) as implemented in a software CONSEL (Shimodaira and Hasegawa, 2001).

Although these bootstrap-based approaches are applicable to broad range of statistical problems, their usage have been limited since implementations of these methods are focused only on phylogenetic analysis. Pvclust is designed for general hierarchical clustering problems, so users can easily obtain bootstrap-based p-values for their own dataset and preferred clustering method. (Suzuki and Shimodaira, 2004).

R is ‘a free software environment for statistical computing and graphics’ (the R project website, http://www.r-project.org/), which runs on several platforms such as Windows, MacOS and UNIX/Linux. Several add-on packages are available via CRAN, the official R package archive.

Package pvclust is included in CRAN packages and can be easily installed on the fly. Once the package is installed, pvclust can be run with the command:

library (pvclust)

In R system data are stored in a ‘data object’. For example, to read a data file data.txt into a data object data, type the following command:

data < - read.table("data.txt")

The bootstrap analysis is performed by applying the function pvclust to the object data with the number of bootstrap replications being B = 10 000:

result < - pvclust(data, nboot=10000)

The result is stored in the object result, which can be shown graphically with the command:

plot(result)

We have applied the above commands to the DNA microarray data of Garber et al. (2001), and the result is shown in Figure 1. By default, pvclust performs bootstrapping at K = 10 different data sizes, and the hierarchical clustering is repeated by K x B times. It took 430 min on a single processor (Athlon MP 2000+). The parallel version parPvclust is available with the help of the snow package (Rossini et al., 2003), an add-on package of R for parallel computation. It took only 24 min using 20 processors; the speed-up value is 430/24 = 18, indicating very efficient parallel computing.


Figure 1
View larger version (41K):
[in this window]
[in a new window]
 
Fig. 1 Hierarchical clustering of 73 lung tumors. The data are expression pattern of 916 genes of Garber et al. (2001). Values at branches are AU p-values (left), BP values (right), and cluster labels (bottom). Clusters with AU ≥ 95 are indicated by the rectangles. The fourth rectangle from the right is a cluster labeled 62 with AU = 0.99 and BP = 0.96.

 
It is recommended to use nboot = 1000 (default) for testing at first, followed by 10 000 for smaller errors. To determine an appropriate size of B (nboot), standard errors of p-values are helpful. Function seplot provides a graphical interface for examining standard errors, while print gives more detailed information about p-values in text-based format. See online instruction on our website for the usage of these facilities.

In the multiscale bootstrap resampling, we intentionally alter the data size of bootstrap samples to several values. Let N be the original data size, and N' be that for bootstrap samples. In the example of Figure 1, N = 916, and N' = 458, 549, 641, 732, 824, 916, 1007, 1099, 1190 and 1282. For each cluster, an observed BP value is obtained for each value of N', and we look at change in z = –{Phi}–1 (BP) values, where {Phi}–1 (·) is the inverse function of {Phi} (·), the standard normal distribution function. For the cluster labeled 62 in Figure 1, the observed BP values are 0.8554, 0.8896, 0.9132, 0.9335, 0.9498, 0.9636, 0.9656, 0.9756, 0.9795 and 0.9859 (Fig. 2). Then, a theoretical curve Formula is fitted to the observed values, and the coefficients v, c are estimated for each cluster. The AU p-value is computed by AU = {Phi}(–v + c). For the cluster labeled 62, v = –2.01, c = 0.26, and thus AU = {Phi}(2.01 + 0.26) = {Phi}(2.27) = 0.988, where BP = 0.964 for N' = N. An asymptotic theory proves that the AU p-value is less biased than the BP value.


Figure 2
View larger version (10K):
[in this window]
[in a new window]
 
Fig. 2 Diagnostic plot of the multiscale bootstrap for the cluster labeled 62. The observed z-values are plotted for Formula, and the theoretical curve is obtained by the weighted least squares fitting. This plot is obtained by command: msplot(result, edges = 62). When the curve fitting is poor, a breakdown of the asymptotic theory may be suspected.

 
Currently only the simplest form of the bootstrapping, i.e. the non-parametric bootstrap resampling, is implemented in pvclust. More elaborate models designed for specific applications, such as that of Kerr and Churchill (2001) for DNA microarray analysis, should be incorporated into the program in a future work.


    Acknowledgments
 
This work is supported in part by Grant KAKENHI (14702061, 17700276) from MEXT of Japan.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Satoru Miyano

Received on August 30, 2005; revised on March 6, 2006; accepted on March 25, 2006

    REFERENCES
 TOP
 ABSTRACT
 REFERENCES
 

    Efron, B. (1979) Bootstrap methods: another look at the jackknife. Ann. Stat, . 7, 1–26.

    Efron, B., et al. (1996) Bootstrap confidence levels for phylogenetic trees. Proc. Natl Acad. Sci., USA, 93, 13429–13434[Abstract/Free Full Text].

    Felsenstein, J. (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution, 39, 783–791[CrossRef][Web of Science].

    Garber, M., et al. (2001) Diversity of gene expression in adenocarcinoma of the lung. [Erratum (2002) Proc. Natl Acad. Sci. USA, 99, 1098.] Proc. Natl Acad. Sci. USA, 98, 13784–13789.

    Kerr, M.K. and Churchill, G.A. (2001) Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc. Natl Acad. Sci. USA, 98, 8961–8965[Abstract/Free Full Text].

    Rossini, A., et al. (2003) Simple parallel statistical computing in R. UW Biostatistics Working Paper Series, , WA Paper 193 University of Washington.

    Shimodaira, H. and Hasegawa, M. (2001) CONSEL: for assessing the confidence of phylogenetic tree selection. Bioinformatics, 17, 1246–1247[Abstract/Free Full Text].

    Shimodaira, H. (2002) An approximately unbiased test of phylogenetic tree selection. Syst. Biol, . 51, 492–508[CrossRef][Web of Science][Medline].

    Shimodaira, H. (2004) Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling. Ann. Stat, . 32, 2616–2641[CrossRef].

    Suzuki, R. and Shimodaira, H. (2004) An application of multiscale bootstrap resampling to hierarchical clustering of microarray data: how accurate are these clusters? In proceedings by the Fifteenth International Conference on Genome Informatics (GIW 2004) p. P034.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Rheumatology (Oxford)Home page
H.-J. van den Ham, W. de Jager, J. W. J. Bijlsma, B. J. Prakken, and R. J. de Boer
Differential cytokine profiles in juvenile idiopathic arthritis subtypes revealed by cluster analysis
Rheumatology, August 1, 2009; 48(8): 899 - 905.
[Abstract] [Full Text] [PDF]


Home page
Cancer Res.Home page
S. Heck, J. Rom, V. Thewes, N. Becker, B. Blume, H. P. Sinn, U. Deuschle, C. Sohn, A. Schneeweiss, and P. Lichter
Estrogen-Related Receptor {alpha} Expression and Function Is Associated with the Transcriptional Coregulator AIB1 in Breast Carcinoma
Cancer Res., June 15, 2009; 69(12): 5186 - 5193.
[Abstract] [Full Text] [PDF]


Home page
Physiol. GenomicsHome page
K. Motoyama, Y. Nakai, T. Miyashita, Y. Fukui, M. Morita, K. Sanmiya, H. Sakakibara, I. Matsumoto, K. Abe, T. Yakabe, et al.
Isolation stress for 30 days alters hepatic gene expression profiles, especially with reference to lipid metabolism in mice
Physiol Genomics, April 10, 2009; 37(2): 79 - 87.
[Abstract] [Full Text] [PDF]


Home page
Mol. Cell. ProteomicsHome page
K. Kim, P. Aronov, S. O. Zakharkin, D. Anderson, B. Perroud, I. M. Thompson, and R. H. Weiss
Urine Metabolomics Analysis for Kidney Cancer Detection and Biomarker Discovery
Mol. Cell. Proteomics, March 1, 2009; 8(3): 558 - 570.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
B. Kaczkowski, E. Torarinsson, K. Reiche, J. H. Havgaard, P. F. Stadler, and J. Gorodkin
Structural profiles of human miRNA families from pairwise clustering
Bioinformatics, February 1, 2009; 25(3): 291 - 294.
[Abstract] [Full Text] [PDF]


Home page
Molecular Cancer TherapeuticsHome page
J. M. Flanagan, J. M. Funes, S. Henderson, L. Wild, N. Carey, and C. Boshoff
Genomics screen in transformed stem cells reveals RNASEH2A, PPAP2C, and ADARB1 as putative anticancer drug targets
Mol. Cancer Ther., January 1, 2009; 8(1): 249 - 260.
[Abstract] [Full Text] [PDF]


Home page
JEMHome page
V. Brune, E. Tiacci, I. Pfeil, C. Doring, S. Eckerle, C. J.M. van Noesel, W. Klapper, B. Falini, A. von Heydebreck, D. Metzler, et al.
Origin and pathogenesis of nodular lymphocyte-predominant Hodgkin lymphoma as revealed by global gene expression analysis
J. Exp. Med., September 29, 2008; 205(10): 2251 - 2268.
[Abstract] [Full Text] [PDF]


Home page
DevelopmentHome page
A. Kawaguchi, T. Ikawa, T. Kasukawa, H. R. Ueda, K. Kurimoto, M. Saitou, and F. Matsuzaki
Single-cell gene profiling defines differential progenitor subclasses in mammalian neurogenesis
Development, September 15, 2008; 135(18): 3113 - 3124.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
A. W. MacInnes, A. Amsterdam, C. A. Whittaker, N. Hopkins, and J. A. Lees
Loss of p53 synthesis in zebrafish tumors with ribosomal protein gene mutations
PNAS, July 29, 2008; 105(30): 10408 - 10413.
[Abstract] [Full Text] [PDF]


Home page
J. Clin. Microbiol.Home page
F. Grenouillet, L. Millon, J.-M. Bart, S. Roussel, I. Biot, E. Didier, A.-S. Ong, and R. Piarroux
Multiple-Locus Variable-Number Tandem-Repeat Analysis for Rapid Typing of Candida glabrata
J. Clin. Microbiol., November 1, 2007; 45(11): 3781 - 3784.
[Abstract] [Full Text] [PDF]


Home page
Hum ReprodHome page
A.-N. Spiess, C. Feig, W. Schulze, F. Chalmel, H. Cappallo-Obermann, M. Primig, and C. Kirchhoff
Cross-platform gene expression signature of human spermatogenic failure reveals inflammatory-like response
Hum. Reprod., November 1, 2007; 22(11): 2936 - 2946.
[Abstract] [Full Text] [PDF]


Home page
Mol Cancer ResHome page
A. A.N. Rose, F. Pepin, C. Russo, J. E. Abou Khalil, M. Hallett, and P. M. Siegel
Osteoactivin Promotes Breast Cancer Metastasis to Bone
Mol. Cancer Res., October 1, 2007; 5(10): 1001 - 1014.
[Abstract] [Full Text] [PDF]


Home page
Eukaryot CellHome page
S. Ohnuki, S. Nogami, H. Kanai, D. Hirata, Y. Nakatani, S. Morishita, and Y. Ohya
Diversity of Ca2+-Induced Morphology Revealed by Morphological Phenotyping of Ca2+-Sensitive Mutants of Saccharomyces cerevisiae
Eukaryot. Cell, May 1, 2007; 6(5): 817 - 830.
[Abstract] [Full Text] [PDF]


Home page
J Exp BotHome page
B. Zhou, Y. Li, Z. Xu, H. Yan, S. Homma, and S. Kawabata
Ultraviolet A-specific induction of anthocyanin biosynthesis in the swollen hypocotyls of turnip (Brassica rapa)
J. Exp. Bot., May 1, 2007; 58(7): 1771 - 1781.
[Abstract] [Full Text] [PDF]


Home page
Cancer Res.Home page
A. Gaur, D. A. Jewell, Y. Liang, D. Ridzon, J. H. Moore, C. Chen, V. R. Ambros, and M. A. Israel
Characterization of MicroRNA Expression Levels and Their Biological Correlates in Human Cancer Cell Lines
Cancer Res., March 15, 2007; 67(6): 2456 - 2468.
[Abstract] [Full Text] [PDF]


Home page
Genes Dev.Home page
J. R. Neilson, G. X.Y. Zheng, C. B. Burge, and P. A. Sharp
Dynamic regulation of miRNA expression in ordered stages of cellular development
Genes & Dev., March 1, 2007; 21(5): 578 - 589.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
M. A. Hannah, D. Wiese, S. Freund, O. Fiehn, A. G. Heyer, and D. K. Hincha
Natural Genetic Variation of Freezing Tolerance in Arabidopsis
Plant Physiology, September 1, 2006; 142(1): 98 - 112.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/12/1540    most recent
btl117v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (44)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Suzuki, R.
Right arrow Articles by Shimodaira, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Suzuki, R.
Right arrow Articles by Shimodaira, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?