Bioinformatics Advance Access originally published online on May 7, 2007
Bioinformatics 2007 23(14):1801-1806; doi:10.1093/bioinformatics/btm233
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure
Center for Computational Medicine and Biology, Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Clustering of individuals into populations on the basis of multilocus genotypes is informative in a variety of settings. In population-genetic clustering algorithms, such as BAPS, STRUCTURE and TESS, individual multilocus genotypes are partitioned over a set of clusters, often using unsupervised approaches that involve stochastic simulation. As a result, replicate cluster analyses of the same data may produce several distinct solutions for estimated cluster membership coefficients, even though the same initial conditions were used. Major differences among clustering solutions have two main sources: (1) label switching of clusters across replicates, caused by the arbitrary way in which clusters in an unsupervised analysis are labeled, and (2) genuine multimodality, truly distinct solutions across replicates.
Results: To facilitate the interpretation of population-genetic clustering results, we describe three algorithms for aligning multiple replicate analyses of the same data set. We have implemented these algorithms in the computer program CLUMPP (CLUster Matching and Permutation Program). We illustrate the use of CLUMPP by aligning the cluster membership coefficients from 100 replicate cluster analyses of 600 chickens from 20 different breeds.
Availability: CLUMPP is freely available at http://rosenberglab.bioinformatics.med.umich.edu/clumpp.html
Contact: mjakob{at}umich.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
A variety of population-genetic applications—such as association mapping, molecular ecological studies and studies of human evolution—make use of the clustering of individual multilocus genotypes into populations. Many clustering algorithms have now been developed for employing population-genetic data to assign individuals—and fractions of individuals—to clusters (Anderson and Thompson, 2002; Chen et al., 2006, 2007; Corander and Marttinen, 2006; Corander et al., 2003, 2004; Dawson and Belkhir, 2001; Falush et al., 2003; François et al., 2006; Huelsenbeck and Andolfatto, 2007; Pella and Masuda, 2006; Pritchard et al., 2000). The result of a single cluster analysis is typically possible to represent as a matrix, where each individual is given a membership coefficient for each cluster—interpreted as a probability of membership, or as a fraction of the genome with membership in the cluster, depending on the setting—with membership coefficients summing to 1 across K clusters. The number of clusters is predefined by the user for some methods, and for others it is inferred.
Because clustering algorithms may incorporate stochastic simulation as part of the inference, independent analyses of the same data may result in several distinct outcomes, even though the same initial conditions were used. The main differences across replicates are of two types: label switching and genuine multimodality. Label switching refers to a scenario in which different replicates obtain the same membership coefficient estimates, except with a different permutation of the cluster labels (Jasra et al., 2005; Stephens, 2000). In unsupervised cluster analyses, because the meaning of each cluster label is not known in advance, a clustering algorithm may be equally likely to reach any of K ! permutations of the same collection of estimated membership coefficients.
It is also possible for replicate cluster analyses to arrive at truly distinct solutions that are not equivalent up to permutation. Unlike label switching, which is simply a nuisance that makes interpretation of the clustering results more difficult, this genuine multimodality may result from real biological factors that cause multiple parts of the space of possible membership coefficients to provide similarly appropriate explanations for the data.
Regardless of the source of differences in clustering outcomes, some method is needed for handling the results from replicate analyses. We develop three algorithms for searching for optimal alignments of R replicate cluster analyses of the same data, and we have implemented these algorithms in the computer program CLUMPP. Our program takes as input the estimated cluster membership coefficient matrices of multiple runs of a clustering program, for any number of clusters. It outputs these same matrices, permuted so that all replicates have as close a match as possible. Thus, CLUMPP strips away the label switching heterogeneity so that the genuine multimodality can be detected and quantified. The input file for CLUMPP is a file similar to the output from STRUCTURE (Falush et al., 2003; Pritchard et al., 2000), and the output from CLUMPP can be used directly as input by the cluster visualization program DISTRUCT (Rosenberg, 2004).
| 2 ALGORITHMS |
|---|
|
|
|---|
We refer to the CxK matrix of membership coefficients for a single cluster analysis as the Q-matrix, with the C rows corresponding to individuals (or populations) and the K columns corresponding to clusters. CLUMPP attempts to maximize a measure of similarity of the Q-matrices of R creplicates over all (K !)R–1 possible alignments of the replicates.
Consider a pair of Q-matrices, Qi and Qj for runs i and j, where the value in the cth row and kth column of Qi is the membership coefficient for individual c in cluster k as inferred in run i. Each matrix consists of non-negative entries, and the sum of the entries in any row is 1. We define the pairwise similarity of matrices Qi and Qj as follows:
|
| (1) |
|
| (2) |
Using G to measure similarity, the optimal alignment of matrices Qi and Qj is defined as the permutation of the columns of Qj that maximizes the similarity G over all permutations P in the set SK of permutations of K clusters. The maximum value, or
|
| (3) |
For a collection of R replicates, the average pairwise similarity is defined as
|
| (4) |
|
| (5) |
We make use of three algorithms for attempting to find the optimal alignment of R replicates. In decreasing order of the extent of the search, and in increasing order of computational speed, these algorithms are termed FullSearch, Greedy and LargeKGreedy. These algorithms supersede earlier methods that we described in Nordborg et al. (2005).
Note that our approach can proceed analogously using alternative functions to measure similarity in place of G. Although it is undefined when one of the two matrices equals W, G is designed to have large negative values when one of the two runs reflects substantial population structure and the other has relatively little structure (i.e. little difference from W). We can define a second similarity function G', which is guaranteed to lie in [0,1]:
|
| (6) |
|
|
0 and
The quantities SSC', H' and rmSSCR' can then be defined by replacing G in Equations (3–5![]()
) with G'. We proceed to describe our algorithms using the G statistic to measure similarity; to instead use G', the approach is analogous with G', SSC', H' and rmSSCR' in place of G, SSC, H, and SSCR.
FullSearch
The FullSearch algorithm computes H for each of the (K!)R–1 alignments of the K clusters in R replicates. Considering all possible vectors of permutations (I, P2, P3, ..., PR), the algorithm computes H(I(Q1),P2(Q2),P3(Q3),...,PR(QR)) and returns the vector of permutations that maximizes SSCR. As the number of possible vectors grows quickly with K and R, however, even for moderate values of K and R, it is unrealistic to test every alignment. The FullSearch algorithm runs in time proportional to TFullSearch = (K!)R–1[R(R–1)/2]KC: the number of permutation vectors is (K!)R–1, the number of computations of G in each evaluation of Equation (4) is R(R–1)/2, and the time required for each computation of G is proportional to KC. Although it proceeds slowly, unlike our other algorithms, the FullSearch algorithm is guaranteed to find the optimal alignment of clusters across multiple runs.
Greedy
The Greedy algorithm employed by CLUMPP proceeds as follows:
- Choose one run, Q1.
- Choose a second run, Q2, and fix the permutation, P2, that maximizes G(P1(Q1),P(Q2)) over all possible permutations P (where P1 is the identity permutation).
- Continue sequentially with each remaining run, Qx, where x=3,...,R, and fix the permutation, Px of Qx, that maximizes the average similarity with the previously fixed x–1 runs, or
over all permutations P.
(7)
This algorithm runs in time proportional to TGreedy=(K !)[R(R–1)/2]KC. The number of permutations tested for each run from 2 to R is K !. For run r (r ranging from 2 to R), the number of computations of G performed for each permutation is r–1. Thus, considering all runs from 2 to R, the total number of computations of G performed is (K !)[R(R–1)]/2. Each computation of G runs in time proportional to KC.
Because the order in which the runs are considered can affect the result, several different sequences of runs should be tested. CLUMPP offers three options for testing different sequences: test all possible sequences of runs, test a pre-defined number of random sequences and test a specific set of user-defined sequences.
LargeKGreedy
When K
15, the number of permutations, K !, is very large, and it may not be possible to calculate G for all permutations of a particular pair of Q-matrices. Instead of testing every permutation as in the Greedy algorithm, the LargeKGreedy algorithm proceeds as follows:
- (1)Choose one run, Q1.
- (2a)Choose a second run, Q2. Compute G for all pairs of columns, one from Q1 and one from Q2. This computation is simply the value of G for two columns—hence no permutations of Q2 are computed, unlike in step 2 for the Greedy algorithm.
- (2b)Pick the pair of columns Q1,y1 and Q2,z1 with highest G-value and fix these columns (Q1,y1 refers to column y1 of matrix Q1). Then pick the pairs of columns Q1,y2 and Q2,z2 with the next highest G-value, ignoring all G-values of pairs containing either of the previously chosen columns Q1,y1 and Q2,z1. Repeat this procedure until K pairs of columns, one each from Q1 and Q2, have been picked, and fix the permutation of Q2 that matches up these pairs of columns, or P2(Q2).
- (3a)Continue sequentially with each remaining run, Qx, where x = 3,..., R. For each y and z from 1 to K, compute the average similarity,
where P
(8)
,y denotes column y of the permuted matrix P
(Q
). This quantity is the similarity of column z of Qx to column y of each of the previously fixed permutations, averaged across all runs previously considered. No permutations of Qx are computed, unlike in Step 3 for the Greedy algorithm.
- (3b)Pick the pair of columns y1 of P1(Q1),P2(Q2),...,Px–1(Qx–1) and z1 of Qx with highest average similarity in Equation (8). Then pick the columns y2 and z2 with the next highest similarity in Equation (8), ignoring similarity scores of pairs containing either of the previously chosen columns y1 and z1. Repeat this procedure until K pairs of columns, one for the matrices P1(Q1),P2(Q2),...,Px–1(Qx–1) and one for Qx, have been picked. Fix the permutation of Qx that matches up these pairs of columns, or Px(Qx).
- (2a)Choose a second run, Q2. Compute G for all pairs of columns, one from Q1 and one from Q2. This computation is simply the value of G for two columns—hence no permutations of Q2 are computed, unlike in step 2 for the Greedy algorithm.
A candidate for the vector of permutations of the R runs that maximizes H across all possible vectors has now been constructed. This algorithm runs in time proportional to TLargeKGreedy=[R(R–1)/2]K2C. The number of pairs of columns, one from the run under consideration and one from the previously fixed runs, is K2. For run r (r ranging from 2 to R), the number of computations of G performed for each pair of columns is r–1. Considering all runs from 2 to R, the total number of computations of G performed is K2[R(R–1)]/2. Since G is computed only for columns rather than for whole matrices, the time of computation of G is proportional only to C rather than to KC, as in the other algorithms.
As is true for the Greedy algorithm, the order in which the runs are considered can affect the result. For the LargeKGreedy algorithm CLUMPP offers the same three options for selecting the input sequence of runs as it provides for the Greedy algorithm.
To get an idea of which algorithm to use, we have found it useful to compute the quantity D=TN for each algorithm, where T is a quantity proportional to the time required by an algorithm (as described above) and N is the number of input sequences to be tested (for FullSearch, N=1). If D
1013 for FullSearch, then this algorithm is fast enough and is preferred; otherwise the Greedy algorithm can be used. If D
1013 for the Greedy algorithm, then this algorithm is probably also too slow. In that case, the LargeKGreedy algorithm should be used, as it can handle K>20, R>100 and C>1000 in reasonable time. We recommend testing at least 100 input sequences for the Greedy and LargeKGreedy algorithms—and preferably many more—to find the approximately highest SSCR value.
Example
Rosenberg et al. (2001) analyzed a data set of 27 microsatellites genotyped in 600 chickens from 20 breeds. Using STRUCTURE (Pritchard et al., 2000), they inferred membership coefficients of the 600 chickens in 19 clusters. By repeating the cluster analysis 100 times, they assessed the robustness of the clusters, and Figure 1A displays the cluster membership estimates from 25 of the 100 replicates (randomly chosen). From Figure 1A, the label-switching problem is apparent. Focusing on Runs 15 and 16, we could quickly match up the clusters (i.e. the colored segments). However, because of some genuine multimodality, it would be not only tedious, but also difficult, to visually align the clusters for all replicates in an optimal manner. CLUMPP is designed to address exactly this type of situation.
|
Using CLUMPP and the LargeKGreedy algorithm on the 100 replicates (10 sets of 100 random input sequences of runs—for each set this takes
6 min on a 2.4 GHz processor), we find the highest value of H to be in the range of 0.51–0.54 (and the highest value of H' to be between 0.69 and 0.70). If we allow the algorithm to run longer (30 000 random input sequences), the highest H equals 0.5546. The highest-scoring input sequences tend to produce quite similar alignments of the replicates. Excluding the input sequence that produced the highest H-value, the next 10 highest-scoring input sequences all produced H-values of at least 0.5441, and had on average 2.0% differences compared to the input sequence that led to the highest H. In other words, considering the permuted position of a randomly chosen cluster (among 19) in a randomly chosen run (among 100) based on the output of CLUMPP using a randomly chosen input sequence (among the 10 next highest-scoring sequences, after the one with the highest H-value), the cluster had a 98.0% chance of being aligned in the same way that it was aligned when using the highest-scoring of all input sequences. The permuted membership coefficients of the 25 runs in Figure 1A for the input sequence among the 30 000 that leads to the highest H are shown in Figure 1B. Comparing the permuted membership coefficients for this input sequence to the clustering behavior in Table 2 of Rosenberg et al. (2001), we recapture the same patterns of clustering. For example, Brown-egg layer lines C and D always share a cluster, and this cluster sometimes (in 8 of 100 runs) includes Godollo Nhx (Run 11 in Fig. 1).
When we consider the LargeKGreedy algorithm with 10 000 fixed input sequences (chosen randomly among the 30 000 described above), for each of the 10 000 sequences, the alignment of the replicates obtained by CLUMPP is identical regardless of which of two statistics—H or H'—is used. The same input sequence that produces the highest H-value (H=0.5508) produces the highest H'-value (H'=0.7099). Excluding the input sequence that produced the highest values of H and H', the next 10 highest-scoring input sequences—the same sequences for both statistics—all produced H-values of at least 0.5392 and H'-values of at least 0.7022, and had on average 7.0% differences compared to the input sequence that led to the highest H and H'. These results suggest that although matrices can be constructed so that the two statistics can lead to different alignments, in practice, their properties are extremely similar. The larger number of differences for the highest-scoring input sequences from among 10 000 sequences compared to the highest-scoring among the 30 000 sequences described above highlights the importance of employing a large number of input sequences whenever possible.
3 DISCUSSION
Interpreting the results of multiple population genetic cluster analyses is not always straightforward (Corander et al., 2004; Evanno et al., 2005; Pritchard et al., 2000; Rosenberg et al., 2001). One could argue that the single replicate with the highest likelihood (according to the criterion of the clustering program) is the optimal clustering solution, and that solutions with lower likelihood can be discarded. However, for complex data sets that produce several distinct solutions with similar likelihoods, this strategy may not be satisfactory, as it disregards both uncertainties of the analysis and important population structure information that may only be visible in different modes. From a Bayesian perspective, a possible approach would be to weight different solutions by their likelihoods. However, this approach can be difficult to apply with individual membership coefficients in the case of genuine multimodality. A third approach is to summarize data on the outcomes of many replicates, such as in Table 2 of Rosenberg et al. (2001). Regardless of the choice of procedure for employing multiple cluster analyses to make biological interpretations from population-genetic data, CLUMPP can provide a useful way of assessing the similarity of the outcomes of individual runs.
We note that our focus here has been on the setting in which the clustering algorithm explores only one mode during the iterative process that constitutes a single replicate analysis, but may explore different modes in different replicates. As was recognized by Pritchard et al. (2000), this setting typically applies to the STRUCTURE approach, for which the results of individual replicates are reasonably straightforward to summarize and interpret, but for which difficulties may arise in interpreting differences across replicates. It is possible to envision an alternative scenario in which multiple important modes (potentially including multiple permutations of each of the genuinely distinct modes) are explored during the course of a single analysis. In this situation, replicate analyses would produce similar results, but because the meaning of a cluster label could vary across intermediate steps in the analysis, it would be challenging to summarize the outcome of any single replicate. Although our analysis has used the language of multiple replicate analyses rather than that of multiple intermediate steps of a single analysis, it is worth noting that the algorithms we have proposed for the matching of clusters can also be applied with each replicate described above corresponding to a specific state of the membership coefficient estimates encountered during the course of a single analysis. Thus, in addition to their uses in interpreting multiple replicate analyses, our algorithms can augment existing approaches to the analysis of individual replicates (e.g. Dawson and Belkhir, 2001; Huelsenbeck and Andolfatto, 2007) and can contribute new approaches for the problem of summarizing the intermediate steps of a single run.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank O. François and two anonymous reviewers for helpful comments on a previous version of the manuscript. This work was supported by NIH grant HL084729, by grants from the Burroughs Wellcome Fund and the Alfred P. Sloan Foundation, and by a postdoctoral fellowship to M.J. from the University of Michigan Center for Genetics in Health and Medicine.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Keith Crandall
Received on December 28, 2006; revised on March 14, 2007; accepted on April 25, 2007
| REFERENCES |
|---|
|
|
|---|
Anderson EC, Thompson EA. A model-based method for identifying species hybrids using multilocus genetic data. Genetics (2002) 160:1217–1229.
Chen C, et al. FASTRUCT: model-based clustering made faster. Mol. Ecol. Notes (2006) 6:980–983.[CrossRef][Web of Science]
Chen C. Bayesian clustering algorithms ascertaining spatial population structure: a new computer program and a comparison study. Mol. Ecol. Notes (2007) doi: 10.1111/j.1471-8286.2007.01769.x.
Corander J, Marttinen P. Bayesian identification of admixture events using multilocus molecular markers. Mol. Ecol. (2006) 15:2833–2843.[Medline]
Corander J, et al. Bayesian analysis of genetic differentiation between populations. Genetics (2003) 163:367–374.
Corander J, et al. BAPS 2: enhanced possibilities for the analysis of genetic population structure. Bioinformatics (2004) 20:2363–2369.
Dawson KJ, Belkhir K. A Bayesian approach to the identification of panmictic populations and the assignment of individuals. Genet. Res. (2001) 78:59–77.[CrossRef][Web of Science][Medline]
Evanno G, et al. Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol. Ecol. (2005) 14:2611–2620.[CrossRef][Medline]
Falush D, et al. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics (2003) 164:1567–1587.
François O, et al. Bayesian clustering using hidden Markov random fields in spatial population genetics. Genetics (2006) 174:805–816.
Golub GH, Van Loan CF. Matrix Computations (1996) 3rd. Johns Hopkins University Press, Baltimore.
Huelsenbeck JP, Andolfatto P. Inference of population structure under a Dirichlet process model. Genetics (2007) 175:1787–1802.
Jasra A, et al. Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Stat. Sci. (2005) 20:50–67.[CrossRef][Web of Science]
Nordborg M, et al. The pattern of polymorphism in Arabidopsis thaliana. PLoS Biol. (2005) 3:1289–1299.[Web of Science]
Pella J, Masuda M. The Gibbs and split-merge sampler for population mixture analysis from genetic data with incomplete baselines. Can. J. Fish. Aquat. Sci. (2006) 63:576–596.[CrossRef]
Pritchard JK, et al. Inference of population structure using multilocus genotype data. Genetics (2000) 155:945–959.
Rosenberg NA. DISTRUCT: a program for the graphical display of population structure. Mol. Ecol. Notes (2004) 4:137–138.[CrossRef][Web of Science]
Rosenberg NA, et al. Empirical evaluation of genetic clustering methods using multilocus genotypes from 20 chicken breeds. Genetics (2001) 159:699–713.
Rosenberg NA, et al. Genetic structure of human populations. Science (2002) 298:2381–2385.
Stephens M. Dealing with label switching in mixture models. J. R. Stat. Soc. B (2000) 62:795–809.[CrossRef]
This article has been cited by other articles:
![]() |
C. Dalvit, M. De Marchi, E. Zanetti, and M. Cassandro Genetic variation and population structure of Italian native sheep breeds undergoing in situ conservation J Anim Sci, December 1, 2009; 87(12): 3837 - 3844. [Abstract] [Full Text] [PDF] |
||||
![]() |
S.-F. Chen, G. Jones, and S. J. Rossiter Determinants of echolocation call frequency variation in the Formosan lesser horseshoe bat (Rhinolophus monoceros) Proc R Soc B, November 7, 2009; 276(1674): 3901 - 3909. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Durand, F. Jay, O. E. Gaggiotti, and O. Francois Spatial Inference of Admixture Proportions and Secondary Contact Zones Mol. Biol. Evol., September 1, 2009; 26(9): 1963 - 1973. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. R. Boyko, R. H. Boyko, C. M. Boyko, H. G. Parker, M. Castelhano, L. Corey, J. D. Degenhardt, A. Auton, M. Hedimbi, R. Kityo, et al. Complex population structure in African village dogs and its implications for inferring dog domestication history PNAS, August 18, 2009; 106(33): 13903 - 13908. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Durand, C. Chen, and O. Francois Comment on 'On the inference of spatial structure from population genetics data' Bioinformatics, July 15, 2009; 25(14): 1802 - 1804. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. L. Evans, K. J. Gaston, A. C. Frantz, M. Simeoni, S. P. Sharp, A. McGowan, D. A. Dawson, K. Walasz, J. Partecke, T. Burke, et al. Independent colonization of multiple urban centres by a formerly forest specialist bird species Proc R Soc B, July 7, 2009; 276(1666): 2403 - 2410. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Guillot Inference of structure in subdivided populations at low levels of genetic differentiation--the correlated allele frequencies model revisited Bioinformatics, October 1, 2008; 24(19): 2222 - 2228. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. X. Pico, B. Mendez-Vigo, J. M. Martinez-Zapater, and C. Alonso-Blanco Natural Genetic Variation of Arabidopsis thaliana Is Geographically Structured in the Iberian Peninsula Genetics, October 1, 2008; 180(2): 1009 - 1021. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








