Bioinformatics Advance Access originally published online on May 29, 2008
Bioinformatics 2008 24(16):1765-1771; doi:10.1093/bioinformatics/btn244
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Efficient functional clustering of protein sequences using the Dirichlet process
1Department of Bioengineering, UC Berkeley and 2Merck & Co., Inc., 1700 Owens St, San Francisco, CA 94158, USA
| ABSTRACT |
|---|
|
|
|---|
Motivation: Automatic clustering of protein sequences is an important problem in computational biology. The recent explosion in genome sequences has given biological researchers a vast number of novel protein sequences. However, the majority of these sequences have no experimental evidence for their molecular function in the cell, and the responsibility for correctly annotating these sequences falls upon the bioinformatics community. Ideally, we would like to be able to group sequences of similar or identical molecular function in an automatic fashion, without relying on experimental evidence.
Results: In this article I present a novel probabilistic framework that models subfamilies within a known protein family. Given a multiple sequence alignment, the model uses Dirichlet mixture densities to estimate amino acid preferences within subfamily clusters, and places a Dirichlet process prior on the overall set of clusters. Based on results from several datasets, the model breaks data accurately into functional subgroups.
Availability: The algorithm is implemented as c++ software available at bpg-research.berkeley.edu/~duncanb/dpcluster/
Contact: duncan_brown{at}merck.com
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
The vast majority of known protein sequences have no experimental evidence for their function, but rather are annotated by homology to sequences of known function. As sequence databases continue to grow, the field of comparative genomics will only become more important. As a result, there has been considerable interest in automatic methods for grouping sequences of similar function.
Many methods for protein clustering depend on the derivation of a tree or graph linking the given set of sequences (Abascal and Valencia, 2002; Enright and Ouzounis, 2000; Krause and Vingron, 1998; Li and Godzik, 2006; Wicker et al., 2001; Yona et al., 1998). These methods tend to use pairwise sequence similarity to define the graph and the clusters, since it is computationally inexpensive, and require setting a similarity threshold parameter to cut the tree into clusters. (Sjölander, 1998) proposed a truly automatic method that minimizes an encoding cost calculated from a Dirichlet mixture. However, all these algorithms depend on producing a graph topology that adequately reflects the functional groups, which is by no means guaranteed.
A recent publication described a probabilistic model that uses a mixture of Gaussian distributions under a Dirichlet process (DP) prior to detect sequence clusters (Dubey et al., 2004) within protein families. The model allows the number of clusters to be defined by the data, and is not constrained by tree topology. However, categorical data such as protein sequences are not natively amenable to inference using the Gaussian. The algorithm transforms sequences into Fisher score vectors using parameters from an HMM trained on the input sequences and then greatly reduces the dimensionality of the data using principal component analysis.
This article proposes a novel solution to the protein clustering problem that is also based on the DP mixture model, but which models the sequence data directly. I focus on the problem of sub- family detection within a known protein family, given a multiple sequence alignment as input. The model efficiently detects latent classes of sequences within a protein family, which very often correspond to functionally distinct subtypes. I compare its perform- ance to several available algorithms on three distinct datasets.
| 2 APPROACH |
|---|
|
|
|---|
2.1 Dirichlet processes
I give here a brief intuitive overview of the DP mixture model; for more details see Neal (2000). Let G0 be a probability measure defined over a measurable space
. G(
, G0) is defined to be a DP if, for any finite partition A=(A1,...,Al) of
, the probability of A is distributed as a Dirichlet density with parameters (
G0(A1),...,
G0(Al)). This definition is precise, but not particularly helpful in visualizing the DP object itself or in understanding its utility. The DP functions as a distribution on distributions; a particular draw from a DP is itself a complete probability distribution from which we can sample. Sethuraman (1994) showed that any draw from a DP must be a countably infinite discrete distribution, even when G0 is continuous. This discretization property makes the DP extremely useful as a prior in clustering applications, since it allows multiple samples to possess the same value, and does not fix a priori the number of clusters in the model.
Blackwell and MacQueen (1973) showed that it is possible to generate data samples
i from a DP according to a Pólya urn model:
|
| (1) |
<i is the set of
values preceding i, and
is the delta function. The probability of assigning
i to an existing c is proportional to the number of points already assigned to c, and the probability of assigning it to a new cluster is proportional to
. This process generates a set c of clusters, where all
assigned to a given cluster c have the same value,
c. The limiting distribution of the
i is a countably infinite mixture, in which unique component values
c are drawn from G0, and the component proportions correspond to the number of
in each cluster (Fig. 1). The cluster sizes can also be modeled directly with a stick-breaking distribution (Neal, 2000).
|
We can utilize this clustering effect by incorporating the DP into a mixture model framework. We treat a set of data y=(y1,...,yn) as independent draws from a mixture of distributions F(y,
), where
is drawn from G (and therefore from the pool of discrete
), rather than from G0, as would normally be the case. As with the standard model, when the underlying distribution G0 is conjugate to F, we may integrate out
to obtain the posterior distribution of the clusters given only the observed data.
|
| (2) |
|
| (3) |
is the number of items in the cluster c to which i has been assigned, and Hi,c(
) is the posterior distribution of
given G0 and data yj such that cj=c and j<i. This reduces the state space of the model to the set of cluster indicators c.
2.2 The protein clustering model
In this section, I define parameterizations for the DP protein model. The algorithm assumes a multiple sequence alignment as input. The yi are individual sequences within the alignment. Each yi thus takes the form of a multivariate multinomial variable, where each column of the alignment corresponds to one dimension of y, and yij indexes the amino acid in sequence i at position j. Gaps in the alignment are ignored. The distribution of a sequence yi for given
is
|
|
j is a multinomial distribution for column j. I set G0 to be a Dirichlet mixture distribution, in which individual components model groups of amino acids that have been observed to substitute for each other in benchmarked alignments (Sjölander et al., 1996). In the experiments reported below, I used the recode4.comp Dirichlet mixture provided as part of the SAM HMM software (Karplus et al., 1997).
A Dirichlet mixture containing M mixture components takes the form
|
| (4) |
|
| (5) |
is the vector of observed counts for cluster c in column j:
|
| (6) |
(yij,k) is the delta function that evaluates to 1 when yij corresponds to amino acid k and 0 otherwise, and s
(yij) picks out the element of s
that corresponds to amino acid yij. P(βm|s
) is the posterior probability of the m-th Dirichlet component given the counts s
.
|
| (7) |
|
| (8) |
2.3 Sampling from the posterior
2.3.1 Gibbs sampling
Exact inference on the full posterior distribution for the DP mixture model is intractable, but we can sample from it relatively easily using MCMC methods (Neal, 2000). The most straightforward sampling algorithm for this model is Gibbs sampling, although the Gibbs algorithm mixes poorly and must be augmented (Jain and Neal, 2000). Gibbs updates are easily obtained from (2) and (3) and are given in the Supplementary Materials.
2.3.2 Split-merge sampling
The Gibbs sampler is straightforward to implement, but when the dimensionality of the data becomes large, as in this model, the Gibbs sampler can often fail to mix well and will become stuck for long periods of time in one configuration. This occurs because Gibbs sampling updates only single data points; the sampler must pass through low-probability states in which a small number of data points break away to establish their own cluster. Jain and Neal (2000) proposed a Metropolis–Hastings algorithm that overcomes this issue by updating multiple data points at once, providing effective mixing within the state space.
Their algorithm chooses points yi and yj at random from the dataset and if these are in the same cluster proposes a split move, assigning all other points yk in the cluster to join yi or yj with probability proportional to the likelihood of yk under either new cluster. Conversely, if the points are in different clusters, the algorithm proposes to merge the two clusters. The method uses split and merge updates to complement Gibbs sampling scans; the split–merge update is efficient in jumping between cluster configurations, while the Gibbs scan is efficient in assigning single data points to their most probable cluster. For results reported here, I chose to use a variant of the split–merge algorithm proposed by Dahl (2003), which was significantly faster for this model. Algorithm details and equations for the proposal, prior and likelihood distributions are given in the Supplementary Materials.
The split–merge samplers described in Jain and Neal (2000) and Dahl (2003) alternate between performing split–merge updates and full Gibbs scans across the whole dataset. However, I found this to be inefficient for the protein model. Sequences were infrequently moved between clusters during Gibbs scans, and these usually followed accepted split–merge updates. I reduced the frequency of Gibbs sampling scans to one per 50 split–merge updates. This seemed to be a good balance, in that most Gibbs scans still did not reassign data points (and thus were not too infrequent), and the average time per iteration was reduced approximately 20-fold. Note that this procedure still produces a valid chain, since the split–merge updates and Gibbs scans each sample from the posterior.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Datasets
The goal of the algorithm is to accurately cluster functionally similar sequences within a protein family. I tested the performance of the split–merge sampler on two datasets that have been previously used to evaluate sequence protein clustering solutions (Brown et al., 2007), and one novel set. The EXPERT set consisted of five extensively curated protein families from three different resources, with additional subdivisions of two of the families to create a total of eight classifications of expert-defined subtypes. These included the amine and secretin families from the GPCRDB (Horn et al., 2003), the nuclear hormone receptor (NHR) family from NucleaRDB (Horn et al., 2001), and the enolase and crotonase enzyme families from the Structure–Function Linkage Database (SFLD; Pegg et al. (2006)), a manually curated resource that incorporates mechanistic, sequence and structural information for several diverse enzyme families to derive functional subtypes based on conserved chemical mechanism.
The second dataset was a larger, uncurated set of 57 PFAM family alignments, matched to SCOP superfamilies by scoring SCOP superfamily members against the PFAM HMM. The set is non-redundant at the level of SCOP fold. Families were divided into functional subtypes based on all four digits of members EC number (Webb et al., 1992). Families with fewer than two EC numbers were removed. The algorithms were given the entire family as input, but were only scored on sequences which possessed EC numbers and aligned over at least 75% of the alignment.
For the final dataset, I examined the algorithms ability to divide sequences in SCOP superfamilies into their family designations. I chose SCOP superfamilies having seven or more families, for a total of 50 superfamilies in the set. Superfamily sequences were taken from Astral 90 (Chandonia et al., 2004) and aligned with Muscle (Edgar, 2004) before submission to the algorithms.
3.2 Comparing partitions
I used several scoring functions to evaluate the performance of clustering methods. These were also used in (Brown et al., 2007). Subfamily purity is measured as the fraction of subfamilies that contain only one expert subtype or EC number.
The variation of information (VI) is a distance metric on partitions (Meila, 2003). As such, it obeys the triangle inequality: VI(A, B)+VI(B, C)
VI(A, C) for partitions A, B, C. Given two partitions, the VI index measures the amount of information in each partition that is not shared between them. It is calculated as
|
| (9) |
|
| (10) |
|
| (11) |
The edit distance is defined as the minimum number of split or merge operations required to transform one partition into the other. A split or merge affecting multiple data points is considered one operation. The edit distance is calculated as
|
| (12) |
The purity and distance scores are complimentary: an algorithm that places each sequence in its own cluster will have perfect purity, but very poor distance scores. The relationship between the VI and edit distances is somewhat more subtle. The VI distance places more weight on large clusters, producing high scores when they are incorrect, whereas the edit distance weights each cluster equally, therefore penalizing methods producing many singletons.
I ran the sampler for 10000 iterations on each family alignment, performing a Gibbs scan over all sequences every 50 iterations. I then evaluated the posterior likelihood of each distinct c state and chose the state with the highest likelihood for assessment.
3.3 EXPERT set
Using the EXPERT set, I compared DPCluster, SCI-PHY (Brown et al., 2007), Ncut (Abascal and Valencia, 2002), Secator (Wicker et al., 2001) and two similarity thresholds for CD-HIT (Li and Godzik, 2006): 40% and 70% within-cluster identity. The results show that DPCluster is definitively the best of the methods tested (Table 1). CD-HIT 70 consistently achieved the highest purity, but also greatly over-divided the families, based on its high distance scores. DPCluster had the next-best subfamily purity in four of the eight classifications, and also achieved the lowest VI distance in three of the eight, and the lowest edit distance in five. No method performed as consistently well overall.
|
The performance of the Ncut algorithm was quite inconsistent. In some cases, such as for the crotonase family, it had low VI and edit distance, but relatively poor purity. In the enolases, however, the algorithm greatly over-divided the family, receiving high purity, but also quite high distance scores. In the amine family, Ncut produced only eight non-singleton subfamilies, seven of which were pure. However, the one impure subfamily contained 317 sequences, 89% of the family.
DPCluster results for the crotonase family were particularly good; only one (cluster 10) of the 26 clusters mixed subtypes, and 88 of the 89 sequences in cluster 10 were of the same subtype. The one sequence in error was the only member of its subtype. The singleton was an outlier within its cluster; it shared at most 35% identity with any other sequence. However, cluster 10 as a whole showed conservation at several positions that were not conserved across the whole family, and agreement at these positions likely explains why the singleton was placed in the cluster. This illustrates the point that in general, singleton clusters will be unlikely under this model, unless the sequence is quite different from the other members of the family. This is part of the reason why the Gibbs sampler performs poorly on its own. In fact, DPCluster produced the fewest singletons in every family. This is a significant improvement over the other methods, which tended to produce far too many singletons for this dataset. This can be a problem if the true classification contains many singletons, but note that DPCluster had the lowest VI and edit distances for the very specific NHR Level 3 classification, which was divided into 77 subtypes.
To demonstrate this anti-singleton effect, I split cluster 10 to place the singleton into its own cluster and then assessed the likelihood of the new cluster state. The new configuration received a substantially lower likelihood than the original. I then introduced an additional copy of the singleton into both states, including the copy into either cluster 10 or the new cluster. With two copies, the split configuration received a higher likelihood than incorporating both copies into cluster 10. This bias could be addressed by setting
to a higher value, but this is not a general solution. The likelihood surface is extremely steep; very small changes in cluster structure (such as creation of a singleton) tended to cause tens or hundreds of log value changes in the likelihood. Setting
to a particular (high) value to counteract this effect would not generalize to other families. However, it may be possible to address the issue by placing a prior on
and integrating it out.
The functional subtypes defined in these families are generally representative of the types of divisions that are useful to biologists, but the concept of functional similarity is somewhat arbitrary. For example, the three levels of classification within the nuclear receptor family represent increasingly more specific functional distinctions: Level 1 divides the families into general classes of ligand, such as estrogen-like and thyroid-hormone-like; Level 2 subdivides the Level 1 subtypes into finer gradations, for instance separating the estrogen-like group into estrogen, estrogen-related and gluco-corticoid receptors; finally Level 3 splits Level 2 groups into subtypes that bind specific ligand molecules. As shown in Table 1, the set of clusters produced by the DPCluster algorithm fell between Levels 2 and 3 in specificity. The performance of the algorithm may be tuned to reflect a particular level of functional specificity through the exact choice of the Dirichlet mixture prior G0. It may also be possible to adjust for the anti-singleton effect seen in the Crotonase family in the same way. I explore this idea further in Section 3.6.
3.4 EC and SCOP sets
Results for the EC dataset, while not as definitive as for the EXPERT set, support the same conclusions (Table 2). DPCluster recieved higher purity scores than SCI-PHY, while maintaining quite similar distance scores. In fact, DPCluster was the only algorithm to show both better purity and better edit distance than another method. In addition, DPCluster shows similar purity scores to CD-HIT 40 (P<0.06), and much better VI and edit distance (P<10–5 in both cases).
|
All algorithms except Secator (which collapsed most families together) received high purity scores on the SCOP dataset (Table 3). Both CD-HIT versions had perfect purity but quite high distance scores, suggesting that they over-divided the families. DPCluster achieved the next-highest purity (0.93) and the lowest distance scores among the four high-purity algorithms.
|
3.5 DPCluster sampling efficiency
Overall, DPCluster is fast: 10000 iterations required between 19 and 132 min on a standard Mac for the families in the Expert set, depending on the size of the alignment. This compares favorably with clustering based on phylogenetic trees, which can require significantly more computation time to estimate (although DPCluster is still far slower than distance methods such as neighbor joining). The run time per iteration is O(nci, cjLM) where M is the number of components in the Dirichlet mixture G0, L is the number of columns in the input alignment and nci,cj is the number of sequences in the cluster(s) under consideration in the current iteration. Clearly, the clustering structure of the set features significantly in the run time: when the number of clusters is large, the cost is reduced significantly.
Based on trace plots, the sampler quickly reaches equilibrium even when the input alignment is large (Fig. 2), and for most runs the cluster structure was essentially unchanged after 2000 iterations. Figure 2 also shows that the chain mixes well between alternate high-probability states. Runs of up to 100 000 iterations showed very similar findings: no major changes in the clustering structure were apparent after the initial burn-in phase of approximately 2000 iterations (data not shown).
|
3.6 The effect of the Dirichlet mixture prior G0
It is worth noting the effect of the Dirichlet mixture prior G0 on the performance of the model. Standard Dirichlet mixture priors are trained on large sets of alignment data and reflect the types of conservative substitutions observed in evolution. Each component of the mixture represents a group of amino acids that tend to substitute for each other; observations that match a particular prior component will give that component high posterior probability compared to the other prior components. The prior then in effect provides pseudocounts for amino acids similar to those observed in the data, allowing the posterior to generalize accurately even when there are few observations. In most models, such as standard HMM parameter estimation, the contribution of the prior will quickly shrink as the number of observations grows, and the posterior distribution will tend to reflect the observed amino acid frequencies in this case. However, in the DP model, each cluster generates its own posterior and the number of observed amino acids is dependent only on the size of the cluster. This means that the prior can have a large effect on the posterior likelihood, no matter how many sequences are present in the alignment.
To demonstrate this, I reran the algorithm using a null prior having just one Dirichlet component, with small total mass. This prior essentially rejects any substitutions; posterior amino acid distributions will tend to be based solely on the observed frequencies. Runs for most families produced between 50% and 150% more clusters, most of which had high internal conservation. The large effect of the choice of prior on the results allows us to tune performance depending on the extent of clustering we expect to see. Very specific functional attributes, such as ligand specificity, or fine taxonomic distinctions, might require a prior that rewards individual amino acid conservation, while more general attributes would use a more forgiving prior. This is analogous to the choice of substitution matrices in pairwise homology searches.
| 4 CONCLUSION |
|---|
|
|
|---|
I have presented a highly accurate probabilistic model for clustering protein sequences. The algorithm, based on the DP infinite mixture model, is able to separate sequences into functionally distinct clusters, given a multiple sequence alignment, and achieves excellent accuracy on a set of five manually classified protein families.
For all families, multiple runs produced very similar output, suggesting that the algorithm is converging to optimal or near-optimal maxima despite the very large state space. As noted earlier, due to the high dimensionality of the data, the likelihood surface is very steep, and favorable proposals often have an acceptance probability of unity. This suggests that the columns of the alignment may be redundant; it may be possible to achieve a faster algorithm by reducing the dimensionality of the data under consideration at each update step.
Finally, as with many sequence-based models, the quality of the input alignment is naturally critical to the success of the algorithm; incorrect alignments will degrade performance and lead to artificial clustering. Several algorithms have been proposed that marry alignment sampling to another model (Lunter et al., 2005; Metzler et al., 2001), and I believe that it is possible to extend this algorithm in a similar manner.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
Deep thanks to Kimmen Sjölander for support during this work. Thanks also to several anonymous reviewers for their insights.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Burkhard Rost
Received on March 10, 2008; revised on May 22, 2008; accepted on May 22, 2008
| REFERENCES |
|---|
|
|
|---|
Abascal F, Valencia A. Clustering of proximal sequence space for the identification of protein families. Bioinformatics (2002) 18:908–921.
Blackwell D, MacQueen JB. Ferguson distributions via Polya Urn schemes. Ann. Stat (1973) 1:353–355.[CrossRef]
Brown DP, et al. Automated protein subfamily identification and classification. PLoS Comput. Biol (2007) 3:e160.[CrossRef][Medline]
Chandonia JM, et al. Structural genomics and structural biology: compare and contrast. Genome Biol (2004) 5:343.[CrossRef][Medline]
Dahl DB. An improved merge-split sampler for conjugate dirichlet process mixture models. In: Technical Report 1086. (2003) Madison: University of Wisconsin.
Dubey A, et al. Clustering protein sequence and structure space with infinite Gaussian mixture models. Pac. Symp. Biocomput (2004) 399–410.
Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res (2004) 32:1792–1797.
Enright AJ, Ouzounis CA. GeneRAGE: a robust algorithm for sequence clustering and domain detection. Bioinformatics (2000) 16:451–457.
Horn F, et al. Collecting and harvesting biological data: the GPCRDB and NucleaRDB information systems. Nucleic Acids Res (2001) 29:346–349.
Horn F, et al. GPCRDB information system for G protein-coupled receptors. Nucleic Acids Res (2003) 31:294–297.
Jain S, Neal RM. A split-merge markov chain Monte Carlo procedure for the dirichlet mixture model. In: Technical Report 2003. (2000) Toronto, Canada: University of Toronto.
Karplus K, et al. Predicting protein structure using hidden Markov models. Proteins (1997) (Suppl. 1):134–139.
Krause A, Vingron M. A set-theoretic approach to database searching and clustering. Bioinformatics (1998) 14:430–438.
Li W, Godzik A. cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics (2006) 22:1658–1659.
Lunter G, et al. Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics (2005) 6:83.[CrossRef][Medline]
Meila M. Comparing clusterings by the variation of information. In: Learning Theory And Kernel Machines. Vol. 2777 of Lecture Notes In Artificial Intelligence.—Carbonell JG, Siekmann J, eds. (2003) 2777. Heidelberg, Germany: Springer-Verlag. 173–187.
Metzler D, et al. Assessing variability by joint sampling of alignments and mutation rates. J. Mol. Evol (2001) 53:660–669.[CrossRef][Web of Science][Medline]
Neal RM. Markov chain sampling methods for Dirichlet process mixture. J. Comput. Graph. Stat (2000) 9:249–265.[CrossRef]
Webb EC, Committee of the International Union of Biochemistry and Molecular Biology. Enzyme nomenclature 1992: recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the nomenclature and classification of enzymes. (1992) San Diego: Academic Press.
Pegg SC, et al. Leveraging enzyme structure-function relationships for functional inference and experimental design: the structure-function linkage database. Biochemistry (2006) 45:2545–2555.[CrossRef][Web of Science][Medline]
Sethuraman J. A constructive definition of dirichlet priors. Stat. Sin (1994) 4:639–650.
Sjölander K. Phylogenetic inference in protein superfamilies: analysis of SH2 domains. Proc. Int. Conf. Intell. Syst. Mol. Biol (1998) 6:165–174.[Medline]
Sjölander K, et al. Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput. Appl. Biosci (1996) 12:327–345.
Wicker N, et al. Secator: a program for inferring protein subfamilies from phylogenetic trees. Mol. Biol. Evol (2001) 18:1435–1441.
Yona G, et al. A map of the protein space–an automatic hierarchical classification of all protein sequences. Proc. Int. Conf. Intel. Syst. Mol. Biol (1998) 6:212–221.[Medline]
This article has been cited by other articles:
![]() |
B. Andreopoulos, A. An, X. Wang, and M. Schroeder A roadmap of clustering algorithms: finding a match for a biomedical application Brief Bioinform, May 1, 2009; 10(3): 297 - 314. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







