Bioinformatics Advance Access originally published online on January 12, 2005
Bioinformatics 2005 21(9):1876-1890; doi:10.1093/bioinformatics/bti244
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
On the quality of tree-based protein classification
Computational Biology Department, Applied Biosystems 850 Lincoln Centre Drive, Foster City, CA 94404, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Phylogenetic analysis of protein sequences is widely used in protein function classification and delineation of subfamilies within larger families. In addition, the recent increase in the number of protein sequence entries with controlled vocabulary terms describing function (e.g. the Gene Ontology) suggests that it may be possible to overlay these terms onto phylogenetic trees to automatically locate functional divergence events in protein family evolution. Phylogenetic analysis of large datasets requires fast algorithms; and even fast, approximate distance matrix-based phylogenetic algorithms are slow on large datasets since they involve calculating maximum likelihood estimates of pairwise evolutionary distances. There have been many attempts to classify protein sequences on the family and subfamily level without reconstructing phylogenetic trees, but using hierarchical clustering with simpler distance measures, which also produce trees or dendrograms. How can these trees be compared in their ability to accurately classify protein sequences?
Results: Given a reference classification or group membership labels for a set of related protein sequences as well as a tree describing their relationships (e.g. a phylogenetic tree), we propose a method for dividing the tree into monophyletic or paraphyletic groups so as to optimize the correspondence between the reference groups and the tree-derived groups. We call the achieved optimal correspondence the accuracy of a tree-based classification (TBC), which measures the ability of a tree to separate proteins of similar function into monophyletic or paraphyletic groups. We apply this measure to compare classical NJ and UPGMA phylogenetic trees with the trees obtained from hierarchical clustering using different protein similarity measures. Our preliminary analysis on a set of expert-curated protein families and alignments suggests that there is no uniformly superior algorithm, and that simple protein similarity measures combined with hierarchical clustering produce trees with reasonable and often the most accurate TBC. We used our measure to help us to design TIPS, a tree-building algorithm, based on agglomerative clustering with a similarity measure derived from profile scoring. TIPS is comparable with phylogenetic algorithms in terms of classification accuracy and is much faster on large protein families. Due to its time scalability and acceptable accuracy, TIPS is being used in the large-scale PANTHER protein classification project. The trees produced by different algorithms for different protein families can be viewed at http://panther.appliedbiosystems.com/pub/tree_quality/trees.jsp. For every tree and every level of classification granularity we provide the optimal TBC along with the reference classification.
Availability: The script that evaluates the accuracy of TBC is available at http://panther.appliedbiosystems.com/pub/tree_quality/index.jsp
Contact: betty.lazareva{at}fc.celera.com
| INTRODUCTION |
|---|
|
|
|---|
With the current scale of genome sequencing the problem of large-scale automatic protein classification has gained practical importance. It is well known that protein function can be classified using phylogenetic analysis of a gene tree combined with the corresponding species tree (Eisen, 1998; Zmasek et al., 2001). Protein classification based on phylogenetic information, known as phylogenomics, requires however, intensive calculations. Even approximate, distance matrix-based phylogenetic algorithms (see Li, 1997) have a performance time bottleneck, namely the maximum likelihood (ML) estimation of evolutionary distances between pairs of proteins. There have been many attempts in protein classification to circumvent the problem of phylogenetic tree reconstruction by using protein sequence similarity measures, rather than evolutionary distances (Heger et al., 2001; Krause et al., 2000; Remm et al., 2001; Sasson et al., 2002; Tatusov et al., 2001). All these similarity-based measures can, in principle, be used to build trees using hierarchical clustering. How do these different methods compare in their ability to partition protein family members into subfamilies that are reflective of the different function classes within the family? In the context of the tree topology, this question becomes: given a tree and the true classification of each sequence in it, how should we divide the tree into subtrees (subfamilies) such that these subtrees best match the true grouping into classes? From the point of view of classification, a tree should place all sequences of similar function in common subtrees, i.e. in monophyletic groups. Since in practice this does not always happen for a given tree, we suggest a method for finding optimal subtrees that minimizes the tree-based classification (TBC) error, i.e. the number of incorrectly classified sequences. We then use this measure to assess different tree-building algorithms for a number of protein families for which reference (true) classifications have been determined. Finally, we give examples of how the optimal TBC clusters can be used to make predictions about protein function. We treat the more complicated case of paraphyletic groups in Appendix B.
The accuracy of classifications derived from hierarchical clustering has been addressed in (Yona et al., 1999; Sasson et al., 2002). Clusters, in these papers, were determined in unsupervised fashion: agglomeration and formation of new clusters stopped when some external criterion was met (e.g. when the dissimilarity between clusters or the number of non-singleton clusters exceeded a certain threshold). Thus, the set of resulting clusters was obtained as a snapshot in the course of agglomeration. Only after clusters had been built was the actual information on functional classification used to evaluate their quality. In contrast, we use the classification information itself to best determine functionally homogeneous clusters. These optimal clusters, therefore, do not correspond to any snapshot in the agglomeration process. Though a snapshot approach is common in cluster analysis, this approach is strictly applicable only when the clusters to be found are assumed to be of the same nature, i.e. when the space is homogeneous. However this does not hold for protein space: some subfamilies in a protein family may be very tight, i.e. the distances between their members are very small, while other subfamilies in the same family can be very broad, with relatively high distances between their members.
The approach described here is novel not only in the way it identifies clusters but also in the way it addresses classification quality. Assessment of the quality of inferred classification, given a reference classification, was previously considered in Gracy and Argos (1998) and Yona et al. (1999). In these works every reference category C was assigned to an inferred cluster that maximizes the number of sequences from C in the cluster (tp, true positives), minus the sum of errors: the number of sequences from the cluster that do not belong to C (fp, false positives) and the number of sequences that belong to C, but do not fall into the cluster (fn, false negatives). A quality index for category C contained three values: the ratio of tp, fp and fn to the sum of those values. To get a quality index of inferred classification these ratios were averaged over all categories. Using these definitions, it is important to note that a false positive for one category is also a false negative for another category (or categories). Thus the indices for different categories are not independent and their average, i.e. the overall quality measure, can be difficult to interpret.
We deal with the case when all sequences to be classified are equally important as they all belong to one protein family and need to be classified into subfamilies. In this case simply taking the total number of true positives for all categories would yield a relevant measure of classification quality. Unlike the approach above, we assign a category C to a cluster that maximizes the number of true positives, if the proportions fp/(tp+fp) and fn/(tp+fn) do not exceed some predefined level, i.e. the level of allowed contamination. In order for the results of this assignment to be well-behaved, it is clear that the allowed impurity should be
0.5. Indeed, it is natural to assume that the number of contaminating sequences (fp) in a cluster corresponding to some subfamily should not exceed the number of true sequences in the cluster (tp), and that the subfamily has to be represented in the cluster by at least half of its sequences. In the Methods section, we prove the somewhat counterintuitive fact that the minimal classification error is achieved with the maximal allowed contamination equal to 0.5.
To illustrate our approach in the Implementation section, we compare eight trees (two phylogenetic and six produced by hierarchical clustering with different similarity measures) built for several protein families, for which curated functional annotation and multiple sequence alignments are publicly available. We refer to the trees built using different sequence similarity measures as protein dendrograms, as they are strictly speaking not phylogenetic trees, since they do not utilize the notion of evolutionary distance between sequences. While evolutionary relationships are highly correlated with the measures of functional similarity they are not the same, since they compare sequences from different perspectives. Ideally, functional similarity should be derived from functionally constrained and the most informative columns of a multiple sequence alignment (MSA) (Hannenhalli and Russel, 2000), while the evolutionary distance should be inferred from both constrained and unconstrained, i.e. variable, positions. Likewise, aligned columns with gaps are usually disregarded in phylogenetic analysis and considered as censored data. However, in classification, gaps can be used to group sequences of common function, e.g. when indels have functional consequences. In Figure 1 we plot pairwise scores derived from an MSA of secretin-like GPCRs using the BLOSUM62 scoring matrix versus evolutionary distances. It is clear that the relationship between pairwise similarity scores and evolutionary distances is monotonic with little noise, but is not linear. This non-linearity, along with noise, can result in a different tree topology using sequence similarity rather than evolutionary distance.
|
| METHODS |
|---|
|
|
|---|
Accuracy of TBC
While it is not easy to derive a single measure that captures the accuracy of phylogenetic trees, i.e. the accuracy of reconstructing actual evolutionary events, it is possible to measure a classification accuracy of a protein tree, i.e. the ability of the tree to place sequences of the same category in one subtree. We measure this classification accuracy in terms of the number, or percentage of misclassified sequences according to the procedure outlined below.
Assume we have a set of sequences, where each sequence has been classified by experts into one and only one category, the reference category. Assume further that we have a rooted tree T with the sequences at the leaves. By a TBC (monophyletic), we call a procedure in which we identify certain non-overlapping subtrees and assign the same category to all leaves in a given subtree. The assigned, or tree-based, category is chosen from the set of all possible reference categories. The goal is to find an optimal TBC that gives the greatest number of sequences whose assigned category coincides with the reference one.
We call a subtree perfect for a category if it includes all sequences from this curated category and only those sequences. It is clear that if for every category there exists a perfect subtree, then there exists a perfect TBC and we say that the tree has classification error E(T) = 0. However, in practice it is often impossible to identify a perfect subtree for every category and one has to deal with the issue of contamination: there is a subtree that represents a category reasonably well but it is missing a few sequences and/or it contains a few extra sequences from other categories. We call such a subtree good for the category and set certain natural restrictions on the level of its impurity. In addition there can occasionally be a category, whose sequences are so dispersed on the tree that it is impossible to find a single subtree that represents it.
In more detail, let t be a subtree of a tree T and G be a category. We denote by |t| the total number of sequences that t spans, by |t(G)| the number of sequences of category G that t spans and by |G| the total number of sequences that belong to category G. If we decide that a subtree t represents a category G, we assign category G to all sequences in t. Thus, |t(G)| would be the number of correctly classified sequences or tp. The number of missing sequences in the subtree, or fn = |G| |t(G)|, while the number of contaminating sequences, or fp = |t| |t(G)|. For a category G to be represented by a subtree t certain natural requirements on the subtree composition should be satisfied:
DEFINITION 1
A subtree t is (, ß)-good for a category G if
where
![]()
1/2, ß
1/2.
The parameters
and ß control the maximum allowed portion of fp and fn predictions, respectively for every category. In the most permissive case, when
= ß =
, a good subtree for a category is still represented by more than half of the sequences from this category and those sequences constitute the majority in the subtree. It follows from the first inequality that a subtree t can be good for at most one category. Therefore any good subtree can be characterized by a set of three numbers (tp(t), fp(t), fn(t)), which will be referred to as the error composition of t.
The overall number of correctly classified sequences is the sum of tp over all good subtrees and thus the remaining sequences are of two types:
- Incorrectly classified: Sequences of category G that belong to a good subtree for category G.
- Unclassified: There is no good subtree that spans any of the sequences. We call those sequences misclassified and measure the quality of the optimal TBC in terms of the number (or percentage) E(T) of misclassified sequences.
- Unclassified: There is no good subtree that spans any of the sequences. We call those sequences misclassified and measure the quality of the optimal TBC in terms of the number (or percentage) E(T) of misclassified sequences.
DEFINITION 2
A (, ß) TBC (monophyletic),
![]()
![]()
, ß
![]()
, for a rooted tree T is a set of non-overlapping (
, ß)-good subtrees {tG1, tG2,...} for some categories G1, G2,... respectively. A (
, ß) TBC is optimal if the total number of tp |tG1 (G1)| + |tG2 (G2)| +
is maximal, or the classification error E(T) = |T| |tG1 (G1)| |tG2 (G2)|
is minimal.
An example of the optimal TBC is given in Figure 2. We illustrate our definitions on two hypothetical trees shown in Figure 2. In this figure we use lower case letters to denote expert assigned reference categories, we use apostrophes to denote paralogs within a reference category and upper case letters to denote different species. The tree on the left is perfect since it can be decomposed into subtrees, each containing all sequences from the same category. The tree on the right has an optimal TBC with three good subtrees. Subtree 1 is good for category a, having two false positives: b_X and b_Y. Subtree 2 is good for category d with one false positive from c and subtree 3 is good for category c with two false negatives. Thus there are a total of four misclassified sequences (in a monophyletic approach). It is important to note that the nesting of the good subtree 4 in the good subtree 1 may not be an artifact of poor tree-building performance, but rather a result of real evolutionary events. Indeed, the paralogortholog labeling in this example suggests that first there were two gene duplications that gave rise to three paralogs a, a', a'', followed by functional divergence of a'' into b and then followed by speciation into X and Y. In this case, the nested sequences from category b labeled as subtree 4, should be considered as true positives and not false positives in subtree 1 as in a purely monophyletic classification. To allow for nesting of good subtrees into other subtrees, which we call spanning subtrees, we consider paraphyletic TBC, described in detail in Appendix B. According to a paraphyletic TBC, the two sequences in subtree 4 are treated as true positives and thus the corresponding classification error Ep = 2.
|
We show that under the above natural restrictions on the composition of good subtrees, the optimal TBC can be found by a greedy bottom-to-top traversal algorithm that at every step maximizes the number of correctly classified sequences. Observe that the first inequality and the requirement
in Definition 1 of a good subtree ensures that a subtree t can be good for no more than one category. It follows also from the second inequality in Definition 1, together with ß
, that there cannot be two or more disjoint subtrees that are good for the same category. These properties imply that the optimal TBC can be found independently for disjoint subtrees and, therefore, the following lemma holds.
LEMMA 1
Let t be a bifurcating subtree, with tl and tr being its left and right splits, respectively. The (, ß) optimal TBC for t is
- t itself, if t is (
, ß)-good for some category G and |t| |t(G)| < E(tl) + E(tr)
- the union of optimal TBCs for tl and tr, otherwise.
Thus, if t is (
,ß)-good E(t) = min {|t| |t(G)|, E(tl) + E(tr)}, and E(t) = E(tl) + E(tr) otherwise.
Applying Lemma 1 recursively, from bottom to top, one can get the optimal TBC E(t) for the whole tree T. The quality of the optimal TBC depends on parameters (
, ß) and in the following lemma we show a counterintuitive fact that the highest quality (or the smallest error) can be achieved in the most permissive situation when
and ß are maximal.
LEMMA 2
The error of (, ß) optimal TBC achieves its minimum, when
= ß =
.
PROOF
Let us index the error of (,
)-classification by E0. We show by induction that E0 (t)
E(t) for any subtree t and
![]()
![]()
, ß
![]()
. This inequality holds trivially for the leaves. Let us prove it now for a subtree t if it holds for its left and right splits: E0(tl)
E(tl) and E0(tr)
E(tr). If the subtree t is (
,
)-good for some category, it can be either (
,ß)-good for the same category or not (
, ß)-good for any category; and from Lemma 1 we get E0 (t) = min {|t| |t(G)|, E0 (tl) + E0 (tr)}
min {|t| |t(G)|, E(tl) + E(tr)}
E(t). If the subtree t is not (
,
)-good for any category, it is not (
, ß)-good for any category and again from Lemma 1 E0 (t) = E0 (tl) + E0 (tr)
E(tl) + E(tr) = E(t).
The above lemma allows to introduce the TBC accuracy that corresponds to (
,
)-optimal classification and that does not depend on (
, ß).
We now describe the tree-building methods and input reference classification sets that we analyze in the following sections.
| IMPLEMENTATION |
|---|
|
|
|---|
Tree-building methods
In the Results section, we use E(T), our measure of TBC accuracy, to compare trees built by eight different methods. These methods are of different types. Neighbor-Joining (NJ) (Saitou and Nei, 1987) and Unweighted Pair Group Method Average (UPGMA) (Sokal and Michener, 1958) use phylogenetic distances and a MSA as a starting input. TIPS, BETE, MSA_score and MSA_norm_score all use MSAs to calculate sequence similarity-based (rather than phylogenetic) distances. SW_score and SW_identity use pairwise alignments (rather than an MSA) to generate sequence similarity-based distances. The last four algorithms in our list build dendrograms using UPGMA with distance measures equal to the negative similarity plus a large enough constant term to preserve positive distances. Note that the tree topology, which is the focus of this paper, does not depend on the added constant. We consider six protein dendrograms built using different similarity measures derived from MSA or SmithWaterman alignments. Though SmithWaterman alignment is slow and in practice is not feasible in large scale experiments, we still include SW-based similarity measures in the comparison for a few reasons. First, there are many protein clustering approaches that start with unaligned sequences and use BLAST score as a similarity measure to approximate SW score. Second, we wanted to test to what extent the classification ability of a tree becomes weaker when one builds a tree from a pairwise alignment. Since in this work we are interested only in the classification ability of a tree it is still possible that trees built using pairwise alignments produce reasonable classification.
NJ and UPGMA phylogenetic algorithms
Both NJ and UPGMA, from the Phylip 3.6 Package, are progressive clustering algorithms based on a distance matrix (Felsenstein, 2002). The pairwise distances in the matrix are evolutionary times between corresponding pairs of proteins, estimated via the ML approach using the PAM evolutionary model. In the Phylip implementation, first a distance matrix is estimated, using a program PROTDIST and then a tree is constructed using either NJ or UPGMA (for more detail, see Li, 1997). Our approach applies only to rooted trees and NJ trees are unrooted. To outgroup a NJ tree we modified the matrix of pairwise distances produced by PROTDIST, by adding one more sequence, an outroot, having an arbitrary, fixed large distance to all sequences in the original set.
BETE algorithm
BETE is an agglomerative clustering algorithm. As a distance measure it uses symmetrized total relative entropy between profiles created around clusters (Sjolander, 1998). Since BETE uses profileprofile distances it takes advantage of functionally important positions. BETE was shown to perform very well in deriving a tree topology for known SH2 domains that correlates better with SH2 domain binding specificity than standard tree-building methods. However, when we tested BETE on a larger scale, we found that for certain families the trees often had problematic topologies for functional classification: sequences similar in both sequence and function appeared in different clades, or subtrees. This led us to design TIPS algorithm, which also utilizes the notion of profile but with the similarity measure based on HMM scoring.
TIPS algorithm
TIPS (Diemer et al., 2001 http://ismb01.cbs.dtu.dk/poster_abstracts.html), described in detail in Appendix A, is an agglomerative clustering algorithm that bases the similarity measure between two clusters on the scoring of sequences from one cluster against the profile of another cluster. The similarity measure used in TIPS is an intermediate case between two extremes: measuring distances between profiles (BETE) and averaging pairwise sequence similarities (e.g. guide trees in CLUSTALW). The former utilizes the notion of conserved and therefore important columns of alignment, while the latter captures the sequence content in the clusters. Our approach uses both types of information and takes advantage of the fact that profiles and HMMs are designed for scoring amino acids against them, and thus they are sensitive in recognizing close sequences.
MSA_score and MSA_norm_score
These are UPGMA algorithms (Phylip 3.6 implementation) with similarity measures based on pairwise BLOSUM62 score in a multiple alignment with gap-open and gap-extension penalties = 12 and 2, respectively. The similarity measure in MSA_score is the pairwise score itself (note that it can be negative), while the similarity measure in MSA_norm_score is the pairwise score normalized by the length of the overlap. We introduce the normalization to allow an adjustment for protein fragments, or partial alignments, in an MSA.
SW_score and SW_identity
These are UPGMA algorithms (Phylip 3.6 implementation) with similarity measures based on SmithWaterman scores using the BLOSUM62 scoring matrix with gap-open and gap-extension penalties = 12 and 2, respectively. The similarity measure in SW_score is the SmithWaterman score itself and in SW_identity it is just the pairwise identity.
Expert-curated classifications for protein families
We identified protein family resources available on the web that are representative, have expert-curated classifications (into subfamilies or other categories) and have curated MSAs. The existence of curated multiple alignments was crucial for our choice since behavior of tree-building algorithms strongly depend on the quality of alignment and a poor alignment can bias the algorithm performance. Most families we have chosen have more than one level of classification. For example, the Protein Kinase family has two levels of classification, e.g. a sequence can be classified as CaMK on the first level and as CaMK Group I on the second level. Though we evaluated the TBC error independently for every level, one should keep in mind that they are strongly dependent. The total number of non-singleton categories and singletons for each level can be found in Table 2.
|
GPCR
GPCRDB (http://www.gpcr.org/7tm/) CLASS A (rhodopsin-like) has 1047 sequences and four levels of classification. CLASS B (secretin-like) has 163 sequences and two levels of classification.
Nuclear Receptors
NuclearDB (http://receptors.ucsf.edu/NR/) has 428 sequences and three levels of classification. There were a few sequences whose evolutionary distance to some other proteins in the set was infinite according to PROTDIST (Phylip 3.6). Therefore, we excluded those sequences for the phylogenetic analysis and built NJ and UPGMA trees on a set of only 411 sequences.
Protein kinases
The Hanks Classification (http://pkr.sdsc.edu/html/pk_classification/pk_catalytic/pk_hanks_class.html) has 390 sequences and two levels of classification.
CYS-Loop Receptors
(http://www.pasteur.fr/recherche/banques/LGIC/cys-loop.html) The anionic group has 98 sequences and three classification levels. The cationic group has 102 sequences and three classification levels. This site provides two types of classification: one based on sequence nomenclature and the other inferred from a phylogenetic analysis (NJ trees and Parsimony). To prevent our analysis from becoming circular, we derived a three-level classification from the sequence nomenclature, e.g. the sequence GABa3 was classified into GABA receptors on the first level, GABA receptor alpha subunit on the second level and GABA receptor alpha 3 subunit on the third level. Certain sequences, such as GABf01c12_1 on the second and third levels were classified as singletons, i.e. the only representative of the category. This nomenclature-based classification might cause errors if the nomenclature is not quite accurate, which proved to be the case for the cationic receptor family. The TBC error was high in every algorithm for this family. Indeed the TBC manually assigned by experts on the corresponding website is quite different from the nomenclature-based classification adopted in our analysis.
| RESULTS |
|---|
|
|
|---|
The comparison of different algorithms in terms of the optimal monophyletic TBC error E(T) is summarized in Table 1, in terms of the number and percentage of misclassified sequences for every family and algorithm under consideration. All the underlying trees obtained in our experiments, together with the optimal TBC for every classification level, can be explored at http://panther.appliedbiosystems.com/pub/tree_quality/trees.jsp using an interactive TreeViewer. This allows one to view a tree topology, compare the reference classification of sequences with the assigned TBC and see immediately the source of classification errors.
|
For four of the six families we consider (GPCR group B, cationic and anionic Cys-Loop receptors, and protein kinases), all of the MSA-based methods (i.e. excluding SW alignment-based distances) perform comparably. Not only is the error rate similar, but the set of sequences that are misclassified is nearly identical for all methods suggesting modifications to the existing nomenclature in a few cases. For these families, there is little ambiguity about the phylogenetic relationships between sequences. Only the SmithWaterman based methods, especially SW_ident, perform sometimes relatively poorly. However the GPCR class A family and the nuclear receptor family show significantly different results for the different methods. When analyzing the table one should have in mind that tree quality in MSA-based algorithms depends on the quality of the MSA. In the case of nuclear receptors, e.g. closely related sequences are well aligned relative to each other, while more distantly related groups are aligned relatively poorly, which explains the relatively high error rates in the first (coarsest) level of classification. This also explains why the TBCs are accurate in the finer levels, especially for the TIPS method.
SW-based distances do not depend on the quality of multiple alignment and combined with score-based similarity produce TBCs that are generally comparable to the MSA-based methods, with only one major exception (Cys-Loop anionic receptors, Levels 2 and 3). However, when the multiple alignment is reliable, the MSA-based methods slightly outperform the SW_score distances for these examples. For a number of families (GPCR class A, nuclear receptors, protein kinases) SW_identity trees have a significantly higher classification error than the trees based on other distance measures. This is not surprising, as percentage identity is a simple but much less informative measure than BLOSUM scores that capture degrees of similarity other than binary (identical versus not identical). We included identity measure primarily to show that more sophisticated distance measures have an appreciable advantage in the families we considered here. BETE, one of several current algorithms based on profileprofile distances, performed poorly on our dataset. Our tests with other possible settings of profileprofile distances, made us believe that there is an inherent limitation of algorithms in this class: profiles are designed for scoring sequences against them and apparently should be used as such. TIPS is one possible implementation of agglomerative clustering that utilizes sequence-profile distances with acceptable classification accuracy.
In the class of MSA-based methods, score-based algorithms not only outperform phylogenetic algorithms in performance time, but often suggest comparable or even more accurate protein classifications. On this small set of examples, TIPS appears to perform the best on average, particularly for finer levels of classification.
When all tree-building methods make the same error, the actual error may be in the reference classification. For example, in the case of anionic receptor, Level 3 one sequence GABa3 (Heliothis virescens) is placed by all methods in an RDL-like subtree. This may be an artifact of inaccurate or outdated nomenclature assignment of the sequence. The case where one tree method misclassifies sequences, while other methods do not, suggests that the first tree may not be reliable. For example, the NJ algorithm also misclassifies two other anionic receptor sequences: one in GABa6 category and one in GLYa1 category (they are missing in the otherwise complete subtrees for those categories). It is interesting to note that the phylogenetic analysis on the Cys-Loop website (http://www.pasteur.fr/recherche/banques/LGIC/cys-loop.html), was performed with bootstrapping separately for smaller alignments of Glycine receptor subunits (15 sequences) and GABA receptor alphagamma subunits (37 sequences) and the two missing sequences were indeed placed in the correct subtrees.
A more detailed view on the source of errors, in the case of monophyletic classification, is given in Table 2. Because of the space limitation we consider NJ, UPGMA and TIPS trees, the only trees in our set that are used in practice: NJ and UPGMA as part of the Phylip package and TIPS as the algorithm behind the PANTHER protein classification (Thomas et al., 2003). As has been discussed above, every good subtree for a category is characterized by an error composition: tp, fp and fn. If the error compositions of good subtrees for some category did not agree for the NJ, UPGMA and TIPS trees, we listed this category in Table 2 along with the corresponding error compositions. For the rest of the good subtrees, for which the error compositions agreed, we gave the overall number of true positives and false positives (the total number of false negatives is hard to interpret since some of them appear among false positives and some among nonclassified). In our approach the total number of true positives, which is highlighted in the Table 2, reflects the classification quality. The remainder are misclassified sequences contributing to classification error E(T).
Table 2 reveals some informative details underlying the summary results from Table 1. For example, the TIPS tree has a significantly lower E(T) than either NJ or UPGMA for the Level 1 subfamily clusters of the GPCR_A family. This is primarily because of the near perfect monophyletic representation of the peptide subfamily in the TIPS tree as opposed to the NJ and UPGMA trees. As is evident from Table 2, this subfamily is nearly complete (only 33 missing, or fn) with 92 apparent fp. The optimal peptide subfamily on the NJ tree, in contrast, has 63 fp and 162 fn. Further, of the 92 fp in the TIPS tree, many belong to either the thyrotropin, gonadotropin-releasing, or melatonin subfamilies, which are all actually peptide ligands but were broken out separately from the other peptide hormone receptors by the curators. If the thyrotropin, gonadotropin and melatonin subfamilies had been Level 2 classes instead of Level 1, each could be a monophyletic sub-subfamily nested inside the Peptide subfamily. Most of the remaining peptide fp in the TIPS tree are singleton subfamilies, i.e. single sequences that were not classified into any of the larger groups. These fp singletons can be interpreted as predictions, i.e. the TIPS tree predicts that 35 (60 singletons in reference classifications minus 25 that appear as singleton subfamilies on the tree) of these are likely to be orphan GPCRs with peptide ligands, as they fall clearly within the optimal monophyletic peptide subfamily.
| DISCUSSION |
|---|
|
|
|---|
We have proposed an approach for using a gold standard set of protein classifications to guide the division of the tree into groups (subfamilies) that optimally match these classifications. In the case where the protein classifications are taken to be correct, the disagreement between the optimal subfamilies and the classifications can be thought of as a measure of the error in a tree representation of protein relationships (either phylogenetic or sequence-similarity based). We showed how this metric can be used to compare trees in their ability to classify proteins into functionally related groups. We have shown that the metric can be applied to make a decision which hierarchical approach (or which parameter settings within one algorithm) to choose for classification purposes. The E(T) metric demonstrates why the TIPS algorithm was chosen over the BETE algorithm for the building of the PANTHER protein classification (Thomas et al., 2003).
We have also shown that having an optimal definition of a monophyletic group, in which most members share a common function, may suggest the function for any unannotated sequences in that group. Our method requires a known reference classification for the majority of sequences, so it is not appropriate for fully automated classification. However, given the rapid advance in methods for assigning controlled vocabulary terms to proteins on a large scale (Ashburner et al., 2000; Mi et al., 2003; Camon et al., 2004), methods that define optimal tree pruning into subfamilies (our definition of optimality is but one possibility) could be combined with protein family tree modeling to identify functional divergence events without the intervention of a human curator as in Thomas et al. (2003). Further, different specificity levels of an ontology will result in different optimal partitioning into subfamilies (or nested subfamilies) as observed here (e.g. in the GPCR_A example above, peptide receptor versus melatonin receptor).
Using our measure, E(T), we independently confirm the conclusion of Feng and Doolittle (1987) that in general trees built using multiple sequence alignments are more accurate than those using pairwise similarity-based distances. In the class of phylogenetic algorithms, UPGMA, which is considered inferior in overall tree reconstruction, slightly outperformed NJ in its classification ability. It is interesting to note that UPGMA was also preferred over NJ in the design of MUSCLE (Edgar, 2004), one of the most accurate up-to-date multiple alignment tools. Our preliminary analysis performed on available curated public datasets suggests that protein dendrograms and in particular UPGMA trees built with simple protein similarity measures, often provide relatively low classification errors and are much faster to build than phylogenetic trees. In our experiments we were limited to the choice of protein families for which not only the category assignments, but also the MSA, were curated. As more curated protein families become available over time, we expect our measure to be of increasing utility and to be able to test our preliminary conclusions on a larger, more representative set of protein families.
Our approach is novel in the way it identifies subfamilies and in the way it quantifies classification error. We illustrate this in a particular example of a UPGMA tree for the secretin-like GPCR family (http://www.gpcr.org/7tm). An interactive version of the tree can be accessed at http://panther.appliedbiosystems.com/pub/tree_quality/trees.jsp. There were 163 sequences that fell into 14 non-singleton subfamilies. We identified 13 non-overlapping good subtrees, or nodes (Node_1, Node_2,...,Node_13), that best correspond to the subfamilies. Figure 3A shows tree collapsed at these nodes (collapsed nodes do not show descending trees). There were 9 false negatives that did not fall into any of the good subtrees and showed up as singletons (Node_14,...,Node_22). There is no node on the tree that corresponds to gastric inhibitory peptide (GIPR) subfamily (three sequences), since all its sequences are nested within Node_6 corresponding to the glucagon subfamily (Fig. 3B). Due to this nesting the number of false positives for the glucogon subfamily is 3, as listed in the corresponding table column. Thus the overall number of misclassified sequences in this example is 12 (9 false negatives and 3 false positives). Further, we explain why a standard snapshot clustering fails on this example. For each node we provide two distances: the distance from the bottom of the tree to the node itself and to its parent. Those distances are also plotted in Figure 4. In a standard approach with a distance threshold D to identify all the subfamilies, one needs to choose D such that it falls between the two distances for every node, which is clearly impossible. In Figure 5 we give the number of incorrectly classified sequences, i.e. the total number of sequences minus correctly classified ones and the number of identified clusters for various thresholds D. The upper line corresponds to the case where every subfamily is mapped to a single best cluster, which relates to the Qsingle quality index in Yona et al. (1999). The lower line corresponds to the case where every cluster is mapped to its best subfamily, similar to the Qset index defined in Yona et al. (1999). In this case, there can be a set of arbitrarily small clusters for a particular subfamily (i.e. one subfamily can map to multiple subtrees). So, of course, in the case where there are as many clusters as sequences there are no errors. For comparison, the black square shows the results of our optimal (i.e. smallest number of errors) cut of the tree. Using this method each subcluster is defined using the curated subfamily assignments, so different subclusters can span different evolutionary distances. This minimizes the number of clusters while minimizing the error. In our method, a given subfamily can map to at most one subtree (or zero if the sequences in a curated subfamily are dispersed in the tree or nested in other subtrees, as is the case of the gastric inhibitory peptide subfamily, in this example), so it is most comparable to the upper curve Qsingle. The standard uniform cut of the tree has roughly four times the error rate for the same number of clusters.
|
|
|
The case of the GIPR subfamily also gives an example of where our monophyletic approach outlined above is insufficient. A paraphyletic classification is sometimes necessary to properly represent functional classes. Indeed, since in this particular case for most of the sequences the paralog/species assignment is known, it is clear from the analysis of Node 6 in Figure 3B, that during evolution there were originally four paralogs: GLP, GLP1, GLP2 and the fourth one, which evolved a new function giving rise to subfamily GIPR. Only afterwards speciation events occurred, which resulted in the nesting of the GIPR subfamily in the GLP subfamily. The monophyletic classification treats all GIPR sequences as fp, while their nesting is clearly justified and should not be penalized. This example calls for different treatment of nested good subtrees and in Appendix B we introduce paraphyletic TBC, which does not penalize nesting of good subtrees for some categories in the spanning subtrees for other categories. We emphasize here that without exact knowledge of the paralog species assignment it is impossible to say which classification should be applied: monophyletic or paraphyletic. However, monophyletic and paraphyletic classifications can be viewed as two extremes: the former being most restrictive and the latter being most permissive. Since real examples of paraphyletic protein subfamilies are more rare than monophyletic subfamilies and errors in tree-building can often appear as apparent paraphyletic groups, we focused on monophyletic classification to judge the quality of tree-building algorithms. The case of paraphyletic groups is treated in Appendix B.
APPENDIX A
TIPS algorithm
TIPS builds trees starting with a multiple alignment of sequences. It is an agglomerative clustering algorithm whose similarity measure is based on HMM scoring. In an agglomerative clustering procedure each sequence initially forms its own cluster. The two most similar clusters are joined iteratively until all clusters have been joined together. In TIPS the similarity between two clusters of sequences K and M is derived from an average normalized score of sequences in one cluster against the profiles of the other:
![]() |
(s,C) (either
(sk,M), or
(sm, K) above), a normalized score of a sequence s to a profile of cluster C, is defined as follows. Let i be any column in the MSA of length L and Ci be its subcolumn that includes only sequences from C. If Ci does not consist of all gaps, we define for it a profile with gaps (Gribskov et al., 1987). We define
, where the first 20 components of the vector form a Dirichlet mixture profile with nine components for amino acids (a1,a2,...,a20) (Sjolander et al., 1996) and p(gap_open) and p(gap_ext) are probabilities of gap-open and gap-extension, respectively. For additive scoring against this profile we normalize the first 20 components by a NULL model p0=(p0(a1),p0(a2),...,p0(a20)) and take the logarithm, so that the corresponding scoring vector becomes
![]() |
If Ci consists of only gaps there can be three possible cases. In the first case, all gaps are terminal, then effectively there should be no scoring against profile, or q(Ci) = (0,0,...,0). In the second case, there is at least one gap open among gaps, then we set q(Ci) = (q(gap_open),...,q(gap_open),0,0) to penalize any amino acid but not gaps. In the third case, there are no gaps open and at least one gap extension, then we set q(Ci) = (q(gap_ext),...,q(gap_ext),0,0).
We define
(s,C), a normalized score of a sequence s = (s1,s2,...,sL) to a profile of cluster C, as a score of the sequence against the extended profile, normalized by the overall number of scored positions (including scored gaps)
![]() |
In its simplest form, deriving the amino acid probability distribution for each position in a profile assumes independence of observed amino acids in the corresponding column. In practice, however, sequences are strongly related to each other either via evolution or just due to database redundancy (there can appear many copies of the same sequence in the alignment). To prevent sequence relatedness in an alignment to bias the resulting profile a heuristic sequence weighting can be applied: an amino acid that comes from i-th sequence is counted wi times (instead of 1), where wi is the weight, or independent count, of the i-th sequence. Sequence weighting heuristics are used to estimate the total number of independent counts (this can be much smaller than the actual number of sequences) and then the relative weights of every sequence, in our application derived according to Hennikoff and Hennikoff (1994). It should be noted that the choice of a NULL model and the details of sequence weighting were important for the performance of TIPS. We obtained the optimal results when the NULL model was a geometric average of profiles in all alignment positions as in SAM 3.2 package (Barrett et al., 1997) and when we fixed the total number of independent counts equal to one. Though limiting the number of independent counts seems to violate the principles of Bayesian calculations by down-weighting observations relative to priors, we argue that this is a simple way to overcome problems with scoring against close competitive models. Indeed, assume for simplicity, that in the course of agglomeration a sequence has to be merged with one of two clusters H or L, which have high and low number of independent counts, respectively. In H conserved profiles are more peaked and thus every amino acid match is more rewarded. Therefore, a sequence that has higher homology to the consensus of L might still score higher to H since all matches in H get higher scores. As a result, the sequence might be incorrectly placed in H though it is more homologous to L. To summarize, without limiting the number of independent counts larger groups would have an unfair advantage in attracting new sequences.
APPENDIX B
Paraphyletic TBC
In the main paper we were interested in disjoint subtrees that optimally comprise sequences from the same curated categories. In such an approach we assume that sequences from the same category form a clade, or monophyletic group, i.e. all descendents of their most recent common ancestor (MRCA) belong to the same category. However, monophyletic classification does not allow for all evolutionary events. For example, suppose that there were two successive gene duplications giving rise to three paralogs: a, a', a''. It was possible that one of the paralogs a'', being freed from functional constrains, developed different functions and formed subfamily b. Assume that afterwards a speciation event occurred, giving rise to two species X and Y. It is clear that the two orthologs b_X and b_Y will be nested in a subtree that comprises two pairs of othologs from subfamily a: a_X, a_Y and a'_X, a'_Y. This hypothetical example is illustrated in the Methods section in Figure 2 and a similar realistic example is given in Discussion section in Figure 3B. In monophyletic classification the two nested sequences from b will be considered as false positives for subfamily a, which would be wrong. To allow for possible nesting of subtrees defined for different categories we introduce paraphyletic TBC, which treats sequences from nested subtrees as true positives allowing more flexible tree pruning. As seen above, the correct accounting of true positives, when nesting occurs, relies heavily on the paralog/species information, which our approach does not assume to have access to. Thus, while monophyletic classification might turn out to be too restrictive the paraphyletic classification might be too permissive, by assuming that all nesting of subfamilies are evolutionary valid. Monophyletic protein subfamilies are more common than paraphyletic ones and in many cases the nesting occurs as an artifact of having wrong tree topology.
To treat the case of paraphyletic classification we introduce the following bookkeeping of true positives in the case of nested subtrees. Suppose that in a subtree t we select certain nested good subtrees for some categories (may be none). We denote by
a subtree in t complementary to the selected good subtrees. The spanning subtree t can be assigned to some category G iff |
(G)| >
· |t| and |
(G)| >
· |G| (since for monophyletic classification we showed that the optimum is achieved when
= ß =
, we assume the same values for the paraphyletic classification). Note that a good subtree for G is a particular case of spanning subtree for G, if we choose to select no good subtrees nested in it. The total number of true positives in the spanning subtree t would be the sum of the G sequences in
plus the sum of all true positives from the selected nested good subtrees. Finally, the overall quality of classification would be the sum of true positives over all disjoint spanning subtrees. However, this natural definition does not allow simple calculation of the optimal classification recursively from bottom to top. The tree in Figure A1 illustrates the complexity of this general setting.
|
The problem of finding the optimal paraphyletic classification can be simplified and solved recursively, once we allow nesting of only those good subtrees that have no false positives. We will refer to such good subtrees as clean.
DEFINITION A1
Let t be a subtree and {t1,...,tk} be a selected set of disjoint clean subtrees for some categories that are nested in t. We say that a subtree t is spanning for category G, given {t1,...,tk}, iffIn paraphyletic classification the number of tp for the spanning subtree t given (t1,...,tk) is tp (t|t1,...,tk) = |t(G)| + |t1 |+
+ |tk|.
The first inequality ensures that the number of G sequences in the remainder of the subtree (since |
(G)| = |t(G)|) spans over half of the total number in the given category and the second inequality is the same as in Definition 1. In the last formula for the number of tp tp(t|t1,...,tk) we omit explicit dependence on G, since G is unique for the spanning subtree t given (t1,...,tk). Observe that, if no nested clean subtrees are selected, the spanning subtree for G is just a good subtree for G.
DEFINITION A2
A paraphyletic classification is a set of disjoint subtrees, such that
is a spanning subtree for some category given nested clean subtrees
.
The quality of paraphyletic classification is the sum of tp over all spanning subtrees,
.
A paraphyletic classification is optimal, if its quality is maximal, or the error
is minimal over all choices of spanning and nested clean subtrees.
It follows from the above definitions that every optimal monophyletic TBC is also a special case of a paraphyletic classification, but it is not necessarily optimal for the paraphyletic case. Therefore Ep (T)
E(T). It can be shown that the optimal paraphyletic classification defined above can be found recursively from top to bottom similarly to Lemma 1. The requirement of only clean nested subtrees simplifies the general setting in two ways: for every subtree t it is straightforward to identify all clean subtrees nested in it and it is straightforward then to optimally select nested clean subtrees and a category for which t would be spanning with a maximal number of true positives.
| Acknowledgments |
|---|
We thank Tom Hatton for his involvement in the early design of TIPS algorithm and for many useful discussions. We would like to thank Michael Campbell for validating whether our objective measure correlates with a subjective assessment of tree quality. We are also very thankful to Anushya Muruganujan for making it possible for us to analyze the trees with TreeViewer and for making the trees involved in our experiments available online. Finally, we thank an anonymous reviewer of this paper for helpful suggestions on the manuscript.
Received on January 15, 2004; revised on December 2, 2004; accepted on December 17, 2004
| REFERENCES |
|---|
|
|
|---|
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. (2000) Gene ontology: tool for the unification of biology. Gene Ontology Consortium. Nat. Genet., 25, 2529[CrossRef][Web of Science][Medline].
Barrett, C., Hughey, R., Karplus, K. (1997) Scoring hidden Markov models. CABIOS, 13, 191199.
Camon, E., Barrell, D., Lee, V., Dimmer, E., Apweiler, R. (2004) The Gene Ontology Annotation (GOA) Databasean integrated resource of GO annotations to the UniProt Knowledgebase. Silico Biol., 4, 56.
Diemer, K., Hatton, T., Thomas, P. (2001) Using profile scores to determine a tree representation of protein relationships. Proceedings of the 9th International Conference on Intelligent Systems for Molecular BiologyJuly 2125Poster 118, Copenhagen, Denmark (ISMB 2001).
Edgar, R. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res., 32, , pp. 17921797
Eisen, J.A. (1998) Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res., 8, 163167
Felsenstein, J. (2002) PHYLIP: Phylogeny Inference Package, Version 3.6. , Seattle, WA University of Washington.
Feng, D.-F. and Doolittle, R.F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol., 25, 351360[Web of Science][Medline].
Gracy, J. and Argos, P. (1998) Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment. Bioinformatics,, 14, 164173
Gribskov, M., McLachlan, A.D., Eisenberg, D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, 43554358
Hannenhalli, S.S. and Russel, R.B. (2000) Analysis and prediction of functional sub-types from protein sequence alignments. J. Mol. Biol., 303, 6176[CrossRef][Web of Science][Medline].
Heger, A. and Holm, L. (2001) Picasso: generating a covering set of protein family profiles. Bioinformatics, 17, 272279
Hennikoff, S. and Hennikoff, J. (1994) Position based sequence weights. J. Mol. Biol., 243, 574578[CrossRef][Web of Science][Medline].
Krause, A., Stoye, J., Vingron, M. (2000) The SYSTERS protein cluster set. Nucleic Acids Res., 28, 270272
Li, W. Molecular Evolution, (1997) , Sunderland, MA Sinauer Associates, Inc.
Mi, H., Vandergriff, J., Campbell, M., Narechania, A., Majoros, W., Lewis, S., Thomas, P.D., Ashburner, M. (2003) Assessment of genome-wide protein function classification for Drosophila melanogaster. Genome Res., 12, 21182128.
Remm, M., Storm, C.E.V., Sonnhammer, E.L.L. (2001) Automatic clustering of orthologs and in paralogs from pairwise species comparisons. J. Mol. Biol., 314, 10411052[CrossRef][Web of Science][Medline].
Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol, 4, 406425[Abstract].
Sasson, O., Linial, N., Linial, M. (2002) The metric space of proteinscomparative study of clustering algorithms. Bioinformatics, 18, S14S21[Abstract].
Sjolander, K. (1998) Phylogenetic inference in protein superfamilies: analysis of SH2 domains. In Proc. Int. Conf. Intell. Syst. Mol. Biol., 6, 165174.
Sjolander, K., Karplus, K., Brown, M., Hughey, R., Krogh, A., Mian, I.S., Haussler, D. (1996) Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comp. Appl. Biosci., 12, 327345.
Sokal, R.R. and Michener, C.D. (1958) A statistical method for evaluation systematic relationships. Univ. Kansas Sci. Bull., 28, 14091438.
Tatusov, R.L., Natale, D.A., Garkavtsev, I.V., Tatusova, T.A., Shankavaram, U.T., Rao, B.S., Kiryutin, B., Galperin, M.Y., Fedorova, N.D., Koonin, E.V. (2001) The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res., 29, 2228
Thomas, P.D., Campbell, M.C., Kejariwal, A., Mi, H., Karlak, B., Daveran, R., Diemer, K., Muruganujan, A., Narechania, A. (2003) PANTHER: a library of protein families and subfamilies indexed by function. Genome Res., 13, 21292141
Yona, G. and Linial.N. and Linial, M. (1999) ProtoMap: automatic classification of protein sequences, a hierarchy of protein families and local maps of protein space. Proteins, 37, 360378[CrossRef][Web of Science][Medline].
Zmasek, C.M. and Eddy, S.R. (2001) A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics, 17, 821828
This article has been cited by other articles:
![]() |
Y. Loewenstein, E. Portugaly, M. Fromer, and M. Linial Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space Bioinformatics, July 1, 2008; 24(13): i41 - i49. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Mi, N. Guo, A. Kejariwal, and P. D. Thomas PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways Nucleic Acids Res., January 12, 2007; 35(suppl_1): D247 - D252. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


is maximal, or the classification error E(T) = |T| |tG1 (G1)| |tG2 (G2)| 








, such that
is a spanning subtree for some category given nested clean subtrees
.
.
is minimal over all choices of spanning and nested clean subtrees.
