Bioinformatics Advance Access originally published online on July 12, 2006
Bioinformatics 2006 22(17):2081-2086; doi:10.1093/bioinformatics/btl366
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
© 2006 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commerical use, distribution, and reproduction in any medium, provided the original work is properly cited.
An initial strategy for comparing proteins at the domain architecture level
MOE Key Laboratory for Biodiversity Science and Ecological Engineering and College of Life Sciences, Beijing Normal University Beijing 100875, China
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Ideally, only proteins that exhibit highly similar domain architectures should be compared with one another as homologues or be classified into a single family. By combining three different indices, the Jaccard index, the Goodman-Kruskal
function and the domain duplicate index, into a single similarity measure, we propose a method for comparing proteins based on their domain architectures.
Results: Evaluation of the method using the eukaryotic orthologous groups of proteins (KOGs) database indicated that it allows the automatic and efficient comparison of multiple-domain proteins, which are usually refractory to classic approaches based on sequence similarity measures. As a case study, the PDZ and LRR_1 domains are used to demonstrate how proteins containing promiscuous domains can be clearly compared using our method. For the convenience of users, a web server was set up where three different query interfaces were implemented to compare different domain architectures or proteins with domain(s), and to identify the relationships among domain architectures within a given KOG from the Clusters of Orthologous Groups of Proteins database.
Conclusion: The approach we propose is suitable for estimating the similarity of domain architectures of proteins, especially those of multidomain proteins.
Availability: http://cmb.bnu.edu.cn/pdart/
Contact: linkui{at}bnu.edu.cn
Supplementary Information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
The public release of an increasing number of genomes has led to a huge amount of protein sequence data, which requires increasing expertise to understand. The goal of functional genomics is to determine the function of the proteins predicted from the genes identified in these sequenced genomes. To this end, it is essential to construct a comprehensive evolutionary classification of these proteins, because members of the same protein family may have or perform identical biochemical functions (Hegyi and Gerstein, 1999). Such a classification scheme is based on homologous relationships between genes. Many methods are currently available for clustering proteins into families; see Apic et al. (2001), Copley et al. (2002a), Liu and Rost (2003) and Ouzounis et al. (2003) for reviews. Most of those approaches rely on sequence similarity measures, such as those obtained with BLAST (Altschul et al., 1997) or hidden Markov models (Eddy, 1996). Because many proteins contain multiple domains, many of these methods of protein clustering result in the establishment of incorrect families. This problem is complicated in metazoan proteomes, and the human proteome in particular, where multidomain proteins abound (Lander et al., 2001).
Domains are the building blocks of all globular proteins and present one of the most useful levels at which protein function can be understood (Copley et al., 2002b). Although the concept of domain now permeates biological descriptions, there are several definitions directed at different levels of the protein. In structural biology, a domain is defined as a spatially distinct, compact and stable protein structural unit that could conceivably fold and function in isolation (Branden and Tooze, 1999). On the other hand, domains are often defined as distinct regions of protein sequence that are highly conserved throughout evolution. These are described as sequence homologues and are often present in different molecular contexts (Ponting and Russell, 2002). Sequence-based domain definitions, which are central to the methods of domain identification and assignment, represent one of the most convenient and practically important levels at which the evolution and function of both proteins and domains can be undertood. Failure to consider the extent of sequence similarity and its relationship to functional domains and motifs, among others things, can lead to the inference of an incorrect protein function (Bork and Koonin, 1998; Brenner, 1999; Devos and Valencia, 2001; Ponting and Dickens, 2001). There is a limited repertoire of types of domains (Chothia, 1992; Wolf et al., 2000), and the domains from this set are duplicated and combined in different ways to form the respective proteomes of various genomes in life. However, the presence of a shared domain within a group of proteins does not necessarily imply that these proteins perform the same or similar biochemical functions (Henikoff et al., 1997). For instance, many proteins sharing the so-called promiscuous domains, which are small, quite widespread protein modules (e.g. LRR_1, SH3_1, WD40, PDZ), have very different functions (Marcotte et al., 1999).
The multidomain nature of many proteins poses two obstacles to our understanding of the functions of uncharacterized proteins compared with those of the well-annotated proteins in various databases. The arrangement of different domains in protein sequences usually produces a plethora of significant alignments when the protein of interest is compared against protein databases, although few of the matched proteins have the same domain architecture as the query protein. Most of these matches are associated with proteins with different domain arrangements, whereas others are multiple hits within the same sequences where duplicated domains or repeats are present. The extent to which known functional information can be usefully transferred is more or less subjective, although the higher the proportion of domains shared by two proteins, the more similar their functions (Hegyi and Gerstein, 2001). Furthermore, differences in domain architecture among multidomain proteins often raise the question of whether these proteins are orthologous, even though they have clearly arisen, at least in part, from a common ancestor. Considering all these problems, it has been suggested that the concept of orthology is applicable only at the level of domains rather than at the level of proteins (Koonin et al., 2000; Ponting and Russell, 2002), except for proteins with identical domain architectures.
In this article, we define the domain architecture of a protein as the sequential order of the domains and their sequence from the N-terminus to the C-terminus (computationally based on existing domain databases, such as Pfam). This definition is equivalent to the domain accretion given by Koonin and colleagues (Koonin et al., 2002). According to this definition, the domain architecture of a protein contains not only information about the domain composition of the protein but also information about domain arrangements and domain duplications. For example, the two domain architectures, ABCCCD and ABCCD, where each capital letter represents a domain, should be considered to be two nearly identical but different architectures. Here, we propose a method that compares two proteins by measuring the similarity between their whole domain architectures. As expected, proteins with identical or near-identical domain architectures are more likely to be homologous, either orthologous or paralogous, than those with only partly identical domain architectures or different orders of their domains. We used the Pfam database (Bateman et al., 2002, 2004) as the source of protein domain definitions in our work. In principle, to compare two proteins with domain(s), we need only compare their domain architectures. Thus, for two different domain architectures, the method is implemented by combining three different indices (see Methods for details): the Jaccard index, which measures how many common domains the two architectures contain; the GoodmanKruskal
function, which estimates the similarity of the arrangement of the distinct domains shared by two architectures; and the domain duplicate index, which assesses how similar in duplication the individual domains are between the two architectures. The Jaccard index and GoodmanKruskal
index are both natural and common graph-theoretic measures that have been used for hierarchical clustering (Wolf et al. 2002). To test the effectiveness of the method, it was benchmarked using the eukaryotic orthologous groups of proteins (KOGs) database (Tatusov et al. 2000, 2003). For the convenience of users, a web server was set up, at which three different query interfaces are implemented. The comparison of proteins containing promiscuous domains is also demonstrated in detail, as a case study. Accordingly, the method is suitable for measuring the similarity of proteins with complex domain architectures at the level of the whole domain architecture rather than at the level of individual domains.
| 2 METHODS |
|---|
|
|
|---|
2.1 Data preparation
To compare protein domain architectures, the constituent elements of the domain architecture of a protein must first be strictly defined and pre-described. Recently, the establishment of several comprehensive databases of protein domains has been undertaken, including Pfam (Bateman et al., 2002), SUPERFAMILY (Gough, 2002; Madera et al., 2004), SMART (Letunic et al., 2002), InterPro (Mulder et al., 2002), and CDD (Marchler-Bauer et al., 2003). In this study, we used Pfam (Bateman et al., 2002, 2004) as the source of protein domain definitions. The Pfam database is a database of protein domain families with manually annotated multiple sequence alignments of high quality. As of July 2005, Pfam version 17.0 consists of 7868 Pfam families and 24 733 domain architectures. The protein sequences on which Pfam 17.0 is based are from the composite of Swiss-Prot release 46.0 and SP-TrEMBL release 29.0. Statistically, 75.24% of all 1 756 632 proteins in Pfam 17.0 contain a match to at least one Pfam entry.
2.2 Notations
Assume that proteins P and Q have NP and NQ individual domains, respectively. Counting the duplicated domains once only, they have
and
distinct domains, respectively. Among these
distinct domains, they have
distinct domains in common. For each i-th of these
distinct domains, we assign to
the number of its duplicates within P and to
the number of its duplicates within Q, where
or
when P or Q does not contain the i-th domain.
2.2.1 The Jaccard index
The Jaccard index is the index most frequently used to compare two sets of objects, in this article, the two sets of domains derived from two proteins. It is defined as the ratio of the number of shared domains to the number of distinct domains in the two proteins. Thus, the Jaccard index can be expressed as
![]() | (1) |
is the number of domains shared by the two proteins and
(
) is the number of distinct domains of protein P(Q), Jpq
[0, 1]. Thus, the Jaccard index measures how many domains are common to the two architectures.
2.2.2 The GoodmanKruskal
index
The importance of domain duplication, insertion and deletion during evolution can be inferred from the repetitive and piecemeal nature of extant proteins. Insertion of the same domain into an ancestral protein at different locations usually produces different domain architectures with the same domain content. In evolution, the order of each particular domain pair is fixed initially, and is conserved thereafter, with a small number of exceptions in which the domains occur in both orientations relative to each other (Vogel et al., 2004). To estimate the similarity of the order of two distinct domains between protein P and protein Q, we count the number of same-order pairs and the number of reversed pairs of domains in P and Q, after masking all duplicate domains except the one closest to the N-terminus. These two numbers, counted for all pairs of domains, are denoted
and
, respectively. For example, in the two proteins P = ABC and Q = BCA, there are five different pairs of domain: AB, AC, BC, BA and CA. For the pair AB, there is no same-order pair in both P and Q, but there is one reversed pair (AB in P and BA in Q). Counting the occurrence of all five pairs, we have
and
. The GoodmanKruskal
function then is calculated as follows,
![]() | (2) |
. GKPQ is defined as 0 if P and Q share zero or one domain.
2.2.3 Domain duplication similarity
Domain duplication events are often observed in multidomain proteins. Duplication has significant implications for both gene evolution and protein function. To assess the similarity of protein P and protein Q with respect to the amount of duplication of a shared domain, we have devised a simple index, DPQ, to measure the duplication similarity between the two proteins, which is defined as follows
![]() | (3) |
![]() |
2.2.4 Similarity between two domain architectures
With the three indices defined above, JPQ, GKPQ, and DPQ, we then defined a similarity measure to assess how similar two proteins are at the level of their whole domain architectures. At present, this is done by combining these three indices, each normalized to [0, 1], into a simple linear function with weighted factors a, b and c
![]() | (4) |
2.3 Distance matrix and clustering approach
To compare a list of proteins based on their domain architecture similarities, the distance-based neighbour-joining clustering method (Saitou and Nei, 1987) was used. Other approaches can also be used, such as UPGMA (Sokal and Sneath, 1973) and the Markov cluster algorithm (Dongen, 1998). In this aricle, the distance between a pair of proteins or a pair of domain architectures, denoted by P and Q, is simply defined as follows,
![]() | (5) |
| 3 IMPLEMENTATION |
|---|
|
|
|---|
3.1 Effectiveness benchmark using the KOGs database
Deciphering orthologous and paralogous relationships among genes is critical for functional genomics and evolutionary genomics. The COGs database of proteins detects candidate sets of orthologues among genes from 43 prokaryotic genomes and seven fully sequenced eukaryotic genomes (KOGs) (Tatusov et al., 2000, 2003). The COG system has become a widely used tool in the functional annotation of newly sequenced genomes and evolutionary analyses on a genome-wide scale. To test the effectiveness of our method of comparing homologous proteins, we performed a benchmark using KOGs on the assumption that proteins with identical or near-identical domain architectures are more likely to be homologues. As of August 2003, KOGs contained 4852 clusters of orthologous groups (http://www.ncbi.nlm.nih.gov/COG/new/). To guarantee computational reliability, all proteins fragments in Pfam 17.0 were excluded from the analysis, and only proteins shared by both Pfam 17.0 and KOGs, with identical sequences were included in the evaluation. After filtering with the above criteria, there were 4011 (out of 4852) groups from KOGs with more than one protein annotated by Pfam 17.0, and 4661 domain architectures present in these 4011 KOGs. For each domain architecture, we counted the number of KOGs containing it and found that 81% (3791 of 4661) of domain architectures were present in only a single KOG, which is consistent with the supposition that proteins with identical domain architectures are usually homologous. On the other hand, we also calculated the number of different domain architectures present in each of the 4011 KOGs. The distribution of KOGs that contain different domain architectures is shown in Figure 1. Surprisingly, only 65% (2608 of 4011) of KOGs contain exactly one domain architecture; we would have expected that most, if not all, of the 4011 KOGs would share identical domain architectures.
|
To obtain an optimal combination of the parameters a, b and c in formula (4) and for simplicity of computation, the 20% (802 of 4011) of KOGs with exactly two different domain architectures were examined further, Thus, 802 pairs of domain architectures were analyzed. We found that 77 of the 802 pairs of domain architectures shared no common domain (Supplementary Table S1). Of the 725 remaining pairs, we tested 4851 (99 x 98/2) different combinations of a, b and c in formula (4) by allowing a and b to vary from 0.01 to 0.98 in steps of 0.01. We calculated all distributions of the numbers of the 725 KOGs over 10 equally divided similarity intervals (bins) (see Fig. 2 as reference). To find an optimal combination of a, b and c in formula (4), we simply grouped the 10 similarity bins into three categories: near-identical (0.7 < sim
1), similar (0.3 < sim
0.7) and dissimilar (sim
0.3) domain architectures. Based on the numbers of KOGs classified into these three categories, we searched for combinations of a, b and c in formula (4) that maximized the value when the number of the 725 KOGs classified in the dissimilar category was subtracted from the number classified in the near-identical category. We identified two different combinations, (0.36, 0.01, 0.63) and (0.35, 0.01, 0.64), that satisfy the optimality. The two combinations produce the same results when protein domain architectures are compared and we chose the first one for the analyses in this article. Figure 2 shows the distribution of the 725 KOGs over the 10 pooled bins. Most (96%) of these KOGs contain two nearly identical domain architectures. This observation, together with the observation that 3791 of 4661 (81%) domain architectures are only present in a single KOG, suggests that identical or nearly identical domain architectures can be used to infer homologous proteins, including orthologues and paralogues.
|
Interestingly, 77 KOGs contained two completely unrelated domain architectures (Supplementary Table S1), i.e. the two domain architectures shared no common domain and therefore, their similarities were calculated to be 0 by our method. However, it is known that Pfam clans classification groups the different domain families together that may have a common origin. Domain architectures that share no common domains from Pfam-A may share domains from the Pfam clans. Therefore, we reassigned the domains for the proteins of the 77 KOGs using the clan classification, to analyze the possibility that the proteins within a given KOG may share common clan(s). In this way, we found 33 (47%) of 77 KOGs with related domain architectures at the Pfam clan level (Supplementary Table S1a), indicating that other signatures, in addition to domains, are required to better understand protein evolution with our method.
3.2 Web server for comparing similar protein domain architectures
To allow users to compare domain architectures or proteins conveniently, a web service called Protein Domain Architecture Retrieval Tool (PDART; http://cmb.bnu.edu.cn/pdart/index.html) was implemented. At present, three different query interfaces are provided. Users can input proteins (UniProt accession numbers), domains (Pfam-A IDs or accession numbers) or KOGs (KOG IDs) to search the domain libraries of proteins with identical or similar domain architectures on our server. For a set of domain architectures, a distance matrix is computed online, and the neighbour-joining clustering approach (Saitou and Nei, 1987) is used via the PHYLIP software (Felsenstein, 2004). The query results are visualized as a dendrogram to depict the relationships of the set of domain architectures of interest. As exemplified with KOG0019, which contains HATPase_c and HSP90 proteins, both a network (Fig. 3a) and a dendrogram (Fig. 3b) representation of the relationships of the domain architectures in the KOG are visualized. A map of domains of proteins with the same domain architecture can be displayed by clicking on the respective hyperlink of the domain architecture on the dendrogram (Fig. 4bd), whereby more specific information about either the domain or the protein can be accessed from the remote Pfam or UniProt servers by clicking on the respective hyperlink.
|
|
3.3 A case study: comparison of proteins with promiscuous domains
As mentioned above, widespread, typically repetitive domains such as WD40 (PF00400), SH2 (PF00017), SH3_1 (PF00018), PDZ (PF00595), LRR_1 (PF00560) and TPR (PF000515) always obstruct the performance of most sequence-based similarity algorithms from performing accurately and efficiently. For most comparison methods based on sequence similarity, masking these domains are required a priori to ensure the robust comparison or classification of homologous proteins, particularly in eukaryotic proteomes. Comparisons of these common promiscuous domains will otherwise result in the spurious lumping of numerous homologous hits, as for example in the construction of the COGs database (Tatusov et al., 2003). In this article, we use the PDZ and LRR_1 domains to demonstrate how proteins containing these promiscuous domains can be clearly compared using our method.
PDZ domains (8090 amino acids in length) are found in diverse signalling proteins in bacteria, yeasts, plants, insects and vertebrates (Ponting, 1997). PDZ domains can occur in one or multiple copies and are nearly always found in cytoplasmic proteins. They bind either the C-terminal sequences of proteins or internal peptide sequences (Ponting et al., 1997). Leucine-rich repeats (LRR_1) are sequence motifs of 2029 residues, present in tandem arrays in a number of proteins with diverse functions, including in hormone receptor interactions, enzyme inhibition, cell adhesion and cellular trafficking. These repeats are usually involved in proteinprotein interactions.
By extracting information for the corresponding domains from Pfam 17.0, we found that there are only 14 proteins that contain both PDZ and LRR_1 domains in three different domain architectures. Figure 4a shows the relationships of these three related domain architectures. These 14 proteins are from Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster and Caenorhabditis elegans. Only one protein with two PDZ domains, encoded in the mouse genome, is identified in Pfam 17.0 (Fig. 4c) and there are nine proteins containing a PDZ domain (Fig. 4d). The other four proteins contain four PDZ domains and are encoded in the Drosophila, human and mouse genomes (Fig. 4c).
| 4 DISCUSSION |
|---|
|
|
|---|
The ultimate reason for delineating protein domain families is to better understand protein functions. Currently, there are various domain-based resources available, such as Pfam (Bateman et al., 2002), SCOP (Murzin et al., 1995; Lo Conte et al., 2002), SMART (Letunic et al., 2002), CDD (Marchler-Bauer et al., 2003), InterPro (Mulder et al., 2003), and SUPERFAMILY (Gough, 2002). Although the approaches based on single domains have yielded many insights into the constraints and mechanisms of protein function and evolution, it is more reasonable to cluster single- or multiple-domain proteins into a single family only if they exhibit highly similar domain architectures.
In this article, we propose a method for measuring the similarity among protein domain architectures based on their Pfam-A domain annotations. Although the Pfam database may contain a small proportion of false positives and false negatives, it is currently one of the most useful domain annotation databases for protein sequences. The performance of our method is impressive, as demonstrated using KOGs. The method is also effective in resolving some problems that have confounded traditional sequence-based comparison approaches, such as the comparison of proteins with promiscuous domain(s). However, there are several caveats for our method and its implementation. It fails for proteins for which no annotated domain information is available and is weakened if the domains of a protein are incompletely annotated. The second caveat is the weakness of Pfam in handling repeats. Short repeated motifs such as the LRR_1 or TPR repeats often fall just below Pfam's threshold with the result that Pfam sometimes reports different numbers of adjacent repeats between a pair of very similar proteins. In such cases, the distance between the domain architectures ABCCCD and ACCD is larger than that between ABEEED and AEED, where E represents a repeat (e.g. LRR_1 or TPR) and C represents a non-repeat domain. Similarly, domain architectures defined in Pfam that share no common domain(s) may contain related domains from the Pfam clans. For example, Pfam subdivides serine/threonine protein kinases into subfamilies in a partly arbitrary manner. Although the subfamilies belong to the same clan, the current method fails to measure this relationship between such subfamilies. To overcome these shortcomings, one possible solution is to include more domain and/or motif features by integrating more domain annotations from other domain databases, such as InterPro (Mulder et al., 2003), although the relationships are more complicated between the domains/motifs in the InterPro database. Besides the sequence-based domain information, structural domain information is also most important in the functional annotation of proteins. Thus, our method should be more useful if extant structural domain information can also be integrated. These types of structural data include SCOP (Andreeva et al., 2004) for known three-dimensional structures, SUPERFAMILY (Gough, 2002; Madera et al., 2004) for completely sequenced genomes and many others. The third caveat results from the combination of the three indices to produce a distance measure, clustering domain architectures based on a semi-metric property. Therefore, in this study, the clustering tree obtained by our method does not imply any phylogenetic relationships among the set of domain architectures. The fourth caveat is that the arrangement of the domains compared between two proteins, measured by the GoodmanKruskal
index, may not be comprehensive or satisfactory because we can currently only compare the order of the two domains along the sequence. The efficient comparison of more than two domains in different arrangements simultaneously remains unsolved at present. Another caveat may involve the issues of convergent evolution (e.g. homoplasy or parallelism), because domain architecture, like other protein features, might also be susceptible to this complexity.
Interestingly, this method could be used to explore the underlying evolutionary relationships among proteins at the level of their whole domain architectures, rather than at the single-domain level. This may be important in improving the quality of the annotation and classification of proteins, because the problem of inherited annotations via partial sequence matches occurs frequently in the extant protein classification databases (Bork and Koonin, 1998; Brenner, 1999; Devos and Valencia, 2001), and many of these annotation errors have been propagated throughout other molecular databases. A database that classifies proteins by domains using our method is under construction, although it fails when proteins contain no domains/motifs. On the other hand, a huge number of protein sequences are currently being predicted from various completely sequenced genomes. Among these, many important protein domain architectures and their corresponding functions require investigation. One phenomenon that may be biologically critical in these different domain architectures occurs when two domain architectures have an inverted order of domains. For example, such domain insertion processes have been demonstrated by analyzing, with our method, a family of proteins typified by both SH3_1 (Src homology 3) and PX (phox) domains (K. Lin, unpublished data). The observation that domain duplication and rearrangement occurs more often than independent domain re-acquisition in genomes during the course of evolution suggest that the acquisition of either the SH3_1 or PX domain (depending on which is more ancient) might be inferred from the relationships of the domain architectures containing these two domains. We believe that, coupled to phylogenetic analysis, our method can facilitate better understanding of the evolution and biological functions of proteins and their domain architectures.
| Acknowledgments |
|---|
The authors thank two anonymous reviewers for their valuable comments. This research was supported by NSFC (Grants 30571037) and by Beijing Normal University. Funding to pay the Open Access publication charges was provided by Beijing Normal University.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Christos Ouzounis
Received on April 2, 2006; revised on June 5, 2006; accepted on July 2, 2006
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, . 25, 33893402
Andreeva, A., et al. (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res, . 32, D226229
Apic, G., et al. (2001) An insight into domain combinations. Bioinformatics, 17, Suppl. 1, S83S89[Abstract].
Bateman, A., et al. (2002) The Pfam protein families database. Nucleic Acids Res, . 30, 276280
Bateman, A., et al. (2004) The Pfam protein families database. Nucleic Acids Res, . 32, D138D141
Bork, P. and Koonin, E.V. (1998) Predicting functions from protein sequenceswhere are the bottlenecks? Nat. Genet, . 18, 313318[CrossRef][ISI][Medline].
Branden, C. and Tooze, J. Introduction to Protein Structure, (1999) , New York Garland Publishing.
Brenner, S.E. (1999) Errors in genome annotation. Trends Genet, . 15, 132133[CrossRef][ISI][Medline].
Chothia, C. (1992) Proteins. One thousand families for the molecular biologist. Nature, 357, 543544[CrossRef][Medline].
Copley, R.R., et al. (2002a) Protein domain analysis in the era of complete genomes. FEBS Lett, . 513, 129134[CrossRef][ISI][Medline].
Copley, R.R., et al. (2002b) Sequence analysis of multidomain proteins: past perspectives and future directions. Adv. Protein Chem, . 61, 7598[ISI][Medline].
Devos, D. and Valencia, A. (2001) Intrinsic errors in genome annotation. Trends Genet, . 17, 429431[CrossRef][ISI][Medline].
Dongen, S.V. A New Cluster Algorithm for Graphs, (1998) , The Netherland Centrum voor Wiskunde en Informatica (CWI).
Eddy, S.R. (1996) Hidden Markov models. Curr. Opin. Struct. Biol, . 6, 361365[CrossRef][ISI][Medline].
Felsenstein, J. (2004) Phylogeny Inference Package.
Gough, J. (2002) The SUPERFAMILY database in structural genomics Acta Crystallogr. D. Biol. Crystallogr, . 58, 18971900[CrossRef].
Hegyi, H. and Gerstein, M. (1999) The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J. Mol. Biol, . 288, 147164[CrossRef][ISI][Medline].
Hegyi, H. and Gerstein, M. (2001) Annotation transfer for genomics: measuring functional divergence in multi-domain proteins. Genome Res, . 11, 16321640
Henikoff, S., et al. (1997) Gene families: the taxonomy of protein paralogs and chimeras. Science, 278, 609614
Koonin, E.V., et al. (2000) The impact of comparative genomics on our understanding of evolution. Cell, 101, 573576[CrossRef][ISI][Medline].
Koonin, E.V., et al. (2002) The structure of the protein universe and genome evolution. Nature, 420, 218223[CrossRef][Medline].
Lander, E.S., et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860921[CrossRef][Medline].
Letunic, I., et al. (2002) Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Res, . 30, 242244
Liu, J. and Rost, B. (2003) Domains, motifs and clusters in the protein universe. Curr. Opin. Chem. Biol, . 7, 511[CrossRef][ISI][Medline].
Lo Conte, L., et al. (2002) SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res, . 30, 264267
Madera, M., et al. (2004) The SUPERFAMILY database in 2004: additions and improvements. Nucleic Acids Res, . 32, D235D239
Marchler-Bauer, A., et al. (2003) CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res, . 31, 383387
Marcotte, E.M., et al. (1999) Detecting protein function and proteinprotein interactions from genome sequences. Science, 285, 751753
Mulder, N.J., et al. (2002) InterPro: an integrated documentation resource for protein families, domains and functional sites. Brief Bioinform, . 3, 225235
Mulder, N.J., et al. (2003) The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res, 31, 315318
Murzin, A.G., et al. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol, . 247, 536540[CrossRef][ISI][Medline].
Ouzounis, C.A., et al. (2003) Classification schemes for protein structure and function. Nat. Rev. Genet, . 4, 508519[ISI][Medline].
Ponting, C.P. (1997) Evidence for PDZ domains in bacteria, yeast, and plants. Protein Sci, . 6, 464468[Abstract].
Ponting, C.P. and Dickens, N.J. (2001) Genome cartography through domain annotation. Genome Biol, . 2, Comment 2006.
Ponting, C.P. and Russell, R.R. (2002) The natural history of protein domains. Annu. Rev. Biophys. Biomol. Struct, . 31, 4571[CrossRef][ISI][Medline].
Ponting, C.P., et al. (1997) PDZ domains: targeting signalling molecules to sub-membranous sites. Bioessays, 19, 469479[CrossRef][ISI][Medline].
Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol, . 4, 406425[Abstract].
Sokal, R. and Sneath, P. Numerical Taxonomy, (1973) , San Francisco Freeman.
Tatusov, R.L., et al. (2000) The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res, . 28, 3336
Tatusov, R.L., et al. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4, 41[CrossRef][Medline].
Vogel, C., et al. (2004) Supra-domains: evolutionary units larger than single protein domains. J. Mol. Biol, . 336, 809823[CrossRef][ISI][Medline].
Wolf, Y.I., et al. (2000) Estimating the number of protein folds and families from complete genome data. J. Mol. Biol, . 299, 897905[CrossRef][ISI][Medline].
Wolf, Y.I., et al. (2002) Scale-free networks in biology: new insights into the fundamentals of evolution? Bioessays, 24, 105109[CrossRef][ISI][Medline].
This article has been cited by other articles:
![]() |
B. Lee and D. Lee DAhunter: a web-based server that identifies homologous proteins by comparing domain architecture Nucleic Acids Res., July 1, 2008; 36(suppl_2): W60 - W64. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Rattei, P. Tischler, R. Arnold, F. Hamberger, J. Krebs, J. Krumsiek, B. Wachinger, V. Stumpflen, and W. Mewes SIMAP structuring the network of protein similarities Nucleic Acids Res., January 11, 2008; 36(suppl_1): D289 - D292. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. E. Vinogradov 'Genome design' model and multicellular complexity: golden middle Nucleic Acids Res., November 6, 2006; 34(20): 5906 - 5914. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||









