Skip Navigation


Bioinformatics Advance Access originally published online on July 12, 2006
Bioinformatics 2006 22(17):2081-2086; doi:10.1093/bioinformatics/btl366
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
22/17/2081    most recent
btl366v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Google Scholar
Right arrow Articles by Lin, K.
Right arrow Articles by Zhang, D.-Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lin, K.
Right arrow Articles by Zhang, D.-Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2006 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commerical use, distribution, and reproduction in any medium, provided the original work is properly cited.

An initial strategy for comparing proteins at the domain architecture level

Kui Lin *, Lei Zhu and Da-Yong Zhang

MOE Key Laboratory for Biodiversity Science and Ecological Engineering and College of Life Sciences, Beijing Normal University Beijing 100875, China

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 REFERENCES
 

Motivation: Ideally, only proteins that exhibit highly similar domain architectures should be compared with one another as homologues or be classified into a single family. By combining three different indices, the Jaccard index, the Goodman-Kruskal {gamma} function and the domain duplicate index, into a single similarity measure, we propose a method for comparing proteins based on their domain architectures.

Results: Evaluation of the method using the eukaryotic orthologous groups of proteins (KOGs) database indicated that it allows the automatic and efficient comparison of multiple-domain proteins, which are usually refractory to classic approaches based on sequence similarity measures. As a case study, the PDZ and LRR_1 domains are used to demonstrate how proteins containing promiscuous domains can be clearly compared using our method. For the convenience of users, a web server was set up where three different query interfaces were implemented to compare different domain architectures or proteins with domain(s), and to identify the relationships among domain architectures within a given KOG from the Clusters of Orthologous Groups of Proteins database.

Conclusion: The approach we propose is suitable for estimating the similarity of domain architectures of proteins, especially those of multidomain proteins.

Availability: http://cmb.bnu.edu.cn/pdart/

Contact: linkui{at}bnu.edu.cn

Supplementary Information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 REFERENCES
 
The public release of an increasing number of genomes has led to a huge amount of protein sequence data, which requires increasing expertise to understand. The goal of functional genomics is to determine the function of the proteins predicted from the genes identified in these sequenced genomes. To this end, it is essential to construct a comprehensive evolutionary classification of these proteins, because members of the same protein family may have or perform identical biochemical functions (Hegyi and Gerstein, 1999). Such a classification scheme is based on homologous relationships between genes. Many methods are currently available for clustering proteins into families; see Apic et al. (2001), Copley et al. (2002a), Liu and Rost (2003) and Ouzounis et al. (2003) for reviews. Most of those approaches rely on sequence similarity measures, such as those obtained with BLAST (Altschul et al., 1997) or hidden Markov models (Eddy, 1996). Because many proteins contain multiple domains, many of these methods of protein clustering result in the establishment of incorrect families. This problem is complicated in metazoan proteomes, and the human proteome in particular, where multidomain proteins abound (Lander et al., 2001).

Domains are the building blocks of all globular proteins and present one of the most useful levels at which protein function can be understood (Copley et al., 2002b). Although the concept of ‘domain’ now permeates biological descriptions, there are several definitions directed at different levels of the protein. In structural biology, a domain is defined as a spatially distinct, compact and stable protein structural unit that could conceivably fold and function in isolation (Branden and Tooze, 1999). On the other hand, domains are often defined as distinct regions of protein sequence that are highly conserved throughout evolution. These are described as sequence homologues and are often present in different molecular contexts (Ponting and Russell, 2002). Sequence-based domain definitions, which are central to the methods of domain identification and assignment, represent one of the most convenient and practically important levels at which the evolution and function of both proteins and domains can be undertood. Failure to consider the extent of sequence similarity and its relationship to functional domains and motifs, among others things, can lead to the inference of an incorrect protein function (Bork and Koonin, 1998; Brenner, 1999; Devos and Valencia, 2001; Ponting and Dickens, 2001). There is a limited repertoire of types of domains (Chothia, 1992; Wolf et al., 2000), and the domains from this set are duplicated and combined in different ways to form the respective proteomes of various genomes in life. However, the presence of a shared domain within a group of proteins does not necessarily imply that these proteins perform the same or similar biochemical functions (Henikoff et al., 1997). For instance, many proteins sharing the so-called ‘promiscuous domains’, which are small, quite widespread protein modules (e.g. LRR_1, SH3_1, WD40, PDZ), have very different functions (Marcotte et al., 1999).

The multidomain nature of many proteins poses two obstacles to our understanding of the functions of uncharacterized proteins compared with those of the well-annotated proteins in various databases. The arrangement of different domains in protein sequences usually produces a plethora of significant alignments when the protein of interest is compared against protein databases, although few of the matched proteins have the same domain architecture as the query protein. Most of these matches are associated with proteins with different domain arrangements, whereas others are multiple hits within the same sequences where duplicated domains or repeats are present. The extent to which known functional information can be usefully transferred is more or less subjective, although the higher the proportion of domains shared by two proteins, the more similar their functions (Hegyi and Gerstein, 2001). Furthermore, differences in domain architecture among multidomain proteins often raise the question of whether these proteins are orthologous, even though they have clearly arisen, at least in part, from a common ancestor. Considering all these problems, it has been suggested that the concept of orthology is applicable only at the level of domains rather than at the level of proteins (Koonin et al., 2000; Ponting and Russell, 2002), except for proteins with identical domain architectures.

In this article, we define the domain architecture of a protein as the sequential order of the domains and their sequence from the N-terminus to the C-terminus (computationally based on existing domain databases, such as Pfam). This definition is equivalent to the domain accretion given by Koonin and colleagues (Koonin et al., 2002). According to this definition, the domain architecture of a protein contains not only information about the domain composition of the protein but also information about domain arrangements and domain duplications. For example, the two domain architectures, ‘ABCCCD’ and ‘ABCCD’, where each capital letter represents a domain, should be considered to be two nearly identical but different architectures. Here, we propose a method that compares two proteins by measuring the similarity between their whole domain architectures. As expected, proteins with identical or near-identical domain architectures are more likely to be homologous, either orthologous or paralogous, than those with only partly identical domain architectures or different orders of their domains. We used the Pfam database (Bateman et al., 2002, 2004) as the source of protein domain definitions in our work. In principle, to compare two proteins with domain(s), we need only compare their domain architectures. Thus, for two different domain architectures, the method is implemented by combining three different indices (see Methods for details): the Jaccard index, which measures how many common domains the two architectures contain; the Goodman–Kruskal {gamma} function, which estimates the similarity of the arrangement of the distinct domains shared by two architectures; and the domain duplicate index, which assesses how similar in duplication the individual domains are between the two architectures. The Jaccard index and Goodman–Kruskal {gamma} index are both natural and common graph-theoretic measures that have been used for hierarchical clustering (Wolf et al. 2002). To test the effectiveness of the method, it was benchmarked using the eukaryotic orthologous groups of proteins (KOGs) database (Tatusov et al. 2000, 2003). For the convenience of users, a web server was set up, at which three different query interfaces are implemented. The comparison of proteins containing promiscuous domains is also demonstrated in detail, as a case study. Accordingly, the method is suitable for measuring the similarity of proteins with complex domain architectures at the level of the whole domain architecture rather than at the level of individual domains.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 REFERENCES
 
2.1 Data preparation
To compare protein domain architectures, the constituent elements of the domain architecture of a protein must first be strictly defined and pre-described. Recently, the establishment of several comprehensive databases of protein domains has been undertaken, including Pfam (Bateman et al., 2002), SUPERFAMILY (Gough, 2002; Madera et al., 2004), SMART (Letunic et al., 2002), InterPro (Mulder et al., 2002), and CDD (Marchler-Bauer et al., 2003). In this study, we used Pfam (Bateman et al., 2002, 2004) as the source of protein domain definitions. The Pfam database is a database of protein domain families with manually annotated multiple sequence alignments of high quality. As of July 2005, Pfam version 17.0 consists of 7868 Pfam families and 24 733 domain architectures. The protein sequences on which Pfam 17.0 is based are from the composite of Swiss-Prot release 46.0 and SP-TrEMBL release 29.0. Statistically, 75.24% of all 1 756 632 proteins in Pfam 17.0 contain a match to at least one Pfam entry.

2.2 Notations
Assume that proteins P and Q have NP and NQ individual domains, respectively. Counting the duplicated domains once only, they have Formula and Formula distinct domains, respectively. Among these Formula distinct domains, they have Formula distinct domains in common. For each i-th of these Formula distinct domains, we assign to Formula the number of its duplicates within P and to Formula the number of its duplicates within Q, where Formula or Formula when P or Q does not contain the i-th domain.

2.2.1 The Jaccard index
The Jaccard index is the index most frequently used to compare two sets of objects, in this article, the two sets of domains derived from two proteins. It is defined as the ratio of the number of shared domains to the number of distinct domains in the two proteins. Thus, the Jaccard index can be expressed as

Formula 1(1)
where Formula 1 is the number of domains shared by the two proteins and Formula 1 (Formula 1) is the number of distinct domains of protein P(Q), Jpq isin [0, 1]. Thus, the Jaccard index measures how many domains are common to the two architectures.

2.2.2 The Goodman–Kruskal {gamma} index
The importance of domain duplication, insertion and deletion during evolution can be inferred from the repetitive and piecemeal nature of extant proteins. Insertion of the same domain into an ancestral protein at different locations usually produces different domain architectures with the same domain content. In evolution, the order of each particular domain pair is fixed initially, and is conserved thereafter, with a small number of exceptions in which the domains occur in both orientations relative to each other (Vogel et al., 2004). To estimate the similarity of the order of two distinct domains between protein P and protein Q, we count the number of same-order pairs and the number of reversed pairs of domains in P and Q, after masking all duplicate domains except the one closest to the N-terminus. These two numbers, counted for all pairs of domains, are denoted Formula 1 and Formula 1, respectively. For example, in the two proteins P = ‘ABC’ and Q = ‘BCA’, there are five different pairs of domain: ‘AB’, ‘AC’, ‘BC’, ‘BA’ and ‘CA’. For the pair ‘AB’, there is no same-order pair in both P and Q, but there is one reversed pair (‘AB’ in P and ‘BA’ in Q). Counting the occurrence of all five pairs, we have Formula 1 and Formula 1. The Goodman–Kruskal {gamma} function then is calculated as follows,

Formula 2(2)
It is finally normalized to [0, 1] and denoted as GKPQ, where Formula 2. GKPQ is defined as 0 if P and Q share zero or one domain.

2.2.3 Domain duplication similarity
Domain duplication events are often observed in multidomain proteins. Duplication has significant implications for both gene evolution and protein function. To assess the similarity of protein P and protein Q with respect to the amount of duplication of a shared domain, we have devised a simple index, DPQ, to measure the duplication similarity between the two proteins, which is defined as follows

Formula 3(3)
where

Formula 3

2.2.4 Similarity between two domain architectures
With the three indices defined above, JPQ, GKPQ, and DPQ, we then defined a similarity measure to assess how similar two proteins are at the level of their whole domain architectures. At present, this is done by combining these three indices, each normalized to [0, 1], into a simple linear function with weighted factors a, b and c

Formula 4(4)
where a + b + c = 1, a > 0, b > 0, c > 0. By testing various combinations of values for the parameters a, b and c, the combination (0.36, 0.01, 0.63) was selected as an optimally weighted scheme by benchmarking using the Clusters of Orthologous Groups of Proteins (COGs) database (see Implementation for details).

2.3 Distance matrix and clustering approach
To compare a list of proteins based on their domain architecture similarities, the distance-based neighbour-joining clustering method (Saitou and Nei, 1987) was used. Other approaches can also be used, such as UPGMA (Sokal and Sneath, 1973) and the Markov cluster algorithm (Dongen, 1998). In this aricle, the distance between a pair of proteins or a pair of domain architectures, denoted by P and Q, is simply defined as follows,

Formula 5(5)


    3 IMPLEMENTATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 REFERENCES
 
3.1 Effectiveness benchmark using the KOGs database
Deciphering orthologous and paralogous relationships among genes is critical for functional genomics and evolutionary genomics. The COGs database of proteins detects candidate sets of orthologues among genes from 43 prokaryotic genomes and seven fully sequenced eukaryotic genomes (KOGs) (Tatusov et al., 2000, 2003). The COG system has become a widely used tool in the functional annotation of newly sequenced genomes and evolutionary analyses on a genome-wide scale. To test the effectiveness of our method of comparing homologous proteins, we performed a benchmark using KOGs on the assumption that proteins with identical or near-identical domain architectures are more likely to be homologues. As of August 2003, KOGs contained 4852 clusters of orthologous groups (http://www.ncbi.nlm.nih.gov/COG/new/). To guarantee computational reliability, all proteins fragments in Pfam 17.0 were excluded from the analysis, and only proteins shared by both Pfam 17.0 and KOGs, with identical sequences were included in the evaluation. After filtering with the above criteria, there were 4011 (out of 4852) groups from KOGs with more than one protein annotated by Pfam 17.0, and 4661 domain architectures present in these 4011 KOGs. For each domain architecture, we counted the number of KOGs containing it and found that 81% (3791 of 4661) of domain architectures were present in only a single KOG, which is consistent with the supposition that proteins with identical domain architectures are usually homologous. On the other hand, we also calculated the number of different domain architectures present in each of the 4011 KOGs. The distribution of KOGs that contain different domain architectures is shown in Figure 1. Surprisingly, only 65% (2608 of 4011) of KOGs contain exactly one domain architecture; we would have expected that most, if not all, of the 4011 KOGs would share identical domain architectures.


Figure 1
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Distribution of KOGs according to the number of different domain architectures.

 
To obtain an optimal combination of the parameters a, b and c in formula (4) and for simplicity of computation, the 20% (802 of 4011) of KOGs with exactly two different domain architectures were examined further, Thus, 802 pairs of domain architectures were analyzed. We found that 77 of the 802 pairs of domain architectures shared no common domain (Supplementary Table S1). Of the 725 remaining pairs, we tested 4851 (99 x 98/2) different combinations of a, b and c in formula (4) by allowing a and b to vary from 0.01 to 0.98 in steps of 0.01. We calculated all distributions of the numbers of the 725 KOGs over 10 equally divided similarity intervals (bins) (see Fig. 2 as reference). To find an optimal combination of a, b and c in formula (4), we simply grouped the 10 similarity bins into three categories: near-identical (0.7 < sim ≤ 1), similar (0.3 < sim ≤ 0.7) and dissimilar (sim ≤ 0.3) domain architectures. Based on the numbers of KOGs classified into these three categories, we searched for combinations of a, b and c in formula (4) that maximized the value when the number of the 725 KOGs classified in the dissimilar category was subtracted from the number classified in the near-identical category. We identified two different combinations, (0.36, 0.01, 0.63) and (0.35, 0.01, 0.64), that satisfy the optimality. The two combinations produce the same results when protein domain architectures are compared and we chose the first one for the analyses in this article. Figure 2 shows the distribution of the 725 KOGs over the 10 pooled bins. Most (96%) of these KOGs contain two nearly identical domain architectures. This observation, together with the observation that 3791 of 4661 (81%) domain architectures are only present in a single KOG, suggests that identical or nearly identical domain architectures can be used to infer homologous proteins, including orthologues and paralogues.


Figure 2
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Distribution of KOGs with exactly two domain architectures, based on the similarity of the two architectures. Dissimilarity of two domain architectures means that they share no common domains.

 
Interestingly, 77 KOGs contained two completely unrelated domain architectures (Supplementary Table S1), i.e. the two domain architectures shared no common domain and therefore, their similarities were calculated to be 0 by our method. However, it is known that Pfam clans classification groups the different domain families together that may have a common origin. Domain architectures that share no common domains from Pfam-A may share domains from the Pfam clans. Therefore, we reassigned the domains for the proteins of the 77 KOGs using the clan classification, to analyze the possibility that the proteins within a given KOG may share common clan(s). In this way, we found 33 (47%) of 77 KOGs with related domain architectures at the Pfam clan level (Supplementary Table S1a), indicating that other signatures, in addition to domains, are required to better understand protein evolution with our method.

3.2 Web server for comparing similar protein domain architectures
To allow users to compare domain architectures or proteins conveniently, a web service called Protein Domain Architecture Retrieval Tool (PDART; http://cmb.bnu.edu.cn/pdart/index.html) was implemented. At present, three different query interfaces are provided. Users can input proteins (UniProt accession numbers), domains (Pfam-A IDs or accession numbers) or KOGs (KOG IDs) to search the domain libraries of proteins with identical or similar domain architectures on our server. For a set of domain architectures, a distance matrix is computed online, and the neighbour-joining clustering approach (Saitou and Nei, 1987) is used via the PHYLIP software (Felsenstein, 2004). The query results are visualized as a dendrogram to depict the relationships of the set of domain architectures of interest. As exemplified with KOG0019, which contains HATPase_c and HSP90 proteins, both a network (Fig. 3a) and a dendrogram (Fig. 3b) representation of the relationships of the domain architectures in the KOG are visualized. A map of domains of proteins with the same domain architecture can be displayed by clicking on the respective hyperlink of the domain architecture on the dendrogram (Fig. 4b–d), whereby more specific information about either the domain or the protein can be accessed from the remote Pfam or UniProt servers by clicking on the respective hyperlink.


Figure 3
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 The similarity relationships of domain architectures in KOG0019 which contain HATPase_c and HSP90. A network representation (a) and a dendrogram representation (b) are shown. Each domain architecture is denoted by an integer that is defined internally in Pfam 17.0 and the number accompanying each edge in the network is the similarity between the two architectures.

 

Figure 4
View larger version (27K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 The figure shows 14 homologous proteins that belong to three different domain architectures with two different promiscuous domains PDZ (PF00595 in blue) and LRR_1 (PF00560 in dark green). (a) The similarity relationships of the three different domain architectures inferred by our method. The homologous proteins of each domain architecture are displayed (bd). Each of these three domain architectures is denoted by an integer that is defined internally in Pfam database.

 
3.3 A case study: comparison of proteins with promiscuous domains
As mentioned above, widespread, typically repetitive domains such as WD40 (PF00400), SH2 (PF00017), SH3_1 (PF00018), PDZ (PF00595), LRR_1 (PF00560) and TPR (PF000515) always obstruct the performance of most sequence-based similarity algorithms from performing accurately and efficiently. For most comparison methods based on sequence similarity, masking these domains are required a priori to ensure the robust comparison or classification of homologous proteins, particularly in eukaryotic proteomes. Comparisons of these common ‘promiscuous’ domains will otherwise result in the spurious lumping of numerous homologous hits, as for example in the construction of the COGs database (Tatusov et al., 2003). In this article, we use the PDZ and LRR_1 domains to demonstrate how proteins containing these promiscuous domains can be clearly compared using our method.

PDZ domains (80–90 amino acids in length) are found in diverse signalling proteins in bacteria, yeasts, plants, insects and vertebrates (Ponting, 1997). PDZ domains can occur in one or multiple copies and are nearly always found in cytoplasmic proteins. They bind either the C-terminal sequences of proteins or internal peptide sequences (Ponting et al., 1997). Leucine-rich repeats (LRR_1) are sequence motifs of 20–29 residues, present in tandem arrays in a number of proteins with diverse functions, including in hormone receptor interactions, enzyme inhibition, cell adhesion and cellular trafficking. These repeats are usually involved in protein–protein interactions.

By extracting information for the corresponding domains from Pfam 17.0, we found that there are only 14 proteins that contain both PDZ and LRR_1 domains in three different domain architectures. Figure 4a shows the relationships of these three related domain architectures. These 14 proteins are from Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster and Caenorhabditis elegans. Only one protein with two PDZ domains, encoded in the mouse genome, is identified in Pfam 17.0 (Fig. 4c) and there are nine proteins containing a PDZ domain (Fig. 4d). The other four proteins contain four PDZ domains and are encoded in the Drosophila, human and mouse genomes (Fig. 4c).


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 REFERENCES
 
The ultimate reason for delineating protein domain families is to better understand protein functions. Currently, there are various domain-based resources available, such as Pfam (Bateman et al., 2002), SCOP (Murzin et al., 1995; Lo Conte et al., 2002), SMART (Letunic et al., 2002), CDD (Marchler-Bauer et al., 2003), InterPro (Mulder et al., 2003), and SUPERFAMILY (Gough, 2002). Although the approaches based on single domains have yielded many insights into the constraints and mechanisms of protein function and evolution, it is more reasonable to cluster single- or multiple-domain proteins into a single family only if they exhibit highly similar domain architectures.

In this article, we propose a method for measuring the similarity among protein domain architectures based on their Pfam-A domain annotations. Although the Pfam database may contain a small proportion of false positives and false negatives, it is currently one of the most useful domain annotation databases for protein sequences. The performance of our method is impressive, as demonstrated using KOGs. The method is also effective in resolving some problems that have confounded traditional sequence-based comparison approaches, such as the comparison of proteins with promiscuous domain(s). However, there are several caveats for our method and its implementation. It fails for proteins for which no annotated domain information is available and is weakened if the domains of a protein are incompletely annotated. The second caveat is the weakness of Pfam in handling repeats. Short repeated motifs such as the LRR_1 or TPR repeats often fall just below Pfam's threshold with the result that Pfam sometimes reports different numbers of adjacent repeats between a pair of very similar proteins. In such cases, the distance between the domain architectures ‘ABCCCD’ and ‘ACCD’ is larger than that between ‘ABEEED’ and ‘AEED’, where ‘E’ represents a repeat (e.g. LRR_1 or TPR) and ‘C’ represents a non-repeat domain. Similarly, domain architectures defined in Pfam that share no common domain(s) may contain related domains from the Pfam clans. For example, Pfam subdivides serine/threonine protein kinases into subfamilies in a partly arbitrary manner. Although the subfamilies belong to the same clan, the current method fails to measure this relationship between such subfamilies. To overcome these shortcomings, one possible solution is to include more domain and/or motif features by integrating more domain annotations from other domain databases, such as InterPro (Mulder et al., 2003), although the relationships are more complicated between the domains/motifs in the InterPro database. Besides the sequence-based domain information, structural domain information is also most important in the functional annotation of proteins. Thus, our method should be more useful if extant structural domain information can also be integrated. These types of structural data include SCOP (Andreeva et al., 2004) for known three-dimensional structures, SUPERFAMILY (Gough, 2002; Madera et al., 2004) for completely sequenced genomes and many others. The third caveat results from the combination of the three indices to produce a distance measure, clustering domain architectures based on a semi-metric property. Therefore, in this study, the clustering tree obtained by our method does not imply any phylogenetic relationships among the set of domain architectures. The fourth caveat is that the arrangement of the domains compared between two proteins, measured by the Goodman–Kruskal {gamma} index, may not be comprehensive or satisfactory because we can currently only compare the order of the two domains along the sequence. The efficient comparison of more than two domains in different arrangements simultaneously remains unsolved at present. Another caveat may involve the issues of convergent evolution (e.g. homoplasy or parallelism), because domain architecture, like other protein features, might also be susceptible to this complexity.

Interestingly, this method could be used to explore the underlying evolutionary relationships among proteins at the level of their whole domain architectures, rather than at the single-domain level. This may be important in improving the quality of the annotation and classification of proteins, because the problem of inherited annotations via partial sequence matches occurs frequently in the extant protein classification databases (Bork and Koonin, 1998; Brenner, 1999; Devos and Valencia, 2001), and many of these annotation errors have been propagated throughout other molecular databases. A database that classifies proteins by domains using our method is under construction, although it fails when proteins contain no domains/motifs. On the other hand, a huge number of protein sequences are currently being predicted from various completely sequenced genomes. Among these, many important protein domain architectures and their corresponding functions require investigation. One phenomenon that may be biologically critical in these different domain architectures occurs when two domain architectures have an inverted order of domains. For example, such domain insertion processes have been demonstrated by analyzing, with our method, a family of proteins typified by both SH3_1 (Src homology 3) and PX (phox) domains (K. Lin, unpublished data). The observation that domain duplication and rearrangement occurs more often than independent domain re-acquisition in genomes during the course of evolution suggest that the acquisition of either the SH3_1 or PX domain (depending on which is more ancient) might be inferred from the relationships of the domain architectures containing these two domains. We believe that, coupled to phylogenetic analysis, our method can facilitate better understanding of the evolution and biological functions of proteins and their domain architectures.


    Acknowledgments
 
The authors thank two anonymous reviewers for their valuable comments. This research was supported by NSFC (Grants 30571037) and by Beijing Normal University. Funding to pay the Open Access publication charges was provided by Beijing Normal University.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Christos Ouzounis

Received on April 2, 2006; revised on June 5, 2006; accepted on July 2, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 REFERENCES
 

    Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, . 25, 3389–3402[Abstract/Free Full Text].

    Andreeva, A., et al. (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res, . 32, D226–229[Abstract/Free Full Text].

    Apic, G., et al. (2001) An insight into domain combinations. Bioinformatics, 17, Suppl. 1, S83–S89[Abstract].

    Bateman, A., et al. (2002) The Pfam protein families database. Nucleic Acids Res, . 30, 276–280[Abstract/Free Full Text].

    Bateman, A., et al. (2004) The Pfam protein families database. Nucleic Acids Res, . 32, D138–D141[Abstract/Free Full Text].

    Bork, P. and Koonin, E.V. (1998) Predicting functions from protein sequences—where are the bottlenecks? Nat. Genet, . 18, 313–318[CrossRef][Web of Science][Medline].

    Branden, C. and Tooze, J. Introduction to Protein Structure, (1999) , New York Garland Publishing.

    Brenner, S.E. (1999) Errors in genome annotation. Trends Genet, . 15, 132–133[CrossRef][Web of Science][Medline].

    Chothia, C. (1992) Proteins. One thousand families for the molecular biologist. Nature, 357, 543–544[CrossRef][Medline].

    Copley, R.R., et al. (2002a) Protein domain analysis in the era of complete genomes. FEBS Lett, . 513, 129–134[CrossRef][Web of Science][Medline].

    Copley, R.R., et al. (2002b) Sequence analysis of multidomain proteins: past perspectives and future directions. Adv. Protein Chem, . 61, 75–98[Web of Science][Medline].

    Devos, D. and Valencia, A. (2001) Intrinsic errors in genome annotation. Trends Genet, . 17, 429–431[CrossRef][Web of Science][Medline].

    Dongen, S.V. A New Cluster Algorithm for Graphs, (1998) , The Netherland Centrum voor Wiskunde en Informatica (CWI).

    Eddy, S.R. (1996) Hidden Markov models. Curr. Opin. Struct. Biol, . 6, 361–365[CrossRef][Web of Science][Medline].

    Felsenstein, J. (2004) Phylogeny Inference Package.

    Gough, J. (2002) The SUPERFAMILY database in structural genomics Acta Crystallogr. D. Biol. Crystallogr, . 58, 1897–1900[CrossRef].

    Hegyi, H. and Gerstein, M. (1999) The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J. Mol. Biol, . 288, 147–164[CrossRef][Web of Science][Medline].

    Hegyi, H. and Gerstein, M. (2001) Annotation transfer for genomics: measuring functional divergence in multi-domain proteins. Genome Res, . 11, 1632–1640[Abstract/Free Full Text].

    Henikoff, S., et al. (1997) Gene families: the taxonomy of protein paralogs and chimeras. Science, 278, 609–614[Abstract/Free Full Text].

    Koonin, E.V., et al. (2000) The impact of comparative genomics on our understanding of evolution. Cell, 101, 573–576[CrossRef][Web of Science][Medline].

    Koonin, E.V., et al. (2002) The structure of the protein universe and genome evolution. Nature, 420, 218–223[CrossRef][Medline].

    Lander, E.S., et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921[CrossRef][Medline].

    Letunic, I., et al. (2002) Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Res, . 30, 242–244[Abstract/Free Full Text].

    Liu, J. and Rost, B. (2003) Domains, motifs and clusters in the protein universe. Curr. Opin. Chem. Biol, . 7, 5–11[CrossRef][Web of Science][Medline].

    Lo Conte, L., et al. (2002) SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res, . 30, 264–267[Abstract/Free Full Text].

    Madera, M., et al. (2004) The SUPERFAMILY database in 2004: additions and improvements. Nucleic Acids Res, . 32, D235–D239[Abstract/Free Full Text].

    Marchler-Bauer, A., et al. (2003) CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res, . 31, 383–387[Abstract/Free Full Text].

    Marcotte, E.M., et al. (1999) Detecting protein function and protein–protein interactions from genome sequences. Science, 285, 751–753[Abstract/Free Full Text].

    Mulder, N.J., et al. (2002) InterPro: an integrated documentation resource for protein families, domains and functional sites. Brief Bioinform, . 3, 225–235[Abstract/Free Full Text].

    Mulder, N.J., et al. (2003) The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res, 31, 315–318[Abstract/Free Full Text].

    Murzin, A.G., et al. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol, . 247, 536–540[CrossRef][Web of Science][Medline].

    Ouzounis, C.A., et al. (2003) Classification schemes for protein structure and function. Nat. Rev. Genet, . 4, 508–519[Web of Science][Medline].

    Ponting, C.P. (1997) Evidence for PDZ domains in bacteria, yeast, and plants. Protein Sci, . 6, 464–468[Web of Science][Medline].

    Ponting, C.P. and Dickens, N.J. (2001) Genome cartography through domain annotation. Genome Biol, . 2, Comment 2006.

    Ponting, C.P. and Russell, R.R. (2002) The natural history of protein domains. Annu. Rev. Biophys. Biomol. Struct, . 31, 45–71[CrossRef][Web of Science][Medline].

    Ponting, C.P., et al. (1997) PDZ domains: targeting signalling molecules to sub-membranous sites. Bioessays, 19, 469–479[CrossRef][Web of Science][Medline].

    Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol, . 4, 406–425[Abstract].

    Sokal, R. and Sneath, P. Numerical Taxonomy, (1973) , San Francisco Freeman.

    Tatusov, R.L., et al. (2000) The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res, . 28, 33–36[Abstract/Free Full Text].

    Tatusov, R.L., et al. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4, 41[CrossRef][Medline].

    Vogel, C., et al. (2004) Supra-domains: evolutionary units larger than single protein domains. J. Mol. Biol, . 336, 809–823[CrossRef][Web of Science][Medline].

    Wolf, Y.I., et al. (2000) Estimating the number of protein folds and families from complete genome data. J. Mol. Biol, . 299, 897–905[CrossRef][Web of Science][Medline].

    Wolf, Y.I., et al. (2002) Scale-free networks in biology: new insights into the fundamentals of evolution? Bioessays, 24, 105–109[CrossRef][Web of Science][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
D. Wichadakul, S. Numnark, and S. Ingsriswang
d-Omix: a mixer of generic protein domain analysis tools
Nucleic Acids Res., July 1, 2009; 37(suppl_2): W417 - W421.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
B. Lee and D. Lee
DAhunter: a web-based server that identifies homologous proteins by comparing domain architecture
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W60 - W64.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
T. Rattei, P. Tischler, R. Arnold, F. Hamberger, J. Krebs, J. Krumsiek, B. Wachinger, V. Stumpflen, and W. Mewes
SIMAP structuring the network of protein similarities
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D289 - D292.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. E. Vinogradov
'Genome design' model and multicellular complexity: golden middle
Nucleic Acids Res., November 6, 2006; 34(20): 5906 - 5914.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
22/17/2081    most recent
btl366v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Google Scholar
Right arrow Articles by Lin, K.
Right arrow Articles by Zhang, D.-Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lin, K.
Right arrow Articles by Zhang, D.-Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?