Bioinformatics Advance Access originally published online on May 19, 2005
Bioinformatics 2005 21(15):3213-3216; doi:10.1093/bioinformatics/bti509
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Exondomain correlation and its corollaries
GPC Biotech AG Fraunhoferstrasse 20, 82152 Martinsried, Germany
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Contact: andrei.grigoriev{at}gpc-biotech.com
Functional proteins are known to contain stretches of amino acid sequences highly conserved in different protein families and across species. These conserved sequences constitute protein domains that are generally integral structural units, conferring specific functionalities and often self-folding. The observation of highly conserved protein domains dispersed across different families and organisms provoked the question whether the domains were reused in some way during evolution.
A compelling mechanism for reusing the domains was put forth by the exon shuffling theory (Gilbert, 1978). In many eukaryotic species, the coding sequences of genes, or exons, are frequently interrupted by non-coding sequences, the introns. When the exonintron split structure of genes correlates with the organization of protein domains, i.e. the exons match the domains, then duplication, permutation and rearrangement of such exons would create novel genes with reused functional properties. Shuffling of exons could be accomplished through various biological processes such as illegitimate recombination or retrotransposition (van Rijk and Bloemendal, 2003). Studies on individual proteins, groups of ancient proteins and genome-wide surveys of alpha helices and beta strands indicated gene-structure and protein-structure correlation (Barik, 2004; de Souza et al., 1996; Holmes and Parham, 1985). Intron phase match studies suggested that domains tend to be bounded by symmetrically phased exons, another trait of exon shuffling process (de Souza et al., 1998; Holmes and Parham, 1985; Kaessmann et al., 2002; Patthy, 1996).
A one domain
one exon match can be often obscured due to either the loss of bordering introns or the insertion of other introns into the domain-encoding sequence. To compensate for these effects, one can analyze the match only between the borders of domains and exons, thus taking into account cases such as one domain
many exons and one exon
many domains. High-quality domain border annotations can be obtained by analyzing protein sequences with the domain definitions from the Pfam database, which is built from conserved protein sequences in a wide spectrum of species. Using this approach, we have recently demonstrated that correlation between the borders of protein domains and their encoding exons is a genome-wide phenomenon in multiple eukaryotic organisms (Liu and Grigoriev, 2004). Further, we have shown that exon-bordering domains probably contributed more to the expansion and diversification of proteomes than other domains as a result of duplications and exon shuffling, as they preferentially expanded into more genes than other domains during evolution (Liu et al., 2005). To highlight the fact of their remarkable genomic mobility, we use the terms mobile domains and exon-bordering domains interchangeably in the text below. In this study, we consider two main corollaries of this exondomain correlation: (1) the impact of mobile domains on the domain network and (2) possible refinement of definitions of individual protein domains.
| EXONDOMAIN CORRELATION AND DOMAIN NETWORKS |
|---|
|
|
|---|
Global properties of various networks, such as World Wide Web or biological networks (of interacting and co-occurring protein domains, as well as metabolic networks and those of proteinprotein interactions, transcriptional regulation, etc.), have received significant attention in recent years (Apic et al., 2001; Jeong et al., 2000; Luscombe et al., 2002; Ye and Godzik, 2004). Graph representation of such networks abstracts them as nodes connected by edges (e.g. proteins connected by their interactions); the number of edges of a given node is called the node's degree; nodes with many edges are often called hubs. Network properties are mainly analyzed from the prospective of their node connectivity (or degree distribution). Many of the networks have been shown to possess scale free property, which means that their degree distribution follows a power law
![]() | (1) |
Earlier, we have shown (Liu et al., 2005) that as a result of preferential amplification exon-bordering domains became on average more abundant and present in more genes than other protein domains. Exon-bordering domains also co-occur with a larger number of different domains to form mosaic proteins with diverse domain architectures. This property suggests that exon-bordering domains should be found among the highly connected hubs and that the evolution of domain networks (at least in terms of degree distribution) is likely to be largely driven by the evolution of exon-bordering domains and their propagation into genes via exon shuffling and duplication mechanisms.
Indeed, many properties of the network of co-occurring protein domains, where each domains in human is a node and an edge represents co-occurrence of two domains (not necessarily adjacent) in one protein, are similar to other biological networks described. As in many of these networks, there is one large component, containing >42% of all nodes and >93% of all edges (Fig. 1A), 172 much smaller components with 311 total edges and many singletons (44% of all nodes). We found that this undirected network is also scale-free [data not shown, this result is analogous to already published reports (Apic et al., 2001; Ye and Godzik, 2004)]. In addition to this property we also observed that a distribution of the number of different pairs of domains contained in human proteins also follows power law [Equation (1)],in this case N(d) being the number of domain pairs connected by d proteins and c = 2.08 (Fig. 1B). Thus, most of the domain pairs can be found only in one protein per pair. For example, out of 4677 detected domain pairs, only
200 pairs occur in more than 10 human proteins. Such proteins, however, are often domain-rich.
|
We also calculated the expected distribution of co-occurring pairs by modeling domain co-occurrence as a Bernoulli process (where a pair frequency would be proportional to the product of frequencies of individual domains, derived in this case from the number of proteins containing a domain, rather than domain numbers). Best-fit power law trendline for the expected distribution generated a much less steeper curve (c = 0.88, R2 = 0.74), indicating that a large proportion of pairwise domain combinations are underrepresented in human proteins. Similar findings obtained by other methods have been very recently published for the domain families in SCOP (Vogel et al., 2005), which annotates domains based on structural data, in contrast to the sequence-based Pfam that wehad used.
As a group, exon-bordering domains show a much higher connectivity (Fig. 1) and, as expected, they comprise most of the hubs in the network. We analyzed the level of network fragmentation after the removal of mobile domains by calculating the number of components, or remaining connected subgraphs, and average degree. We also estimated the distributions of these parameters for 1000 networks obtained from the network we studied by removal of the corresponding number of random nodes. Removal of mobile domains results in substantial fragmentation of the network and a drop in the average degree, significantly different from random node removals (Fig. 1C). Thus, mobile domains appear to be the major determinants of the network topology and evolution.
| EXONDOMAIN CORRELATION COULD HELP RESOLVE CONFLICTS BETWEEN DOMAIN DEFINITIONS FROM DIFFERENT DATABASES |
|---|
|
|
|---|
In our method, exondomain correlation was analyzed as follows. For computational prediction of protein domains in human proteins retrieved from the Ensembl (Birney et al., 2004) genome database, we used HmmPfam (Eddy, 1998) with domain definitions from Pfam (Bateman et al., 2002). For comparison, we also used CD-search (Marchler-Bauer and Bryant, 2004) with domain definitions from SMART (Letunic et al., 2004), a database that uses a mixture of analyses to create protein alignments and domain definitions for a relatively small number of signaling and extracellular domains. We collected statistics for only one multi-exon transcript per gene whose protein translation had at least one domain. Domain prediction may not be exact, so for each domain border, frequencies of exon boundaries were calculated for a window of 10 amino acids immediately outside the domain and 10 amino acids inside it (marked as [10;10] window). If a window contained one or more amino acid positions with frequencies significantly different from random expectation (threshold of P < 107), the domain was deemed exon bordering.
Remarkably, when we detected correlation of the borders of protein domains with encoding exons, it was nearly always positive, i.e. we observed significantly higher numbers of exon borders than that expected in the domain border boxes. However, there was one notable exception: the immunoglobulin domain displayed a negative correlation with exons, with the number of exon borders contained in its border boxes being much smaller than expected. This was rather surprising since the immunoglobulin domain was considered to be mobile and its bounding introns to have phase 11, which is the characteristic of mobile domains (Kolkman and Stemmer, 2001). Upon further investigation, we noticed that the Pfam definition of Ig domain was actually 820 amino acids shorter than its counterpart domain definition from the SMART database. Owing to this reason, the amino acid positions immediately outside the Ig domain border boxes as defined by Pfam were actually right inside the domain border boxes as defined by SMART. This indicates a preference for the SMART domain definition because we consistently observed lower numbers of exon borders inside Pfam-specific Ig domain border boxes (Fig. 2A) than expected.
|
When we switched to using SMART domain definition for Ig domains, we discovered that the two most prevalent Ig-related domains in SMART, IGc1 and IG, were ranked #2 and #7, respectively, out of all human mobile domains, with both having positive correlation with exons in contrast to the results obtained from Pfam's Ig domain definition. This contrast is even more obvious on the correlation graph for these domains (Fig. 2A). SMART's IG definition produced perfect correlation with exons with the peak correlation position at 1, the first amino acid outside the domain. IGc1 also has a strong positive correlation peak outside the domain, while Pfam's Ig domain showed a negative correlation with exon border at every position in the domain border box, both inside and outside of the domain borders.
In addition, if we separate statistics collected from the domain border boxes at the start and at the end of domains, we could produce a correlation graph that gives us information on where the exon borders preferentially fall at the start and end of domains (data not shown). Interestingly, SMART IG domain has a few peaks from positions 10 to 6 amino acids outside starting position of IG domain, and has a major peak at the first amino acid outside the ending position of the domain. This suggests that the exons correlate with the actual IG domain quite well and that additional analysis of residue conservation between positions 10 and 1 preceding the IG might further improve the domain annotation.
Another interesting example is the FA58C (F5/8 type C) domain present in blood coagulation factors that is thought to be involved in cell adhesion. Our study identified FA58C as a mobile domain that correlates with multiple exons and displays a very strong preference for phase 11 introns. The majority of the exon borders inside the two domain border boxes for FA58C fall onto positions 3 to 8, somewhat distant to the domain borders. We investigated the properties of this domain and found that its Pfam prediction of FA58C could be improved by taking into account the exon border positions. In the illustrated example (Fig. 2B), the DDR2 protein contains Pfam-annotated FA58C domain at the N-terminus (amino acids 33182). However, the highly conserved cystein residues at both ends of the domain that form a disulfide bond were actually at positions 30 and 185, both excluded by the HmmPfam prediction yet right inside the exon borders at both ends of FA58C domain. We also found that the domain annotation by pfscan program using Prosite profile is closer to the exon borders and it included the two cystein residues.
From these examples, it is apparent that at least for identified mobile domains, exon borders could in some cases serve as indicators of domain coordinates to improve (or choose between) predictions of computational tools. In fact, presence of exon borders in the vicinity of domain borders may potentially be used prediction tools themselves.
| Acknowledgments |
|---|
We would like to thank Jonathon Blake for helping with the data collection and early discussions and two anonymous reviewers for helpful suggestions.
Conflict of Interest: none declared.
Received on March 11, 2005; revised on May 16, 2005; accepted on May 18, 2005
| REFERENCES |
|---|
|
|
|---|
Apic, G., et al. (2001) An insight into domain combinations. Bioinformatics, 17, Suppl. 1, S83S89[Abstract].
Barik, S. (2004) When proteome meets genome: the alpha helix and the beta strand of proteins are eschewed by mRNA splice junctions and may define the minimal indivisible modules of protein architecture. J. Biosci., 29, 261273[Web of Science][Medline].
Bateman, A., et al. (2002) The Pfam protein families database. Nucleic Acids Res., 30, 276280
Birney, E., et al. (2004) An overview of Ensembl. Genome Res., 14, 925928
de Souza, S.J., et al. (1996) Introns and gene evolution. Genes Cells, 1, 493505[Abstract].
de Souza, S.J., et al. (1998) Toward a resolution of the introns early/late debate: only phase zero introns are correlated with the structure of ancient proteins. Proc. Natl Acad. Sci. USA, 95, 50945099
Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755763
Gilbert, W. (1978) Why genes in pieces? Nature, 271, 501[CrossRef][Medline].
Grigoriev, A. (2004) Understanding the yeast proteome: a bioinformatics perspective. Expert Rev. Proteom., 1, 133145.
Holmes, N. and Parham, P. (1985) Exon shuffling in vivo can generate novel HLAclass I molecules. EMBO J., 4, 28492854[Web of Science][Medline].
Jeong, H., et al. (2000) The large-scale organization of metabolic networks. Nature, 407, 651654[CrossRef][Medline].
Kaessmann, H., et al. (2002) Signatures of domain shuffling in the human genome. Genome Res., 12, 16421650
Kolkman, J.A. and Stemmer, W.P. (2001) Directed evolution of proteins by exon shuffling. Nat. Biotechnol., 19, 423428[CrossRef][Web of Science][Medline].
Letunic, I., et al. (2004) SMART 4.0: towards genomic data integration. Nucleic Acids Res., 32, D142D144
Liu, M. and Grigoriev, A. (2004) Protein domains correlate strongly with exons in multiple eukaryotic genomesevidence of exon shuffling? Trends Genet., 20, 399403[CrossRef][Web of Science][Medline].
Liu, M., et al. (2005) Significant expansion of exon-bordering protein domains during animal proteome evolution. Nucleic Acids Res., 33, 95105
Luscombe, N.M., et al. (2002) The dominance of the population by a selected few: power-law behaviour applies to a wide variety of genomic properties. Genome Biol., 3, RESEARCH0040.
Marchler-Bauer, A. and Bryant, S.H. (2004) CD-Search: protein domain annotations on the fly. Nucleic Acids Res., 32, W327W331
Patthy, L. (1996) Exon shuffling and other ways of module exchange. Matrix Biol., 15, 301310 Discussion 311302[CrossRef][Web of Science][Medline].
van Rijk, A. and Bloemendal, H. (2003) Molecular mechanisms of exon shuffling: illegitimate recombination. Genetica, 118, 245249[CrossRef][Web of Science][Medline].
Vogel, C., et al. (2005) The relationship between domain duplication and recombination. J. Mol. Biol., 346, 355365[CrossRef][Web of Science][Medline].
Ye, Y. and Godzik, A. (2004) Comparative analysis of protein domain organization. Genome Res., 14, 343353
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



2 P-value calculated from expected and observed numbers of exon boundaries for each amino acid position as described previously (