Bioinformatics Advance Access originally published online on August 12, 2004
Bioinformatics 2005 21(1):2-9; doi:10.1093/bioinformatics/bth475
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 21 issue 1 © Oxford University Press 2005; all rights reserved.
Origins of introns based on the definition of exon modules and their conserved interfaces
The Beagle Armada Postbus 964, 4600 AZ Bergen op Zoom, The Netherlands
| ABSTRACT |
|---|
|
|
|---|
Summary: Central to the unraveling of the early evolution of the genome is the origin and role of introns. The evolution of the genome can be characterized by a continuous expansion of functional modules that occurs without the interruption of existing processes. The design-by-contract methodology of software development offers a modular approach to design that seeks to increase flexibility by focusing on the design of constant interfaces between functional modules. Here, it is shown that design-by-contract can offer a framework for genome evolution. The definition of an ancient exon module with identical splice sites leads to a relatively simple sequence of events that explains the role of introns, intron phase differences and the evolution of multi-exon proteins in an RNA world. An interaction of the experimentally defined six-nucleotide splicing consensus sequence together with a limited number of primitive ribozymes can account for a rapid creation of protein diversity.
Contact: albert.de.roos{at}thebeaglearmada.nl
| INTRODUCTION |
|---|
|
|
|---|
One of the most intriguing questions in unravelling genomic evolution is whether the intron/exon structure of eukaryotic genes reflects their ancient assembly by exon shuffling or whether the introns have been inserted into preformed genes. Several theories have been put forward to explain the role of introns and exons in evolution (reviewed in Mattick 1994, Logsdon 1998, Fedorova 2003, Rzhetsky 1999). There are now two main competing theories that try to explain the role of introns, both based on the involvement of DNA-based introns and exons. The introns early or exon theory of genes states that the introns are ancient and have been subsequently lost in prokaryotes (Gilbert, 1987; Gilbert et al., 1996, 1997). In this theory, the first exons coded for ancient protein modules from which multi-modular proteins were assembled by means of exon shuffling and recombination. Introns facilitated this process by providing the actual sites of recombination. On the other hand, the introns late theory maintains that the spliceosomal introns were inserted into the eukaryote genes later in evolution (Palmer and Logsdon, 1991, Cavalier-Smith, 1991, Cho and Doolittle, 1997, Logsdon, 1998) after the evolution of multi-modular proteins. In introns-late, the appearance of introns could also have aided in the creation of diversity by facilitating recombination. No conclusive evidence has been found to prove or disprove intron-early or intron-late, although these theories are based on completely different genome architectures and mechanisms of evolution.
The genome has evolved from a simple RNA-based self-replicating system, the RNA world (Gilbert, 1986, Joyce, 2002) to a complex system of multi-exon genes coding for multi-modular proteins. During this evolutionary process, numerous new functions were added or modified without disrupting the functioning of older systems. The evolution from strands of RNA to multi-exon genes with sophisticated expression systems implies that the genome was able to increase in size and complexity many orders of magnitude without losing flexibility. Any genome architecture meant to form the basis of genome evolution should therefore be flexible and robust in order to meet the requirements for virtually unlimited expansion of size and function.
Modern software designs seek to increase flexibility by using a modular approach which allows for the addition, replacement and changing operations within individual modules. Complex software architectures are based on a methodology in which a software system is viewed as a set of communicating modules whose interaction is based on precisely defined interfaces. The interfaces can be viewed as specifications of the mutual obligations or contracts. The effect of constant interfaces across modules is a reduction of the interdependences across modules or components and a reduction in the risk that changes within one module will create unanticipated changes in other modules. This methodology is also known as design-by-contract (Meyer, 1997). Since the characteristics of the design-by-contract methodology are similar to those required in genome evolution, it is hypothesized here that genome architecture reflects the paradigms of design-by-contract: definition of functional modules that interact with each other by well-defined interfaces.
| MODULARITY AND INTERFACES IN THE GENOME |
|---|
|
|
|---|
The basic unit of genetic information, the gene, can be regarded as a self-contained module with a well-defined interface. A gene contains all the necessary information from which the encoded protein can be generated, whereas the highly conserved genetic code functions as the interface between gene and protein. Eukaryotic genes consist themselves of parts of coding sequences, exons, interrupted by non-coding sequence, the introns (Fig. 1A). The introns have to be spliced out in order to form a continuous coding sequence, mRNA, that can be recognized by the translation machinery. In principle, an intron contains all the necessary information to be spliced out, which enables it to function independently from the exon sequence. The intron can therefore be regarded as a self-contained module with a well-defined (conserved) interface, the splice recognition site (Fig. 1B), which is located exclusively in the intron. This configuration enables the excision of introns independent from exon sequence.
|
Exons are, in contrast to introns, dependent upon information that lies outside of the exons, since the splice recognition sites of the intron determine the span of the exon. A dependence on intron sequences would severely hamper independent movement and exchange of coding sequences between genes. However, extensive recombination of exons by exon shuffling is believed to play an important role in the creation of genetic diversity (Patthy, 2003, Sudhof et al., 1985, Kolkman and Stemmer, 2001) and many of the proteins with functionally divergent domains were established before the division of prokaryotes from eukaryotes (Ohno, 1987). In order to be inserted into random nucleotide sequences, the exon module should preferably behave like a self-contained module. The exon greatly acquires independence when the conserved intron sequences that flank both ends of the exon is included as part of the exon (Fig. 1C), enabling it to function as an independent coding module, or ancient exon module.
| MOLECULAR VIEW ON THE ANCIENT EXON |
|---|
|
|
|---|
The ends of the proposed ancient exon module were studied in more detail at a molecular level using an intronexon database (Clark, 2003, http://www.maths.uq.edu.au/~fc/datasets/) generated from GenBank release 127 (Benson et al., 2002). The last nucleotides on either side of the exon module are represented by the intronexon boundary and possible remnants of a consensus sequence were determined by looking at nucleotide triplets from the intron and the exon part of the intronexon boundary. The tri-nucleotide sequences with the highest frequencies of several species are shown in Tables 1 and 2. Looking at the overall similarity between the sequences on both ends of the exon module and the conservation of these sequences between species, a bias towards the sequence CAG|GTG can be discerned both in the sequence preceding the exon and in the one following the exon. No significant differences were observed between sequences from intronexon boundaries with different intron phases (data not shown).
|
|
Based on the data in Tables 1 and 2, it is proposed that the conserved sequences of both ends of this ancient exon module functioned as the ancient exon recognition site with an original sequence CAGGUG (Fig. 2A). This consensus sequence could have served as a cleavage recognition site enabling the splicing out of the coding sequence, creating the substrate for the translation machinery (Fig. 2B). A cleavage in the middle of the sequence CAGGUG would result in a spliced out coding RNA sequence that is always surrounded by the remaining parts of the recognition sequence, GUG at the start and CAG at the end of the exon. Concatenation of these ancient exon modules after cleavage of the recognition sites joins the remaining parts of the recognition sequence (Fig. 2C), forming multi-exon mRNA.
|
Support for the existence of the ancient splice site can be provided by the fact that the codon GUG still acts as a translation start site in bacteria (Gold, 1988) and can still function as one in other organisms (Mehdi et al., 1990, Peabody, 1989). Moreover, the most common start codon AUG differs only one nucleotide from GUG and a single mutation of the first nucleotide of the hypothetical ancient end sequence CAG is needed to convert it into the amberstop codon UAG. Other support for the role of the ancient splice site comes from the intron-less genes of prokaryotes. It has been shown that the coding sequences around the positions of introns insertion in their eukaryotic counterparts also show a consensus sequence CAG
GT, originally dubbed the proto-splice site (Dibb and Newman, 1989). If introns were lost during evolution in an RNA world with a mechanism closely related to splicing (cf. Fig. 2C), the proposed ancient splice site would also be retained. | EXON PHASE AND FRAME-SHIFT |
|---|
|
|
|---|
The joining of two exons modules as shown in Figure 2C implies that part of the consensus splicing sites become part of the coding sequence (Fig. 3A) and every module would be connected by a fixed series of 6 nt, formed by the sequence CAGGUG. The two codons in this sequence (CAG and GUG) would always be translated into the amino acids glutamine (Q) and valine (V). In our design-by-contract model, the recognition sequence represents the interface for the splicing of the exons and therefore, any mutation in this sequence would be deleterious since it would result in the inactivation of the splice site (Fig. 3B) and resulting loss of function of the encoded protein. On the other hand, mutation of the amino acid sequence would be advantageous for the evolutionary process since it would relieve the obligatory translation of the ancient splice site into the amino acids Q and V. A phase shift enables the reading-through of the recognition sequence in another way (Fig. 3C), leading to a different amino acid sequence between exons with identical recognition consensus sequences.
|
The actual distribution of amino acids at splice junctions was investigated using an exonintron database containing phase information (Sakharkar et al., 2000) derived from GenBank 122 (Benson et al., 2002). Figure 4 shows that the last amino acid of an exon has a phase-dependent preference for specific amino acids. In each phase, the last amino acid follows closely the ones that can be predicted from a phase shift based on a constant splice recognition sequence (Fig. 3C). Note that at the nucleotide level, the intronexon boundary does not exhibit phase-dependent differences (data not shown). Table 3 shows that the amino acid positions that would be have been affected by an ancient phase shift still show a bias towards their predicted phase. This effect is even stronger when the effect of a phase shift is viewed in both exons simultaneously, up to the point that almost 95% of the amino acid sequences Q|Varound a splice site is in phase 0.
|
|
Since splicing out of introns is necessary for correct translation, intronless mRNA can be considered as a well-conserved interface to the translation machinery. The generation of intronless mRNA by a concatenation of different coding RNA modules in random intron RNA sequence (Fig. 2), would not change this interface and could take place without affecting translation. Also, the separate development of functional protein modules, followed by an assembly of these modules would be inherently less complex and more flexible (Gilbert, 1987, Patthy, 2003). Phase shift could be viewed as an outcome of a genomic evolution model based on the design-by-contract methodology, since phase shifts could provide a means for creating more protein diversity without affecting the established splicing interface. The development of a splicing machinery that would confine the splicing recognition sequence exclusively to the intron (as is presently the case, cf. Fig. 1B) would ultimately enable the complete independent evolution of the coding ends of the exon.
The degree of conservation at the boundaries of exons flanking introns has been shown earlier and has been interpreted as a derived result of evolution for efficient splicing (Long et al., 1997), the preferred insertion site for introns (Dibb and Newman, 1989) or as functional splice sites that existed in the coding sequence of genes prior to the insertion of introns (Sadusky et al., 2004). Intron phase has been shown to be correlated to the codon position (Long et al., 1995, Tomita et al., 1996) and hypothesized to be related to exon shuffling between exons in the same phase (Long et al., 1995).
| FUNDAMENTAL STEPS IN EVOLUTION BASED ON A SINGLE TEMPLATE |
|---|
|
|
|---|
It is proposed here that the sequence CAGGUG acted as the ancient cleavage recognition site for a ribozyme. Ribozymes can interact with its targets b, a complementary RNA sequence primarily based on WatsonCrick base pairing (Guerrier-Takada et al., 1989, Cech, 1987). Based on the sequence of the ancient splice site, an antiparallel arrangement of this sequence could interact with itself (Fig. 4A), making a single recognition sequence act as both the target site and the target recognition sequence. At a molecular level, this interaction could be stabilized by four WatsonCrick base pairs while leaving two G-pairs unpaired.
The splicing out of the RNA sequences between the exon modules, equivalent to intron splicing, is an important step in genome evolution. Figure 5B shows how the antiparallel arrangement of two adjacent exon modules could facilitate splicing. In addition to an intra-strand cleavage between G residues, a religation of the G's to the opposing strand would concatenate the two exons, a process that could be facilitated by the close physical proximity of the G's involved.
|
Another important step in the evolution of proteins is the exchange of coding sequences between different genes resulting in the recombination of genes. A mechanism identical to intron splicing as shown in Figure 5B but followed by an in trans religation would lead to the exchange of RNA strands (Fig. 5C) between RNA molecules. In this way, ancient ribozymes could have played an active role in the generation of the diversity of proteins.
Thus, based on a six-nucleotide proto-splice site and relatively simple ribozymes that could cleave and religate this sequence, three important events in the exon-centric evolution of multi-domain proteins can be explained: (i) the splicing out of the exon modules yielding short exonic mRNA, (ii) the splicing out of RNA sequences between exons thereby concatenating exon modules to multi-exon mRNA and (iii) the active recombination of exons. The classes of ribozymes that could catalyse the cleavage and ligation reactions proposed in Figure 5 have been shown to occur naturally (Symons, 1992, Guerrier-Takada et al., 1983). Ribozyme RNAse T1 cleaves a double-stranded complementary RNA sequence at unpaired G residues), and apart from several naturally occurring RNA ligases (Yoshida, 2001, Hager et al., 1996), it has also been shown that complex ligases can evolve from group I ribozyme domains (Jaeger et al., 1999) and from small random RNA sequences (Ekland et al., 1995).
The proto-splice site can act as a starting point for the evolution of multifunction proteins when the consensus sequence of the proposed proto-splice site arises randomly in strands of RNA. Two splice sites in close proximity could then lead to the first functional single-exon genes. The transformation of the coding parts of the proto-splice site sequences into start (GUG to AUG) and stop codons (CAG to UAG) and vice versa back to a functional proto-splice site could facilitate a stepwise concatenation of exons (cf. Fig. 2).
The introns that arose early in evolution as a consequence of a concatenation of exons (Fig. 2) could be lost further in evolution, but their presence at conserved positions would still reflect their ancient origins. The evolution of transposons from introns, both able to function as relatively independent functional units, may account for many of the observations attributed to the introns-late theory (Cho and Doolittle, 1997, Logsdon, 1998).
| EVOLUTION ON A DESIGN-BY-CONTRACT THEORY |
|---|
|
|
|---|
The application of the design-by-contract methodology by viewing the exon as a module that interacts with its environment by its interface, led to a series of logical steps explaining the intronexon structure of genes and intron phase differences. It suggests that evolution behaved according to a design pattern that separates functional modules from each other by well-defined interfaces. The dependence of vital functions on interfaces prevents changes in the interfaces and forces evolution in an architecture that reflects design-by-contract rules. It also proposes that the major events leading to a diversification of proteins were situated in an RNA world. The next fundamental step in genome evolution, the transition from the RNA world to the RNA/DNA world can also be explained in line with design-by-contract. In order to keep all the interfaces that were created in the RNA world intact, the entire RNA genome could have been copied verbatim into DNA.
Received on March 4, 2004; revised on July 22, 2004; accepted on August 5, 2004
| REFERENCES |
|---|
|
|
|---|
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Rapp, B.A., Wheeler, DL. (2002) GenBank. Nucleic Acids Res., 30, 1720
Cavalier-Smith, T. (1991) Intron phylogeny: a new hypothesis. Trends Genet., 7, 145148[Web of Science][Medline].
Cech, T.R. (1987) The chemistry of self-splicing RNA and RNA enzymes. Science, 236, 15321539
Cho, G. and Doolittle, R.F. (1997) Intron distribution in ancient paralogs supports random insertion and not random loss. J. Mol. Evol., 44, 573584[CrossRef][Web of Science][Medline].
Clark, F. (2003) Gene data sets derived from GenBank.
Dibb, N.J. and Newman, A.J. (1989) Evidence that introns arose at proto-splice sites. EMBO J., 8, 20152021[Web of Science][Medline].
Ekland, E.H., Szostak, J.W., Bartel, D.P. (1995) Structurally complex and highly active RNA ligases derived from random RNA sequences. Science, 269, 364370
Fedorova, L. and Fedorov, A. (2003) Introns in gene evolution. Genetica, 118, 123131[CrossRef][Web of Science][Medline].
Gilbert, W. (1986) The RNA World. Nature, 319, 618.
Gilbert, W. (1987) The exon theory of genes. Cold Spring Harb. Symp. Quant. Biol., 52, 901905
Gilbert, W., Marchionni, M., McKnight, G. (1986) On the antiquity of introns. Cell, 46, 151153[CrossRef][Web of Science][Medline].
Gilbert, W., de Souza, S.J., Long, M. (1997) Origin of genes. Proc. Natl Acad. Sci., USA, 94, 76987703
Gold, L. (1988) Posttranscriptional regulatory mechanisms in. Escherichia coli. Annu. Rev. Biochem., 57, 199233.
Guerrier-Takada, C., Gardiner, K., Marsh, T., Pace, N., Altman, S. (1983) The RNA moiety of ribonuclease P is the catalytic subunit of the enzyme. Cell, 35, 849857[CrossRef][Web of Science][Medline].
Guerrier-Takada, C., Lumelsky, N., Altman, S. (1989) Specific interactions in RNA enzymesubstrate complexes. Science, 246, 15781584
Hager, A.J., Pollard, J.D., Szostak, J.W. (1996) Ribozymes: aiming at RNA replication and protein synthesis. Chem. Biol., 3, 717725[CrossRef][Web of Science][Medline].
Jaeger, L., Wright, M.C., Joyce, G.F. (1999) A complex ligase ribozyme evolved in vitro from a group I ribozyme domain. Proc. Natl Acad. Sci., USA, 96, 47124717.
Joyce, G.F. (2002) The antiquity of RNA-based evolution. Nature, 418, 214221[CrossRef][Medline].
Kolkman, J.A. and Stemmer, W.P. (2001) Directed evolution of proteins by exon shuffling. Nat. Biotechnol., 19, 423428[CrossRef][Web of Science][Medline].
Logsdon, J.M. (1998) The recent origins of spliceosomal introns revisited. Curr. Opin. Genet. Dev., 8, 637648[CrossRef][Web of Science][Medline].
Long, M., Rosenberg, C., Gilbert, W. (1995) Intron phase correlations and the evolution of the intron/exon structure of genes. Proc. Natl Acad. Sci., USA, 92, 1249512499
Long, M., de Souza, S.J., Gilbert, W. (1997) The yeast splice site revisited: new exon consensus from genomic analysis. Cell, 12, 739740.
Mattick, J.S. (1994) Introns: evolution and function. Curr. Opin. Genet. Dev., 4, 823831[CrossRef][Medline].
Mehdi, H., Ono, E., Gupta, K.C. (1990) Initiation of translation at CUG, GUG, and ACG codons in mammalian cells. Gene., 91, 173178[CrossRef][Web of Science][Medline].
Meyer, B. Object-Oriented Software Construction, (1997) 2nd ed. , NY Prentice-Hall.
Ohno, S. (1987) Early genes that were oligomeric repeats generated a number of divergent domains on their own. Proc. Natl Acad. Sci., USA, 84, , pp. 64866490
Palmer, J.D. and Logsdon, J.M. (1991) The recent origins of introns. Curr. Opin. Genet. Dev., 1, 470477[CrossRef][Medline].
Patthy, L. (2003) Modular assembly of genes and the evolution of new functions. Genetica, 118, 217231[CrossRef][Web of Science][Medline].
Peabody, D.S. (1989) Translation initiation at non-AUG triplets in mammalian cells. J. Biol. Chem., 264, 50315035
Rzhetsky, A. and Ayala, F.J. (1999) The enigma of intron origins. Cell. Mol. Life Sci., 55, 36.
Sakharkar, M., Long, M., Tan, T.W., de Souza, S.J. (2000) ExInt: an Exon/Intron database. Nucleic Acids Res., 28, 191192
Sadusky, T., Newman, A.J., Dibb, N.J. (2004) Exon junction sequences as cryptic splice sites: implications for intron origin. Curr. Biol., 14, 505509[Web of Science][Medline].
Sudhof, T.C., Goldstein, J.L., Brown, M.S., Russell, D.W. (1985) The LDL receptor gene: a mosaic of exons shared with different proteins. Science, 228, 815822
Symons, R.H. (1992) Small catalytic RNAs. Annu. Rev. Biochem., 61, 641671[CrossRef][Web of Science][Medline].
Tomita, M., Shimizu, N., Brutlag, D.L. (1996) Introns and reading frames: correlation between splicing sites and their codon positions. Mol. Biol. Evol., 13, 12191223[Abstract].
Yoshida, H. (2001) The ribonuclease T1 family. Methods Enzymol., 341, 2841[Web of Science][Medline].
This article has been cited by other articles:
![]() |
S. W. Roy and W. Gilbert Complex early genes PNAS, February 8, 2005; 102(6): 1986 - 1991. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





