Skip Navigation


Bioinformatics Advance Access originally published online on February 24, 2006
Bioinformatics 2006 22(10):1211-1216; doi:10.1093/bioinformatics/btl067
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/10/1211    most recent
btl067v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Nagasaki, H.
Right arrow Articles by Gotoh, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Nagasaki, H.
Right arrow Articles by Gotoh, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Automated classification of alternative splicing and transcriptional initiation and construction of visual database of classified patterns

Hideki Nagasaki 1,*, Masanori Arita 1,2, Tatsuya Nishizawa 3, Makiko Suwa 1 and Osamu Gotoh 1,4

1 Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology Tokyo 135-0064, Japan
2 Department of Computational Biology, Graduate School of Frontier Sciences, The University of Tokyo Kashiwa 277-8561, Japan
3 Information and Mathematical Science Laboratory Inc. Tokyo 171-0014, Japan
4 Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University Kyoto 606-8501, Japan

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 ALGORITHM
 SYSTEMS AND METHODS
 RESULTS
 DISCUSSION
 CONCLUSION
 REFERENCES
 

Motivation: Large-scale detection and classification of alternative splicing and transcriptional initiation (ASTI) is the first step towards detailed studies of the functional implication and mechanisms of these phenomena.

Results: We have developed an algorithm that classifies all observed units of ASTI into an extendable set of distinct types (e.g. cassette type) by converting a collection of alignments between a genomic DNA sequence and cDNA sequences into binary description. This description system can uniquely and compactly encode not only typical patterns but also any rare patterns that are usually collectively assigned to ‘others.’ More than 150 distinct ASTI types were found when this system was applied to genome-wide detection of ASTI units in human and five other eukaryotes.

Availability: The data detected by this system are available through ASTRA (http://alterna.cbrc.jp/), a database equipped with a Java-based browser that can interactively reorganize the order of displayed splicing patterns on demand.

Contact: h-nagasaki{at}aist.go.jp


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 ALGORITHM
 SYSTEMS AND METHODS
 RESULTS
 DISCUSSION
 CONCLUSION
 REFERENCES
 
Alternative splicing (AS) and alternative transcriptional initiation (ATI, also called alternative promoter or alternative 5' end) have increasingly attracted researchers' attention since the discovery that the number of genes on a genome is not linearly correlated with the structural, behavioral or developmental complexity of an organism. Alternative splicing and transcriptional initiation (ASTI) is widely observed in eukaryotic cells and the abundance of each product in a cell is exquisitely or loosely controlled according to tissue type, developmental stage and other conditions (Johnson et al., 2003; Xu et al., 2002). Black (2003) recently reviewed a number of examples of AS in relation to their regulatory mechanisms.

ASTI is realized by alternative uses of transcriptional initiation, donor and/or acceptor splice sites that are detected by comparing mRNA sequences derived from the same gene locus with each other or by comparing mRNA sequences with the genomic sequence. Compared with the former procedure that sometimes fails to resolve alternative patterns unequivocally, the latter provides full resolution of ASTI patterns and is the preferred procedure for use. Depending on the combination of alternatively used exon boundaries, various ASTI patterns can be generated. It has been customary to identify only such typical patterns as alternative donor or acceptor splice sites, cassette (exon skipping or cryptic exon), mutually exclusive exons, terminal exons with alternate polyadenylation sites and retained intron (Breitbart et al., 1987), whereas other patterns are collectively classified as ‘miscellaneous’ or ‘others.’ Although some complicated atypical patterns may be decomposed into several simpler elementary units (Thanaraj et al., 2004), the most basic concept of an ‘ASTI unit’ is, to this day, not well established.

We propose herein the definition of an ASTI unit as a minimal span of variably expressed genomic region flanked by common exonic or extragenic region(s), and we show a simple and efficient algorithm that enables compact and extendable representation of the delineated and classified patterns (types) of ASTI units in the form of a vector with two integer components. Using this encoding system, not only typical patterns but also any rare or novel patterns are uniquely classified. To detect putative ASTI units in human transcripts, we mapped human mRNAs (cDNA sequences) onto human genomic sequences. As a result of application of our algorithm, we found 11 498 instances of AS units that were classified into as many as 124 distinct types, and 4627 instances of ATI units that were classified into 52 types. These data and the data from five other eukaryotes are stored in the database named Alternative Splicing and Transcription Archives (ASTRA). Each ASTRA record contains ASTI patterns, sequence data and annotation, and can be retrieved by JAVA-based graphical user interface via the Internet.


    ALGORITHM
 TOP
 ABSTRACT
 INTRODUCTION
 ALGORITHM
 SYSTEMS AND METHODS
 RESULTS
 DISCUSSION
 CONCLUSION
 REFERENCES
 
Notation of delineated patterns of ASTI
ASTI variants (isoforms) are mature mRNA forms that retain partially different portions of the same template gene after transcription and processing. For simplicity, we exclude partial, immature or degraded products from our consideration. In addition, we concentrate most of our attention on pairwise comparison, although complicated combinatory patterns may appear when many isoforms from a single gene are compared simultaneously.

Consider some isoforms that are aligned onto the genomic sequence. From the alignment, exon–intron structures of respective isoforms are immediately derived. Our algorithm starts with labeling each nucleotide in each variant either 1 or 0 depending on whether that nucleotide lies in an exon or a non-exon (intron or extragenic region), respectively. This procedure produces two-dimensional bit arrays in which each column corresponds to a position in the genomic sequence, each row corresponds to a distinct mRNA and the label indicates the exonic status. The bit pattern is then compressed so that adjacent columns with the same values are combined (Fig. 1; I, I' and I'').


Figure 1
View larger version (19K):
[in this window]
[in a new window]
 
Fig. 1 Outline of the algorithm for classifying ASTI patterns. The algorithm first converts each exon–intron structure into a binary (0 or 1) array (I, I' and I''). Redundant rows with the same bit series within a delineated region are thinned out (II). Each pair of the non-redundant bit series is compressed again, e.g. two consecutive (0, 0) columns in II are combined into the underlined column (III). Finally, binary series are converted into decimal numbers to identify an ASTI type by a two-dimensional integer vector (IV). Solid arrows indicate the regions defined as AS units whereas broken arrows indicate those not defined as AS units. In a more efficient procedure actually used, exon boundaries are processed from left to right in the order determined by a priority queue. Numbers in circles indicate the order of visits in this example. A flip-flop switch indicates the exonic status of each isoform.

 
Isoforms usually share a majority of exons and introns in common; however, we are interested in only the difference between them. Moreover, apparently complicated variations in exon–intron organizations may be decomposed into simpler units. Hence, we define an ASTI unit as a minimal span of distinct exon–intron structures flanked by either common exon(s) or extragenic region(s). With the compressed bit arrays mentioned above, the region is represented by a pair of non-identical shortest bit series flanked by either (1, 1) or ‘extragenic’ (0, 0) at one end of a transcript. (To save space, bit arrays are shown side by side rather than up and down.) In the example shown in Figure 1, we find four ASTI units indicated by solid bidirectional arrows (the left solid arrow corresponds to three pairwise ASTI units), whereas the regions indicated by broken arrows are not ASTI units because the bit series are identical (the right broken arrow) or as discussed in the next paragraph (first arrow). ASTI units may be classified into many types, some of which correspond to typical AS patterns, such as alternative donor and acceptor splice sites, cassette, mutually exclusive exons, terminal exons with alternate polyadenylation sites and retained intron (Breitbart et al., 1987). With our system, these typical patterns are represented by relatively short bit arrays, as shown in Figure 2a. Note that bit arrays encoding an AS type are flanked by (1, 1) at both ends, whereas those encoding variants with different transcriptional initiation and termination sites have (0, 0) at the left and right ends, respectively. In fact, the terminal (0, 0) column is dispensable for the unique identification of distinct types because its immediate neighbor cannot be (1, 1) and hence never denotes an AS type. A pair of bit arrays flanked by (0, 0) at both ends without any (1, 1) in between correspond to nested genes and are not counted as an ASTI type.


Figure 2
View larger version (30K):
[in this window]
[in a new window]
 
Fig. 2 Examples of ASTI types detected in human transcripts. (a) The seven representative AS types proposed by Breitbart et al. (1987) displayed in the descending order of abundance in human transcripts. From left to right: the AS type, the binary representation, the decimal representation, the number of AS units detected and the relative abundance of the AS type. (b) The 10 most abundant AS types classified as ‘others.’ (c) The five most abundant ATI types. A black or gray box indicates an alternative exon or exon part, whereas a white box indicates a constitutive exon or exon part. A retained intron is indicated by a thick line.

 
To improve human recognition, we convert a bit series into the corresponding decimal number (Fig. 1; IV), whereby the smaller number is defined to be the first component and the larger one is the second component of the two-dimensional integer vector. In this conversion, we omit the terminal (0, 0) column without loss of information, as noted above. Each decimal representation is specific to each ASTI type. For instance, the aforementioned typical patterns, i.e. alternative donor and acceptor splice sites, cassette, mutually exclusive exons, two patterns of terminal exons with alternate polyadenylation sites and retained intron, are represented by (9, 13), (9, 11), (17, 21), (69, 81), (17, 20), (9, 12) and (5, 7), respectively (Fig. 2a). In a computer program, we can use a hash function or an associative array to compactly deal with such an extendable set of multi-component variables.

ATI sites are usually regulated by different promoters of a gene, and the molecular mechanisms of transcriptional initiation are quite different from those of splicing (Landry et al., 2003). On the other hand, transcriptional termination is tightly coupled with the upstream splicing patterns. Hence, we consider variations in transcriptional initiation separately from the other variations in the analyses described below, although a common classification system is used in both cases.

Conversion of mapping data into bit arrays
The primary information obtained by mapping cDNA sequences onto genomic sequence is the coordinates (positions) of exon boundaries. Base-wise conversion of this information into bit arrays, as suggested above, is obviously inefficient. Thus, we developed a simple algorithm similar to that used in merge sort for combining two or more already sorted arrays into a single array. We treated a set of isoforms pairwise or collectively. In either case, the boundary coordinates are processed from left to right with a set of switches that indicate exonic status. In the collective procedure, a priority queue is used to indicate the exon boundary to be processed next (Fig. 1). This procedure converts the boundary coordinates into the compressed form of bit arrays in O(KN log(N)), where N and K denote the number of isoforms and the average number of boundaries per isoform, respectively. Regions delineated by exonic regions common to all isoforms are treated separately. Rows with the same bit series within such a region are merged to eliminate redundant computations involved in the all-by-all comparisons for the detection of ASTI units. Thus, although the overall computational complexity is O(KN2), practical computational time may be considerably shortened when ASTI units are sparsely distributed. Implementation of this algorithm contributed to 6.6-fold reduction in execution time compared with the original implementation by round robin comparisons of splice variants. A Perl script implementing the above algorithm is available from the authors (H.N. or O.G.) upon request.


    SYSTEMS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 ALGORITHM
 SYSTEMS AND METHODS
 RESULTS
 DISCUSSION
 CONCLUSION
 REFERENCES
 
In silico mapping of mRNA sequences onto genomic sequences
To collect ASTI variants, we mapped full-length mRNA sequences in UniGene and other public databases onto respective genomic sequences by the combined use of MEGABLAST (version 2.2.1) (Zhang et al., 2000) and ALN (Gotoh, 2000). We used only those mRNA sequences that satisfy the following conditions: (1) each matched region flanking a putative intron must be at least 25 bp with at most a single mismatch, (2) the length of an intron must be ≥30 bp and (3) its terminal dinucleotides must be GT.AG, GC.AG or AT.AC. The detailed procedures are described elsewhere (Nagasaki et al., 2005).

The predicted gene structures were sorted in the order of chromosomal location for each direction of transcription. Identical gene structures were merged into a single entity and a conceptual multiple alignment of exon boundaries was constructed from gene structures that shared putative exonic regions. Such an alignment was then subjected to the analysis of ASTI patterns as mentioned in the previous subsection.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 ALGORITHM
 SYSTEMS AND METHODS
 RESULTS
 DISCUSSION
 CONCLUSION
 REFERENCES
 
ASTI units observed in human mRNAs
As a representative example, we briefly describe the results of our analyses of human transcripts. We found 65 041 mRNAs that satisfied the criteria described in ‘Systems and Methods’, and mapped them onto 15 371 disjoint loci (putative genes) of human genomic sequences. Of the 65 041 mRNAs, 15 903 mRNAs mapped onto 4931 distinct loci were associated with AS, where variation in transcriptional initiation was not counted in this category. After removing redundancy, the number of unique variants involved in AS was reduced to 12 470. Thus, ~32.1% (4931/15 371) of human genes were estimated to undergo AS when mature transcripts were compared, and ~2.5 (12 470/4931) splicing variants were found for each locus that underwent AS. The above values are underestimated because we considered apparently full-length mRNAs only and adopted stringent criteria to identify cognate transcripts. We deliberately avoided using EST sequences to obtain a conservative view of the diversity of AS types.

We applied the algorithm described in the ‘Algorithm’ section to the variant pairs and found 11 498 independent AS units that were classified into 124 distinct AS types. The numbers and fractions of observed AS units belonging to seven typical AS types are shown in Figure 2a. In human, typical AS units other than the retained introns tended to occur within the protein coding sequence (CDS) regions rather than within the untranslated regions (UTRs). The difference in alternative exon lengths was most frequently in multiples of three, and this tendency was most prominent when the alternative exons were embedded within the CDSs (Nagasaki et al., 2005).

Our algorithm can detect not only these typical AS types but also 1843 units that were classified into 117 ‘other’ AS types. Thus, a significant fraction of the AS units we found (16.0% of the total) belong to the rare types. The 10 most abundant rare AS types are shown in Figure 2b. Three of the 10 rare types are the extended cassette type containing more than one additional exon in one of the variants. The most abundant rare type is also of this type with two insert exons, encoded as (65, 85: 1000001, 1010101).

As noted earlier, we categorized the variation at transcriptional initiation sites separately from AS. Applying our system to the same dataset as above, we found 4627 ATI units that were classified into 52 types. These ATI units were derived from 2474 distinct loci. Hence, at least 16.1% (= 2474/15 371) of human genes were estimated to have more than one transcriptional initiation site. Variations because of wobble or possible truncated ends were disregarded in these counts. The numbers of loci in which only ATI, only AS, both AS and ATI and neither AS nor ATI units were detected amounted to 854, 3251, 1601 and 9623, respectively. To reduce potential artifacts, unspliced single-exon variants were omitted in these calculations. The {chi}2 test indicates strong statistical association between AS and ATI loci ({chi}2 = 1522, d.f. = 1, p < 10–300). Such strong association was also observed when we restricted the samples to the loci with only two supporting variants (p < 4.5 x 10–24) or other specific numbers of variants (data not shown). In a recent study, Kornblihtt et al. (2004) suggested the possible interaction between RNA polymerase and SR proteins as a regulatory mechanism of transcription and splicing. By carefully analyzing those loci that have both ATI and AS units, we hope to gain some insights into the mechanism of coupling between transcriptional initiation and splicing.

ASTRA, a visual database of ASTI patterns
In addition to the human ASTI units briefly described above, we also detected the ASTI units for five organisms (Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana and Oryza sativa). These data were stored in a database named ASTRA. The ASTI units were pre-computed from full-length cDNA sequences in the UniGene Database and from genomic sequences at NCBI, Sanger Center, and TIGR Institute (Nagasaki et al., 2005). All data can be searched by their ASTI types, gene names, GO terms, GenBank accession numbers, UniGene IDs, OMIM IDs and some properties of AS such as association with NMD (nonsense-mediated mRNA decay) (Lewis et al., 2003) and NAGNAG (Zavolan et al., 2003, Hiller et al., 2004) (Fig. 3). On the front page, ASTRA also reports some statistical features characteristic to each species, such as the fractional representations of ASTI types.


Figure 3
View larger version (86K):
[in this window]
[in a new window]
 
Fig. 3 Snapshots of the graphical interface of ASTRA. (1) ‘Gene Viewer’ representing exon–intron structures defined by genome-cDNA mapping. (2) ‘Navigation Window’ showing all variants of which those within the yellow-colored area are presented in the upper frame of the Gene Viewer. (3) ‘Control Panel’ used to scroll and zoom in/out of Gene Viewer. (4) ‘Annotation Window’ indicating annotation and sequence of the relevant cDNA. Links to GenBank and Ensembl databases are also indicated.

 
ASTI units represent the most elementary localized information of ASTI. However, when a single gene undergoes widely different ASTI events, their underlying mechanisms might be better understood with a graphical display of the ASTI patterns whose alignment order can be customized depending on the user's personal purposes. To satisfy this requirement, we designed the graphical interface of ASTRA. The system consists of two components: (1) an SQL-based database engine to provide visual classification of ASTI patterns and (2) an interactive Java-based browser for rearranging ASTI patterns on the client side at the user's command.

The browser is launched when a user chooses a specific UniGene locus in the database. The browser supports zoom in/out of the overview for the chosen locus. By double-clicking the exon and intron boxes, the user can retrieve DNA sequences and the amino acid translations of the chosen splicing variant. Since the order of splicing diagram can be rearranged, the user can focus on the splicing patterns of interest.


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 ALGORITHM
 SYSTEMS AND METHODS
 RESULTS
 DISCUSSION
 CONCLUSION
 REFERENCES
 
Standardization of classification system
Several research groups have reported large-scale detection of AS (Black, 2000; Clark and Thanaraj, 2002; Kan et al., 2001; Modrek and Lee, 2002; Xu et al., 2002; Zavolan et al., 2003). Whereas the classification of observed events according to their patterns should precede detailed studies of the functional implication and mechanisms of AS (Smith and Valcarcel, 2000) and ATI, different research groups have adopted different categories for the classification (Stamm et al., 2000; Breitbart et al., 1987; Modrek and Lee, 2002; Huang et al., 2002; Kan et al., 2001; Hide et al., 2001). Variants of transcriptional initiation were included in the categories of Kan et al. (2001) and Zavolan et al. (2003); only terminal variations were included in the original category of Breitbart et al. (1987); or both were excluded from the categories of Huang et al. (2002) and Modrek and Lee (2002). This anarchic situation impedes the direct comparison of independent observations and brings about unnecessary confusion. Our proposed notation realizes a systematic, objective description of ASTI patterns based only on sequences, and renders their automatic classification feasible.

Diversity in ASTI patterns in human transcripts
We found as many as 124 distinct types of elementary AS patterns in human mRNAs. This striking diversity was revealed for the first time by the automatic classification system we have developed. A vast majority of these divergent types are considered genuine for three reasons. First, the transcripts we used are full-length or nearly full-length mRNAs as annotated in the UniGene database. This subset of UniGene entries of high quality may effectively prevent various troubles associated with ESTs and other imperfect sequences. Second, we adopted stringent criteria to identify cognate transcripts in order to obtain a relatively conservative view. Finally, of the 124 AS types, 68 (58%) were represented by more than one independent AS unit. Moreover, 17 883 of 20 330 (88.0%) intron-flanking boundary pairs that comprise the 9498 AS units were supported by multiple mRNA or EST sequences when we searched the 50 bp-long boundary sequences against the entire UniGene entries by BLASTN (Altschul et al., 1997). These observations indicate that experimental error or mistakes in data processing are not likely to account for the observed diversity in AS patterns.

AS types classified as ‘others’
Atypical AS types were observed in all the six species we examined, and are most popular in human with respect to both number and kind (Nagasaki et al., 2005). Recently, Sharov et al. (2005) confirmed 23 distinct ASTI types in mouse transcripts that were detected by a method based on our preliminary proposal (Nagasaki et al., 2003). Thus, atypical ASTI events are not exceptional but rather common phenomena observed widely in eukaryotes. Our observation that ~16% of AS units found in human mRNAs belong to the ‘others’ types suggests their significant contribution to the multi-functionality of many genes. Because these atypical AS patterns produce structural variations that are generally more complicated than common patterns, they may have a greater impact on the diversification of products generated from a single gene. Then, what is the biological significance of atypical AS types? Here we introduce a few interesting examples.

The Monarch-1 (NBD-LRR/NACHT/PYPAF) gene on human chromosome 19 is genetically linked to immunological disorders (Williams et al., 2003). A cDNA clone, Hs#S4623591 (Accession No.: AY116294 [GenBank] ), derived from Monarch-1 gene has three leucine-rich repeat domains encoded in the 7–9th exons inside the CDS region (Fig. 4a). By comparison with the transcript Hs#S4623588 (Accession No.: AY116207 [GenBank] ), our algorithm detected a rare type of AS denoted by (100000001, 101010101). The lengths of all the exons are 171 bp, causing no frame shift. The Etandem program included in EMBOSS package 2.9.0 diagnosed that those exons are tandem repeats, although mutual similarities are weak, ranging from 61 to 67% identities at the nucleotide sequence level, and from 62 to 74% identities at the translated amino acid sequence level (Fig. 4b). AS variants that lack one or two of the tandem repeat exons were also found (Fig. 4a). Hence, in addition to the above-mentioned AS type with three exon insertions, there exist one (100000101, 101010001)-type, three (1000001, 1010101)-type, one (1010001, 1000101)-type and four (10001, 10101)-type distinct AS units within this genomic region. Each tandem repeat exon encodes a single domain called ribonuclease-inhibitor (RI)-like Leucine-rich repeat as suggested by CD-Search (Marchler-Bauer and Bryant, 2004) on NCBI online BLASTP search site (Fig. 4c). Although Williams et al. reported that Monarch-1 transcripts encode leucine-rich repeats in the AS position (Williams et al., 2003), they did not refer to the tandem repeats of alternative exons. Monarch-1 gene is known to be a global inducer of MHC-I (Major Histocompatibility Complex, class I) genes. As a leucine-rich repeat functions as a protein recognition motif, the variation in the number of repeat exons may modulate the induction levels of MHC-I genes.


Figure 4
View larger version (51K):
[in this window]
[in a new window]
 
Fig. 4 Analysis of tandem repeat exons of human Monarch-1 gene. Each exon encodes a leucine-rich repeat with a conserved sequence motif. (a) Exon–intron structures of human Monarch-1 gene. Black bars, blue boxes, green boxes and green boxes with red frame indicate introns, 5'- and 3'-UTRs, CDSs and the tandem repeat exons, respectively. The three tandem repeats are labeled repeat1, 2 and 3, and indicated by blue, green and orange arrows, respectively. (b) Dot matrix plots between tandem repeat exons of human Monarch-1 gene. (c) Multiple alignment of translated tandem repeat sequences. The conserved motif (LxxLxLxxN/CxL) is boxed.

 
An atypical AS unit of the (100011, 101001) type was found in human interleukin 28 receptor alpha (IL-28RA) gene. There is another isoform that lacks both internal exonic regions of these variants, and hence one (10001, 10101)-type and one (1001, 1011)-type AS unit are also present in this region. The middle exon of the former encodes a transmembrane domain, and the extra exonic region in the latter encodes an intracellular domain (Sheppard et al., 2003).

A similar situation was observed in human Wilim's tumor 1 (WT1) gene, where an atypical AS unit of the (100001, 110101) type is associated with one (1101, 1001)-type and one (10001, 10101)-type AS unit. The internal AS exon encodes 17 amino acids that supposedly modify the transcriptional regulatory property of WT1 (Wang et al., 1995).

In all the cases of Monarch-1, IL-28RA and WT1 genes, each tally comprising an atypical AS unit (each splice variant of the gene) is also involved in typical AS units, such as cassette and alternative donor/acceptor sites, if a third isoform is used as the counterpart. Of the 912 atypical internal AS units we found, 400 are isolated ones whereas 512 share the same genomic region with some other typical/atypical AS units. If we also take atypical alternative transcriptional terminations into account, 446 units are isolated ones whereas 586 units are associated with some other AS units. Thus, >55% of the genomic regions corresponding to atypical AS units are associated with other typical/atypical AS units as well. In this sense, a majority of atypical AS units may be viewed as composite products of simpler AS events, often observed at the ‘hot spots’ of AS events.

Utility of ASTRA for analysis of complex ASTI patterns
The primary task of ASTRA database is to store ASTI units detected in various organisms and to provide a user-friendly graphic interface for the retrieval of these data together with related information such as the relevant genomic sequence. In addition, one important function of ASTRA is to provide the user with an interactive tool for the analysis of complex ASTI patterns.

As discussed above, some genomic regions can be regarded as ASTI hot spots within which various splicing variations are observed (e.g. Fig. 4a). Because our algorithm for detecting ASTI units is based on the pairwise comparison of isoforms, it is not convenient to grasp an overall ASTI pattern represented by many isoforms. The visual interface of ASTRA can compensate for the limitation of pairwise analyses. For example, the pairwise analysis detects 10 AS units in the genomic region shown in Figure 4a, while their mutual relationships are easily recognized by the graphical representation. The flexible user interface of ASTRA (e.g. easy access to nucleotide/amino acid sequences, and rearrangement of the order of the aligned splice variants) would help researchers get new hints from their investigations.

Another function of ASTRA is to present species-specific statistical properties, such as the fractional representations of various ASTI types. Because ASTI is a mechanism that enhances the complexity of transcriptome of an organism with a limited number of genes, the complexity in ASTI is expected to be correlated with the functional and structural complexity of the organism. Our recent study with six eukaryotes generally supports this idea (Nagasaki et al., 2005). Preparations are under way to add more information to ASTRA, to deepen our understanding of the species specificity of ASTI mechanisms.


    CONCLUSION
 TOP
 ABSTRACT
 INTRODUCTION
 ALGORITHM
 SYSTEMS AND METHODS
 RESULTS
 DISCUSSION
 CONCLUSION
 REFERENCES
 
AS is controlled by extremely complicated mechanisms (Black, 2003). The very divergent AS patterns revealed herein certainly reflect this mechanical complexity. Although we have not yet accumulated sufficient knowledge to correlate an observed pattern with its underlying mechanisms, our new classification system may help us understand this ill-understood correlation. In the future, we must perform more detailed analyses of AS with respect to tissue and stage specificity, functional classification of relevant genes, structural difference in variant proteins, conservation among related species and so forth. We also demonstrated that the relative orders of abundance in individual ASTI types were considerably different between evolutionarily distant species, such as between mammals and plants (Nagasaki et al., 2005). All these future directions also apply to the analyses of ATI, which is also shown herein to have as divergent patterns as AS.


    Acknowledgments
 
The authors thank Drs T. Aita, T. Kin, K. Fukui and K. Asai for helpful discussions. The authors also thank Mr T. Kumagai for help in setting up ASTRA hardware. This work was partially supported by a Grant-in-Aid for Scientific Research on Priority Areas (C) ‘Genome Information Science’ from the Ministry of Education, Culture, Sports, Science and Technology of Japan and by Institute for Bioinformatics Research and Development (BIRD) of Japan Science and Technology Agency (JST).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alex Bateman

Received on December 15, 2005; revised on February 6, 2006; accepted on February 21, 2006

    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 ALGORITHM
 SYSTEMS AND METHODS
 RESULTS
 DISCUSSION
 CONCLUSION
 REFERENCES
 

    Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, . 25, 3389–3402[Abstract/Free Full Text].

    Black, D. (2000) Protein diversity from alternative splicing: a challenge for bioinformatics and post-genome biology. Cell, 103, 367–370[CrossRef][ISI][Medline].

    Black, D.L. (2003) Mechanisms of alternative pre-messenger RNA splicing. Annu. Rev. Biochem, . 72, 291–336[CrossRef][ISI][Medline].

    Breitbart, R.E., et al. (1987) Alternative splicing: a ubiquitous mechanism for the generation of multiple protein isoforms from single genes. Annu. Rev. Biochem, . 56, 467–495[CrossRef][ISI][Medline].

    Clark, F. and Thanaraj, T.A. (2002) Categorization and characterization of transcript-confirmed constitutively and alternatively spliced introns and exons from human. Hum. Mol. Genet, . 11, 451–464[Abstract/Free Full Text].

    Gotoh, O. (2000) Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps. Bioinformatics, 16, 190–202[Abstract/Free Full Text].

    Hide, W.A., et al. (2001) The contribution of exon-skipping events on chromosome 22 to protein coding diversity. Genome Res, . 11, 1848–1853[Abstract/Free Full Text].

    Hiller, M., et al. Widespread occurrence of alternative splicing at NAGNAG acceptors contributes to proteome plasticity [Erratum (2005) Nat Genet, 37, 106.]. Nat Genet, . 36, 1255–1257.

    Huang, Y.H., et al. (2002) PALS db: putative alternative splicing database. Nucleic Acids Res, . 30, 186–190[Abstract/Free Full Text].

    Johnson, J.M., et al. (2003) Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science, 302, 2141–2144[Abstract/Free Full Text].

    Kan, Z., et al. (2001) Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res, . 11, 889–900[Abstract/Free Full Text].

    Kornblihtt, A.R., et al. (2004) Multiple links between transcription and splicing. RNA, 10, 1489–1498[Abstract/Free Full Text].

    Landry, J.R., et al. (2003) Complex controls: the role of alternative promoters in mammalian genomes. Trends Genet, . 19, 640–648[CrossRef][ISI][Medline].

    Lewis, B.P., et al. (2003) Evidence for the widespread coupling of alternative splicing and nonsense-mediated mRNA decay in humans. Proc. Natl Acad. Sci. USA, . 100, 189–192[Abstract/Free Full Text].

    Marchler-Bauer, A. and Bryant, S.H. (2004) CD-Search: protein domain annotations on the fly. Nucleic Acids Res, . 32, W327–W331[Abstract/Free Full Text].

    Modrek, B. and Lee, C. (2002) A genomic view of alternative splicing. Nat. Genet, . 30, 13–19[CrossRef][ISI][Medline].

    Nagasaki, H., et al. (2005) Species-specific variation of alternative splicing and transcriptional initiation in six eukaryotes. Gene, 364, 53–62[CrossRef][ISI][Medline].

    Nagasaki, H. and Suwa, M., et al. (2003) An algorithm for classification of alternative splicing and transcriptional initiation and its genome-wide application. Genome Inform, . 14, 424–425.

    Sharov, A.A., et al. (2005) Genome-wide assembly and analysis of alternative transcripts in mouse. Genome Res, . 15, 748–754[Abstract/Free Full Text].

    Sheppard, P., et al. (2003) IL-28, IL-29 and their class II cytokine receptor IL-28R. Nat. Immunol, . 4, 63–68[CrossRef][ISI][Medline].

    Smith, C.W. and Valcarcel, J. (2000) Alternative pre-mRNA splicing: the logic of combinatorial control. Trends Biochem. Sci, . 25, 381–388[CrossRef][ISI][Medline].

    Stamm, S., et al. (2000) An alternative-exon database and its statistical analysis. DNA Cell Biol, . 19, 739–756[CrossRef][ISI][Medline].

    Thanaraj, T.A., et al. (2004) ASD: the alternative splicing database. Nucleic Acids Res, . 32, D64–D69[Abstract/Free Full Text].

    Wang, Z.Y., et al. (1995) Products of alternatively spliced transcripts of the Wilms' tumor suppressor gene, wt1, have altered DNA binding specificity and regulate transcription in different ways. Oncogene, 10, 415–422[ISI][Medline].

    Williams, K.L., et al. (2003) Cutting edge: Monarch-1: a pyrin/nucleotide-binding domain/leucine-rich repeat protein that controls classical and nonclassical MHC class I genes. J. Immunol, . 170, 5354–5358[Abstract/Free Full Text].

    Xu, Q., et al. (2002) Genome-wide detection of tissue-specific alternative splicing in the human transcriptome. Nucleic Acids Res, . 30, 3754–3766[Abstract/Free Full Text].

    Zavolan, M., et al. (2003) Impact of alternative initiation, splicing, and termination on the diversity of the mRNA transcripts encoded by the mouse transcriptome. Genome Res, . 13, 1290–1300[Abstract/Free Full Text].

    Zhang, Z., et al. (2000) A greedy algorithm for aligning DNA sequences. J. Comput. Biol, . 7, 203–214[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
O. Gotoh
A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence
Nucleic Acids Res., May 1, 2008; 36(8): 2630 - 2638.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Bhasi, R. V. Pandey, S. P. Utharasamy, and P. Senapathy
EuSplice: a unified resource for the analysis of splice signals and alternative splicing in eukaryotic genes
Bioinformatics, July 15, 2007; 23(14): 1815 - 1823.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. Foissac and M. Sammeth
ASTALAVISTA: dynamic and flexible analysis of alternative splicing events in custom gene datasets
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W297 - W299.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Hiller, S. Nikolajewa, K. Huse, K. Szafranski, P. Rosenstiel, S. Schuster, R. Backofen, and M. Platzer
TassDB: a database of alternative tandem splice sites
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D188 - D192.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/10/1211    most recent
btl067v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Nagasaki, H.
Right arrow Articles by Gotoh, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Nagasaki, H.
Right arrow Articles by Gotoh, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?