Skip Navigation


Bioinformatics Advance Access originally published online on December 10, 2004
Bioinformatics 2005 21(8):1393-1400; doi:10.1093/bioinformatics/bti207
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1393    most recent
bti207v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (10)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wahl, M. B.
Right arrow Articles by Imai, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wahl, M. B.
Right arrow Articles by Imai, K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

LongSAGE analysis significantly improves genome annotation: identifications of novel genes and alternative transcripts in the mouse

Matthias B. Wahl 1,{dagger}, Ulrich Heinzmann 2 and Kenji Imai 1,*

1Institute of Developmental Genetics, GSF-National Research Center for Environment and Health Ingolstädter Landstrasse 1, D-85764 Neuherberg, Germany
2Institute of Pathology, GSF-National Research Center for Environment and Health Ingolstädter Landstrasse 1, D-85764 Neuherberg, Germany

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 REFERENCES
 

Motivation: Owing to its increased tag length, LongSAGE tags are expected to be more reliable in direct assignment to genome sequences. Therefore, we evaluated the use of LongSAGE data in genome annotation by using our LongSAGE dataset of 202 015 tags (consisting of 41 718 unique tags), experimentally generated from mouse embryonic tail libraries.

Results: A fraction of LongSAGE tags could not be unambiguously assigned to its gene, due to the presence of widely conserved sequences downstream of particular CATG anchor sites. The presence of alternative forms of transcripts was confirmed in 45% of all detected genes. Surprisingly, a large fraction of LongSAGE tags with hits to the genome (66%) could not be assigned to any gene annotated in EnsEMBL. Among such cases, 2098 LongSAGE tags fell into a region containing a putative gene predicted by GenScan, providing experimental evidence for the presence of real genes, while 9112 genes were found out to be left out or wrongly annotated by the EnsEMBL pipeline.

Conclusions: LongSAGE transcriptome data can significantly improve the genome annotation by identifying novel genes and alternative transcripts, even in the case of thus far best-characterized organisms like the mouse.

Contact: imai{at}gsf.de


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 REFERENCES
 
The availability of whole genomic sequences of humans (Lander et al., 2001) and major model organisms like mice (Waterston et al., 2002), together with the large datasets of human and mouse full-length cDNA (Imanishi et al., 2004; Okazaki et al., 2002) and expressed sequence tag (EST) sequences (Wheeler et al., 2004) revolutionized biomedical research. However, the challenge of the so-called ‘post-genomic’ era will be to extract biological information on a large scale from the available sequence data. This endeavor includes annotation of genes to the genome (reviewed in Stein, 2001) and large-scale gene expression screens (reviewed by Kanehisa and Bork, 2003), and might finally allow, in conjunction with functional data, to model and simulate biological processes, i.e. systems biology (reviewed by Kitano, 2002). While the latter is still difficult to achieve, significant progress has been made in genome annotation and gene expression profiling over the recent years.

The annotation of eukaryotic genomes is a very difficult task to address. The BLAST algorithm alone (Altschul et al., 1990) cannot be used to predict gene structures, since the algorithm has no model for splice sites and hence exon boundaries. Furthermore, the presence of repetitive sequences within transcript sequences, high homology to paralogs of the same protein family and pseudogenes, whose nucleotide sequences are very similar to their transcribed counterparts, complicates the identification of genes on the genomic sequence based upon homology programs. On the other hand, de novo gene prediction programs leave out a substantial fraction of genes, especially non-protein coding genes (reviewed by Zhang, 2002), and on the contrary lead to many false-positive predictions (Rogic et al., 2001). Comparative gene prediction programs greatly increase gene prediction specificity (Alexandersson et al., 2003); however, they leave out genes outside conserved syntenic regions as well as organism-specific genes. Therefore, currently a combination of ab initio gene finding algorithms and sequence alignments are used for genome annotation, like GeneWise (Birney et al., 2004b), which is used for the EnsEMBL genome annotation project (Curwen et al., 2004).

Real gene identification is ultimately dependent on experimentally generated nucleotide or protein sequence data. Nevertheless, since even millions of EST sequences in humans and mice were not sufficient to cover every single gene in the genome, leaving out especially rarely or low-abundantly expressed genes, more efficient experimental methods are required. Recently it has been shown that Serial Analysis of Gene Expression (SAGE) (Velculescu et al., 1995), a method initially developed for gene expression profiling, is more powerful in transcript detection than EST sequencing (Sun et al., 2004). It is well established that in a significant fraction of the cases a short sequence of 10 bases can be assigned to a single gene as a SAGE tag, immediately following the 3' most NlaIII recognition site (i.e. 5'-CATG-3') in a transcript sequence. Such tag-to-gene assignments have been made and provided as the SAGEmap database (Lash et al., 2000). With the improved version of SAGE, generating tags with the length of 21 bases (LongSAGE), which theoretically can be uniquely assigned to a single genomic position (Saha et al., 2002), LongSAGE might also assist in the correct identification of the genomic locus corresponding to a certain transcript. In the present study, we performed the first LongSAGE analysis in the mouse by generating multiple libraries from mouse embryonic tail tissues. Subsequently, we analyzed various aspects of the dataset in detail and addressed the use of LongSAGE in annotating genes to the genome as well as in detecting novel genes. Our analyses clearly indicate that LongSAGE transcriptome data can significantly improve the genome annotation, despite the fact that information on genes and transcripts in the mouse is the most advanced among vertebrate models.


    MATERIALS AND METHODS
 TOP
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 REFERENCES
 
LongSAGE library construction and data acquisition
LongSAGE library construction and data acquisition were described in detail in the accompanying paper (Wahl et al., 2005). The complete LongSAGE tag dataset used in this study is available at the Gene Expression Omnibus database at NCBI with the accession number GSM26978 [NCBI GEO] .

Tag-to-gene mapping
All programs for the data processing described in the following paragraphs were written in Perl using the BioPerl (Stajich et al., 2002) and EnsEMBL Perl APIs (Stabenau et al., 2004).

UniGene: Instead of the SAGEmap database, we used our own algorithm to define tag-to-gene assignments: in contrast to the SAGEmap database, we used only sequences reliably containing the 3' end of a transcript. Therefore, all sequences of a UniGene release (build #133) were categorized according to its description into full-length, 3', 5' and others, and were examined for the presence of a canonical polyadenylation signal (up to 50 bases upstream of a polyadenylated tail) and/or the polyadenylated tail. All full-length and 3' annotated sequences were taken when they contained at least a polyadenylation signal or a polyadenylated tail, while 5' sequences were considered only when they harbored both. Only sequences containing both polyadenylation signal and tail in a reverse-compliment orientation were changed to 5' to 3' orientation. For each sequence of a UniGene cluster passing the above criteria, the 10- to 17-base sequence downstream of the 3' most CATG was taken as the SAGE tag (‘UniGene all’, Fig. 1). For the dataset of ‘reliable’ tag-to-gene mappings (‘UniGene reliable’, Fig. 1), we considered only those SAGE tags that were derived from at least 10% of all 3' sequences of the corresponding UniGene cluster or from a full-length cDNA sequence.



View larger version (30K):
[in this window]
[in a new window]
 
Fig. 1 Uniqueness of SAGE tags of different length to the transcriptome and the genome. Percentages of no-hit, single-hit and multiple-hit cases are shown for SAGE tags of different length, ranging from 14 bases (original SAGE) to 21 bases (LongSAGE) against UniGene (A and B) and the genome (C). Reliable assignments to UniGene are those cases, where a LongSAGE tag was detected in at least 10% of the 3' most sequences and/or in a full-length cDNA clone. For the definitions of the datasets ‘UniGene all’, ‘UniGene reliable’ and ‘Genome’, see Materials and Methods section.

 
MGD: For the entries in the Mouse Genome Database (MGD, based upon MGD report file, the release of January 16, 2004; downloaded from ftp.jax.org), its corresponding UniGene cluster(s) were determined from the MGD database report. To link a MGD ID to EnsEMBL genes and EST genes, the ‘MarkerSymbol’ entry in the EnsEMBL db_xref database table, which corresponds to the official symbol in MGD, was taken. All LongSAGE tags mapped onto the UniGene cluster(s) and/or EnsEMBL gene(s)/EST gene(s) of a MGD ID were assigned to the particular MGD ID. In order to avoid multiple hit cases, a LongSAGE tag was taken only when it could not be assigned to another MGD ID by the same procedure.

EnsEMBL: EnsEMBL genes and EST genes (database version 19_30, based on the NCBI mouse genome assembly, build 30), which had an exon completely overlapping with an experimentally obtained LongSAGE tag on the genomic sequence, were associated with the LongSAGE tag. When exons of two or more EnsEMBL genes and/or EST genes overlapped with the same LongSAGE tag on the genomic sequence, such genes were considered as redundant, which was often the case for EnsEMBL and EST genes. The gene with the longest cDNA sequence was taken as the representative hit. When a LongSAGE tag was mapped onto an EnsEMBL gene or EST gene on two different genomic locations, it is considered as a ‘multiple hit’.

Pseudogenes
When a single UniGene cluster was represented with multiple distinct EnsEMBL genes and/or EST genes in the EnsEMBL database, we considered only one of these EnsEMBL genes corresponding to the real gene, whereas others were considered to be pseudogenes. To identify those cases, we first determined the EnsEMBL gene or EST gene corresponding to every UniGene cluster. Therefore, the representative sequence of each UniGene cluster (Mm.seq.uniq) was analyzed for sequence homology by BLAST against all EnsEMBL transcripts. For each UniGene sequence, the best hit, having a minimum of 92% percent identity over at least 250 bp in the same orientation, was considered as the corresponding EnsEMBL gene (‘primary’ hit), thereby allowing multiple UniGene clusters per EnsEMBL gene/EST gene, but only one gene/EST gene per UniGene cluster. Next, a second comparison with those EnsEMBL genes and EST genes, which have not yet been assigned to a UniGene cluster as primary hit, was conducted in a similar manner. If one of those genes, not assigned in the first round, shared a sequence similarity of at least 92% over at least 250 bp to a UniGene cluster (‘secondary’ EnsEMBL hit for the UniGene cluster), this particular UniGene cluster has to be represented with multiple genes in the EnsEMBL genome annotations. Otherwise, i.e. if the two (or more) genes would correspond to two distinct transcribed genes, two (or more) distinct UniGene clusters would have to exist for those genes. To take into account the redundancy between EnsEMBL genes and EST genes, those ‘secondary’ hits having an overlapping chromosomal localization to ‘primary’ hits are left out. All UniGene clusters associated with ‘primary’ and ‘secondary’ hits were considered as gene–pseudogene pairs.

Number of LongSAGE tags per gene
EnsEMBL: All LongSAGE tags derived from the embryonic tail libraries that could be directly mapped uniquely onto an EnsEMBL gene or EST gene was evaluated. If multiple LongSAGE tags were assigned to the same EnsEMBL gene or EST gene, the location of such LongSAGE tags in the EnsEMBL gene was determined. In case of multiple overlapping genes, the representative one was used (see above). If the gene had overlapping exons (e.g. different polyadenylation sites in the last exon), the longest one was used.

MGD: For each MGD entry, the mapping LongSAGE tags were retrieved and the number of unique tags in the dataset per MGD ID was counted.

Support of GenScan prediction by LongSAGE tags
Only those LongSAGE tags that could not be assigned to either a UniGene cluster or an EnsEMBL gene or EST gene, but mapped onto the genomic sequence were used for this analysis. For each hit to the genome, genes predicted by GenScan were retrieved through the EnsEMBL database (predictionTranscripts). All LongSAGE tags within GenScan predictions were directly associated and LongSAGE tags between two predicted exons or up to 2000 bp downstream of the predicted gene were considered as potential hits.

Verification of cDNA/EST alignments by LongSAGE tag
We examined for every LongSAGE tag mapped onto the genome (including the multiple-hits) whether one or more cDNA or EST sequences could be aligned to the genomic position of the LongSAGE tag. We first retrieved the overlapping dna_align_features from the EnsEMBL databases core, estgene and est, which hold all high-scoring sequence pairs (HSPs) of a global BLAT analysis of all transcript sequences against the genome. To ensure that the alignment between transcript and genome sequence is real, we then required the presence of splicing donor (GT) or splicing acceptor (AG) immediately downstream or upstream of the aligned segments. In the case of potential single-exon genes, alignments were also accepted if the whole sequence (at least 95%) could be aligned to the genomic position without any gap (Supplementary Figure 1). For each HSP, a minimum percent identity of 97% was required. LongSAGE tags overlapping with a dna_align_feature at multiple different genomic locations were not processed. Next, the aligned transcript sequence(s) was compared to annotated genes or EST genes. A LongSAGE tag was considered to correspond to a particular gene/EST gene in the following cases. First, if one transcript sequence overlapping with a LongSAGE tag also overlapped with an EnsEMBL exon associated with at least splice donor or acceptor on the same or the opposite strand with a maximal distance of 10 kb to the LongSAGE tag (Supplementary Figure 2A). In the latter case (opposite strand), the LongSAGE tag was categorized as an antisense tag (Supplementary Figure 2B). Second, if the transcript sequence showed a minimum percent identity of 95% over at least 150 bp to the EnsEMBL transcript maximal 10 kb distant from the LongSAGE tag on the same or different strand (Supplementary Figure 2C and D). For more detailed descriptions about antisense tags and genes, see the accompanying paper (Wahl et al., 2005). All transcript sequence(s)/LongSAGE tag combinations that could not be assigned to any EnsEMBL gene in the above strategy were considered as ‘novel’. LongSAGE tags with multiple hits to the genome were included in the final numbers, only when for one single genomic locus one or more transcript sequence(s) could be aligned.


    RESULTS
 TOP
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 REFERENCES
 
Assignment of LongSAGE tags to genes
Table 1 shows the statistics for the reliable assignment of LongSAGE tags to UniGene clusters (Wheeler et al., 2004), genomic sequence and EnsEMBL genes (Birney et al., 2004a). The percentage of LongSAGE tags that were not mapped onto genes (UniGene and/or EnsEMBL genes) was inversely proportional to the tag abundance, which is in concordance with previous reports (Margulies et al., 2001; Wahl et al., 2004).


View this table:
[in this window]
[in a new window]
 
Table 1 Assignment of LongSAGE tags to UniGene clusters, genomic sequence and EnsEMBL genes

 
A detailed analysis of the 41 714 types of LongSAGE tags collected in our experiment against the genomic sequence is summarized in Table 2. Unique genome hits were confirmed in 21 904 cases (53%), whereas 4087 tags (10%) hit multiple genome sites and 15 723 tags (38%) did not show any hit in the genome. Of the 21 904 tags with a unique hit to the genomic sequence, 6317 tags matched to both UniGene and EnsEMBL databases, 2510 LongSAGE tags could be assigned only to an EnsEMBL gene and 3729 tags detected in the genomic sequence could not be mapped onto any EnsEMBL gene, but only to a UniGene cluster. Interestingly, 9348 tags could not be mapped onto either UniGene or EnsEMBL genes. Further investigations on these 9348 tags are presented later in separate sections.


View this table:
[in this window]
[in a new window]
 
Table 2 Analysis of LongSAGE tags against genomic sequence

 
Surprisingly, there was a considerable portion of LongSAGE tags, which had no hit to the genomic sequence (38%). To determine, whether this observation was due to gaps or sequencing errors in the genome sequence assembly, or due to exon-spanning LongSAGE tags, we analyzed how many of these LongSAGE tags were represented by a UniGene cluster. We could only identify 1097 cases (out of 15 723) where a no-hit tag to the genome was included in transcript sequences (UniGene). Since the phenomenon of no-hit tags to the genome was predominantly seen in single-count tags, we assumed that it was due to sequence errors introduced during the PCR or sequencing step. Even though the base-calling program Phred (Ewing et al., 1998) was used and the raw ditags were processed by the SAGEScreen algorithm, which tries to correct wrong LongSAGE tags (Akmaev and Wang, 2004), this possibility remains. Indeed, an analysis of the relative abundance of all derivates of linker tags, allowing up to two substitutions and up to one deletion or insertion, to the count of their parent before SAGEScreen correction determined an error rate of 11–15%, which could be extrapolated to 22 000–30 000 tags. This observation is concordant to the considerations and analyses of other SAGE data (Akmaev and Wang, 2004).

Relationship between tag length and unique assignment
Figure 1 depicts the frequencies of no-hits, single-hits and multiple-hits to either UniGene clusters or the genome sequence, in order to provide a basis for determining the minimum length of a SAGE tag (including the CATG anchor sequence) to be considered unique to the corresponding database. In order to avoid artificial LongSAGE tags due to PCR and/or sequencing artefacts, we concentrated on LongSAGE tags with a minimum count of three for this analysis. These LongSAGE tags were artificially trimmed by up to 7 bases at the 3'-end. For the assignment to UniGene, a plateau was reached at a tag length of 16–17 bases (Fig. 1A and B), and interestingly, an increased tag length did not reduce the number of multiple hits. In contrast, for the mapping to the genome, each additional base increased specificity, whereas the number of no-hit cases remained low (Fig. 1C). In the case of single-hit matching to the genome sequence, with increasing tag lengths from 16 to 20, we observed a significant increase in the percentage of unique genome assignments from ~20 to ~75% in our experimental dataset. However, this increase reached a plateau with the tag length of 20, and the incidence of single-hit matching did not significantly increase any more with 21-base LongSAGE tags.

Uniqueness of a LongSAGE tag in the presence of pseudogenes
Pseudogenes are copies of transcribed genes with very high-sequence similarity to the coding region of the corresponding real gene, but are not transcribed. Therefore, it is often difficult to discriminate a real, transcribed gene represented by EST or cDNA sequences from its non-transcribed pseudogene(s) on the genome sequence. Thus, we addressed the question of how many LongSAGE tag derived from a real gene that had one or more highly similar pseudogene(s) were found only once in the genome sequence: i.e. present only in the transcribed gene, but not in the pseudogene(s). We first identified potential pseudogenes annotated in EnsEMBL with a strategy explained in detail in the Materials and Methods section. As summarized in Table 3, out of 1144 potential gene/pseudogene(s) pairs with one or more LongSAGE tag(s) observed in our dataset, for 457 (339 + 118) genes at least one LongSAGE tag was unique to the genome, but in 687 cases, an identical LongSAGE tag was observed in both gene and pseudogene(s).


View this table:
[in this window]
[in a new window]
 
Table 3 Uniqueness of LongSAGE tags for genes with potential pseudogenes

 
Number of tags per gene
As we often observed that different LongSAGE tags were mapped onto the same gene, we analyzed the number of LongSAGE tags for every gene. For this purpose, the direct use of UniGene clusters was not the best choice, since it was known that a significant number of genes were represented in multiple UniGene clusters. Therefore, we started with all genes with entries in the MGD (Bult et al., 2004) and retrieved the possible LongSAGE tags through the UniGene clusters assigned to entries in MGD. Of a total of 9088 MGD entries with LongSAGE tags detected in our libraries, in ~45% (4165 tags) of the cases, more than one kind of tag was observed in the dataset, and ~21% (1896 tags) were represented three or more times (Table 4, ‘MGD entries’).


View this table:
[in this window]
[in a new window]
 
Table 4 Alternative tags per gene

 
To determine the exact localizations of the LongSAGE tags within the genes, we performed the same analysis on EnsEMBL genes, for which the exon–intron structure was known (Table 4, ‘Mapped to EnsEMBL’). Altogether, less LongSAGE tags could be assigned to EnsEMBL genes, thereby resulting in a total of 7179 genes analyzed and only 1933 genes with multiple unique tags. Among these, 1066 (848 + 218) EnsEMBL genes had LongSAGE tags located in different exons and in 1303 (1085 + 218) cases, more than one LongSAGE tag was assigned to the same exon.

LongSAGE tags providing experimental supports for predicted genes
Since 9348 LongSAGE tags that were mapped onto the genome lacked a reliable hit to UniGene and could not be associated with an EnsEMBL gene (Table 2), we analyzed how many of these LongSAGE tags overlapped with an ab initio prediction by GenScan (Burge and Karlin, 1997). As GenScan is known to fail in detecting some of the exons of a gene and is unable to predict untranslated regions (UTRs), we also included those LongSAGE tags in between two predicted exons or downstream of the last predicted exon (possibly in the 3'-UTR), as long as the distance was not >2000 bp. As illustrated in Figure 2, a total of 2098 LongSAGE tags were located in the region of a predicted gene, including, 111 LongSAGE tags within predicted exons, 1039 LongSAGE tags between two predicted exons and 948 LongSAGE tags downstream of the last predicted exon.



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 2 Verification of GenScan predicted genes by LongSAGE tags. Out of the 9348 LongSAGE tags that could not be assigned to UniGene and EnsEMBL genes, a total of 2098 tags were located in the same orientation either within (A and B) or near downstream of (C) putative genes predicted by GenScan. (A) In 111 cases, LongSAGE tags were found with a predicted exon. (B) In 1039 cases, LongSAGE tags were located between predicted exons. (C) In 948 cases, LongSAGE tags were localized within 2000 bp downstream of the last exon of predicted genes. Long horizontal lines: genomic sequence; hatched boxes: exons of a GenScan predicted gene (the transcriptional direction from the left to the right); bold arrows: LongSAGE tags.

 
Evidence for novel genes, novel exons and alternative polyadenylation sites
Surprisingly, of the 21 904 LongSAGE tags with a single hit to the genome sequence, 13 077 (9348 + 3729; Table 2) tags could not be assigned to an EnsEMBL gene. However, in 3729 cases of these no-EnsEMBL-gene-hit LongSAGE tags, indeed, we were able to assign to UniGene. This observation reflected the fact that a certain fraction of UniGene entries were not included in EnsEMBL genome annotation, and at the same time suggested the possibility that LongSAGE data might be useful to link between such non-annotated UniGene entries and their genome locations. Therefore, we addressed this possibility by determining transcript sequences that could be aligned to the chromosomal position of each LongSAGE tag mapped onto the genome. As shown in Figure 3, for a total of 18 205 unique LongSAGE tags, a transcript sequence overlapped with the LongSAGE tag on the genomic sequence. In 9112 cases, there was no EnsEMBL gene annotated to the chromosomal position of the LongSAGE tag (Fig. 3A). In 207 cases, the LongSAGE tag supported by a transcript sequence was located between two exons of an annotated gene (Fig. 3C), and in 480 cases downstream of the 3'-UTR of an EnsEMBL gene (Fig. 3D).



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 3 Alignment of transcript sequences to the genomic position of a LongSAGE tag. We identified all EST and/or cDNA sequences that could be aligned to the genomic position of each LongSAGE tag, following splicing rules and requiring a complete overlap with the whole LongSAGE tag. Thus, we observed four different classes of LongSAGE tags concerning their locations: (A) outside of an EnsEMBL gene and no annotated EnsEMBL gene corresponding to the aligned transcript sequence present (‘novel’ gene) (B) within an exon of an annotated EnsEMBL gene (known gene) (C) in between two exons (novel exon) and (D) downstream of the last exon (alternative polyadenylation) of an EnsEMBL gene, but at least one aligned with transcript sequence derived from the same gene. The numbers depicted are the result of an analysis of all 25 991 LongSAGE with one or more hits to the genome. When a LongSAGE tag had multiple hits to the genome, but only one was associated with aligned transcript sequence(s) on a single genomic position, it is included in the numbers of this figure. Long horizontal lines: genomic sequence; short horizontal lines: segments of transcript sequences matching to the genomic sequence; open boxes: exons of an EnsEMBL gene; bold arrows: LongSAGE tags.

 

    DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 REFERENCES
 
Optimal length of SAGE tags
For the best performance of SAGE, it is necessary to generate tags as short as possible, which can still reliably be assigned to corresponding genes. As illustrated in Figure 1, against both transcript (UniGene) and genome databases, from a certain tag length on the number of multiple hits does not significantly decrease with increased tag length, reflecting the fact that transcript and genome sequences are not purely random. The fact that the graph approaches to a plateau could be explained by the existence of highly homologous genes or repetitive sequences, leading to the same SAGE tags for multiple transcript sequences or for multiple genomic locations. This notion is in agreement with a previous report on invertebrates such as Caenorhabditis elegans and Drosophila melanogaster (Pleasance et al., 2003). This indicates that not all transcripts will be covered by a single anchoring enzyme (here: Hsp92II, cutting at the recognition sequence CATG). However, by generating parallel LongSAGE libraries from the same mRNA material using different anchoring enzymes, this limitation can be overcome to some extent, whereby additional tiny fraction of transcripts that do not contain a CATG may be covered. According to the results shown in Figure 1, a 17-base SAGE tag (including CATG) is sufficient for a unique assignment to transcript sequences. Owing to the larger size and higher complexity of the genome sequence, a SAGE tag has to be longer to be uniquely assigned to the genome. Since the number of multiple hits only marginally decreases (and accordingly the percentage of single-hit matching to the genome does not improve) between 20 and 21 bases, LongSAGE tags seem to be really adequate. However, this will have to be experimentally verified by generating even longer SAGE tags with the recently published method SuperSAGE (Matsumura et al., 2003) utilizing 26-base long tags. Concerning the tag uniqueness in the genome, the maximal incidence of unique assignment theoretically calculated in the original LongSAGE article (Saha et al., 2002) is almost 100%. This maximal level is reached also with 20-base long tags, but is significantly higher than the one from our observation (~75%) based on the experimental dataset from this study. This difference may reflect again the non-randomness of the genomic sequences.

Number of alternative tags per gene
It has been previously reported that most genes have alternative transcripts. By EST data mining, it has turned out that at least 59% of human and 41% of mouse multi-exon genes have alternative splice forms (Brett et al., 2002; Zavolan et al., 2003), and 28.6% of human genes show alternative polyadenylation (Beaudoing and Gautheret, 2001). In our dataset, ~45% of the genes detected had alternative transcripts, which is consistent with above numbers. However, it should be noted that LongSAGE cannot detect all isoforms, i.e. only those leading to the use of alternative 3' most CATG sites. Therefore, most of the differences observed are limited to the 3' part of the corresponding gene. This explains the observed predominance of alternative polyadenylation (LongSAGE tags usually within the same last exon) over alternative splicing (tags in different exons). Interestingly, many alternatively spliced and/or polyadenylated transcripts co-exist in the same tissue (the embryonic mouse tail). This raises the question about the biological significance of both alternative splicing and alternative polyadenylation. It is shown in the classical example of the Drosophila Sex-lethal gene, which, due to alternative splicing, has a male- and a female-specific transcript, thereby determining the sex of the individual (reviewed by Penalva and Sanchez, 2003). Nevertheless, observations that alternative transcripts specific for a certain cell type have a different function might not be the general rule. Since LongSAGE predominantly records changes to the 3' part of a gene (3' alternative splicing and alternative polyadenylation), the coding sequence might not be affected by the observed events. On the other hand, the 3'-UTRs have been reported to be involved in processes like translational regulation, mRNA stability and sub-cellular localization (reviewed by Kuersten and Goodwin, 2003).

Impacts of LongSAGE data on genome annotation
As mentioned in the Introduction section, the experience in genome annotation of higher vertebrates has revealed the need for experimentally generated DNA or protein sequences to annotate all genes to the genome. Therefore, we assessed whether LongSAGE could fulfill its promise as a method to assist genome annotation (Saha et al., 2002). Our data strongly suggests that even at the current state of transcript (full-length cDNA and EST) sequencing projects in the mouse and with the momentary tools for in silico gene prediction, a significant number of genes may have not been annotated in the genome.

By comparing GenScan predicted genes and the LongSAGE tags with no hit to both EnsEMBL genes and UniGene clusters, we found a complete overlap in 111 cases (Fig. 2A). Since gene prediction programs often fail to detect some exons and especially the 3'-UTR of a gene (Rogic et al., 2001), those LongSAGE tags falling in between two predicted exons (1039 cases, Fig. 2B) or downstream of the last exon of a predicted transcript (i.e. 3'-UTR; 948 cases, Fig. 2C) could be part of these predicted genes, thereby providing an experimental support for them being real. As all LongSAGE tags used for this analysis were not found in any of the gene or transcript databases used, those genes can be considered as being completely novel. Moreover, we have shown that LongSAGE tags can be utilized to identify the genomic locus of genes that are not annotated through the EnsEMBL pipeline. Therefore, we propose that the genes of 9112 LongSAGE tags supported by aligned cDNA/ESTs (Fig. 3A) are not (or wrongly) annotated through EnsEMBL. As we have described in the accompanying paper (Wahl et al., 2005), among the LongSAGE/transcript sequence pairs, 1260 potential antisense genes (1468 antisense transcripts) were identified, of which only 296 (23%) are included in EnsEMBL. The fact that our strategy could confirm >98% (conflicting in only 95 out of 8406 cases) of the EnsEMBL gene annotations proves the feasibility of our approach.


    CONCLUSIONS
 TOP
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 REFERENCES
 
In conclusion, LongSAGE data are very useful and efficient in identifying and locating transcribed units in the genome. Together with further evidence from gene prediction programs or transcript sequence alignments to the genome, the complete structure of a gene can be determined, independent of any comparative analysis to other species, which would leave out species-specific genes. Even in the presence of pseudogenes, which often cannot be discriminated from its transcribed copy in the EnsEMBL algorithm, in a considerable number of cases, a LongSAGE tag derived from the expressed gene could be assigned to only one of the candidates. Finally, it should be noted that a significant number of LongSAGE tags with a unique hit to the genome were still not associated with a gene by any of the above approaches, suggesting that many genes have still not been recognized.


    Acknowledgments
 
We thank Rudi Balling (GBF) for valuable comments to this work. This work was supported by the GSF.


    Footnotes
 
{dagger}Present address: Stowers Institute for Medical Research, 1000 E. 50th Street, Kansas City, MO64110, USA Back

Received on September 4, 2004; revised on November 23, 2004; accepted on December 2, 2004

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 REFERENCES
 

    Akmaev, V.R. and Wang, C.J. (2004) Correction of sequence-based artifacts in serial analysis of gene expression. Bioinformatics, 20, 1254–1263[Abstract/Free Full Text].

    Alexandersson, M., Cawley, S., Pachter, L. (2003) SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res., 13, 496–502[Abstract/Free Full Text].

    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410[CrossRef][ISI][Medline].

    Beaudoing, E. and Gautheret, D. (2001) Identification of alternate polyadenylation sites and analysis of their tissue distribution using EST data. Genome Res., 11, 1520–1526[Abstract/Free Full Text].

    Birney, E., Andrews, T.D., Bevan, P., Caccamo, M., Chen, Y., Clarke, L., Coates, G., Cuff, J., Curwen, V., Cutts, T., et al. (2004a) An overview of Ensembl. Genome Res., 14, 925–928[Abstract/Free Full Text].

    Birney, E., Clamp, M., Durbin, R. (2004b) GeneWise and genomewise. Genome Res., 14, 988–995[Abstract/Free Full Text].

    Brett, D., Pospisil, H., Valcarcel, J., Reich, J., Bork, P. (2002) Alternative splicing and genome complexity. Nat. Genet., 30, 29–30[CrossRef][ISI][Medline].

    Bult, C.J., Blake, J.A., Richardson, J.E., Kadin, J.A., Eppig, J.T., Baldarelli, R.M., Barsanti, K., Baya, M., Beal, J.S., Boddy, W.J., et al. (2004) The Mouse Genome Database (MGD): integrating biology with the genome. Nucleic Acids Res., 32, D476–D481[Abstract/Free Full Text].

    Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268, 78–94[CrossRef][ISI][Medline].

    Curwen, V., Eyras, E., Andrews, T.D., Clarke, L., Mongin, E., Searle, S.M., Clamp, M. (2004) The Ensembl automatic gene annotation system. Genome Res., 14, 942–950[Abstract/Free Full Text].

    Ewing, B., Hillier, L., Wendl, M.C., Green, P. (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res., 8, 175–185[Abstract/Free Full Text].

    Imanishi, T., Itoh, T., Suzuki, Y., O'Donovan, C., Fukuchi, S., Koyanagi, K.O., Barrero, R.A., Tamura, T., Yamaguchi-Kabata, Y., Tanino, M., et al. (2004) Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol., 2, E162.

    Kanehisa, M. and Bork, P. (2003) Bioinformatics in the post-sequence era. Nat. Genet., 33, suppl., 305–310.

    Kitano, H. (2002) Computational systems biology. Nature, 420, 206–210[CrossRef][Medline].

    Kuersten, S. and Goodwin, E.B. (2003) The power of the 3' UTR: translational control and development. Nat. Rev. Genet., 4, 626–637[CrossRef][ISI][Medline].

    Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921[CrossRef][Medline].

    Lash, A.E., Tolstoshev, C.M., Wagner, L., Schuler, G.D., Strausberg, R.L., Riggins, G.J., Altschul, S.F. (2000) SAGEmap: a public gene expression resource. Genome Res., 10, 1051–1060[Abstract/Free Full Text].

    Margulies, E.H., Kardia, S.L., Innis, J.W. (2001) A comparative molecular analysis of developing mouse forelimbs and hindlimbs using serial analysis of gene expression (SAGE). Genome Res., 11, 1686–1698[Abstract/Free Full Text].

    Matsumura, H., Reich, S., Ito, A., Saitoh, H., Kamoun, S., Winter, P., Kahl, G., Reuter, M., Kruger, D.H., Terauchi, R. (2003) Gene expression analysis of plant host–pathogen interactions by SuperSAGE. Proc. Natl Acad. Sci. USA, 100, 15718–15723[Abstract/Free Full Text].

    Okazaki, Y., Furuno, M., Kasukawa, T., Adachi, J., Bono, H., Kondo, S., Nikaido, I., Osato, N., Saito, R., Suzuki, H., et al. (2002) Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature, 420, 563–573[CrossRef][Medline].

    Penalva, L.O. and Sanchez, L. (2003) RNA binding protein sex-lethal (Sxl) and control of Drosophila sex determination and dosage compensation. Microbiol. Mol. Biol. Rev., 67, 343–359[Abstract/Free Full Text].

    Pleasance, E.D., Marra, M.A., Jones, S.J. (2003) Assessment of SAGE in transcript identification. Genome Res, 13, 1203–1215[Abstract/Free Full Text].

    Rogic, S., Mackworth, A.K., Ouellette, F.B. (2001) Evaluation of gene-finding programs on mammalian sequences. Genome Res., 11, 817–832[Abstract/Free Full Text].

    Saha, S., Sparks, A.B., Rago, C., Akmaev, V., Wang, C.J., Vogelstein, B., Kinzler, K.W., Velculescu, V.E. (2002) Using the transcriptome to annotate the genome. Nat. Biotechnol., 20, 508–512[CrossRef][ISI][Medline].

    Stabenau, A., McVicker, G., Melsopp, C., Proctor, G., Clamp, M., Birney, E. (2004) The Ensembl core software libraries. Genome Res., 14, 929–933[Abstract/Free Full Text].

    Stajich, J.E., Block, D., Boulez, K., Brenner, S.E., Chervitz, S.A., Dagdigian, C., Fuellen, G., Gilbert, J.G., Korf, I., Lapp, H., et al. (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res., 12, 1611–1618[Abstract/Free Full Text].

    Stein, L. (2001) Genome annotation: from sequence to biology. Nat. Rev. Genet., 2, 493–503[ISI][Medline].

    Sun, M., Zhou, G., Lee, S., Chen, J., Shi, R.Z., Wang, S.M. (2004) SAGE is far more sensitive than EST for detecting low-abundance transcripts. BMC Genomics, 5, 1[CrossRef][Medline].

    Velculescu, V.E., Zhang, L., Vogelstein, B., Kinzler, K.W. (1995) Serial analysis of gene expression. Science, 270, 484–487[Abstract/Free Full Text].

    Wahl, M., Heinzmann, U., Imai, K. (2005) LongSAGE analysis revealed the presence of a large number of novel antisense genes in the mouse genome. Bioinformatics, 21, 1391–1394.

    Wahl, M., Shukunami, C., Heinzmann, U., Hamajima, K., Hiraki, Y., Imai, K. (2004) Transcriptome analysis of early chondrogenesis in ATDC5 cells induced by bone morphogenetic protein 4. Genomics, 83, 45–58[CrossRef][ISI][Medline].

    Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature, 420, 520–562[CrossRef][Medline].

    Wheeler, D.L., Church, D.M., Edgar, R., Federhen, S., Helmberg, W., Madden, T.L., Pontius, J.U., Schuler, G.D., Schriml, L.M., Sequeira, E., et al. (2004) Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res., 32, D35–D40[Abstract/Free Full Text].

    Zavolan, M., Kondo, S., Schonbach, C., Adachi, J., Hume, D.A., Hayashizaki, Y., Gaasterland, T. (2003) Impact of alternative initiation, splicing, and termination on the diversity of the mRNA transcripts encoded by the mouse transcriptome. Genome Res., 13, 1290–1300[Abstract/Free Full Text].

    Zhang, M.Q. (2002) Computational prediction of eukaryotic protein-coding genes. Nat. Rev. Genet., 3, 698–709[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
IOVSHome page
C. Bowes Rickman, J. N. Ebright, Z. J. Zavodni, L. Yu, T. Wang, S. P. Daiger, G. Wistow, K. Boon, and M. A. Hauser
Defining the Human Macula Transcriptome and Candidate Retinal Disease Genes Using EyeSAGE
Invest. Ophthalmol. Vis. Sci., June 1, 2006; 47(6): 2305 - 2316.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. B. Wahl, U. Heinzmann, and K. Imai
LongSAGE analysis revealed the presence of a large number of novel antisense genes in the mouse genome
Bioinformatics, April 15, 2005; 21(8): 1389 - 1392.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1393    most recent
bti207v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (10)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wahl, M. B.
Right arrow Articles by Imai, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wahl, M. B.
Right arrow Articles by Imai, K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?