Bioinformatics Advance Access originally published online on December 10, 2004
Bioinformatics 2005 21(8):1393-1400; doi:10.1093/bioinformatics/bti207
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LongSAGE analysis significantly improves genome annotation: identifications of novel genes and alternative transcripts in the mouse

1Institute of Developmental Genetics, GSF-National Research Center for Environment and Health Ingolstädter Landstrasse 1, D-85764 Neuherberg, Germany
2Institute of Pathology, GSF-National Research Center for Environment and Health Ingolstädter Landstrasse 1, D-85764 Neuherberg, Germany
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Owing to its increased tag length, LongSAGE tags are expected to be more reliable in direct assignment to genome sequences. Therefore, we evaluated the use of LongSAGE data in genome annotation by using our LongSAGE dataset of 202 015 tags (consisting of 41 718 unique tags), experimentally generated from mouse embryonic tail libraries.
Results: A fraction of LongSAGE tags could not be unambiguously assigned to its gene, due to the presence of widely conserved sequences downstream of particular CATG anchor sites. The presence of alternative forms of transcripts was confirmed in 45% of all detected genes. Surprisingly, a large fraction of LongSAGE tags with hits to the genome (66%) could not be assigned to any gene annotated in EnsEMBL. Among such cases, 2098 LongSAGE tags fell into a region containing a putative gene predicted by GenScan, providing experimental evidence for the presence of real genes, while 9112 genes were found out to be left out or wrongly annotated by the EnsEMBL pipeline.
Conclusions: LongSAGE transcriptome data can significantly improve the genome annotation by identifying novel genes and alternative transcripts, even in the case of thus far best-characterized organisms like the mouse.
Contact: imai{at}gsf.de
| INTRODUCTION |
|---|
|
|
|---|
The availability of whole genomic sequences of humans (Lander et al., 2001) and major model organisms like mice (Waterston et al., 2002), together with the large datasets of human and mouse full-length cDNA (Imanishi et al., 2004; Okazaki et al., 2002) and expressed sequence tag (EST) sequences (Wheeler et al., 2004) revolutionized biomedical research. However, the challenge of the so-called post-genomic era will be to extract biological information on a large scale from the available sequence data. This endeavor includes annotation of genes to the genome (reviewed in Stein, 2001) and large-scale gene expression screens (reviewed by Kanehisa and Bork, 2003), and might finally allow, in conjunction with functional data, to model and simulate biological processes, i.e. systems biology (reviewed by Kitano, 2002). While the latter is still difficult to achieve, significant progress has been made in genome annotation and gene expression profiling over the recent years.
The annotation of eukaryotic genomes is a very difficult task to address. The BLAST algorithm alone (Altschul et al., 1990) cannot be used to predict gene structures, since the algorithm has no model for splice sites and hence exon boundaries. Furthermore, the presence of repetitive sequences within transcript sequences, high homology to paralogs of the same protein family and pseudogenes, whose nucleotide sequences are very similar to their transcribed counterparts, complicates the identification of genes on the genomic sequence based upon homology programs. On the other hand, de novo gene prediction programs leave out a substantial fraction of genes, especially non-protein coding genes (reviewed by Zhang, 2002), and on the contrary lead to many false-positive predictions (Rogic et al., 2001). Comparative gene prediction programs greatly increase gene prediction specificity (Alexandersson et al., 2003); however, they leave out genes outside conserved syntenic regions as well as organism-specific genes. Therefore, currently a combination of ab initio gene finding algorithms and sequence alignments are used for genome annotation, like GeneWise (Birney et al., 2004b), which is used for the EnsEMBL genome annotation project (Curwen et al., 2004).
Real gene identification is ultimately dependent on experimentally generated nucleotide or protein sequence data. Nevertheless, since even millions of EST sequences in humans and mice were not sufficient to cover every single gene in the genome, leaving out especially rarely or low-abundantly expressed genes, more efficient experimental methods are required. Recently it has been shown that Serial Analysis of Gene Expression (SAGE) (Velculescu et al., 1995), a method initially developed for gene expression profiling, is more powerful in transcript detection than EST sequencing (Sun et al., 2004). It is well established that in a significant fraction of the cases a short sequence of 10 bases can be assigned to a single gene as a SAGE tag, immediately following the 3' most NlaIII recognition site (i.e. 5'-CATG-3') in a transcript sequence. Such tag-to-gene assignments have been made and provided as the SAGEmap database (Lash et al., 2000). With the improved version of SAGE, generating tags with the length of 21 bases (LongSAGE), which theoretically can be uniquely assigned to a single genomic position (Saha et al., 2002), LongSAGE might also assist in the correct identification of the genomic locus corresponding to a certain transcript. In the present study, we performed the first LongSAGE analysis in the mouse by generating multiple libraries from mouse embryonic tail tissues. Subsequently, we analyzed various aspects of the dataset in detail and addressed the use of LongSAGE in annotating genes to the genome as well as in detecting novel genes. Our analyses clearly indicate that LongSAGE transcriptome data can significantly improve the genome annotation, despite the fact that information on genes and transcripts in the mouse is the most advanced among vertebrate models.
| MATERIALS AND METHODS |
|---|
|
|
|---|
LongSAGE library construction and data acquisition
LongSAGE library construction and data acquisition were described in detail in the accompanying paper (Wahl et al., 2005). The complete LongSAGE tag dataset used in this study is available at the Gene Expression Omnibus database at NCBI with the accession number GSM26978 [NCBI GEO] .
Tag-to-gene mapping
All programs for the data processing described in the following paragraphs were written in Perl using the BioPerl (Stajich et al., 2002) and EnsEMBL Perl APIs (Stabenau et al., 2004).
UniGene: Instead of the SAGEmap database, we used our own algorithm to define tag-to-gene assignments: in contrast to the SAGEmap database, we used only sequences reliably containing the 3' end of a transcript. Therefore, all sequences of a UniGene release (build #133) were categorized according to its description into full-length, 3', 5' and others, and were examined for the presence of a canonical polyadenylation signal (up to 50 bases upstream of a polyadenylated tail) and/or the polyadenylated tail. All full-length and 3' annotated sequences were taken when they contained at least a polyadenylation signal or a polyadenylated tail, while 5' sequences were considered only when they harbored both. Only sequences containing both polyadenylation signal and tail in a reverse-compliment orientation were changed to 5' to 3' orientation. For each sequence of a UniGene cluster passing the above criteria, the 10- to 17-base sequence downstream of the 3' most CATG was taken as the SAGE tag (UniGene all, Fig. 1). For the dataset of reliable tag-to-gene mappings (UniGene reliable, Fig. 1), we considered only those SAGE tags that were derived from at least 10% of all 3' sequences of the corresponding UniGene cluster or from a full-length cDNA sequence.
|
MGD: For the entries in the Mouse Genome Database (MGD, based upon MGD report file, the release of January 16, 2004; downloaded from ftp.jax.org), its corresponding UniGene cluster(s) were determined from the MGD database report. To link a MGD ID to EnsEMBL genes and EST genes, the MarkerSymbol entry in the EnsEMBL db_xref database table, which corresponds to the official symbol in MGD, was taken. All LongSAGE tags mapped onto the UniGene cluster(s) and/or EnsEMBL gene(s)/EST gene(s) of a MGD ID were assigned to the particular MGD ID. In order to avoid multiple hit cases, a LongSAGE tag was taken only when it could not be assigned to another MGD ID by the same procedure.
EnsEMBL: EnsEMBL genes and EST genes (database version 19_30, based on the NCBI mouse genome assembly, build 30), which had an exon completely overlapping with an experimentally obtained LongSAGE tag on the genomic sequence, were associated with the LongSAGE tag. When exons of two or more EnsEMBL genes and/or EST genes overlapped with the same LongSAGE tag on the genomic sequence, such genes were considered as redundant, which was often the case for EnsEMBL and EST genes. The gene with the longest cDNA sequence was taken as the representative hit. When a LongSAGE tag was mapped onto an EnsEMBL gene or EST gene on two different genomic locations, it is considered as a multiple hit.
Pseudogenes
When a single UniGene cluster was represented with multiple distinct EnsEMBL genes and/or EST genes in the EnsEMBL database, we considered only one of these EnsEMBL genes corresponding to the real gene, whereas others were considered to be pseudogenes. To identify those cases, we first determined the EnsEMBL gene or EST gene corresponding to every UniGene cluster. Therefore, the representative sequence of each UniGene cluster (Mm.seq.uniq) was analyzed for sequence homology by BLAST against all EnsEMBL transcripts. For each UniGene sequence, the best hit, having a minimum of 92% percent identity over at least 250 bp in the same orientation, was considered as the corresponding EnsEMBL gene (primary hit), thereby allowing multiple UniGene clusters per EnsEMBL gene/EST gene, but only one gene/EST gene per UniGene cluster. Next, a second comparison with those EnsEMBL genes and EST genes, which have not yet been assigned to a UniGene cluster as primary hit, was conducted in a similar manner. If one of those genes, not assigned in the first round, shared a sequence similarity of at least 92% over at least 250 bp to a UniGene cluster (secondary EnsEMBL hit for the UniGene cluster), this particular UniGene cluster has to be represented with multiple genes in the EnsEMBL genome annotations. Otherwise, i.e. if the two (or more) genes would correspond to two distinct transcribed genes, two (or more) distinct UniGene clusters would have to exist for those genes. To take into account the redundancy between EnsEMBL genes and EST genes, those secondary hits having an overlapping chromosomal localization to primary hits are left out. All UniGene clusters associated with primary and secondary hits were considered as genepseudogene pairs.
Number of LongSAGE tags per gene
EnsEMBL: All LongSAGE tags derived from the embryonic tail libraries that could be directly mapped uniquely onto an EnsEMBL gene or EST gene was evaluated. If multiple LongSAGE tags were assigned to the same EnsEMBL gene or EST gene, the location of such LongSAGE tags in the EnsEMBL gene was determined. In case of multiple overlapping genes, the representative one was used (see above). If the gene had overlapping exons (e.g. different polyadenylation sites in the last exon), the longest one was used.
MGD: For each MGD entry, the mapping LongSAGE tags were retrieved and the number of unique tags in the dataset per MGD ID was counted.
Support of GenScan prediction by LongSAGE tags
Only those LongSAGE tags that could not be assigned to either a UniGene cluster or an EnsEMBL gene or EST gene, but mapped onto the genomic sequence were used for this analysis. For each hit to the genome, genes predicted by GenScan were retrieved through the EnsEMBL database (predictionTranscripts). All LongSAGE tags within GenScan predictions were directly associated and LongSAGE tags between two predicted exons or up to 2000 bp downstream of the predicted gene were considered as potential hits.
Verification of cDNA/EST alignments by LongSAGE tag
We examined for every LongSAGE tag mapped onto the genome (including the multiple-hits) whether one or more cDNA or EST sequences could be aligned to the genomic position of the LongSAGE tag. We first retrieved the overlapping dna_align_features from the EnsEMBL databases core, estgene and est, which hold all high-scoring sequence pairs (HSPs) of a global BLAT analysis of all transcript sequences against the genome. To ensure that the alignment between transcript and genome sequence is real, we then required the presence of splicing donor (GT) or splicing acceptor (AG) immediately downstream or upstream of the aligned segments. In the case of potential single-exon genes, alignments were also accepted if the whole sequence (at least 95%) could be aligned to the genomic position without any gap (Supplementary Figure 1). For each HSP, a minimum percent identity of 97% was required. LongSAGE tags overlapping with a dna_align_feature at multiple different genomic locations were not processed. Next, the aligned transcript sequence(s) was compared to annotated genes or EST genes. A LongSAGE tag was considered to correspond to a particular gene/EST gene in the following cases. First, if one transcript sequence overlapping with a LongSAGE tag also overlapped with an EnsEMBL exon associated with at least splice donor or acceptor on the same or the opposite strand with a maximal distance of 10 kb to the LongSAGE tag (Supplementary Figure 2A). In the latter case (opposite strand), the LongSAGE tag was categorized as an antisense tag (Supplementary Figure 2B). Second, if the transcript sequence showed a minimum percent identity of 95% over at least 150 bp to the EnsEMBL transcript maximal 10 kb distant from the LongSAGE tag on the same or different strand (Supplementary Figure 2C and D). For more detailed descriptions about antisense tags and genes, see the accompanying paper (Wahl et al., 2005). All transcript sequence(s)/LongSAGE tag combinations that could not be assigned to any EnsEMBL gene in the above strategy were considered as novel. LongSAGE tags with multiple hits to the genome were included in the final numbers, only when for one single genomic locus one or more transcript sequence(s) could be aligned.
| RESULTS |
|---|
|
|
|---|
Assignment of LongSAGE tags to genes
Table 1 shows the statistics for the reliable assignment of LongSAGE tags to UniGene clusters (Wheeler et al., 2004), genomic sequence and EnsEMBL genes (Birney et al., 2004a). The percentage of LongSAGE tags that were not mapped onto genes (UniGene and/or EnsEMBL genes) was inversely proportional to the tag abundance, which is in concordance with previous reports (Margulies et al., 2001; Wahl et al., 2004).
|
A detailed analysis of the 41 714 types of LongSAGE tags collected in our experiment against the genomic sequence is summarized in Table 2. Unique genome hits were confirmed in 21 904 cases (53%), whereas 4087 tags (10%) hit multiple genome sites and 15 723 tags (38%) did not show any hit in the genome. Of the 21 904 tags with a unique hit to the genomic sequence, 6317 tags matched to both UniGene and EnsEMBL databases, 2510 LongSAGE tags could be assigned only to an EnsEMBL gene and 3729 tags detected in the genomic sequence could not be mapped onto any EnsEMBL gene, but only to a UniGene cluster. Interestingly, 9348 tags could not be mapped onto either UniGene or EnsEMBL genes. Further investigations on these 9348 tags are presented later in separate sections.
|
Surprisingly, there was a considerable portion of LongSAGE tags, which had no hit to the genomic sequence (38%). To determine, whether this observation was due to gaps or sequencing errors in the genome sequence assembly, or due to exon-spanning LongSAGE tags, we analyzed how many of these LongSAGE tags were represented by a UniGene cluster. We could only identify 1097 cases (out of 15 723) where a no-hit tag to the genome was included in transcript sequences (UniGene). Since the phenomenon of no-hit tags to the genome was predominantly seen in single-count tags, we assumed that it was due to sequence errors introduced during the PCR or sequencing step. Even though the base-calling program Phred (Ewing et al., 1998) was used and the raw ditags were processed by the SAGEScreen algorithm, which tries to correct wrong LongSAGE tags (Akmaev and Wang, 2004), this possibility remains. Indeed, an analysis of the relative abundance of all derivates of linker tags, allowing up to two substitutions and up to one deletion or insertion, to the count of their parent before SAGEScreen correction determined an error rate of 1115%, which could be extrapolated to 22 00030 000 tags. This observation is concordant to the considerations and analyses of other SAGE data (Akmaev and Wang, 2004).
Relationship between tag length and unique assignment
Figure 1 depicts the frequencies of no-hits, single-hits and multiple-hits to either UniGene clusters or the genome sequence, in order to provide a basis for determining the minimum length of a SAGE tag (including the CATG anchor sequence) to be considered unique to the corresponding database. In order to avoid artificial LongSAGE tags due to PCR and/or sequencing artefacts, we concentrated on LongSAGE tags with a minimum count of three for this analysis. These LongSAGE tags were artificially trimmed by up to 7 bases at the 3'-end. For the assignment to UniGene, a plateau was reached at a tag length of 1617 bases (Fig. 1A and B), and interestingly, an increased tag length did not reduce the number of multiple hits. In contrast, for the mapping to the genome, each additional base increased specificity, whereas the number of no-hit cases remained low (Fig. 1C). In the case of single-hit matching to the genome sequence, with increasing tag lengths from 16 to 20, we observed a significant increase in the percentage of unique genome assignments from
20 to
75% in our experimental dataset. However, this increase reached a plateau with the tag length of 20, and the incidence of single-hit matching did not significantly increase any more with 21-base LongSAGE tags.
Uniqueness of a LongSAGE tag in the presence of pseudogenes
Pseudogenes are copies of transcribed genes with very high-sequence similarity to the coding region of the corresponding real gene, but are not transcribed. Therefore, it is often difficult to discriminate a real, transcribed gene represented by EST or cDNA sequences from its non-transcribed pseudogene(s) on the genome sequence. Thus, we addressed the question of how many LongSAGE tag derived from a real gene that had one or more highly similar pseudogene(s) were found only once in the genome sequence: i.e. present only in the transcribed gene, but not in the pseudogene(s). We first identified potential pseudogenes annotated in EnsEMBL with a strategy explained in detail in the Materials and Methods section. As summarized in Table 3, out of 1144 potential gene/pseudogene(s) pairs with one or more LongSAGE tag(s) observed in our dataset, for 457 (339 + 118) genes at least one LongSAGE tag was unique to the genome, but in 687 cases, an identical LongSAGE tag was observed in both gene and pseudogene(s).
|
Number of tags per gene
As we often observed that different LongSAGE tags were mapped onto the same gene, we analyzed the number of LongSAGE tags for every gene. For this purpose, the direct use of UniGene clusters was not the best choice, since it was known that a significant number of genes were represented in multiple UniGene clusters. Therefore, we started with all genes with entries in the MGD (Bult et al., 2004) and retrieved the possible LongSAGE tags through the UniGene clusters assigned to entries in MGD. Of a total of 9088 MGD entries with LongSAGE tags detected in our libraries, in
45% (4165 tags) of the cases, more than one kind of tag was observed in the dataset, and
21% (1896 tags) were represented three or more times (Table 4, MGD entries).
|
To determine the exact localizations of the LongSAGE tags within the genes, we performed the same analysis on EnsEMBL genes, for which the exonintron structure was known (Table 4, Mapped to EnsEMBL). Altogether, less LongSAGE tags could be assigned to EnsEMBL genes, thereby resulting in a total of 7179 genes analyzed and only 1933 genes with multiple unique tags. Among these, 1066 (848 + 218) EnsEMBL genes had LongSAGE tags located in different exons and in 1303 (1085 + 218) cases, more than one LongSAGE tag was assigned to the same exon.
LongSAGE tags providing experimental supports for predicted genes
Since 9348 LongSAGE tags that were mapped onto the genome lacked a reliable hit to UniGene and could not be associated with an EnsEMBL gene (Table 2), we analyzed how many of these LongSAGE tags overlapped with an ab initio prediction by GenScan (Burge and Karlin, 1997). As GenScan is known to fail in detecting some of the exons of a gene and is unable to predict untranslated regions (UTRs), we also included those LongSAGE tags in between two predicted exons or downstream of the last predicted exon (possibly in the 3'-UTR), as long as the distance was not >2000 bp. As illustrated in Figure 2, a total of 2098 LongSAGE tags were located in the region of a predicted gene, including, 111 LongSAGE tags within predicted exons, 1039 LongSAGE tags between two predicted exons and 948 LongSAGE tags downstream of the last predicted exon.
|
Evidence for novel genes, novel exons and alternative polyadenylation sites
Surprisingly, of the 21 904 LongSAGE tags with a single hit to the genome sequence, 13 077 (9348 + 3729; Table 2) tags could not be assigned to an EnsEMBL gene. However, in 3729 cases of these no-EnsEMBL-gene-hit LongSAGE tags, indeed, we were able to assign to UniGene. This observation reflected the fact that a certain fraction of UniGene entries were not included in EnsEMBL genome annotation, and at the same time suggested the possibility that LongSAGE data might be useful to link between such non-annotated UniGene entries and their genome locations. Therefore, we addressed this possibility by determining transcript sequences that could be aligned to the chromosomal position of each LongSAGE tag mapped onto the genome. As shown in Figure 3, for a total of 18 205 unique LongSAGE tags, a transcript sequence overlapped with the LongSAGE tag on the genomic sequence. In 9112 cases, there was no EnsEMBL gene annotated to the chromosomal position of the LongSAGE tag (Fig. 3A). In 207 cases, the LongSAGE tag supported by a transcript sequence was located between two exons of an annotated gene (Fig. 3C), and in 480 cases downstream of the 3'-UTR of an EnsEMBL gene (Fig. 3D).
|
| DISCUSSION |
|---|
|
|
|---|
Optimal length of SAGE tags
For the best performance of SAGE, it is necessary to generate tags as short as possible, which can still reliably be assigned to corresponding genes. As illustrated in Figure 1, against both transcript (UniGene) and genome databases, from a certain tag length on the number of multiple hits does not significantly decrease with increased tag length, reflecting the fact that transcript and genome sequences are not purely random. The fact that the graph approaches to a plateau could be explained by the existence of highly homologous genes or repetitive sequences, leading to the same SAGE tags for multiple transcript sequences or for multiple genomic locations. This notion is in agreement with a previous report on invertebrates such as Caenorhabditis elegans and Drosophila melanogaster (Pleasance et al., 2003). This indicates that not all transcripts will be covered by a single anchoring enzyme (here: Hsp92II, cutting at the recognition sequence CATG). However, by generating parallel LongSAGE libraries from the same mRNA material using different anchoring enzymes, this limitation can be overcome to some extent, whereby additional tiny fraction of transcripts that do not contain a CATG may be covered. According to the results shown in Figure 1, a 17-base SAGE tag (including CATG) is sufficient for a unique assignment to transcript sequences. Owing to the larger size and higher complexity of the genome sequence, a SAGE tag has to be longer to be uniquely assigned to the genome. Since the number of multiple hits only marginally decreases (and accordingly the percentage of single-hit matching to the genome does not improve) between 20 and 21 bases, LongSAGE tags seem to be really adequate. However, this will have to be experimentally verified by generating even longer SAGE tags with the recently published method SuperSAGE (Matsumura et al., 2003) utilizing 26-base long tags. Concerning the tag uniqueness in the genome, the maximal incidence of unique assignment theoretically calculated in the original LongSAGE article (Saha et al., 2002) is almost 100%. This maximal level is reached also with 20-base long tags, but is significantly higher than the one from our observation (
75%) based on the experimental dataset from this study. This difference may reflect again the non-randomness of the genomic sequences.
Number of alternative tags per gene
It has been previously reported that most genes have alternative transcripts. By EST data mining, it has turned out that at least 59% of human and 41% of mouse multi-exon genes have alternative splice forms (Brett et al., 2002; Zavolan et al., 2003), and 28.6% of human genes show alternative polyadenylation (Beaudoing and Gautheret, 2001). In our dataset,
45% of the genes detected had alternative transcripts, which is consistent with above numbers. However, it should be noted that LongSAGE cannot detect all isoforms, i.e. only those leading to the use of alternative 3' most CATG sites. Therefore, most of the differences observed are limited to the 3' part of the corresponding gene. This explains the observed predominance of alternative polyadenylation (LongSAGE tags usually within the same last exon) over alternative splicing (tags in different exons). Interestingly, many alternatively spliced and/or polyadenylated transcripts co-exist in the same tissue (the embryonic mouse tail). This raises the question about the biological significance of both alternative splicing and alternative polyadenylation. It is shown in the classical example of the Drosophila Sex-lethal gene, which, due to alternative splicing, has a male- and a female-specific transcript, thereby determining the sex of the individual (reviewed by Penalva and Sanchez, 2003). Nevertheless, observations that alternative transcripts specific for a certain cell type have a different function might not be the general rule. Since LongSAGE predominantly records changes to the 3' part of a gene (3' alternative splicing and alternative polyadenylation), the coding sequence might not be affected by the observed events. On the other hand, the 3'-UTRs have been reported to be involved in processes like translational regulation, mRNA stability and sub-cellular localization (reviewed by Kuersten and Goodwin, 2003).
Impacts of LongSAGE data on genome annotation
As mentioned in the Introduction section, the experience in genome annotation of higher vertebrates has revealed the need for experimentally generated DNA or protein sequences to annotate all genes to the genome. Therefore, we assessed whether LongSAGE could fulfill its promise as a method to assist genome annotation (Saha et al., 2002). Our data strongly suggests that even at the current state of transcript (full-length cDNA and EST) sequencing projects in the mouse and with the momentary tools for in silico gene prediction, a significant number of genes may have not been annotated in the genome.
By comparing GenScan predicted genes and the LongSAGE tags with no hit to both EnsEMBL genes and UniGene clusters, we found a complete overlap in 111 cases (Fig. 2A). Since gene prediction programs often fail to detect some exons and especially the 3'-UTR of a gene (Rogic et al., 2001), those LongSAGE tags falling in between two predicted exons (1039 cases, Fig. 2B) or downstream of the last exon of a predicted transcript (i.e. 3'-UTR; 948 cases, Fig. 2C) could be part of these predicted genes, thereby providing an experimental support for them being real. As all LongSAGE tags used for this analysis were not found in any of the gene or transcript databases used, those genes can be considered as being completely novel. Moreover, we have shown that LongSAGE tags can be utilized to identify the genomic locus of genes that are not annotated through the EnsEMBL pipeline. Therefore, we propose that the genes of 9112 LongSAGE tags supported by aligned cDNA/ESTs (Fig. 3A) are not (or wrongly) annotated through EnsEMBL. As we have described in the accompanying paper (Wahl et al., 2005), among the LongSAGE/transcript sequence pairs, 1260 potential antisense genes (1468 antisense transcripts) were identified, of which only 296 (23%) are included in EnsEMBL. The fact that our strategy could confirm >98% (conflicting in only 95 out of 8406 cases) of the EnsEMBL gene annotations proves the feasibility of our approach.
| CONCLUSIONS |
|---|
|
|
|---|
In conclusion, LongSAGE data are very useful and efficient in identifying and locating transcribed units in the genome. Together with further evidence from gene prediction programs or transcript sequence alignments to the genome, the complete structure of a gene can be determined, independent of any comparative analysis to other species, which would leave out species-specific genes. Even in the presence of pseudogenes, which often cannot be discriminated from its transcribed copy in the EnsEMBL algorithm, in a considerable number of cases, a LongSAGE tag derived from the expressed gene could be assigned to only one of the candidates. Finally, it should be noted that a significant number of LongSAGE tags with a unique hit to the genome were still not associated with a gene by any of the above approaches, suggesting that many genes have still not been recognized.
| Acknowledgments |
|---|
We thank Rudi Balling (GBF) for valuable comments to this work. This work was supported by the GSF.
| Footnotes |
|---|
Present address: Stowers Institute for Medical Research, 1000 E. 50th Street, Kansas City, MO64110, USA
Received on September 4, 2004; revised on November 23, 2004; accepted on December 2, 2004
| REFERENCES |
|---|
|
|
|---|
Akmaev, V.R. and Wang, C.J. (2004) Correction of sequence-based artifacts in serial analysis of gene expression. Bioinformatics, 20, 12541263
Alexandersson, M., Cawley, S., Pachter, L. (2003) SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res., 13, 496502
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403410[CrossRef][ISI][Medline].
Beaudoing, E. and Gautheret, D. (2001) Identification of alternate polyadenylation sites and analysis of their tissue distribution using EST data. Genome Res., 11, 15201526
Birney, E., Andrews, T.D., Bevan, P., Caccamo, M., Chen, Y., Clarke, L., Coates, G., Cuff, J., Curwen, V., Cutts, T., et al. (2004a) An overview of Ensembl. Genome Res., 14, 925928
Birney, E., Clamp, M., Durbin, R. (2004b) GeneWise and genomewise. Genome Res., 14, 988995
Brett, D., Pospisil, H., Valcarcel, J., Reich, J., Bork, P. (2002) Alternative splicing and genome complexity. Nat. Genet., 30, 2930[CrossRef][ISI][Medline].
Bult, C.J., Blake, J.A., Richardson, J.E., Kadin, J.A., Eppig, J.T., Baldarelli, R.M., Barsanti, K., Baya, M., Beal, J.S., Boddy, W.J., et al. (2004) The Mouse Genome Database (MGD): integrating biology with the genome. Nucleic Acids Res., 32, D476D481
Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268, 7894[CrossRef][ISI][Medline].
Curwen, V., Eyras, E., Andrews, T.D., Clarke, L., Mongin, E., Searle, S.M., Clamp, M. (2004) The Ensembl automatic gene annotation system. Genome Res., 14, 942950
Ewing, B., Hillier, L., Wendl, M.C., Green, P. (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res., 8, 175185
Imanishi, T., Itoh, T., Suzuki, Y., O'Donovan, C., Fukuchi, S., Koyanagi, K.O., Barrero, R.A., Tamura, T., Yamaguchi-Kabata, Y., Tanino, M., et al. (2004) Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol., 2, E162.
Kanehisa, M. and Bork, P. (2003) Bioinformatics in the post-sequence era. Nat. Genet., 33, suppl., 305310.
Kitano, H. (2002) Computational systems biology. Nature, 420, 206210[CrossRef][Medline].
Kuersten, S. and Goodwin, E.B. (2003) The power of the 3' UTR: translational control and development. Nat. Rev. Genet., 4, 626637[CrossRef][ISI][Medline].
Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860921[CrossRef][Medline].
Lash, A.E., Tolstoshev, C.M., Wagner, L., Schuler, G.D., Strausberg, R.L., Riggins, G.J., Altschul, S.F. (2000) SAGEmap: a public gene expression resource. Genome Res., 10, 10511060
Margulies, E.H., Kardia, S.L., Innis, J.W. (2001) A comparative molecular analysis of developing mouse forelimbs and hindlimbs using serial analysis of gene expression (SAGE). Genome Res., 11, 16861698
Matsumura, H., Reich, S., Ito, A., Saitoh, H., Kamoun, S., Winter, P., Kahl, G., Reuter, M., Kruger, D.H., Terauchi, R. (2003) Gene expression analysis of plant hostpathogen interactions by SuperSAGE. Proc. Natl Acad. Sci. USA, 100, 1571815723
Okazaki, Y., Furuno, M., Kasukawa, T., Adachi, J., Bono, H., Kondo, S., Nikaido, I., Osato, N., Saito, R., Suzuki, H., et al. (2002) Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature, 420, 563573[CrossRef][Medline].
Penalva, L.O. and Sanchez, L. (2003) RNA binding protein sex-lethal (Sxl) and control of Drosophila sex determination and dosage compensation. Microbiol. Mol. Biol. Rev., 67, 343359
Pleasance, E.D., Marra, M.A., Jones, S.J. (2003) Assessment of SAGE in transcript identification. Genome Res, 13, 12031215
Rogic, S., Mackworth, A.K., Ouellette, F.B. (2001) Evaluation of gene-finding programs on mammalian sequences. Genome Res., 11, 817832
Saha, S., Sparks, A.B., Rago, C., Akmaev, V., Wang, C.J., Vogelstein, B., Kinzler, K.W., Velculescu, V.E. (2002) Using the transcriptome to annotate the genome. Nat. Biotechnol., 20, 508512[CrossRef][ISI][Medline].
Stabenau, A., McVicker, G., Melsopp, C., Proctor, G., Clamp, M., Birney, E. (2004) The Ensembl core software libraries. Genome Res., 14, 929933
Stajich, J.E., Block, D., Boulez, K., Brenner, S.E., Chervitz, S.A., Dagdigian, C., Fuellen, G., Gilbert, J.G., Korf, I., Lapp, H., et al. (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res., 12, 16111618
Stein, L. (2001) Genome annotation: from sequence to biology. Nat. Rev. Genet., 2, 493503[ISI][Medline].
Sun, M., Zhou, G., Lee, S., Chen, J., Shi, R.Z., Wang, S.M. (2004) SAGE is far more sensitive than EST for detecting low-abundance transcripts. BMC Genomics, 5, 1[CrossRef][Medline].
Velculescu, V.E., Zhang, L., Vogelstein, B., Kinzler, K.W. (1995) Serial analysis of gene expression. Science, 270, 484487
Wahl, M., Heinzmann, U., Imai, K. (2005) LongSAGE analysis revealed the presence of a large number of novel antisense genes in the mouse genome. Bioinformatics, 21, 13911394.
Wahl, M., Shukunami, C., Heinzmann, U., Hamajima, K., Hiraki, Y., Imai, K. (2004) Transcriptome analysis of early chondrogenesis in ATDC5 cells induced by bone morphogenetic protein 4. Genomics, 83, 4558[CrossRef][ISI][Medline].
Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature, 420, 520562[CrossRef][Medline].
Wheeler, D.L., Church, D.M., Edgar, R., Federhen, S., Helmberg, W., Madden, T.L., Pontius, J.U., Schuler, G.D., Schriml, L.M., Sequeira, E., et al. (2004) Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res., 32, D35D40
Zavolan, M., Kondo, S., Schonbach, C., Adachi, J., Hume, D.A., Hayashizaki, Y., Gaasterland, T. (2003) Impact of alternative initiation, splicing, and termination on the diversity of the mRNA transcripts encoded by the mouse transcriptome. Genome Res., 13, 12901300
Zhang, M.Q. (2002) Computational prediction of eukaryotic protein-coding genes. Nat. Rev. Genet., 3, 698709[CrossRef][ISI][Medline].
This article has been cited by other articles:
![]() |
C. Bowes Rickman, J. N. Ebright, Z. J. Zavodni, L. Yu, T. Wang, S. P. Daiger, G. Wistow, K. Boon, and M. A. Hauser Defining the Human Macula Transcriptome and Candidate Retinal Disease Genes Using EyeSAGE Invest. Ophthalmol. Vis. Sci., June 1, 2006; 47(6): 2305 - 2316. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. B. Wahl, U. Heinzmann, and K. Imai LongSAGE analysis revealed the presence of a large number of novel antisense genes in the mouse genome Bioinformatics, April 15, 2005; 21(8): 1389 - 1392. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




