Bioinformatics Advance Access originally published online on February 3, 2007
Bioinformatics 2007 23(7):903-905; doi:10.1093/bioinformatics/btm023
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Automatic correspondence of tags and genes (ACTG): a tool for the analysis of SAGE, MPSS and SBS data
1Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, São Paulo, 2Laboratory of Computational Biology, Ludwig Institute for Cancer Research, São Paulo Branch, Brazil, 3Department of Genetics, Harvard Medical School, 4Howard Hughes Medical Institute, Harvard Medical School, 5Decision Systems Group, Brigham and Women's Hospital, 6Division of Health Sciences and Technology, Harvard Medical School and MIT, Boston, 7Department of Organismic and Evolutionary Biology/FAS, Harvard University, and 8Department of Developmental Biology, Harvard School of Dental Medicine, Harvard University, Cambridge, MA, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: A critical step in any SAGE, MPSS and SBS data analysis is tag-to-gene assignment. Current available tools are limited by a tag-by-tag annotation process and/or do not provide the dataset that is used to produce a complete tag-to-gene mapping. We developed ACTG, a web-based application that allows a large-scale tag-to-gene mapping using several reference datasets. ACTG can annotate SAGE (14 or 21 bp), MPSS (17 or 20 bp) and SBS (16 bp) data for both human and mouse organisms.
Availability: http://retina.med.harvard.edu/ACTG/
Contact: pgalante{at}ludwig.org.br
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Serial analysis of gene expression (SAGE) (Velculescu et al., 1995), Massively parallel signature sequencing (MPSS) (Brenner et al., 2000) and sequencing-by-synthesis (SBS) (www.illumina.com/pages.ilmn?ID=201) determine the expression level of genes by measuring the frequency of sequence tags derived from polyadenylated transcripts. SAGE allows the construction of a comprehensive expression profile in which each mRNA is defined by specific 14 or 21 bp sequences adjacent to the 3' most NlaIII site. MPSS generates 17–20 bp sequences adjacent to the 3' most DpnII site from millions of mRNA molecules in a sample, providing a quantitative assessment of the transcript abundance. SBS is a new technology that is similar to MPSS (it also generates millions of 20-bp-long tags corresponding to sequences adjacent to the 3' most DpnII site). SAGE and MPSS have been widely used for the study of gene expression in pathological tissues (Hermeking, 2003) as well as in the study of general aspects of gene expression in different organisms (Hermeking, 2003; Jongeneel et al., 2005; Meyers et al., 2004). An essential step in the analyses of SAGE, MPSS and SBS data is the correct assignment/mapping of a tag to a gene. Basically, there are two strategies to perform this process: an annotation based on data from websites, such as SAGE Genie (Boon et al., 2002) and SAGEmap (Lash et al., 2000) or an annotation based on databases constructed using in-house computer approaches (Blackshaw et al., 2004; Silva et al., 2004). SAGEmap and SAGE Genie are web-based tools that map tags to cDNAs and tags to gene/cDNA references, respectively (these web tools are exclusive to SAGE data). These strategies for tag-to-gene assignments either do not produce a complete annotation or cannot batch process a large number of tags. To address these limitations, we have created a web-based application, called automatic correspondence of tags and genes (ACTG). ACTG is designed to map a large number of tags into several reference datasets. ACTG is user-friendly and generates an output file that is simple to analyze and can be integrated with other applications.
| 2 OVERVIEW OF ACTG |
|---|
|
|
|---|
ACTG is a collection of several Perl scripts, Perl + CGI scripts, shell scripts and HTML codes that performs three main tasks: (i) The assembling of virtual tag databases used in the ACTG, (ii) The uploading and processing of information submitted by the user and (iii) The mapping of the submitted tag list and generation of the output files.
ACTG datasets are composed of virtual tags (a computer prediction of tags produced by a SAGE, a MPSS or a SBS experiment) extracted from three major databases, SAGEGenie, SAGEmap and almost all public cDNAs from GenBank. The following are the main steps for the assembly of each dataset. For SAGEGenie: (i) Download of the files containing the best virtual tags matches to UniGene clusters, (ii) Parsing of the raw data, (iii) Assembly of the data and generation of the final dataset. For SAGEmap, the process is similar: (i) Download of the files containing the reliable virtual tags matches to UniGene clusters, (ii) Parsing of the raw data, (iii) Selection of the best tag to cDNA assignment (based on the SAGE map tag-to-gene score) and (iv) Assembly of the final dataset. For all public cDNAs sequences the process is more complex: (i) Download of all cDNAs from UniGene (Boguski and Schuler, 1995), RefSeq (Pruitt et al., 2005), MGC (Strausberg et al., 1999) and dbEST (Boguski et al., 1993), (ii) Split of dbEST ESTs, RefSeq, MGC and UniGene mRNAs in subsets presenting either a poly(A) tail (at least five adenosines at the cDNA 3' end), or a canonical poly(A) signal (AAUAAA or AUUAAA at the 3' most 50 bp segment), or both, poly(A) tail and signal and (iii) Extraction of the virtual tags and assembly of the final datasets in the ACTG format (for dbEST data, only ESTs containing poly(A) tail and poly(A) tail and signal have been included). These subdivisions of the cDNA datasets, based on the 3' end information, are important to evaluate the tag to cDNA mapping, from which a well-defined 3' end produces a more reliable tag-to-gene match (Boon et al., 2002).
Since ACTG presents virtual tags from commonly obtained sources of cDNA data, some users may have difficulty to fully utilize the datasets and interpret the redundancy between each. We therefore created an additional virtual tag dataset, which is a non-redundant tag list that merges and removes the redundancy of every ACTG datasets. In addition, we classified the virtual tags to three categories of tag-to-gene assignment in terms of their reliability (high, medium and low), thus producing a non-redundant and ranked virtual tag dataset (details of the non redundant tag list are described on the ACTG website). These new datasets would help simplify the tag mapping process and the user's interpretation of results.
We have also created another additional data, a set of putative artifactual tags. This dataset contains tags similar (allowing 1 mismatch) to the sequence of the linker (utilized in the construction of SAGE library) and ambiguous tags, tags mapped to two or more genes. Putative tags are identified by special characters in the output files and should be considered as non-reliable in the tag-to-gene assignments. These specific features are also available in SAGE Genie (Boon et al., 2002).
Figure 1 illustrates the main steps necessary to map a list of tags using ACTG. The first step is the submission of a file with a list of tags (a plain text file containing the tag sequences and optional columns, for example, the tag frequencies). Second, selecting an organism (human or mouse), tag type (SAGE, MPSS or SBS), and at least one database to map the tags. The final steps include the execution of the program (run ACTG) and downloading of the output files. ACTG produces three output files; (1) A file(s) containing the tag mapping for each selected database, (2) A file that merges all mapping results, and (3) A file containing statistics of the mapping process.
|
Below is a simplified guide to help best utilize the functions of ACTG and the interpretation of its mapping results. If a user is interested in: (1) Identifying known genes and produce the most reliable set of tag-to-gene mapping, use databases containing virtual tags from mRNAs sequences with poly(A) tails and poly(A) tails and poly(A) signals, (2) Identifying new genes and/or new transcripts, use databases containing virtual tags from ESTs with poly(A) tails and poly(A) tails and poly(A) signals, and (3) Performing the most complete mapping, use databases containing virtual tags from SAGEGenie, SAGEmap or virtual tags from the non-redundant tag lists. In reference to the interpretation of the results, a critical aspect to be carefully inspected is the ambiguity of tag-to-gene assignments. If a tag is mapped to virtual tags from two or more different cDNAs, but all sequences are in the same UniGene cluster, this redundancy is acceptable and will not influence the tag-to-gene match. However, the assignment is ambiguous if a tag is mapped on distinct UniGene clusters.
For further details on the functions of ACTG, the construction of the ACTG datasets, the mapping of a submitted tag list, and the interpretation of results; please visit the HELP & FAQ section of the ACTG website.
| 3 APPLICATION |
|---|
|
|
|---|
In order to evaluate our tool, we have selected and submitted to ACTG sets of tags from two recent publications (Blackshaw et al., 2004; Jongeneel et al., 2005). In Blackshaw et al. (2004), they were not able to map 37 813 distinct SAGE tags from 12 retina libraries. By, using ACTG, we were able to map 16 927 of these tags (databases: SAGEGenie, UniGene mRNAs and ESTs with poly(A) tail). In Jongeneel et al. (2005), they constructed an atlas of human gene expression based on tags from 32 MPSS library, where, for each library, the tags were assigned to transcribed regions (genes). For example, tags from cerebellum were mapped to 8183 genes and tags from bone marrow were mapped to 7182 genes. By using ACTG, tags from cerebellum and bone marrow were mapped to 13 773 and 12 689 genes, respectively (database: mRNAs from UniGene, RefSeq, MGC and ESTs with poly(A) tail and poly(A) signal and EST with poly(A) tail). The annotations for these datasets are available in the Publication section of ACTG website as supplementary data.
| 4 SUMMARY |
|---|
|
|
|---|
ACTG, a web-based tool, allows the user to quickly annotate a complete SAGE (14 or 21 bp long), MPSS (tags 17 or 20 bp long) or SBS (20 bp long) library using complete datasets for both human and mouse organisms. In addition, ACTG can filter redundant and artifact tags and generates a report of the mapping process. ACTG is a simple publicly available tag-to-gene mapping tool packaged in a user-friendly environment that addresses an essential step in SAGE, MPSS, and SBS data analyses.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Arthur Ramos for bioinformatics support and Noboru Jo Sakabe for discussions. PAFG was supported by a FAPESP fellowship. This study was supported in part by grant 5D43TW007015-02 from the Fogarty International center, NIH. WPK was supported by HSDM Dean's Scholar Award. Funding to pay the Open Access publication charges was provided by grant 5D43TW007015-02 from the Fogarty International center, NIH.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
Received on October 23, 2006; revised on December 22, 2006; accepted on January 19, 2007
| REFERENCES |
|---|
|
|
|---|
Blackshaw S, et al. Genomic analysis of mouse retinal development. PLoS Biol, ( (2004) ) 2, : E247.[CrossRef][Medline].
Boguski MS, Schuler GD. Establishing a human transcript map. Nat. Genet, ( (1995) ) 10, : 369–371.[CrossRef][ISI][Medline].
Boguski MS, et al. dbEST – database for "expressed sequence tags". Nat. Genet, ( (1993) ) 4, : 332–333.[CrossRef][ISI][Medline].
Boon K, et al. An anatomy of normal and malignant gene expression. Proc. Natl Acad. Sci. USA, ( (2002) ) 99, : 11287–11292.
Brenner S, et al. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol, ( (2000) ) 18, : 630–634.[CrossRef][ISI][Medline].
Hermeking H. Serial analysis of gene expression and cancer. Curr. Opin. Oncol, ( (2003) ) 15, : 44–49.[CrossRef][ISI][Medline].
Jongeneel CV, et al. Comprehensive sampling of gene expression in human cell lines with massively parallel signature sequencing. Proc. Natl Acad. Sci. USA, ( (2003) ) 100, : 4702–4705.
Jongeneel CV, et al. An atlas of human gene expression from massively parallel signature sequencing (MPSS). Genome. Res, ( (2005) ) 15, : 1007–1014.
Lash AE, et al. SAGEmap: a public gene expression resource. Genome Res, ( (2000) ) 10, : 1051–1060.
Meyers BC, et al. The use of MPSS for whole-genome transcriptional analysis in Arabidopsis. Genome Res, ( (2004) ) 14, : 1641–1653.
Pruitt KD, et al. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res, ( (2005) ) 33, : D501–D504.
Silva AP, et al. The impact of SNPs on the interpretation of SAGE and MPSS experimental data. Nucleic Acids Res, ( (2004) ) 32, : 6104–6110.
Strausberg RL, et al. The mammalian gene collection. Science, ( (1999) ) 286, : 455–457.
Velculescu VE, et al. Serial analysis of gene expression. Science, ( (1995) ) 270, : 484–487.
This article has been cited by other articles:
![]() |
S. Thomas, M. Thomas, P. Wincker, C. Babarit, P. Xu, M. C. Speer, A. Munnich, S. Lyonnet, M. Vekemans, and H. C. Etchevers Human neural crest cells display molecular and phenotypic hallmarks of stem cells Hum. Mol. Genet., November 1, 2008; 17(21): 3411 - 3425. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Norambuena, R. Malig, and F. Melo SAGExplore: a web server for unambiguous tag mapping in serial analysis of gene expression oriented to gene discovery and annotation Nucleic Acids Res., July 13, 2007; 35(suppl_2): W163 - W168. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


