Bioinformatics Advance Access originally published online on October 6, 2005
Bioinformatics 2005 21(23):4302-4303; doi:10.1093/bioinformatics/bti705
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GARSA: genomic analysis resources for sequence annotation
1DBBM, Instituto Oswaldo Cruz Fiocruz, Brazil
2Laboratório de Bioinformática and Laboratório de ProtozoologiaMIP/CCB, Universidade Federal de Santa Catarina Brazil
3Departamento de Ciência da Computação/Núcleo de Computação EletrônicaUFRJ Brazil
4Engenharia de Bioprocessos e Biotecnologia, Universidade Federal do Paraná Brazil
5Instituto Militar de Engenharia Brazil
6Bioinformatics and Molecular Evolutionary Genetics Group Brazil
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Growth of genome data and analysis possibilities have brought new levels of difficulty for scientists to understand, integrate and deal with all this ever-increasing information. In this scenario, GARSA has been conceived aiming to facilitate the tasks of integrating, analyzing and presenting genomic information from several bioinformatics tools and genomic databases, in a flexible way. GARSA is a user-friendly web-based system designed to analyze genomic data in the context of a pipeline. EST and GGS data can be analyzed using the system since it accepts (1) chromatograms, (2) download of sequences from GenBank, (3) Fasta files stored locally or (4) a combination of all three. Quality evaluation of chromatograms, vector removing and clusterization are easily performed as part of the pipeline. A number of local and customizable Blast and CDD analyses can be performed as well as Interpro, complemented with phylogeny analyses. GARSA is being used for the analyses of Trypanosoma vivax (GSS and EST), Trypanosoma rangeli (GSS, EST and ORESTES), Bothrops jararaca (EST), Piaractus mesopotamicus (EST) and Lutzomyia longipalpis (EST).
Availability: The GARSA system is freely available under GPL license (http://www.biowebdb.org/garsa/). For download requests visit http://www.biowebdb.org/garsa/ or contact Dr Alberto Dávila.
Contact: davila{at}fiocruz.br
The increasing amount of genome data and the consequent possibilities for genome analyses has raised new levels of difficulty for scientists to understand, integrate and deal with all this ever-increasing information. One of the main problems is to manipulate and process different file formats, using a number of tools that usually do not easily communicate with each other. Researchers have to deal with dozens of sequence formats (Rice et al., 2000) and several different software packages to analyze nucleotide sequences within a typical bioinformatics pipeline. As a consequence, to overcome heterogeneity, redundancy and low productivity, biologists use alternative strategies such as scripts or adaptation/reuse of some available modules (e.g. Bioperl). Although effective, such approaches are far from being ideal since the intermediate files generated throughout the process are usually not properly stored and organized, generating a large number of files and versions that can potentially lead to processing errors, wrong analyses and/or inferences. The use of database management systems adds facilities such as integrity constraints, transaction management and query languages (SQL), amongst others. Despite the description of several analysis pipelines in the literature, such as the EST pipeline system (Xu et al., 2003), the ESTAP (Mao et al., 2003) and the ESTWeb (Paquola et al., 2003), as far as we know, none of them was specifically designed for GSS analyses or a combination of GSS with transcriptome projects.
Considering the above mentioned problems and the increasing number of network-based projects, in which laboratories can be geographically dispersed, we have conceived a web-based environment named GARSA (genomic analyses resources for sequence annotation), aimed to facilitate the analysis, integration and presentation of genomic information, concatenating several bioinformatics tools and sequence databases, using a flexible and user-friendly approach.
GARSA system is specially designed to analyze genomic data, presenting a pipeline, also called workflow, composed of selected bioinformatics software packages, and an intuitive web-based interface. Its underlying platform includes Perl, Bioperl, CGI, Apache and MySQL, as well as several Linux-based bioinformatics packages. In the current version, the system can analyze EST, Orestes and GSS data, accepting as inputs (1) chromatograms, (2) downloads from GenBank, (3) Fasta files stored locally or (4) a combination of all of these inputs. GARSA uses the Phred/Phrap package (http://www.phrap.org/phredphrapconsed.html) to process chromatograms, evaluate the quality of traces and remove vector contamination. CAP3 program (Huang and Madan, 1999) is used for clustering, while for gene finding, the system employs the Yacop metatool (Tech and Merkl, 2003), which includes programs such as Critica, Glimmer and ZCURVE. Selected programs of the EMBOSS package (Transeq, Geecee, Cusp) (Rice et al., 2000) are used to translate, estimate G + C content and codon usage, from both predicted ORFs and clusters. Clusters are submitted to (standalone) Blast similarity searches (http://www.ncbi.nlm.nih.gov/BLAST/) against NR, NT, UniProt and any other custom database built by the user. Conserved domain searches are also performed using the NCBI's CDD tool and Interpro.
Similarity search results are stored in the corresponding GARSA database tables and then users can select them and build multiple alignments using ClustalW (Thompson et al., 1994). Alignments are presented in ClustalW, Phylip and WebLogo (Crooks et al., 2004) formats for download and then users can do further analyses with them. Phylogenetic trees are built using SeqBoot, Dnadist/Protdist, Neighbor and Consense programs of the Phylip package (Felsenstein, 1989). A feature for registered users to enter comments or annotations on any clusters has also been added. Several gene discovery projects can easily be included in the system, as GARSA can simultaneously deal with multiple projects.
The Library table identifies each genomic library used in the project. Sequences from each library are uploaded to the Reads table. These sequences may come from local experiments or GenBank downloads. CAP3 results are stored in the Clustering, Clusters and Clusters_Fasta tables. Predicted ORFs identified by Yacop or Glimmer are stored in the ORF_Predicted table. Each similarity search execution is registered in the Similarity_Search table, and its hits are stored in the Blast_Hit and Interpro_Hit tables. The Taxonomy table contains classification data of organisms, which are referred by annotations stored in the Annotation table. Along with the possibility to define parameters, to run and to check the analyses' pipeline, there is also an option to query the results stored in the database and then users can retrieve and manually check the clusters with hits and no-hits, using E-value, score, query_frame, query_start, query_end and/or description as parameters.
GARSA is being used for the sequence analyses of Trypanosoma vivax (GSS and EST) (Guerreiro et al., 2005), Trypanosoma rangeli (GSS, EST and ORESTES), Bothrops jararaca (EST), Piaractus mesopotamicus (EST) and Lutzomyia longipalpis (EST). Installation and usage documentation are available at http://www.biowebdb.org/garsa/documentation.html
Furthermore, the GARSA system is unique on integrating (1) gene finders, (2) phylogeny software, (3) multiproject environment and (4) user-based authenticated access. A new version towards comparative genomics analyses is being developed to integrate more software packages in the pipeline, such as GO tools (http://www.geneontology.org/GO.tools.shtml), RepeatMasker (http://repeatmasker.org) and Eukaryotes gene finders, and to provide a self-extract installer for local installation.
| Acknowledgments |
|---|
We would like to thank Dr José Marcos Ribeiro (NIAID/NIH) for suggestions and for sharing his experience on EST analysis, João Setubal (VBI and LBI/IC/UNICAMP) for allowing us to modify the algorithm for processing EST chromatograms and MCT/CNPq, IAEA, CIRAD and FAPESP for financial support.
Conflict of Interest: none declared.
Received on July 10, 2005; revised on September 7, 2005; accepted on October 5, 2005
| REFERENCES |
|---|
|
|
|---|
Crooks, G.E., et al. (2004) WebLogo: a sequence logo generator. Genome Res, . 14, 11881190
Felsenstein, J. (1989) PHYLIPPhylogeny Inference Package (Version 3.2). Cladistics, 5, 164166.
Guerreiro, L.T., et al. (2005) Exploring the genome of Trypanosoma vivax through GSS and in silico comparative analysis. OMICS, 9, 116128[CrossRef][ISI][Medline].
Huang, X. and Madan, A. (1999) CAP3: a DNA sequence assembly program. Genome Res, . 9, 868877
Mao, C., et al. (2003) ESTAPan automated system for the analysis of EST data. Bioinformatics, 19, 17201722
Paquola, A.C., et al. (2003) ESTWeb: bioinformatics services for EST sequencing projects. Bioinformatics, 19, 15871588
Rice, P., et al. (2000) EMBOSS: the european molecular biology open software suite. Trends Genet, . 16, 276277[CrossRef][ISI][Medline].
Tech, M. and Merkl, R. (2003) YACOP: enhanced gene prediction obtained by a combination of existing methods. In Silico Biol, . 3, 441451[Medline].
Thompson, J.D., et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res, . 22, 46734680
Xu, H., et al. (2003) EST pipeline system: detailed and automated EST data processing and mining. Genomics Proteomics Bioinformatics, 1, 236242[Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||