Bioinformatics Advance Access originally published online on February 15, 2005
Bioinformatics 2005 21(10):2566-2567; doi:10.1093/bioinformatics/bti326
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
MamMiBase: a mitochondrial genome database for mammalian phylogenetic studies
1Laboratório Nacional de Computação Científica LNCC/MCT Petrópolis, RJ, Brazil
2Instituto de Bio-Manguinhos, FIOCRUZ Rio de Janeiro, RJ, Brazil
3UFRJ, Departamento de Genética, Instituto de Biologia, Universidade Federal do Rio de Janeiro Rio de Janeiro, RJ, Brazil
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: MamMiBase, the mammalian mitochondrial genome database, is a relational database of complete mitochondrial genome sequences of mammalian species. The database is useful for phylogenetic analysis, since it allows a ready retrieval of nucleotide and aminoacid individual alignments, in three different formats (NEXUS for PAUP program, for MEGA program and for PHYLIP program) of the 13 protein coding mitochondrial genes. The user may download the sequences that are useful for him/her based on their parameters values, such as sequence length, p-distances, base content, transition transversion ratio, gamma, which are also given by MamMiBase. A simple phylogenetic tree (neighbor-joining tree with Jukes Cantor distance) is also available for download, useful for parameter calculations and other simple tasks.
Availability: MamMiBase is available at http://www.mammibase.lncc.br
Contact: atrv{at}lncc.br
| INTRODUCTION |
|---|
|
|
|---|
In recent years, there has been an unprecedented increase in the production of molecular sequences, with one the most popular data banks, the GenBank, having reached dozens of trillions of bases deposited. Unfortunately, until properly analyzed, most of these sequences will remain of no avail to the understanding of their biological significance. Reliable phylogenetic analyses may serve as the basis for the understanding of biological patterns and the processes that govern their evolution. Nevertheless, for a particular phylogenetic problem, a number of data sets are available that may, even if properly analyzed, yield conflicting results. Sequence length, variability levels, gaps and transition and transversion ratio values have been shown to be critical to gene performance (Russo et al., 1996).
More restrictive data banks play a central role to lessen this problem, since they provide a fast retrieval of the set of sequences the user intends to analyze. In this paper we present a mitochondrial protein database that is particularly interesting for those that aim at reconstruction of phylogenies. The database, at this point, does not include individual genes that have been sequenced in isolation; the bank enables the user to promptly select individual gene alignments from the mitochondrial genomes so far sequenced. The user may select genes by statistical parameter values, such as GC content, gamma parameter, etc. For this purpose, we have gathered all protein coding genes within complete mammalian mitochondrial genome sequences, besides two tetrapods, included as outgroups.
| MamMiBase |
|---|
|
|
|---|
MamMiBase is a relational database with a user-friendly interface. It is presented as a tree menu that enables the selection of a particular set from all mammalian mitochondrial genomes available stored in a MySQL relational database. The web interface was programmed using PHP: hypertext processor, a web-based programming language. To perform computation value distances, PERL (practical extraction and report language) was used with common gateway interface (CGI, for web) and database interface (DBI). MamMiBase is a MySQL database that contains tables for mammalian organism information, amino acid and nucleotide sequences from mitochondrial DNA, gi numbers, related bibliographic information and other protein information (e.g., name, length and gene relations). All DNA sequences were obtained from the GenBank database. During the development of MamMiBase, mitochondrial DNA data files were processed using PERL scripting. A BioPERL toolkit was used to do the parsing of GenBank download files and to run programs, such as CLUSTALW for the alignments.
We decided to exclude both tRNA and rRNA sequences as well as partial genomes. All protein coding genes were previously multialigned, considering their translated aminoacid sequences. Multialignments were generated by CLUSTALW and inspected by eye.
| Contents of MamMiBase |
|---|
|
|
|---|
MamMiBase is designed to store mammalian mitochondrial genome sequences and provide rapid access for gene alignments, along with useful phylogenetic information, such as p-distance and transition and transversion ratios. It provides a taxonomy hierarchy specifically designed to facilitate the selection of mammalian organisms. MamMiBase contains the mitochondrial genomes of mammals and two outgroups represented by the chicken Gallus gallus and the African toad Xenopus laevis. Other mammalian species will be included as their finished genome records become available on GenBank.
The most interesting aspect of the database is the retrieval of alignments based on statistical parameter values for the selected mammalian species. The alignments are available for nucleotide and amino acid sequences, in MEGA, PHYLIP and NEXUS (i.e., PAUP) formats. In order to decrease computer time, some tables were created where pre-processed results are stored. Pairwise p-distances and transition transversion ratio values, for instance, have been previously calculated and all pairwise comparison values are already stored.
MamMiBase also provides parameters that need to be computed for each particular set of species, such as average base content and the gamma parameter. PAML 3.13 (Yang, 1997) is used to estimate the gamma parameter, using four rate categories to approximate the Gamma distribution in a maximum likelihood framework. This approach requires a phylogenetic tree, and this is inferred by the LinTree program (Takezaki et al., 1995). In this case, a PERL script translates the LinTree outfile into a parenthetic (Newick format) tree infile with branch lengths. This file is to be input, along with nucleotide sequences, in PAML to estimate the gamma parameter. Once all parameters are calculated, a list of genes with respective parameter values is organized and the user may download amino acid or nucleotide sequence alignments (or the tree files) for the selected organisms based on any of these parameters.
In addition to these parameters, the database also makes the phylogenetic tree available for download. It is a neighbor-joining tree with Jukes Cantor (1969) distances and complete deletion. MamMiBase uses the LinTree program to generate a flat file with topology and branch lengths for the tree. It is available in text (.njb for the LinTree program), newick (for the PAML program package) and postscript (.ps for publication) formats for download. It is important to emphasize that we discourage the use of this automatically generated tree in phylogenetic studies; it may be used for simpler tasks, such as parameter computations or merely as a guide for the multi-alignment steps. MamMiBase provides a useful additional resource for comparative analyses.
| Acknowledgments |
|---|
The development of MamMiBase was done at LNCC (National Laboratory of Scientific Computation). We thank Darcy F. de Almeida for his help with the final preparation of this manuscript. This work has been supported by National Research Council of the Brazilian Ministry of Science and Technology (CNPq/MCT) and Rio de Janeiro Science Foundation (FAPERJ) grants to A.T.V. and to C.A.M.R.
Received on September 10, 2004; revised on October 27, 2004; accepted on February 10, 2005
| REFERENCES |
|---|
|
|
|---|
Russo, C.A.M., et al. (1996) Efficiencies of different genes and different tree-building methods in recovering a known vertebrate phylogeny. Mol. Biol. Evol., 13, 525536[Abstract].
Takezaki, N., et al. (1995) Phylogenetic test of molecular clock and linearized trees. Mol. Biol. Evol., 12, 823833[Abstract].
Yang, Z. (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS,, 13, 555556.
This article has been cited by other articles:
![]() |
P. C. Feijao, L. S. Neiva, A. M. L. d. Azeredo-Espin, and A. C. Lessinger AMiGA: the arthropodan mitochondrial genomes accessible database Bioinformatics, April 1, 2006; 22(7): 902 - 903. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
