Skip Navigation


Bioinformatics Advance Access originally published online on September 11, 2006
Bioinformatics 2006 22(22):2835-2837; doi:10.1093/bioinformatics/btl471
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/22/2835    most recent
btl471v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Vallender, E. J.
Right arrow Articles by Wyckoff, G. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Vallender, E. J.
Right arrow Articles by Wyckoff, G. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

SPEED: a molecular-evolution-based database of mammalian orthologous groups

Eric J. Vallender 1, Justin E. Paschall 2, Christine M. Malcom 1,3, Bruce T. Lahn 1 and Gerald J. Wyckoff 2,*

1 Department of Human Genetics and Committee on Genetics, Howard Hughes Medical Institute, University of Chicago 920 East 58th Street, Chicago, IL 60637, USA
2 Division of Molecular Biology and Biochemistry, University of Missouri-Kansas City 5007 Rockhill Road, Kansas City, MO 64110, USA
3 Department of Anthropology, University of Chicago 920 East 58th Street, Chicago, IL 60637, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 REFERENCES
 

Motivation: The abundance of nucleotide sequence information available has expanded horizons of inquiry for molecular evolution; however, the full potential of whole-genome analysis has not been realized because of inadequate tools. Here, we present one of the first toolkits to aid multidisciplinary high-throughput analysis.

Summary: SPEED was created to integrate molecular evolutionary data with existing genetic resources and provide a straightforward user interface to 17 352 orthologous gene groups, containing representatives from eight mammalian species and an avian outgroup.

Availability: See http://bioinfobase.umkc.edu/speed/ for access

Contact: wyckoffg{at}umkc.edu

Supplementary information: A larger version of the data model and a site map are available online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 REFERENCES
 
Mammalian genomes are being sequenced at an increasingly rapid rate. High quality analyses of five mammalian genomes have already been released [human (Lander et al., 2001; Venter et al., 2001), mouse (Waterston et al., 2002), rat (Gibbs et al., 2004), chimpanzee (The Chimpanzee Sequencing and Analysis Consortium, 2005) and dog (Lindblad-Toh et al., 2005)], with two more (rhesus macaque and cow) in publicly available draft versions. These diverse species open up countless avenues of exploration for comparative evolution; studies of changes in mutation rates, chromosomal rearrangements and non-coding regions are all prevalent.

The accessibility of whole-genome assemblies is changing the face of protein studies. Largely, these studies have focused on deciphering patterns of gene divergence across chromosomes (Malcom et al., 2003; Vallender and Lahn, 2004; Webster et al., 2004) and detecting positive selection (Bustamante et al., 2005; Clark et al., 2003; Nielsen et al., 2005). Although, the methodology of whole-genome analysis has broad applications, studies to date have often used different datasets that are difficult or impossible to recapitulate and compare. The format and size of these datasets often create prohibitive hurdles for researchers without the expertise or tools necessary to interpret and manipulate them.

Several platforms, such as The National Center for Biotechnology Information (NCBI) HomoloGene (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologene) (Wheeler et al., 2006) and Ensembl (http://www.ensembl.org) (Birney et al., 2006) have integrated evolutionary analysis of orthologs into their presentations of gene information. Although both platforms serve as the lingua franca of the genetics community, neither of them gives more than the basic evolutionary information. Molecular biologists still need to enlist evolutionists to probe beyond basic summary statistics and contextualize information.

Here, we present a database that houses a large set of mammalian orthologs, plus an array of evolutionary information for each gene. The Searchable Prototype Experimental Evolutionary Database (http://bioinfobase.umkc.edu/speed) (SPEED) integrates information currently scattered throughout disparate sources (e.g. domain structure, expression, disease phenotype and function), and marries these preexisting data with molecular evolutionary parameters. SPEED offers the benefit of a web interface that places this information easily within the grasp of researchers, enabling them to devise and conduct complete evolutionary studies of mammalian protein evolution.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 REFERENCES
 
SPEED is constructed as a MySQL relational database with a PHP-based web front end. Genes from each species are assigned unique sequence ids. These are grouped into orthologs and then assigned a unique orthologous group id (SPEED id). Links to outside databases and information unique to species-specific sequences are constructed based on sequence ids, whereas information related to the orthologous groups, such as the evolutionary information, is linked to the SPEED ids. This robust structure lends itself to expansion, both in terms of the amount of data stored (i.e. orthologous sequences and species included) and the queryable parameters available (i.e. evolutionary, physiological and other characteristics).

The primary interface for most users is the web-based front end. Designed to be user friendly without limiting access to all data, this can be partitioned between gene search and data display sections via the gene search portals, orthologous groups of interest are identified and displayed. A complete data mode is included as Supplementary Figure 1 and a site map for the SPEED homepage, is included as Supplementary Figure 2.

Datasets used were obtained from Ensembl v36 (Birney et al., 2006). A total of seven mammalian datasets were used: human, Homo sapiens (NCBI 35); chimp, Pan troglodytes (PanTro 1.0); rhesus macaque, Macaca mulatta (Mmul 0.1); mouse, Mus musculus (NCBI m34); rat, Rattus norvegicus (RGSC 3.4); dog, Canis familiaris (CanFam 1.0); cow, Bos taurus (Btau 2.0). Two additional species were included where possible: opossum, Monodelphis domestica (source: MonDom 2.0) as a basal mammalian representative and chicken, Gallus gallus (source: WASHUC1) as a true outgroup.

Orthologous groups were identified using BLAST reciprocal best hits (RBH) (Tatusov et al., 1997, 2003). Clusters of RBHs were generated by first requiring that all genes show RBH with all other genes in the cluster. Additional genes were then added by requiring RBH to at least three members of an orthologous group. To verify putative orthologous relationships, crude synteny maps were created using orthologs as anchors. These maps were in agreement with previously published synteny maps (Brudno et al., 2004). A total of 17 352 orthologous groups were identified. Among these, 5370 have orthologs identified in each of the nine species interrogated and 9401 represent groups containing orthologs for all seven mammalian species. A total of 13 506 groups contain orthologs in all three sequenced primate species, with 10 929 of these also containing mouse and rat orthologs.

Alignments were accomplished using CLUSTAL (Chenna et al., 2003) on translated sequences frame-aligned with nucleotide sequences, preserving gaps in the amino acid alignment. Divergence data were calculated using the modified Li method (Li, 1993). Sliding-window analyses were conducted using a window size of 33 codons and a step size of 10 codons. Expression information was drawn from the Cancer Genome Anatomy Project (CGAP) (Riggins and Strausberg, 2001) and information on metabolic pathways was culled from Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2006).


    3 IMPLEMENTATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 REFERENCES
 
Six primary portals for gene information retrieval are provided: (1) Gene name, symbol or description; (2) associated disease; (3) Pfam conserved domain; (4) tissue of expression; (5) functional pathway or (6) protein amino acid motif. Upon selection of a single orthologous group, both summary information on individual species and data on evolutionary comparisons of orthologs are provided. This summary information includes gene names and descriptions for each ortholog as well as links out to additional databases, (such as RefSeq and EntrezGene). Also provided are the sequence length, cytogenetic position, a global multi-way alignment (both of the nucleotide and amino acid sequences), expression and pathway information. For each orthologous group, every pair-wise comparison is made and values are presented for Ka, Ks, Ka/Ks, as well as numbers of amino acid and synonymous changes (raw and corrected for multiple hits). A breakdown of amino acid changes (with residue-specific information) and a sliding-window display of sequence evolution are also available. Aligned sequnces can be downloaded for further local analysis.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 REFERENCES
 
Additional genomes and the large-scale data analysis projects that accompany them are producing valuable information that is broadly applicable to the life sciences. SPEED is a conduit through which molecular evolutionary data can be made available to researchers regardless of discipline.

Earlier versions of SPEED have formed the backbone for a number of large-scale evolutionary studies (Choi et al., 2005; Malcom et al., 2003; Vallender and Lahn 2004; Wyckoff et al., 2005) as well as studies on specific genes of interest (Dorus et al., 2004; Gilbert et al., 2005). SPEED can be used to determine orthologous genes across species and display the similarities, differences and rates of evolution between species. It allows for consideration of variation in these factors within genes whether by focusing on positional heterogeneity, evolutionary parameters of individual domains or even properties of mutations between individual amino acids. The orthologous groups are tied into many traditionally useful databases, allowing genes and their evolutionary signatures to be considered in context. We believe SPEED will greatly aid inter-disciplinary studies in the post-genomic era.


    Acknowledgments
 
This work was supported by a University of Chicago William Rainey Harper Dissertation Fellowship (to E.J.V.), a University of Missouri Research Board grant (to G.J.W) and a Searle Scholarship and Burroughs Wellcome Career Award (to B.T.L).

Conflict of Interest: none declared.


    FOOTNOTES
 
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors

Associate Editor: Chris Stoeckert

Received on June 1, 2006; revised on August 29, 2006; accepted on August 31, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 REFERENCES
 

    Birney, E., et al. (2006) Ensembl 2006. Nucleic Acids Res, . 34, D556–D561[Abstract/Free Full Text].

    Brudno, M., et al. (2004) Automated whole-genome multiple alignment of rat, mouse, and human. Genome Res, . 14, 685–692[Abstract/Free Full Text].

    Bustamante, C.D., et al. (2005) Natural selection on protein-coding genes in the human genome. Nature, 437, 1153–1157[CrossRef][Medline].

    Chenna, R., et al. (2003) Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res, . 31, 3497–3500[Abstract/Free Full Text].

    Choi, S.S., et al. (2005) Robust signals of coevolution of interacting residues in mammalian proteomes identified by phylogeny-aided structural analysis. Nature Genet, . 37, 1367–1371[CrossRef][Web of Science][Medline].

    Clark, A.G., et al. (2003) Inferring nonneutral evolution from human-chimp-mouse orthologous gene trios. Science, 302, 1960–1963[Abstract/Free Full Text].

    Dorus, S., et al. (2004) Accelerated evolution of nervous system genes in the origin of Homo sapiens. Cell, 119, 1027–1040[CrossRef][Web of Science][Medline].

    Gibbs, R.A., et al. (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature, 428, 493–521[CrossRef][Medline].

    Gilbert, S.L., et al. (2005) Genetic links between brain development and brain evolution. Nature Rev. Genet, . 6, 581–590[CrossRef][Web of Science][Medline].

    Kanehisa, M., et al. (2006) From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res, . 34, D354–D357[Abstract/Free Full Text].

    Lander, E.S., et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921[CrossRef][Medline].

    Li, W.H. (1993) Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J. Mol. Evol, . 36, 96–99[CrossRef][Web of Science][Medline].

    Lindblad-Toh, K., et al. (2005) Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature, 438, 803–819[CrossRef][Medline].

    Malcom, C.M., et al. (2003) Genic mutation rates in mammals: local similarity, chromosomal heterogeneity, and X-versus-autosome disparity. Mol. Biol. Evol, . 20, 1633–1641[Abstract/Free Full Text].

    Nielsen, R., et al. (2005) A scan for positively selected genes in the genomes of humans and chimpanzees. PLoS Biol, . 3, e170[CrossRef][Medline].

    Riggins, G.J. and Strausberg, R.L. (2001) Genome and genetic resources from the Cancer Genome Anatomy Project. Hum. Mol. Genet, . 10, 663–667[Abstract/Free Full Text].

    Tatusov, R.L., et al. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4, 41[CrossRef][Medline].

    Tatusov, R.L., et al. (1997) A genomic perspective on protein families. Science, 278, 631–637[Abstract/Free Full Text].

    Tarjei, S., et al. (2005) The Chimpanzee Sequencing and Analysis Consortium: initial sequence of the chimpanzee genome and comparison with the human genome. Nature, 437, 69–87[CrossRef][Medline].

    Vallender, E.J. and Lahn, B.T. (2004) Effects of chromosomal rearrangements on human-chimpanzee molecular evolution. Genomics, 84, 757–761[CrossRef][Web of Science][Medline].

    Venter, J.C., et al. (2001) The sequence of the human genome. Science, 291, 1304–1351[Abstract/Free Full Text].

    Waterston, R.H., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature, 420, 520–562[CrossRef][Medline].

    Webster, M.T., et al. (2004) Gene expression, synteny, and local similarity in human noncoding mutation rates. Mol. Biol. Evol, . 21, 1820–1830[Abstract/Free Full Text].

    Wheeler, D.L., et al. (2006) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res, . 34, D173–D180[Abstract/Free Full Text].

    Wyckoff, G.J., et al. (2005) A highly unexpected strong correlation between fixation probability of nonsynonymous mutations and mutation rate. Trends Genet, . 21, 381–385[CrossRef][Web of Science][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
T. Hachiya, Y. Osana, K. Popendorf, and Y. Sakakibara
Accurate identification of orthologous segments among multiple genomes
Bioinformatics, April 1, 2009; 25(7): 853 - 860.
[Abstract] [Full Text] [PDF]


Home page
DNA ResHome page
S. K. Bag, S. Paul, S. Ghosh, and C. Dutta
Reverse Polarization in Amino acid and Nucleotide Substitution Patterns Between Human Mouse Orthologs of Two Compositional Extrema
DNA Res, September 25, 2007; (2007) dsm015v1.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/22/2835    most recent
btl471v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Vallender, E. J.
Right arrow Articles by Wyckoff, G. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Vallender, E. J.
Right arrow Articles by Wyckoff, G. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?