Skip Navigation


Bioinformatics Advance Access originally published online on September 16, 2004
Bioinformatics 2005 21(3):418-420; doi:10.1093/bioinformatics/bti010
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
21/3/418    most recent
bti010v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (14)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kahraman, A.
Right arrow Articles by Weiss, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kahraman, A.
Right arrow Articles by Weiss, B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics vol. 21 issue 3 © Oxford University Press 2005; all rights reserved.

PhenomicDB: a multi-species genotype/phenotype database for comparative phenomics

Abdullah Kahraman 1, Andrey Avramov 2, Lyubomir G. Nashev 2, Dimitar Popov 2, Rainer Ternes 3, Hans-Dieter Pohlenz 3 and Bertram Weiss 3,*

1 Department of Bioinformatics, University of Applied Science Giessen 35596 Giessen, Germany
2 Metalife AG Im Metapark 1, 79297 Winden, Germany
3 Research Laboratories, Schering AG 13342 Berlin, Germany

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 MOTIVATION AND CONCEPT
 IMPLEMENTATION
 DISCUSSION AND OUTLOOK
 REFERENCES
 

Summary: We have created PhenomicDB, a multi-species genotype/phenotype database by merging public genotype/phenotype data from a wide range of model organisms and Homo sapiens. Until now these data were available in distinct organism-specific databases (e.g. WormBase, OMIM, FlyBase and MGI). We compiled this wealth of data into a single integrated resource by coarse-grained semantic mapping of the phenotypic data fields, by including common gene indices (NCBI Gene), and by the use of associated orthology relationships. With its use-case-oriented user interface, PhenomicDB allows scientists to compare and browse known phenotypes for a given gene or a set of genes from different organisms simultaneously.

Availability: PhenomicDB has been implemented at Schering AG as described below. A PhenomicDB implementation differing in some technical details has been set up for the public at Metalife AG http://www.phenomicDB.de

Contact: bertram.weiss{at}schering.de

Supplementary information: database model, semantic mapping table.


    MOTIVATION AND CONCEPT
 TOP
 Abstract
 MOTIVATION AND CONCEPT
 IMPLEMENTATION
 DISCUSSION AND OUTLOOK
 REFERENCES
 
More and more phenotypic data are being generated for both model and non-model organisms. New technologies such as RNAi now make genome-wide knock-down studies feasible and have already been applied in a high-throughput manner, for instance, to Homo sapiens (Berns et al., 2004; Fraser, 2004; Paddison et al., 2004). Valuable resources for phenotypic data are already available, but only for a given organism, for example, OMIM (Hamosh et al., 2002), WormBase (Harris et al., 2004), FlyBase (The FlyBase Consortium 2003) and MGD (Blake et al., 2003). Scientists have realized that there is an additional need to make phenotypic data from different organisms simultaneously searchable, visible and, most importantly, comparable (Lussier and Li, 2004).

Currently, research scientists looking for genes involved in a given disease have to search different phenotype databases. They need to figure out manually the orthology relationships among all genes concerned in order to understand the different genotypic effects on the phenotype of a certain gene in different organisms. These species-specific databases are scattered over the Internet and tailored to different objectives, and they store phenotypic data in different formats. Tedious handwork is therefore necessary to compare the phenotype of a gene in different organisms. A simple meta-search engine for these databases alone does not resolve this kind of problem, and this is exactly the functionality we were aiming to develop.

Currently, the different source databases all use different gene loci description systems (i.e. gene indices) and the orthology relationships are not always obvious, so that many important phenotypic relationships may be difficult to discover. As others (Claustres et al., 2002; Lussier and Li, 2004) have already stated, a common data model combining the data with a common gene index is required. Orthology data must be available and an use-case-oriented user interface should facilitate access to phenotypic data. Most data are available, but to the best of our knowledge, an integrative system, as described here, is not yet available.

In order to remedy to this situation, we set out to gather phenotype and genotype data from the different public resources and to map the data semantically into a single data model. To allow for direct comparison of phenotypes of orthologous genes from yeast to humans, we also uploaded these mapped data together with a gene index-like database [NCBI Gene (Pruitt et al., 2001)] and the associated orthology data [HomoloGene database (Wheeler et al., 2004)].

PhenomicDB is thought as a first step towards comparative phenomics and will improve our understanding of gene function by combining the knowledge about phenotypes from several organisms. PhenomicDB has to compromise between data depth as available in the source databases and data compatibility. It is not intended to compete with the much more dedicated primary source databases but tries to compensate its partial loss of depth by linking back to the primary sources. The basic functional concept of PhenomicDB is an integrated meta-search engine for phenotypes.

Users should be aware that comparing genotypes or even phenotypes between organisms as different as yeast and humans may involve serious scientific hurdles. Nevertheless, finding, for instance, that the phenotype of a given mouse gene is described as ‘similar to psoriasis’ and at the same time that the human orthologue has been described as a gene linked to skin defects can lead to novel and interesting hypotheses. Similarly, a gene involved in cancer in mammalian organisms could show a proliferation phenotype in a lower organism such as yeast, and this knowledge may lead to further insights.


    IMPLEMENTATION
 TOP
 Abstract
 MOTIVATION AND CONCEPT
 IMPLEMENTATION
 DISCUSSION AND OUTLOOK
 REFERENCES
 
We implemented scripts to download phenotype/genotype data from public databases for Mus musculus (MGD), H.sapiens (OMIM), Drosophila melanogaster (FlyBase), Caenorhabditis elegans and Caenorhabditis briggsae (WormBase), Arabidopsis thaliana (MAtDB) (Schoof et al., 2004) and Saccharomyces cerevisae (CYGD) (Mewes et al., 2004). In addition, NCBI Gene and HomoloGene were downloaded. Whenever possible, the given source genotypes (here meant as equivalent of gene loci) were mapped to a NCBI Gene entry.

We performed coarse-grained semantic field mapping to bring the very heterogeneous phenotype data from different organisms into a common data model. Fields in the different source databases with the same content-type (e.g. containing the phenotype description) were identified and, irrespective of their original name there, uploaded in the corresponding PhenomicDB data field (e.g. ‘phenotype description’). Details of all semantic mappings and how we connected the data are shown in the semantic mapping table and the database schemata both available as Supplementary Information.

PhenomicDB was designed as a normalized relational Oracle v. 8.1.7.4 database. The database scheme comprises three parts: common data, genotype data and phenotype data. The common part is used for information shared between genotypes as well as phenotypes, e.g. name, symbol, organism or literature references. The genotype-specific part contains specific genotype information (e.g. gene description, chromosomal location, gene ontology, etc.). Each genotype entry can relate to a NCBI Gene identifier in order to bin identical genes that are represented by, for example, different transcript identifiers in the source databases. The use of NCBI Gene identifiers is a prerequisite for making use of the orthology relationships uploaded from the HomoloGene database. They are captured as pairs of orthologous NCBI Gene identifiers (as determined by HomoloGene). The phenotype-specific part stores the free-text phenotype descriptions, and the data that describe the underlying experiments (e.g. mutagenesis, RNAi and k.o. mice). Associated phenotype keywords or catalogue terms are stored as well. The genotype and phenotype parts are treated separately in our database. The connection between the two parts is implemented by genotype and phenotype id-mapping. Owing to the nature of the Oracle RDBMS, all knowledge containing fields are searchable. For performance reasons, those search functions which are accessible through the user interface are supported by interMedia text indices (Oracle).

User is presented with a search interface that allows free text search with the ability to restrict the search to certain fields (e.g. gene name, organism, etc.) or to show only genotypes, if they have a phenotype associated or vice versa. The result is a list of genotypes with their associated phenotypes. From the result list, user can select genotypes/phenotypes of further interest and expand these with their orthologues and the phenotypes of their orthologues. This allows direct and simultaneous comparison of all available phenotypes for a certain gene in all available organisms. Each result is hyperlinked to a dedicated genotype/phenotype report and directly to the source database. Genotype/phenotype reports contain the associated data about the genotype (usually a gene) and the associated phenotype(s) including phenotype experiments.


    DISCUSSION AND OUTLOOK
 TOP
 Abstract
 MOTIVATION AND CONCEPT
 IMPLEMENTATION
 DISCUSSION AND OUTLOOK
 REFERENCES
 
Data-integrative approaches over a range of organisms decrease conceptual accuracy or eliminate data: a genotype in our database can mean a gene, a mutated gene or a chromosomal region. Phenotype description ranges from the mere mention of ‘non-viability’ in yeast to the detailed characterization of a knockout mouse including all the experimental details. However, only these general concepts allow for data integration. The detailed and extensive description of the data or dedicated mining tools, e.g. PhenoBlast (Gunsalus et al., 2004) should stay within the realm of the organism-specific source databases. Others (Lussier and Li, 2004) have started with the integration of phenotypic notation and terminology over several species or have proposed common semantics for genome-wide phenotype databases (Claustres et al., 2002). They have also discussed in more detail the associated difficulties of integrating phenotypic terminology that differs significantly between each organism-specific research community. We adopted a practical approach clearly intended to allow for high-level data integration and easy integration of new upcoming data. The data content will be updated every 8 weeks. PhenomicDB now has to prove that the method of integration applied here can add value to the scientific exploitation of phenome data.

Most of the valuable phenotypic data reside in the public literature not captured in databases. Effective text mining is needed to gather these data as well. A prerequisite for text mining, however, is the availability of specified thesauri, catalogues and validated terms. Those are not yet available for phenotypic data (Gunsalus et al., 2004). First steps are underway (Lussier and Li, 2004) and PhenomicDB could be used as a resource to extract such phenotype-specific vocabulary. We have started compiling thesauri from PhenomicDB to use them for the extraction of phenotypic data from literature by text mining.


    Acknowledgments
 
We thank Dr Bernard Haendler (Schering AG) for useful discussion of the manuscript, and Dr Stephan Brock (Metalife) and Prof. Michael Schoenemann (Metalife) for their continuous support.

Received on June 25, 2004; revised on August 11, 2004; accepted on August 30, 2004

    REFERENCES
 TOP
 Abstract
 MOTIVATION AND CONCEPT
 IMPLEMENTATION
 DISCUSSION AND OUTLOOK
 REFERENCES
 

    Berns, K., Hijmans, E.M., Mullenders, J., Brummelkamp, T.R., Velds, A., Heimerikx, M., Kerkhoven, R.M., Madiredjo, M., Nijkamp, W., Weigelt, B., et al. (2004) A large-scale RNAi screen in human cells identifies new components of the p53 pathway. Nature, 428, 431–437[CrossRef][Medline].

    Blake, J.A., Richardson, J.E., Bult, C.J., Kadin, J.A., Eppig, J.T. (2003) MGD: the Mouse Genome Database. Nucleic Acids Res., 31, 193–195[Abstract/Free Full Text].

    Claustres, M., Horaitis, O., Vanevski, M., Cotton, R.G. (2002) Time for a unified system of mutation description and reporting: a review of locus-specific mutation databases. Genome Res., 12, 680–688[Abstract/Free Full Text].

    The FlyBase Consortium. (2003) The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res., 31, 172–175[Abstract/Free Full Text].

    Fraser, A. (2004) RNA interference: human genes hit the big screen. Nature, 428, 375–378[CrossRef][Medline].

    Gunsalus, K.C., Yueh, W.C., MacMenamin, P., Piano, F. (2004) RNAiDB and PhenoBlast: web tools for genome-wide phenotypic mapping projects. Nucleic Acids Res., 32, D406–D410[Abstract/Free Full Text].

    Hamosh, A., Scott, A.F., Amberger, J., Bocchini, C., Valle, D., McKusick, V.A. (2002) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res., 30, 52–55[Abstract/Free Full Text].

    Harris, T.W., Chen, N., Cunningham, F., Tello-Ruiz, M., Antoshechkin, I., Bastiani, C., Bieri, T., Blasiar, D., Bradnam, K., Chan, J., et al. (2004) WormBase: a multi-species resource for nematode biology and genomics. Nucleic Acids Res., 32, D411–D417[Abstract/Free Full Text].

    Lussier, Y.A. and Li, J. (2004) Terminological mapping for high throughput comparative biology of phenotypes. Pac. Symp. Biocomput., 202–213.

    Mewes, H.W., Amid, C., Arnold, R., Frishman, D., Guldener, U., Mannhaupt, G., Munsterkotter, M., Pagel, P., Strack, N., Stumpflen, V., Warfsmann, J., Ruepp, A. (2004) MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res., 32, D41–D44[Abstract/Free Full Text].

    Paddison, P.J., Silva, J.M., Conklin, D.S., Schlabach, M., Li, M., Aruleba, S., Balija, V., O’Shaughnessy, A., Gnoj, L., Scobie, K. (2004) A resource for large-scale RNA-interference-based screens in mammals. Nature, 428, 427–431[CrossRef][Medline].

    Pruitt, K.D. and Maglott, D.R. (2001) RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res, 29, 137–40[Abstract/Free Full Text].

    Schoof, H., Ernst, R., Nazarov, V., Pfeifer, L., Mewes, H.W., Mayer, K.F. (2004) MIPS Arabidopsis thaliana Database (MAtDB): an integrated biological knowledge resource for plant genomics. Nucleic Acids Res., 32, D373–D376[Abstract/Free Full Text].

    Wheeler, D.L., Church, D.M., Edgar, R., Federhen, S., Helmberg, W., Madden, T.L., Pontius, J.U., Schuler, L.M., Schriml, G.D., Sequeira, E., et al. (2004) Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res., 32, D35–D40[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
M. G. Kann
Protein interactions and disease: computational approaches to uncover the etiology of diseases
Brief Bioinform, September 1, 2007; 8(5): 333 - 346.
[Abstract] [Full Text] [PDF]


Home page
CarcinogenesisHome page
S. Y. Kim and W. C. Hahn
Cancer genomics: integrating form and function
Carcinogenesis, July 1, 2007; 28(7): 1387 - 1392.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
P. Groth, N. Pavlova, I. Kalev, S. Tonov, G. Georgiev, H.-D. Pohlenz, and B. Weiss
PhenomicDB: a new cross-species genotype/phenotype resource
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D696 - D699.
[Abstract] [Full Text] [PDF]


Home page
Proc Am Thorac SocHome page
Y. A. Lussier and Y. Liu
Computational Approaches to Phenotyping: High-Throughput Phenomics
Proceedings of the ATS, January 1, 2007; 4(1): 18 - 25.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
A. Ng, B. Bursteinas, Q. Gao, E. Mollison, and M. Zvelebil
Resources for integrative systems biology: from data through databases to networks and dynamic system models
Brief Bioinform, December 1, 2006; 7(4): 318 - 330.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
21/3/418    most recent
bti010v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (14)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kahraman, A.
Right arrow Articles by Weiss, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kahraman, A.
Right arrow Articles by Weiss, B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?