Bioinformatics Advance Access originally published online on July 26, 2006
Bioinformatics 2006 22(19):2421-2429; doi:10.1093/bioinformatics/btl405
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
© 2006 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Bio-Ontology and text: bridging the modeling gap
,*
1 Department of Biomedical Informatics, Columbia University New York, NY 10032, USA
2 Department of Pathology The University of Chicago, 5841 South Maryland Avenue, Chicago, IL 60637,USA
3 Department of Radiation Oncology The University of Chicago, 5841 South Maryland Avenue, Chicago, IL 60637,USA
4 Department of Medicine, Center for Biomedical Informatics The University of Chicago, 5841 South Maryland Avenue, Chicago, IL 60637,USA
*To whom correspondence should be addressed.
Motivation: Natural language processing (NLP) techniques are increasingly being used in biology to automate the capture of new biological discoveries in text, which are being reported at a rapid rate. Yet, information represented in NLP data structures is classically very different from information organized with ontologies as found in model organisms or genetic databases. To facilitate the computational reuse and integration of information buried in unstructured text with that of genetic databases, we propose and evaluate a translational schema that represents a comprehensive set of phenotypic and genetic entities, as well as their closely related biomedical entities and relations as expressed in natural language. In addition, the schema connects different scales of biological information, and provides mappings from the textual information to existing ontologies, which are essential in biology for integration, organization, dissemination and knowledge management of heterogeneous phenotypic information. A common comprehensive representation for otherwise heterogeneous phenotypic and genetic datasets, such as the one proposed, is critical for advancing systems biology because it enables acquisition and reuse of unprecedented volumes of diverse types of knowledge and information from text.
Results: A novel representational schema, PGschema, was developed that enables translation of phenotypic, genetic and their closely related information found in textual narratives to a well-defined data structure comprising phenotypic and genetic concepts from established ontologies along with modifiers and relationships. Evaluation for coverage of a selected set of entities showed that 90% of the information could be represented (95% confidence interval: 8693%; n = 268). Moreover, PGschema can be expressed automatically in an XML format using natural language techniques to process the text. To our knowledge, we are providing the first evaluation of a translational schema for NLP that contains declarative knowledge about genes and their associated biomedical data (e.g. phenotypes).
Availability: http://zellig.cpmc.columbia.edu/PGschema
Contact: Friedman{at}dbmi.columbia.edu or Lussier{at}uchicago.edu
Received on May 15, 2006; revised on July 20, 2006; accepted on July 21, 2006
This article has been cited by other articles:
![]() |
Y. A. Lussier and Y. Liu Computational Approaches to Phenotyping: High-Throughput Phenomics Proceedings of the ATS, January 1, 2007; 4(1): 18 - 25. [Abstract] [Full Text] [PDF] |
||||
