Bioinformatics Advance Access originally published online on July 26, 2006
Bioinformatics 2006 22(19):2421-2429; doi:10.1093/bioinformatics/btl405
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
© 2006 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Bio-Ontology and text: bridging the modeling gap
,*
1 Department of Biomedical Informatics, Columbia University New York, NY 10032, USA
2 Department of Pathology The University of Chicago, 5841 South Maryland Avenue, Chicago, IL 60637,USA
3 Department of Radiation Oncology The University of Chicago, 5841 South Maryland Avenue, Chicago, IL 60637,USA
4 Department of Medicine, Center for Biomedical Informatics The University of Chicago, 5841 South Maryland Avenue, Chicago, IL 60637,USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Natural language processing (NLP) techniques are increasingly being used in biology to automate the capture of new biological discoveries in text, which are being reported at a rapid rate. Yet, information represented in NLP data structures is classically very different from information organized with ontologies as found in model organisms or genetic databases. To facilitate the computational reuse and integration of information buried in unstructured text with that of genetic databases, we propose and evaluate a translational schema that represents a comprehensive set of phenotypic and genetic entities, as well as their closely related biomedical entities and relations as expressed in natural language. In addition, the schema connects different scales of biological information, and provides mappings from the textual information to existing ontologies, which are essential in biology for integration, organization, dissemination and knowledge management of heterogeneous phenotypic information. A common comprehensive representation for otherwise heterogeneous phenotypic and genetic datasets, such as the one proposed, is critical for advancing systems biology because it enables acquisition and reuse of unprecedented volumes of diverse types of knowledge and information from text.
Results: A novel representational schema, PGschema, was developed that enables translation of phenotypic, genetic and their closely related information found in textual narratives to a well-defined data structure comprising phenotypic and genetic concepts from established ontologies along with modifiers and relationships. Evaluation for coverage of a selected set of entities showed that 90% of the information could be represented (95% confidence interval: 8693%; n = 268). Moreover, PGschema can be expressed automatically in an XML format using natural language techniques to process the text. To our knowledge, we are providing the first evaluation of a translational schema for NLP that contains declarative knowledge about genes and their associated biomedical data (e.g. phenotypes).
Availability: http://zellig.cpmc.columbia.edu/PGschema
Contact: Friedman{at}dbmi.columbia.edu or Lussier{at}uchicago.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
New biological discoveries are being reported at an extremely rapid rate. This new information is found in diverse resources that encompass a broad array of journal articles and public databases associated with different sub-disciplines within biology and medicine. The integration of biological knowledge and information is recognized as a critical knowledge gap in science (Pennisi, 2005), and as essential for the future of the field because dissemination and subsequent deployment of the knowledge by automated applications and by researchers who need to access and connect the diverse information is also recognized as critical (Gardner, 2005; Gopalacharyulu et al., 2005). In addition, a large quantity of biological information resides in unstructured or semi-structured textual databases, thus posing a frequent, yet special, category of integration problem addressed in this paper. While linguistic knowledge is computable using NLP, it does not allow for the same quality of inference as declarative knowledge. Thus, it is essential to translate linguistic structures generated by NLP into ontology-anchored declarative datasets to acquire otherwise unattainable large-scale or cross-disciplinary inferences. There are several requirements for high-throughput large-scale integration of textual information with biological knowledge: (1) the existence of ontologies or terminologies (Ashburner et al., 2000; Blake, 2004) that specify and describe biological concepts, (2) NLP methods that automatically acquire biomedical information occurring in unstructured text (Cohen et al., 2005; Hirschman et al., 2005), (3) the existence of a comprehensive information model specifying the biological entities and relations as described in text (Gkoutos et al., 2004), (4) methods, likely based on a biological information schema, allowing for translation of the data structures produced by NLP into those of structured and ontology-anchored databases and (5) integration and knowledge management tools that are based on coded data associated with established databases (Cantor et al., 2005). Therefore, to achieve reusability, it is critical that automated systems that structure textual information also map the information to a representation that provides codes linking the information in text to established ontologies. In addition, the representation must be rich enough to model the complex relationships that are typically described in text. Such a representation entails at least two levels of specification: (1) representation of the biomedical concepts via identifiers that correspond to existing ontologies or controlled terminologies and (2) representation of salient contextual information and relations, such as information that modifies and connects the coded concepts because these are critical for accurate representation of biological information. It provides for fine-grained representation of information and relations, which is necessary for enabling expressiveness, such as that found in natural language, and also for enabling subsequent fine-grained retrieval of structured information that was extracted from text. While a substantial amount of work has been devoted to the first level involving formal knowledge and reasoning within traditional ontologies, much less has been improved at the second level, which is the focus of this paper.
In this paper, we propose an ontology-anchored representational schema for biological information called Phenotype-Genetic Schema (PGschema), which enables the translation of both phenotypic and genetic information found in the language of biological text and concepts closely related to phenotypic and genetic information (e.g. modifiers of phenotypes and diseases). For simplicity of notation and discussion, we will assume these relate concepts subsumed by the terms phenotypic and genetic information in this paper. This schema, focused on phenotypic and genetic information, represents (1) individual concepts, (2) modifiers of the concepts, (3) identifiers associated with external ontologies and (4) relationships between these concepts. Most importantly, it incorporates relevant external ontological identifiers as building blocks in order to represent more complex and expressive relations. Thus, PGschema is intended to utilize existing ontologies while serving as a bridge between natural language and the more formal bio-ontologies related to phenotypic and genetic information. Further, the schema can be directly realized automatically using an NLP technology that generates a compatible form of XML output (Lussier et al., 2006). Once in PGschema form, many automated applications would become potentially possible. For example, it would be possible to reliably find relations between genes and cellular processes, or to find relations between genes, functions and anatomical locations.
1.1 Natural language processing systems
A number of NLP and text mining systems have been described that extract limited information from biological text. For example, there are many systems that recognize or identify the names of biomolecular entities (BNER) (Hirschman et al., 2005), while other systems extract interactions between biomolecular entities (Rzhetsky et al., 2004; Hirschman et al., 2005), capture subcellular locations of proteins from text (Craven et al., 1999), or capture the kinase, substrate and residue associated with phosphorylation (Narayanaswamy et al., 2005). These systems require a relatively straightforward representational model. For example, a BNER system may insert tags around the entities in text where the tags specify the corresponding semantic classes and possibly unique identifiers. Similarly, a system that captures interactions can represent an interaction as a triplet interaction entity1 entity2, where the entities are or are not necessarily coded. However, systems that capture more comprehensive informational relations generally do require representational schemas, particularly if convergence, completeness and integration with many different systems are objectives. Since we are currently developing an NLP system called BioMedLEE, which aims to capture a broad range of phenotypicgenetic entities and relations, we require a schema to represent the extracted information. Furthermore, for interoperability purposes, the schema should be well-defined and use an established specification language so that other applications can access the information appropriately. The BioMedLEE NLP system that uses PGschema has been implemented and evaluated for use with an application called PhenoGO (Lussier et al., 2006), providing a proof of concept that PGschema can be realized automatically. PhenoGO utilizes BioMedLEE to obtain information that augments gene-GO relations in the GO annotations database with additional context, such as cellular and other anatomical information. However, the focus of the work reported here is the representational schema and not the NLP component or the applications using it.
Two other efforts involving integration of NLP and ontologies are the Obol effort (Mungall, 2004), and the GENIA effort (Kim, 2003). Obol has a different focus from our work because the aim is to assist in ontological development. More specifically, Obol uses NLP technology to process the ontological terms in order to discover unique computable definitions for them, to elicit relations between the elements composing the ontological terms, and to facilitate reasoning over the ontology. GENIA maintains an annotated corpus of biological entities, which substantially furthers the development of NLP systems. The entity types conform to a model consisting of substances and biological locations involved in protein interactions. The GENIA model has a semantic category Other for entities such as disease, process and phenotypic descriptions, which is our primary focus (thus PGschema has numerous semantic types for this semantic category of GENIA).
1.2 Ontologies
There are substantial efforts in the biological community for organizing biological concepts as controlled terminologies or ontologies (Ashburner et al., 2000; Blake, 2004; Stevens et al., 2002), and for developing tools that provide interoperability among different ontologies (Bodenreider, 2004; Cantor et al., 2005) in order to support intra- and interoperability among the different research communities. This is critical for the field because there are so many different groups working on the same model organism, different model organisms or different scales of biology. Some integrative ontologies concerned with biomolecular entities are UniProt (Bairoch et al., 2005), and UniGene (Wheeler et al., 2005), while Gene Ontology (Ashburner et al., 2000) is concerned with biomolecular functions, processes and subcellular components. Other ontologies are associated with phenotypic traits, such as mouse anatomy (MA) (Evsikov et al., 2004), mammalian phenotype ontology (MP) (Smith et al., 2005b), cell ontology (CL) (Bard et al., 2005), the Unified Medical Language System (UMLS) (Lindberg et al., 1993), and SNOMED (Spackman, 2004). In general, these efforts involve specification of the individual concepts so that they are associated with non-ambiguous unique identifiers, and are appropriately situated within a classification or partwhole hierarchy. The Open Biological Ontologies (OBO) (http://obo.sourceforge.net/) consortium hosts over 50 open source ontologies associated with phenotypic and biomolecular information. One of the OBO ontologies, called Phenotype, Attribute and Trait Ontology (PAtO) (Gkoutos et al., 2004), is a general ontology for describing phenotypes that can be measured either quantitatively or qualitatively. What is significant about PAtO is that it is species-independent. PAtO actually consists of two components, where one is the model, and the other is the attribute ontology. It contains an Entity-Attribute-Value (EAV) representation where three ontological terms are linked together to form a description. The Entity component is the phenotype being described, and, most importantly, it can be associated with an ontology that is external to PAtO. In contrast to the entity component, the Attribute and Value components generally correspond to concepts internal to PAtO. There also has been work concerning the ontology of relations in biomedical ontologies (Smith et al., 2005a). This work differs from the treatment of relations in PGschema because in PGschema the relations are linguistically based and represent terms, such as cause, and play a role in, which connect different observations or events in text whereas the relations specified by Smith and colleagues provide consistent and formal ontological definitions. A more complete discussion of general issues concerning ontologies for biological concepts is found in (Baker et al., 1999), and a fuller discussion of issues associated with requirements for clinical terminologies can be found in (Chute et al., 1999).
1.3 Representation schemas for biomedical language
In addition to development of ontologies for individual concepts, there have been efforts in the clinical domain to model the complex clinical information associated with the language of patient documents. Models have been developed to represent information in specific medical domains, such as radiology (Evans et al., 1994; Friedman et al., 1994; Rector et al., 1995), anatomy (Rosse et al., 1998), and surgical procedures (Rodrigues et al., 1997), as well as for the broad medical domains (Campbell et al., 1994; Friedman et al., 1999). These models represent specific relations among concepts so that a clinical event or observation may be associated with multiple qualifiers (e.g. equivalent to PAtO attributes) and values, which denote different types of informational qualifiers, such as negation, time, severity, frequency, body location and descriptive information. These different types of modifiers are critical for automated applications that use structured information because they are needed to achieve highly precise retrieval results. Negation, uncertainty and previous events occur frequently in clinical documents, and therefore, an application that seeks to detect a current clinical condition must retrieve reports containing that condition and filter out ones that have occurred in the past or that have not been asserted. For example, in rule out pneumonia, the condition pneumonia is not being asserted, and should not be retrieved. Similarly, anatomical and other qualifiers are also critical when high accuracy is needed. For example, in worsening left lower lobe pneumonia, the lack of improvement of pneumonia may be important to capture along with the specific lobular location.
A clinical informational schema was developed for the MedLEE NLP system (Friedman et al., 1994), a natural language extraction and encoding system, which covers a broad range of clinical information (Friedman et al., 1999). Numerous evaluations have demonstrated that MedLEE performs similarly to medical experts (Hripcsak et al., 1995; Knirsch et al., 1998). A critical factor for achieving high performance was that retrieval of the information encoded by MedLEE was fine-grained owing to the way the extracted information was modeled. The evaluation studies that were performed were designed for clinical applications associated with decision support tasks, and they relied on queries to retrieve the structured output generated by MedLEE. What is significant about these queries is that they required complex medical logic, which included selecting and then filtering out cases based on clinical conditions along with various modifier combinations, such as certainty, time, anatomical locations and other contextual modifiers.
Our schema, PGschema, is framed on that of MedLEE (Friedman et al., 1999), but differs significantly from it in that PGschema is specifically designed to represent genotypic and phenotypic information, as well as compound relations and functions instead of clinical events. There are many similarities between phenotypic information and clinical events, and therefore representational schemas for clinical information are highly relevant. For example, each of them includes anatomical, morphological and functional entities, many of which are associated with similar modifier types, such as degree, change and certainty. PGschema is not an ontology, but a schema that represents compositional and contextual aspects of terms where the terms may be associated with ontological concepts in external ontologies. While PGschema represents observations and qualifiers (e.g. attributes in PAtO) which have values as does PAtO, PGschema contrasts PAtO in several ways: (1) In PGschema, an observation may have many qualifiers representing different types of information, (2) whenever possible, the value of a qualifier may be associated with codes from external ontologies, (3) a qualifier may have nested qualifiers, providing a mechanism for representation of very complex information, (4) an observation may be a phenotype, biomolecular entity, relation, or function, (5) complex entities, such as functions and relations, are represented as having arguments, which may be nested, that are also associated with directionality, (6) an observation can be but does not have to be associated with an external code and (7) the schema is based on information and relationships that occur in the language of biological text.
In summary, PGschema is designed to bridge the representational gap between heterogeneous ontologies and the natural way in which their genetic and phenotypic concepts are used and related to one another in the far more expressive and complex statements of biological narratives. As a result, the information coded in PGschema imparts the computational formalism of ontologies and the expressiveness of language.
| 2 METHODS |
|---|
|
|
|---|
Since the ability to formally represent all information that occurs in text is not currently possible, we modeled a broad but selective set of biological concepts focused on genetic and phenotypic types of entities and relations as expressed in the literature. We have divided the methods in two sections: Iterative Conceptualization and Evaluation.
2.1 Iterative Conceptualization of PGschema
The model was developed iteratively using a random sample of 50 abstracts selected from a corpus of 3705 MEDLINE abstracts, where the corpus consisted of articles annotated for functional information by the Mouse Genomics Informatics group (MGI) (Blake et al., 2003). First, an initial schema was established using the MedLEE schema as a foundation because clinical information has many similarities to phenotypic information. However, certain entities found in our prior work over clinical narratives, such as recommendation and demographic information, were removed because they were not applicable. Similarly, certain modifiers were also removed, such as family history. We then performed a manual analysis of the information in the sample corpus. Based on the sample and knowledge of biology, we revised the initial schema accordingly by determining the basic types of entities in the language of the biological text that were important to represent (e.g. gene, gene product, anatomy, process, cell), and then the types of information that modify the basic entities. For example, in text, a mutated allele may occur in the context of a specific cell type (e.g. p53/T cell), a gene may occur with organism information (e.g. mouse Ror2), and a phenotypic trait may occur with negation or anatomical information (e.g. absence of limbs, stiffness in joints).
After a revised design was established, the sample articles were analyzed again and the relevant information in the text was manually mapped into the model. Several rounds of refinements were made based on results of the manual mapping activity. Whenever relevant information could not be represented, the schema was revised accordingly, if possible. Once the modeling of the basic entities and their modifiers was deemed satisfactory, modeling of the relations and functions was performed in a similar manner. However, in addition to modifiers, a mechanism for representing arguments of the relations and functions as well as their directionality was specified. For example, in Dexamethasone induced cell death of T-cells, the function induce is represented so that it has an agent argument (e.g. the substance dexamethoasone), and a target argument (e.g. the process cell death of T-cells). After several more rounds of analysis and refinement, we determined that the model was adequate for representing information captured automatically by an NLP system. We then modified the BioMedLEE NLP system so that it would automatically structure biological information in text in accordance with PGschema. BioMedLEE generates output in XML form that is compatible with the representational schema. A document type definition (DTD) was created that specifies the entities and relationships in conformance with PGschema, as shown in Figure 1.
|
2.2 Evaluation
We performed an initial evaluation of PGschema for coverage, which consisted of assessing the completeness of the modifiers associated with the various types of entities. For this effort, we choose the entities that were most important to represent for the NLP applications we were currently working on. These included the following types: biophysiological abnormality (diseases, morphologies, symptoms, phenotypic descriptions), process, anatomical body location, cell, organism and biomolecular entity. A set of sentences corresponding to each type was randomly selected for manual analysis. The set was obtained by first collecting a set of 16 851 MEDLINE abstracts related to the gene-GO annotations recorded in the GO database for the human. We chose a different dataset than the mouse genomic dataset that was used to develop PGschema to establish that it was generalizable to another organism. In order to facilitate the collection process for each type of entity, we used BioMedLEE to obtain structured output for the 178 686 distinct sentences. BioMedLEE was used because it identifies entities and their types based on lexical lookup, which is independent of the process of recognizing the modifiers of the entities and the relationships. Thus, BioMedLEE was used only as a tool to facilitate identification of sentences containing the type of entity to be analyzed. For each entity type, a program selected sentences containing a tag in the XML output that corresponded to the type of entity. For example, to select sentences containing a cell entity, sentences containing the tag cell in the XML output were chosen. Once the sentences were collected into sets, all tags were removed so that the sets consisted of the original sentences and references to the full abstracts. From each set, 100 sentences were randomly selected. The first 50 sentences were chosen for manual analysis, and the remaining sentences in each set formed a reserve set. A different set of 25 sentences for each type of entity was also chosen, which was used to train the expert performing the evaluation. The curator, with expertise in biology, was not involved in the development of PGschema. First, the expert was taught some principles of linguistics and English grammar, and was then given guidelines that were developed to help further consistency for the evaluation. During the training session, problematic areas in the evaluation where identified and the guidelines revised accordingly.
The expert performed the manual analysis by reading each sentence in each set (and the abstract where they were collected from if necessary), identifying term(s) associated with the corresponding entity type, finding the modifiers of those terms, determining their semantic types if possible and finally determining whether they were specified in PGschema. The expert reviews were then analyzed qualitatively and quantitatively. The quantitative analysis consisted of estimating the coverage of PGschema over the set. This involved computing the ratio of instances that were covered over the instances that should have been covered. Confidence intervals were computed for a proportion p with sample size n
30 using the method described in (Walpole et al., 1978) where a (1
)*100% confidence interval for the binomial parameter p is approximately
![]() |
In this formulation,
is the proportion of successes in a random sample of size n,
= (
) and Z
/2 is the value of the standard normal curve leaving an area of
/2 to the right.
Using the same set of 16 851 abstracts, a qualitative analysis was also performed to determine how much of the genotypicphenotypic information PGschema actually covers. Five abstracts were randomly chosen from the set. The expert read them, manually determined the entities and their types, and then determined whether or not they were in PGschema.
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
3.1 Schema description
PGschema was developed to represent a variety of information associated with biomolecular and phenotypic entities, modifiers and relationships that are found in biological text. Simplified overviews of the entity and modifier types are shown in Tables 1 and 2, but the actual representation is an XML form, which can be generated automatically as a result of processing text or manual analysis by curator. An example of a few specifications of the XML representation is shown in Figure 1 in the form of a document type definition (DTD). There are currently 29 types of entities or types of information that are represented. Table 1 lists the entity types and Table 2 lists some common types of modifiers, MD1, that were grouped for convenience to simplify specification of modifiers in Table 1. The tables provide examples of each type, and show what types of modifiers each entity type can have. For example, the entity ORG(anism) may have a GD (genetic descriptor) modifier (homozygous mice), a temporal modifier (e.g. newborn mice), a STR(ain) (e.g. C57BL/6J mice), and a MUTG (mutated gene/allele) modifier as in p53/mice. Another entity type is process, which can have several modifiers including temporal (e.g. embryonic stage development), ANAtomy (e.g. liver development, hepatocyte proliferation), change (e.g. increased proliferation), certainty (failure to develop). Note that change and certainty are included in the MD1 group. The entity cell may have different modifiers than the other types of anatomical entities because it can have a genetic descriptor (e.g. wild-type fibroblast cells), or specify a gene's allele that has been modified (e.g. Traf5cells). The entity type GGP (gene_gproduct) is an artifact useful when it is not possible to determine whether an occurrence of an entity is a gene or gene product. This situation typically occurs when the precise molecular nature of the gene or protein mentioned in the text is ambiguous to the expert or the NLP system. PGschema allows for representing a gene or its corresponding protein as single type of construct. Column 1 of Table 1 is used to group types for the convenience of specifying ones with similar modifiers, as shown in column 4. In addition, the full term for the abbreviation is specified in parentheses in column 1. Some types of entities can only occur as modifiers because they do not correspond to independent observations or entities (strain, certainty). These are noted in Tables 1 and 2 by adding a single * following the name. The types without an * can occur in text as either an observation or a modifier (process, anatomy, gene). One type of modifier, named code, is different than the others because it does not occur in the text, but is used as metadata to associate identifiers of an entity with an external ontology. For phenotypes, the identifier may consist of three fields (e.g. MP:0000351
increased cell proliferation) where the first field specifies the applicable ontology (MP), the second the identifier of the concept in the ontology, and the third the preferred name of the concept according to the ontology, which is shown for improved readability. In the above example, MP is an abbreviation representing the mammalian phenotype ontology. For genes, the identifier may have an additional fourth field, which specifies the taxonomic code of the organism.
|
|
More complex information is represented by the entities FUN(ction) (inhibit, bind) and REL(ation) (correlate with, play role in). These entities may be further qualified by degree, change, certainty, temporal and anatomical modifiers (e.g. high level of activation, decreased activation, not activated, expression in liver), but, notably, they also specify arguments with directionality or order. PGschema was designed to have a mechanism for specifying these phenomena, which commonly occur in describing molecular and biological functions and processes, which cannot be easily represented using other NLP models. An argument is different from a modifier because the meaning of the function or relation substantially depends on the arguments and their roles. An example is the sentence Tenascin-C regulates cell proliferation, where the function regulate is represented so that it has an argument Tenascin-C belonging to the class GGP which is the agent of regulate, and an argument cell proliferation, which is a process that is the target. Tenascin-C is specified as an argument by adding a metadata tag arg, which has the value agent to the GGP element, and a metadata tag arg with the value target to the process element. Similarly, in Tenascin-C plays a role in cell proliferation the relation play role in would have two arguments, where the first argument would be the GGP element and the second argument would be the process element. Specification of argument order is accomplished by adding a metadata tag arg to each element and assigning it a value 1 or 2 corresponding to the order of the argument in the text. A specific role, such as agent or target, is not assigned to the arguments of relation at this point because the actual role would depend on knowledge of the particular relation, in which case post-processing or additional knowledge would be necessary.
Although, Tables 1 and 2 show the entities in the schema in tabular form, the actual representation is an XML form that can be generated automatically by BioMedLEE when processing articles. Figure 1 illustrates examples of the DTD for two elements of PGschema. The element biophysiological_abn has optional elements, such as anatomy, code, and process, which are nested structures and are considered modifiers or qualifiers of biophysiological_abn. The v attribute of biophysiological_abn is a string corresponding to a textual term that denotes the value of the informational type, such as enlarged. Note that the change entity type can modify the type biophysiological_abn, and that change can only be modified by degree, certainty and temporal types of information.
Figure 2 illustrates a simplified form of the XML output obtained as a result of processing Hepatocytic proliferation was increased in livers of newborn C/EBPalpha knockout mice. Note that the XML output is consistent with Tables 1 and 2. In the XML form, the types are specified as tags and the instances as values of the tags. The primary observation for the information in the sample sentence is a process whose value is proliferation. In addition, it has several modifiers that are represented as nested elements. One modifier is an entity cell whose value is hepatocyte, which is linked to a code CL:0000182 from cell ontology. Other modifiers of proliferation are a change entity increase, which is not linked to any code, an anatomy entity liver, which is linked to a code MP:0000598 corresponding to the mammalian phenotype anatomy. In addition, a code MP:0000351 corresponding to increased cell proliferation has been specified, which is associated with the proliferation structure. This code is the most specific code found by BioMedLEE for the process structure. Note that the anatomy tag includes nested modifiers. Thus, organism whose value is mouse modifies liver; similarly newborn, which corresponds to temporal information modifies mouse as does the gene product C/EBPalpha, which is a type called mutg because it has a required genetic descriptor modifier knockout.
|
3.2 Summary of PGschema features
There are several features of PGschema that are important to note. The schema allows for some redundancies in representations in order to accommodate natural language. The issue of focus or different viewpoints arises frequently in natural language because the expressiveness of language incorporates such flexibility. Since our schema is based on relations as expressed in natural language, it has entity-modifier combinations where the entity and modifier types can be reversed. For example, according to Table 1, a process, which is a PO (phenotypic observation), can be qualified by another PO. Thus, when representing the information abnormal development, development, which is a process, could be considered the primary entity, and abnormal, which is a problem, its qualifier. However, in abnormal in development, the primary entity could be considered the problem abnormal and the qualifier development. There are two ways that this redundancy could be handled subsequent to natural language processing. One way would be to allow the redundancy to remain as is, and to formulate queries that retrieve the structured output so that the queries account for the different possible combinations. Another way would be to write transformations that map the NLP output to a uniform representation or to one that conforms to another ontology. For example, another representational system, such as PAtO, may view the appropriate representation of hepatocyte proliferation differently than the one shown in Figure 2. In the PAtO representation, the primary observation would be hepatocyte and the qualifier proliferation. In addition, the view where cell is modified by liver, which in turn is modified by mouse, may also be considered incorrectly represented based on world-knowledge. By transforming the XML structure appropriately, the appropriate view could be obtained allowing for modifiers spanning multiple scales of phenotypic expression (e.g. cellular, tissular, organism/species, etc.). While PAtO is designed to provide a biologically organized view of phenotypic and genetic information, it is not designed to support the representation required to organize the information contained in complex sentences, which translates in PGschema as including highly nested and multiple textual modifiers that span many scales of biology.
Another feature of our schema is that it is permissive and allows combinations that may be unlikely, such as embryonic diabetes mellitus. The purpose of PGschema by design is to represent general compositional and relational aspects of various types of phenotypic and genetic information in the text rather than to represent specific knowledge concerning individual concepts. PGschema therefore provides a semantic bridge between language and ontological concepts, and enables further computations associated with biological or pragmatic considerations. For example, with further processing, illegitimate combinations may be filtered out, or the representational structure could be simplified to the appropriate granularity for an application. Thus, while an ontology may not permit a concept such as embryonic diabetes mellitus, in PGschema, it would be permissible in general for disease type information to be qualified by temporal type information. A third feature of PGschema is that external ontologies are currently represented as identifiers, but are not integral to the model. It may be advantageous in the future to link them directly to the source ontologies using URLs, a method that would be in keeping with the semantic Web. This approach will be explored in future studies.
Some information in text is not represented in PGschema by design in an effort to balance completeness versus efficiency. If the information was highly detailed because it modified a modifier and we did not foresee that it would be useful for an application, it was omitted from PGschema in order to avoid unnecessarily increasing the complexity of the corresponding NLP system or incurring additional complexity to the applications that would be required to query the output. For example, the amount of change associated with an entity is not captured, although the change itself is (e.g. 5% in increased by 5% is not captured), and certain modifiers of measures are not represented (e.g. approximately in approximately 5%)
3.3 Evaluation
Table 3 shows the results of the quantitative analysis for coverage of a random sample of selected entity types in PGschema. The evaluation required 6 weeks, which included training followed by manual curation of the test dataset. Column 1 shows the type of entity; column 2 shows the percent of modifiers that were judged to be covered in PGschema followed by the total number of modifiers found to correspond to that type (in parentheses). The average coverage for all the types combined is 92% (CI: 8895%). In conformity with statistical sampling theory, the 95% confidence interval provides a reliable estimate of the coverage of PGschema over the complete set of 16 851 PubMED abstracts. Because we utilized, by design, a mouse dataset for training and a human dataset for testing, our precisions may have been superior if we performed the evaluation using a mouse dataset. Careful examination of the manual evaluation and automated analysis revealed that human errors occurred during expert evaluation that impacted our evaluation results. Some inconsistency in manual analysis arose because of the complexity of the evaluation process itself. Our experience demonstrated that the evaluation task required expertise in four disciplines: linguistics, biology, medicine and ontology, and thus was very difficult and time-consuming. In the future, by making a great effort to identify multiple experts with adequate background preparation, and by carefully training them according to guidelines, we will conduct more extensive evaluations in our continued development of this technology.
|
One type of difficulty occurred because the expert had to determine the semantic categories of the terms that modified the entity being evaluated. The semantic classes of many of the modifier terms were clear cut, such as limb, mouse, hepatocyte and tumor. But the semantic types of certain terms (e.g. direct, specific) were vague and difficult to ascertain, and this occasionally led to errors in the manual analysis. The second type of difficulty was encountered because the expert performing the manual analysis had to determine whether a multi-word term was compositional in meaning and thus consisted of an entity and modifiers, or whether it was an atomic unit. This required knowledge of both biology and medicine. For example, essential hypertension should be considered an atomic unit in medicine, and not as denoting hypertension with a modifier essential. In contrast moderate hypertension should be considered to be compositional, and therefore as an entity hypertension with a degree type of modifier moderate. The third type of difficulty involved the ambiguity in determining which entities were the modifiers and which others were being modified as different interpretations or viewpoints were allowed. The lowest coverage, which was in the biophysiological abnormality entity type, was primarily owing to a curation issue and occurred when the head noun was not the entity type being evaluated. As a result, the modifier was inappropriately determined to modify the abnormality entity type instead of the head noun. For example, in A172 glioblastoma cell line, A172 was judged by the curator to modify glioblastoma but it actually modified cell line. According to linguistics, the focus of a noun phrase is typically the head noun, and adjuncts typically modify the head noun. Thus, when analyzing A172 glioblastoma cell line, cell line should be considered the observed entity, and its modifiers the name A172 and the abnormality glioblastoma. However there are exceptions. For example, in adenocarcinoma occurrences, the head noun, occurrences, is not the observed entity but modifies adenocarcinoma. We will have to analyze the different situations and revise the guidelines accordingly.
In the qualitative analysis of the five randomly selected abstracts, there were a total of 1115 words. The most frequent occurrences were genes or gene products (113), functions (70), sequence or structural modifiers (28), cell components (17) and biophysiological abnormalities (17). There were only two occurrences of information not covered in that set. One was a description of the relative size of a protein (e.g. shorter protein) and the other was information stating that the experiment was in vivo. PGschema was designed for genotype and phenotype associations, but we noted other types of information in the abstract, such as methods used (e.g. differential hybridization), and information concerning an interpretation or level of understanding of the underlying mechanisms (e.g. our understanding is limited).
PGschema significantly enhanced the development of the BioMedLEE NLP system in several ways. First, by specifying the entity types, corresponding modifiers and relations, it determined the types of information that exist in text, and therefore that need to be recognized and potentially extracted. Second, the entity types, corresponding modifiers and relations determine the elements of the language patterns (although not the actual patterns) that the system should handle and therefore be trained for. Last, the schema specifies the output format that the NLP system should generate. PGschema was the basis for the NLP component of the PhenoGO application because queries based on PGschema were written to obtain the appropriate information. For example, queries were written to retrieve relations between biomolecular entities, functions and cellular locations as well as anatomical locations. For the PhenoGO application, the high performance (e.g. 92% precision and 91% recall) would not have been possible if the relations between genes, functions and anatomical locations were not accurately represented. In addition, the translation by BioMedLEE of the abstract into the structured representation occurred in a reasonable time. It took
11.5 s on a Sun Blade 2000 workstation with 2 GB RAM and dual 1.05 GHz CPUs to process an abstract. The workstation was running other processes, and therefore with a dedicated machine a reduced running time would be likely. Although, PGschema itself is complex, queries needed for a particular application are unlikely to utilize the full complexity because generally not all the modifiers or relations are required. For example, for PhenoGO, many of the modifiers of genes, cells and anatomical entities were not needed, although the relations between the entities and some of the modifiers were critical.
We evaluated the precision of PGschema within its design, which covers a broad number of the genephenotype entities and relationships. Additional and perhaps less common entities and relationships may indeed exist and require an extension of PGSchema. A larger scale determination of how much of the overall genotypicphenotypic information in abstracts is covered by PGschema will be pursued in future work. In future work we also plan on further refinement and evaluation of PGschema, as well as expansion of PGschema with environmental factors to complete the representation of the processes of genetic information transmission and expression as conceived in the central dogma of molecular biology. For example, narrative of pharmacogenomic studies require an understanding of phenotypes, genotypes and the environment (e.g. medication used, etc.) as well. Very few genetic or model organism databases provide environmental conditions with their geneticphenotypic models, however the scientific literature contains abundant instances of genephenotypes and their environmental conditions. Translational schemas between NLP and ontologies, as the one proposed, could enable mining of the scientific literature to augment genetic and model organism databases with computable knowledge of related environmental conditions.
In summary, the evaluation demonstrated a high level of precision for the designed coverage of PGschema for the phenotypic and genetic entities that were studied, as well as their closely related concepts, found in complex texts and provided an estimate of the coverage for the MEDLINE abstracts associated with the GO annotation databases. More importantly, PGschema not only represents phenotypic and genetic entities and their corresponding ontological codes, but also represents modifiers of the entities as well as the relations between the entities so that more complete information can be retained and subsequently reused.
| 4 CONCLUSIONS |
|---|
|
|
|---|
We have developed a novel informational schema called PGschema, which is capable of representing phenotypic and genotypic entities, modifiers, relationships and their closely related concepts as found in scientific journals. PGschema is unique in several ways: (1) it can be realized automatically using NLP techniques (Lussier et al., 2006), (2) it bridges the gap between language and ontologies; while it provides compositional expressiveness similar to that found in natural language, it also links to formal ontologies, which are required for reasoning and specification of external declarative and world knowledge, (3) it is in the form of XML, which is textual and easy to read and (4) it connects diverse biological scales of information. Manual expert evaluation demonstrated a high rate of coverage for the entities analyzed, and revealed the importance of implementing guidelines for the evaluation. Evaluation is a complex and labor-intensive task, which requires experts to have expertise in biology, medicine, linguistics and knowledge representation.
Rapid technological improvements of biomedical ontologies and natural language processing should lead to a profound transformation in the reuse of heterogeneous narrative information when it occurs in the form of curated and highly computational knowledge stored in specialized biomedical databases. Thus, the proposed schema should result in accelerated reuse of phenotypic and genetic knowledge by grouping and organizing the output of natural language processing systems into biologically highly computable and biologically relevant semantic types. To our knowledge, this is the first translational schema between NLP data structures and genetic or model organism databases. Moreover, technological standardization of declarative knowledge and the semantic Web have profoundly accelerated the development cycles in computational semantics, resulting in ontology-anchored databases that could be automatically transformed with a common expressive information schema, such as the one proposed. As the gap between linguistic and declarative knowledge is bridged with highly expressive and computable information schemas, such schemas are poised to produce a paradigm shift. Indeed, comprehensive information models are likely to enable rapid large-scale computational analyses of unprecedented volumes of fine-grained information and knowledge.
| Acknowledgments |
|---|
This work was supported in part by grants R01 LM07659, R01 LM08635, 1K22 LM008308-01 and 1U54CA121852-01A1. Funding to pay the Open Access publication charges was provided by the National Library of Medicine Grants R01 LM007659 and R01 LM008635.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
The authors wish it to be known that, in their opinion, the first and last authors should be regarded as joint First Authors. Associate Editor: Martin Bishop
Received on May 15, 2006; revised on July 20, 2006; accepted on July 21, 2006
| REFERENCES |
|---|
|
|
|---|
Ashburner, M., et al. (2000) Gene ontology: tool for the unification of biology. Nat. Genet, . 25, 259[CrossRef][ISI][Medline].
Bairoch, A., et al. (2005) The Universal Protein Resource (UniProt). Nucleic Acids Res, . 33, D154D159
Baker, P.G., et al. (1999) An ontology for bioinformatics applications. Bioinformatics, 15, 510520
Bard, J., et al. (2005) An ontology for cell types. Genome Biol, . 6, R21[CrossRef][Medline].
Blake, J.A. (2004) Bio-ontologies-fast and furious. Nat. Biotechnol, . 22, 773774[CrossRef][ISI][Medline].
Blake, J.A., et al. (2003) MGD: The Mouse Genome Database. Nucleic Acids Res, . 31, 193195
Bodenreider, O. (2004) The Unified Medical Language System (UMLS), integrating biomedical terminology. Nucleic Acids Res, . 32, D267D270
Campbell, K., et al. (1994) A logical foundation for representation of clinical data. J. Am. Med. Inf Assoc, . 1, 218232
Cantor, M.N., et al. (2005) Genestrace: phenomic knowledge discovery via structured terminology. Pac. Symp. Biocomput, . 103114.
Chute, C.G., et al. (1999) Desiderata for a clinical terminology server. Proc. AMIA Symp, . 4246.
Cohen, A.M., et al. (2005) A survey of current work in biomedical text mining. Brief Bioinform, . 6, 5771
Craven, M., et al. (1999) Constructing biological knowledge bases by extracting information from text sources. Proc. Int. Conf. Intell. Syst. Mol Biol, . 7786.
Evans, D.A., et al. (1994) Toward a medical concept representation language. J. Am. Med. Inf. Assoc, . 1, 207217
Evsikov, A.V., et al. (2004) Systems biology of the 2-cell mouse embryo. Cytogenet Genome Res, . 105, 240250[CrossRef][ISI][Medline].
Friedman, C., et al. (1994) A general natural language text processor for clinical radiology. J. Am. Med. Inf. Assoc, . 1, 161174
Friedman, C., et al. (1999) Representing information in patient reports using NLP and the extensible markup language. J. Am. Med. Inf. Assoc, . 6, 7687
Gardner, S.P. (2005) Ontologies and semantic data integration. Drug Discov. Today, 10, 10011007[CrossRef][ISI][Medline].
Gkoutos, G.V., et al. (2004) Building mouse phenotype ontologies. Pac. Symp. Biocomput, . 178189.
Gopalacharyulu, P.V., et al. (2005) Data integration and visualization system for enabling conceptual biology. Bioinformatics, 21, Suppl. 1, i177i185[Abstract].
Hirschman, L., et al. (2005) Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics, 6, Suppl 1, S1.
Hripcsak, G., et al. (1995) Unlocking clinical data from narrative reports. Ann. of Int. Med, . 122, 681688.
Knirsch, C.A., et al. (1998) Respiratory isolation of tuberculosis patients using clinical guidelines and an automated decision support system. Infect. Control Hospital Epidemiol, . 19, 94100[ISI][Medline].
Kim, J.D., et al. (2003) GENIA corpussemantically annotated corpus for bio-textmining. Bioinformatics, 19, Suppl. 1, il80i182.
Lindberg, D.A.B., et al. (1993) The Unified Medical Language System. Meth. Inform. Med, . 32, 281291[ISI][Medline].
Lussier, Y.A., et al. (2006) PhenoGO: Assigning phenotypic context to gene ontology annotations with natural language processing. Pac. Symp. Biocomp, . 6475.
Mungall, C.J. (2004) Obol: integrating language and meaning in bio-ontologies. Comp. Funct. Genom, . 5, 509520.
Narayanaswamy, M., et al. (2005) Beyond the clause: extraction of phosphorylation information from medline abstracts. Bioinformatics, 21, Suppl. 1, i319i327[Abstract].
Pennisi, E. (2005) How will big pictures emerge from a sea of biological data? Science, 309, 94
Rector, A.L., et al. (1995) Medical-concept models and medical records: an approach based on GALEN. J. Am. Med. Inform Assoc, . 2, 1935
Rzhetsky, A., et al. (2004) GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J. Biomed. Inf, . 37, 4353.
Rodrigues, J.M., et al. (1997) Galen-In-Use: an EU Project applied to the development of a new national coding system for surgical procedures: NCAM. Stud. Health Technol. Inform, . 43, 897901.
Rosse, C., et al. (1998) The digital anatomist foundational model: principles for defining and structuring its concept domain. Proc. AMIA Symp, . 820824.
Smith, B., et al. (2005a) Relations in biomedical ontologies. Genome Biol, . 6, R46.15.
Smith, C.L., et al. (2005b) The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biol, . 6, R7[CrossRef][Medline].
Spackman, K.A. (2004) SNOMED CT milestones: endorsements are added to already-impressive standards credentials. Health Inform, . 21, 5456[CrossRef].
Stevens, R., et al. (2002) Building a bioinformatics ontology using OIL. IEEE Trans. Inf. Technol. Biomed, . 6, 131141.
Walpole, R.E. and Myers, R.H. Probabilities and Statistics for Engineers and Scientists, (1978) 2nd edn Macmillian, pp. 210.
Wheeler, D.L., et al. (2005) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res, . 33, D39D45
This article has been cited by other articles:
![]() |
Y. A. Lussier and Y. Liu Computational Approaches to Phenotyping: High-Throughput Phenomics Proceedings of the ATS, January 1, 2007; 4(1): 18 - 25. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



