Bioinformatics Advance Access originally published online on January 21, 2006
Bioinformatics 2006 22(6):665-670; doi:10.1093/bioinformatics/btl010
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Automatic extension of Gene Ontology with flexible identification of candidate terms
Computer Science Division and AITrc, KAIST 373-1 Guseong-dong, Yuseong-gu, Daejeon 305-701, South Korea
*To whom correspondence should be addressed.
ABSTRACT
Motivation: Gene Ontology (GO) has been manually developed to provide a controlled vocabulary for gene product attributes. It continues to evolve with new concepts that are compiled mostly from existing concepts in a compositional way. If we consider the relatively slow growth rate of GO in the face of the fast accumulation of the biological data, it is much desirable to provide an automatic means for predicting new concepts from the existing ones.
Results: We present a novel method that predicts more detailed concepts by utilizing syntactic relations among the existing concepts. We propose a validation measure for the automatically predicted concepts by matching the concepts to biomedical articles. We also suggest how to find a suitable direction for the extension of a constantly growing ontology such as GO.
Availability: http://autogo.biopathway.org
Contact: park{at}nlp.kaist.ac.kr
Supplementary information: Supplementary materials are available at Bioinformatics online.
INTRODUCTION
With an exceedingly fast accumulation of the biological data, it has become essential to provide users with a convenient access to such data through various biological databases, such as FlyBase (The FlyBase Consortium 2003), MGD (Blake et al., 2003), TAIR (Rhee et al., 2003), SGD (Christie et al., 2004) and UniProt (Apweiler et al., 2004). Since the terms used in these databases are inherently diverse, researchers in the biological domain have developed a common terminology of biology, often in forms of ontologies such as Gene Ontology (GO) (The Gene Ontology Consortium, 2004), for access to such databases. However, the growth rate of such an ontology is rather slow, partly due to the burden for both developers and curators in the manual incorporation of new pieces of information, despite the help of a convenient set of computational tools such as DAG-Edit (http://www.godatabase.org/dev/java/dagedit/docs) and AmiGO (http://godatabase.org). In order to address this problem, we present a system that automatically extends and validates a standard biological ontology (namely, GO), using existing resources.
Previous methods for ontology extension usually work by first collecting candidate terms from the literature and by then locating them in the existing hierarchical structures of ontologies or clustering them into a certain hierarchy. Based on the observation that many ontological terms are noun phrases with an expected frequency, Witschel (2005) presented a method that identifies noun phrases with a regular grammar, selects candidate terms based on their frequencies and locates them in a term hierarchy by utilizing co-occurrence features from large corpora. Cimiano and Staab (2005) presented a method that clusters the candidate terms into a hierarchy by identifying evidence to hypernymy relations among them from the literature. Since the biology domain has many well-known ontologies, however, there is clearly the need for expanding the ontologies by adding more specific terms as hyponyms of existing terms. Of course, the methods such as that of Witschel (2005) can be used to add those specific terms, but they have trouble generating standard terms and locating the candidate terms at appropriate positions.
Instead, we present a novel method that generates candidate terms from existing terms by utilizing relations among the existing terms. For example, the hypernymy relation between the GO terms chemokine binding and C-C chemokine binding also suggests a hypernymy relation between their subterms chemokine and C-C chemokine, and this relation between the subterms can be used to generate candidate terms for GO terms that also include chemokine. We utilize this method to extend GO automatically as a proof of evidence for the automatic extension of an ontology.
GO is a well-known biological ontology with a controlled vocabulary, usually working as a semantic model for the annotation of gene products (i.e. RNAs or proteins from gene expressions) with its controlled terms. Many researchers have presented their work that utilizes the biomedical literature for automatic ontological annotation of gene products: Blaschke and Valencia (2002) with a clustering model for statistically extracted information, Raychaudhuri et al. (2002) with a maximum entropy model, Chiang and Yu (2003) with an automatic construction of patterns that represent syntactic relationships between gene products and GO terms, Kim and Park (2004) with a dependency parser for identifying syntactic relationships between ontological terms and protein names, and Koike et al. (2005) with a shallow parser for analyzing ACTOR and OBJECT relationships. In particular, BioCreAtIvE (Hirschman et al., 2005) provided an evaluation task for identifying text passages that support annotations of gene products with GO concepts. The methods for ontological annotation can be applied to validating term extension in that they provide relevant gene products for the candidate terms. Such methods generally deal with two distinguishable subtasks, both for recognizing terms in the literature and for associating them with gene products. In this paper, we focus on term identification that weakly validates candidate terms by looking for their occurrences in the literature. In particular, we do not consider the issue of associating them with gene products, because the focus of this paper is only on the extension of GO with novel concepts.
As for term identification, the participants at BioCreAtIvE presented their own methods for the identification of GO concepts in the literature, e.g. Couto et al. (2005) with an unsupervised method utilizing evidence content of ontological terms, Ehrler et al. (2005) with a sentence classifier computing distances between sentences and GO concepts, and Krallinger et al. (2005) with a rule-based method producing word-level variations of GO terms. However, these methods still have room for further improvement, in the sense that they only deal with occurrences of GO terms within sentence boundaries and that they do not seriously consider relations among the component words of GO terms.
The task of term identification from the literature rather requires dealing with all the possible linguistic variations of terms, including morphological, syntactic and discourse-related variations, synonyms, hyponyms and hypernyms. Consider sentence (1) below.
- (1) While lactate dehydrogenase, alkaline phosphatase and glucose-6-phosphate dehydrogenase exhibit a relatively low affinity to the carrier, alcohol dehydrogenase, glutamate dehydrogenase and urease were found to form stabile complexes with the polymer that are enzymatically active. [PMID:1022129; The relevant words are italicized for emphasis.]
Among the three subontologies of GO (i.e. molecular function, cellular component and biological process), biological process appears to grow faster and steadier than the other two, when we consider the average growth rate.1 These different growth rates among the three subontologies come from the fact that terms in biological process can be induced by composition of internal information of GO while extensions of both molecular function and cellular component are usually made as a result of direct interactions with information from external resources. Therefore, we have paid more attention to biological process in order to justify the need for an automatic extension of GO.
In this paper, we present novel methods to automatically generate candidate concepts from existing GO concepts and to recognize the candidate concepts in the biomedical literature. We show experimental results of the methods and suggest how to find a suitable direction for the possible extension of GO.
METHODS
We present a system for the automatic extension of GO and its validation. This system works in two steps: (1) The first step generates more detailed concepts from existing GO concepts by utilizing syntactic relations among the existing concepts and (2) the second step validates the predicted concepts by consulting them in the literature.
Extension
As shown in Figure 1, the relations between existing GO concepts can be inferred from relations between subconcepts of the GO concepts. For example, the hyponymy relation between two concepts chemokine binding and C-C chemokine binding can be inferred from the hyponymy relation between the subconcepts chemokine and C-C chemokine. We call such subconcepts of GO concepts concept units (CUs). In a relation between two CUs, we call the general one between the two CUs the upper CU and the specific one the lower CU. If an existing GO concept includes upper CUs, we can generate candidate concepts from the existing GO concept, called source GO concept, by replacing the upper CUs with their corresponding lower CUs.
|
However, the relations between CUs cannot be straightforwardly applied to all GO concepts due to the overgeneration of a massive number of meaningless candidate concepts. To overcome this problem, we define meta-level rules that encode the surrounding context of relevant CUs, thus providing them as a contextual restriction for the CUs, as exemplified in Table 1. The variable X in a rule indicates the upper CU, and the variable Y indicates the corresponding lower CU. In this case, the CU of the variable Y can be added to GO concepts that match the side on the left of the arrow in the rule. The rules are context sensitive, in the sense that the words (e.g. a, b) work as contextual restriction.
|
The system has automatically constructed such rules, which encode relations between CUs, where such relations can be induced from hierarchical relations among GO concepts. The system generates specific concepts from the source GO concepts by utilizing these rules.
Validation
Our system tries to validate the candidate terms by looking for them in the biomedical literature. Since the candidate terms are more complex than the source GO terms, we cannot expect them to appear in the literature as they are. Instead, there may be many levels of variations.
The system recognizes word-level variations by utilizing hand-made rules that deal with morphology (e.g. protein/proteins, cytosol/cytosolic, transcriptional activator activity/trans-activate), and by consulting a manually constructed resource both for semantic similarity (e.g. negative regulation/down-regulate, cell movement/cell transmigration, transporter/carrier, protein kinase/phosphorylation) and for semantic hierarchy (e.g. metabolism/hydrolase, transport/permease).
A syntactic variation of a GO term is a certain combination of component words of the GO term, where the component words are syntactically related to each other within a sentence according to their syntactic relations within the GO term (cf. Jacquemin, 1999). Consider the following sentence:
- (2) The E2A protein E47 is known to be involved in the regulation of tissue-specific gene expression and cell differentiation. (PMID:10781029)
- (3) To identify molecules regulating this interaction, we generated FDC-staining monoclonal antibodies (mAbs) and screened them for their ability to block FDC-mediated costimulation of growth and differentiation of CD40-stimulated B cells. (PMID:10727470)
|
This approach that utilizes syntactic dependencies has the advantage of analyzing relations among component words of ontological terms, compared to the bag-of-words approach. Consider the following sentence:
- (4) Ligation of retinoic acid receptor alpha regulates negative selection of thymocytes by inhibiting both DNA binding of nur77 and synthesis of bim. (PMID:12646620)
To identify the syntactic dependencies among the component words in a GO term, our system employs a dependency parser that identifies the dependency structure among the words in the sentence (Kim and Park, 2004). The parser is implemented in a combinatory categorial grammar (CCG) framework (Steedman, 2000). In the presence of a given candidate term for a sentence, the parser does not analyze base noun phrases in the sentence that do not include any component words of the term in order to improve efficiency in parsing, where a base noun phrase is a noun phrase that consists only of nouns, determiners, and adjectives.
Note that the syntactic dependencies among the component words of a GO term are frequently scattered across multiple sentences. Consider the following paragraph in a MEDLINE abstract:
- (5) Spindle elongation is crucial to normal chromosome separation in eukaryotes; ... We have characterized male meiotic spindle lengths in wild-type and the ask1-1 mutant plants. (PMID:11402192)
We offer two methods for the validation of candidate terms: The first method (named as sentence validation) identifies the dependency structure of a sentence with our dependency parser, and then checks if the component words of a candidate term show syntactic dependencies in the sentential structure according to their syntactic relations within the term. For example, the system identifies the syntactic dependencies among block, cell and growth in sentence (3) and verifies that the dependencies in the sentence are identical to those in the GO term regulation of cell growth, where block is a hyponym of regulation.
The second method (named as abstract validation) also identifies the dependency structures of sentences in an abstract, cross-links such structures by connecting component words of each candidate term, and then checks if the component words of the candidate term show syntactic dependencies in the cross-linked structures. Notice that the first method deals with syntactic variations and that the second method deals with discourse variations too.
EXPERIMENTAL RESULTS
The implemented system generated 18 964 candidate concepts from 8768 GO concepts in the version of June 2004.2 The system utilized 11 286 automatically induced rules for GO extension. Table 2 shows sample candidate terms generated by our system. The concepts in bold face already exist in the version of June 2004, while the others are newly generated by the system. The current version of November 2005 includes all the newly generated concepts except the concept regulation of imaginal disc morphogenesis. The number in parentheses indicates the number of gene products that are assigned to the corresponding GO concept in the current version.
|
We compared the generated candidate concepts with 9692 GO concepts in the version of June 2005, with exactly one year interval. Figure 3 shows a diagram with three circles that correspond to the three groups: (1) GO concepts in the old version (i.e. GO in June 2004) (2) GO concepts in the new version (i.e. GO in June 2005) and (3) candidate concepts generated by our system (i.e. Extended GO). The region A in the diagram indicates the area for the candidate concepts that are not included in GO yet. The region B includes the GO concepts that were excluded in the new version. The region C includes the GO concepts that are successfully predicted by our system. The region D indicates the area for the GO concepts that are not predicted by our system. We found that the 55 candidate concepts in the region C were included in the region D (55/1594, 3.5%), which indicate the GO concepts that were newly added to the new version of GO, but were not included in the old version. Note that all the new concepts in Table 2 which are included in the current version of November 2005 were not included in the (new) version of June 2005.
|
In addition to the comparison of the extended GO with the one-year newer version, we also extended the version of each month with the system and compared the extended GO with one-month newer version.3 Notice that the sum of candidate concepts that are gradually incorporated into the one-month newer versions (88 concepts) is bigger than the number of candidate concepts that are included in the one-year newer version (55 concepts).
We evaluated the validation step of the system with 55 candidate concepts in the region C.4 We constructed a test corpus of 448 MEDLINE abstracts, where at most 10 abstracts were randomly selected for each candidate concept among the abstracts that were retrieved with the candidate concept via PubMed (http://www.ncbi.nlm.nih.gov/PubMed).
In particular, we evaluated the two proposed methods for the validation with the test corpus. We found that the sentence validation method correctly recognized 69 occurrences of the candidate concepts and incorrectly recognized 27 occurrences (69/96, 71.9% precision) from 123 sentences, each of which includes all the component words of a candidate term. The method gave rise to 15 false negatives (69/84, 82.1% recall). When we set a baseline as the method that regards the 123 sentences as evidence to the concepts, the precision of the first method is slightly higher than that of the baseline (84/123, 68.3% precision). Many of the incorrect results of the sentence validation method are due to the incorrect analyses by the dependency parser. And, the rest of the incorrect results are due to the problem of the method, which does not consider semantic relations of component words. Consider the following example of a false positive:
- Clinical manifestations and pathophysiological mechanisms of diabetic angiopathy can be traced back to the development of endothelial cell dysfunction with alterations in the eNOS/NO system production or availability as the primum movens in its natural history. (PMID:15156413)
We found that the abstract validation method correctly recognized 94 occurrences and incorrectly recognized 56 occurrences (94/150, 62.7% precision) from the test corpus of 448 abstracts. The method also gave rise to 60 false negatives (94/154, 61.0% recall). When we set a baseline as the method that treats each of the abstracts as evidence to the corresponding candidate concept, the precision of the abstract validation method is much higher than that of the baseline (154/448, 34.4% precision). Thus, we find that the presented methods are usable not only in their precisions but also in their recalls. The method gives rise to the false negatives due to the lack of deep discourse analysis. The sentences from an example abstract in (7) implicitly represent the GO concept negative regulation of cellular defense response, whose component words are commonly related to the protein BI-1 which is a salient factor in the discourse structure of the abstract, whereas the component words do not show syntactic dependencies. We leave this problem for future work.
- We found differential expression of BI-1 in response to Bgh in susceptible and resistant plants. Chemical induction of resistance to Bgh by soil drench treatment with 2,6-dichloroisonicotinic acid led to down-regulation of the expression level of BI-1. ... We suggest that BI-1 is a regulator of cellular defense in barley sufficient to substitute for MLO function in accessibility to fungal parasites. (PMID:12704231)
40% due to the lack of relevant MEDLINE abstracts, and may successfully recognize occurrences of 30% candidate concepts. CONCLUSION
We have described our system that generates candidate GO concepts from existing GO concepts by utilizing relations between the existing concepts and validates the candidate concepts by recognizing them in the biomedical literature. The system should be helpful for GO developers and curators to speed up their process for GO extension. In particular, they can extend GO with the system in an interactive way of consulting new candidate concepts that are proposed by the system whenever they add new GO concepts.
We can also utilize the present system to address the need for balancing the number of gene products assigned to each GO concept. For example, though the concept imaginal disc morphogenesis is assigned 549 gene products, its subconcepts in the version of June 2004 are assigned only 35 gene products in total (Table 2). This heavy assignment of gene products to a single GO concept indicates that the concept is a sure target for further subcategorization. While the current version of November 2005 provides nine more subconcepts (i.e. the eight subconcepts in Table 2 and another subconcept histoblast morphogenesis), the system successfully predicts eight of the new subconcepts as shown in Table 2. By recursively applying the system to the concepts with heavy assignments of gene products, we will be able to further balance the gene assignment to GO concepts.
Acknowledgments
This work was supported by MOST/KOSEF through AITrc. We thank Il Park for insightful discussion. We also thank the anonymous reviewers for helpful comments.
Conflict of Interest: none declared.
FOOTNOTES
Associate Editor: Alfonso Valencia
1Appendix A of Supplementary Materials shows the growth rate of each subontology. ![]()
2Readers may download the whole set of the GO that is automatically extended from the version of June 2004 at the homepage http://autogo.biopathway.org ![]()
3Appendix B shows the results of the comparisons with a one-month interval. ![]()
4Readers may see the evaluation results on the test corpus at the homepage http://autogo.biopathway.org ![]()
Received on April 18, 2005; revised on January 12, 2006; accepted on January 15, 2006
REFERENCES
Apweiler, R., et al. (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Res, . 32, D115D119
Blake, J., et al. (2003) MGD: the Mouse Genome Database. Nucleic Acids Res, . 31, 193195
Blaschke, C. and Valencia, A. (2002) Automatic ontology construction from the literature. Proceedings of the International Conference on Genome Informatics (GIW) , pp. 201213 Tokyo, Japan.
Chiang, J. and Yu, H. (2003) MeKE: discovering the functions of gene products from biomedical literature via sentence alignment. Bioinformatics, 19, 14171422
Christie, K., et al. (2004) Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res, . 32, Database issue, D311D314.
Cimiano, P. and Staab, S. (2005) Learning concept hierarchies from text with a guided hierarchical clustering algorithm. Proceedings of the workshop on learning and extending lexical ontologies by using machine learning methods at ICML 2005Bonn, Germany.
Couto, F., et al. (2005) Finding genomic ontology terms in text using evidence content. BMC Bioinformatics, 6, Suppl.1, S21.
Ehrler, F., et al. (2005) Data-poor categorization and passage retrieval for Gene Ontology Annotation in Swiss-Prot. BMC Bioinformatics, 6, Suppl.1, S23.
Hirschman, L., et al. (2005) Overview of BioCreAtIvE: critical assessment of information extraction in biology. BMC Bioinformatics, 6, Suppl.1, S1.
Jacquemin, C. (1999) Syntagmatic and paradigmatic representations of term variation. Proceedings of the ACL , pp. 341348.
Kim, J. and Park, J. (2004) Annotation of gene products in the literature with Gene Ontology terms using syntactic dependencies. Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP) , pp. 528534.
Koike, A., et al. (2005) Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics, 21, 12271236
Krallinger, M., et al. (2005) A sentence sliding window approach to extract protein annotations from biomedical articles. BMC Bioinformatics, 6, Suppl.1, S19.
Raychaudhuri, S., et al. (2002) Associating gene with Gene Ontology codes using a maximum entropy analysis of biomedical literature. Genome Res, . 12, 203214
Rhee, S., et al. (2003) The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res, . 31, 224228
Steedman, M. (2000) The syntactic process. The MIT Press, Massachusetts, USA.
Nucleic Acids Res. The FlyBase Consortium. (2003) The FlyBase database of the Drosophila genome projects and community literature. 31, 172175.
The Gene Ontology Consortium. (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res, . 32, D258D261
Witschel, H. (2005) Using decision trees and text mining techniques for extending taxonomies. Proceedings of the Workshop on Learning and Extending Lexical Ontologies by using Machine Learning Methods at ICML 2005 Bonn, Germany.
This article has been cited by other articles:
![]() |
R. Winnenburg, T. Wachter, C. Plake, A. Doms, and M. Schroeder Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? Brief Bioinform, December 6, 2008; (2008) bbn043v1. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



