Bioinformatics Advance Access originally published online on October 27, 2004
Bioinformatics 2005 21(7):1227-1236; doi:10.1093/bioinformatics/bti084
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Automatic extraction of gene/protein biological functions from biomedical text
1Department of Computational Biology, Graduate School of Frontier Science, The University of Tokyo Kiban-3A1(CB01) 5-1-5, Kashiwanoha Kashiwa, Chiba 277-8561, Japan
2Central Research Laboratory, Hitachi Ltd. 1-280 Higashi-koigakubo, Kokubunji City, Tokyo 185-8601, Japan
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: With the rapid advancement of biomedical science and the development of high-throughput analysis methods, the extraction of various types of information from biomedical text has become critical. Since automatic functional annotations of genes are quite useful for interpreting large amounts of high-throughput data efficiently, the demand for automatic extraction of information related to gene functions from text has been increasing.
Results: We have developed a method for automatically extracting the biological process functions of genes/protein/families based on Gene Ontology (GO) from text using a shallow parser and sentence structure analysis techniques. When the gene/protein/family names and their functions are described in ACTOR (doer of action) and OBJECT (receiver of action) relationships, the corresponding GO-IDs are assigned to the genes/proteins/families. The gene/protein/family names are recognized using the gene/protein/family name dictionaries developed by our group. To achieve wide recognition of the gene/protein/family functions, we semi-automatically gather functional terms based on GO using co-occurrence, collocation similarities and rule-based techniques. A preliminary experiment demonstrated that our method has an estimated recall of 5464% with a precision of 9194% for actually described functions in abstracts. When applied to the PUBMED, it extracted over 190 000 geneGO relationships and 150 000 familyGO relationships for major eukaryotes.
Availability: The extracted gene functions are available at http://prime.ontology.ims.u-tokyo.ac.jp
Contact: akoike{at}hgc.jp
| INTRODUCTION |
|---|
|
|
|---|
With the development of high-throughput methods such as the yeast two-hybrid method, mass spectrometry and genome sequencing, an enormous amount of experimental results covering various genes can be quickly obtained. Although all relevant results reported so far should be considered when interpreting experimental data, retrieving them from PUBMED abstracts and/or full papers and studying them is an overwhelming task for a single researcher.
A promising approach to overcoming this problem is the use of natural language processing (NLP) to automatically extract and mine the information. The advantages of NLP in the biomedical field have been demonstrated for gene/protein name recognition (Fukuda et al., 1998; Collier et al., 2000; Tanabe and Wilbur, 2002) proteinprotein interactions (Friedman et al., 2001; Koike et al., 2003) and general event extraction (Yakushiji et al., 2001; Rindflesch et al., 2000; Humphrey et al., 2000). In addition, to clarify the definition of classes (concepts) and definitize the relationships between classes that have been generated with the rapid advancements in the biomedical field, efforts have been made to construct ontologies manually (Ashburner et al., 2000; disease ontology, http://diseaseontology.sourceforge.net/; IMGT ontology, http://imgt.cines.fr/textes/IMGTindxn/ontology.html) and automatically (Blaschke and Valencia, 2002). Gene Ontology (GO) (Ashburner et al., 2000), the most widely used ontology, consists of biological process, molecular function and cellular component ontologies. Several preliminary studies have been made on automatically annotating genes, proteins and families with the corresponding GO-ID, which is assigned to each defined term (class), using only abstracts and/or sequence information (Schug et al., 2002; Xie et al., 2002; Raychaudhuri et al., 2002; Nenadic et al., 2003).
To evaluate each information extraction system by solving common tasks such as gene name recognition and gene function annotation based on gene ontology, to clarify common problems, and to accelerate IE progress in the biomedical field, the KDD cup (http://www.biostat.wisc.edu/~craven/kddcup/), TREC (http://trec.nist.gov/) and BioCreAtIvE (http://www.pdg.cnb.uam.es/BioLINK/BioCreative.eval.html) have been held. Biomedical domain-specific problems in text mining are obvious, and various techniques for solving them have been proposed.
We have developed a method for automatically assigning the GO-ID of a biological process to each gene and protein using natural language techniques. It uses shallow parsing and sentence structure analysis to extract the ACTOR and OBJECT relationships, so detailed gene functional annotations are possible, at least in theory.
Gene Ontology vocabularies are controlled ones, so some of them do not frequently appear in the abstracts. Terms (words and multi-word terms) representing similar or related meanings of GO terms are gathered semi-automatically using co-occurrence and collocation similarity of GO terms to enable recognition of the functional terms. Furthermore, rule-based term generation including morphological and syntactic term variations is used to complement the semi-automatic term-gathering methods mentioned above.
This paper is organized as follows. In the following section, we introduce related work and compare our method with related ones. The functional terminology generation method and the genefunction relationship extraction method are explained in Systems and Methods. The precision and recall rate of our extraction system are presented in Results and Discussion. We also discuss the causes of errors. We conclude with a summary.
Related work
The extraction of relationships between classes has been well studied (Yakushiji et al., 2001; Rindflesch et al., 2000; Humphrey et al., 2000). For example, Yakushiji et al. demonstrated biological event extraction using a full-parser (Yakushiji et al., 2001). Rindflesch et al., 2000 developed an information extraction (general extraction) system for drugs and genes relevant to cancer using a shallow parser and UMLS (http://www.nlm.nih.gov/research/umls/). For the identification of GO-IDs, methods using a combination of sequence information with text information (Xie et al., 2002) and methods using only text information have been proposed (Raychaudhuri et al., 2002; Nenadic et al., 2003; Kim and Park, 2004).
Raychaudhuri et al., 2002 demonstrated that abstracts can be classified into major GO-ID classes using words in the abstracts and machine leaning, such as the maximum entropy, Naive Bayes or nearest-neighbor method. Nenadic et al., 2003 classified genes and proteins into major GO-ID groups using words (ignoring collocations) appearing in abstracts and a support vector machine. Unfortunately, the receiver operating characteristic (ROC) curve of some class IDs, which did not have a sufficient number of abstracts, was quite low. Since the annotated corpus for detailed GO-IDs is not yet sufficiently large, the IDs used in both methods are limited to a high class, such as signal transduction. Although biologists would usually like to know the evidence for the automatic extraction result at the abstract or sentence level, both methods do not clearly show it. There have been preliminary reports on sentence-level annotation using a genefunction relationship extraction method based on syntactic dependency (Kim and Park, 2004) and using one based on sentence similarity using Naive Bayes (Chiang and Yu, 2003. In BioCreAtIvE, which uses full texts, a sliding window approach was used to respect the genefunction description over multiple sentences (Krallinger and Padron, 2004) and query expansions and derivationally related term expansions were used to achieve wide recognition of gene ontology terms (Krymolowski et al., 2004). Since the information extraction of genefunction relationships is quite difficult, further efforts are required.
Functional annotation is difficult because there are a variety of functional term expressions, and some functions cannot be correctly extracted without considering the ACTOROBJECT relationship (a simple co-occurrence of gene and functional terms becomes an error). In our approach, functional terms are gathered using various methods in order to address the first problem as much as possible, and sentence structure analysis is used to address the second problem.
| SYSTEMS AND METHODS |
|---|
|
|
|---|
The overview of our method for automatically extracting the gene function is shown in Figure 1. The following sections describe the method for gathering/generating the terms for each GO-ID and the method for extracting the information and assigning the GO-ID using the gathered terms.
|
Augmentation of functional terms
In principle, the terms for the biological process of GO are used to assign the GO-ID to the gene. However, the number of these terms is insufficient for automatic extraction (at least at the sentence level) in terms of the recall. The terms are controlled vocabularies, and some are not frequently used in the abstracts. Furthermore, each GO class can be expressed using various terms instead of a defined GO term (e.g.GO0006915: apoptosis can be replaced with apoptotic process in most cases). We thus gather the same and similar meanings or related terms semi-automatically based on:
- related terms having a high co-occurrence score with GO terms,
- similar terms having similar collocations with GO terms,
- enzyme name extraction by pattern matching,
- rule-based generation of syntactic/semantic variations and
- verbtechnical term combination variations.
Related terms having high co-occurrence score with GO terms
The terms that co-occur with statistically significant frequency in the abstracts are extracted as follows. Assume we have a large-scale text database, D. For a GO function, term F, the co-occurrence score of an arbitrary term T with F can be measured using various formulae, among which the simplest is the ratio between the density of T in the texts containing F and the density of T in the whole database D. Although there are more sophisticated formulae, they are difficult to use when comparing the co-occurrence scores of terms whose frequencies are much different. Therefore, we first classify all candidate terms (i.e. terms appearing at least once in the texts containing F) into several frequency classes and take the ones with the highest score from each class.
The summarized candidate terms calculated for each organism and each frequency are shown in html style, and terms with a meaning similar to that of the query term are selected by biologists (PhD holders or PhD students).
Similarity of terms having similar collocations with GO terms
As mentioned above, the similarity of terms is measured by the similarity of their collocations. For each term T, its profiling text is defined as the set of all collocations of T (paired with their frequencies) in the database. As for the types of collocations, we adopted simpler ones such as np(noun phrase)-vp(verb phrase), vp-np, vp-prep(preposition)-np, np-vp-np, np-vp-prep-np and np-prep-np. The search for similar terms is then done by applying a similar text search technique, the vector space model (Salton et al., 1975).
The procedure for making the profiling text of a term is shown in Figure 2. After shallow parsing of the whole text, all collocation patterns (see above) are extracted. The expansion process and the sorting and indexing processes are then applied to obtain the indexing of all terms by their collocations.
|
The similarity is defined by the following equations, which are known as SMART (Singhal et al., 1996):
![]() | (1) |
![]() | (2) |
![]() | (3) |
is a slope constant, which is set to 0.2, and ci is the i-th collocation of query term q. The weight of each collocation
(ci) is defined by Equation (2) where df(ci) is the number of terms whose profile texts contain ci and N is the total number of terms. The weight of significance of each collocation ci with respect to term X is given by Equation (3) where tf(ci|X) is the frequency of collocation ci in the profile of X, and tf(.|X) is the average of the frequencies over the collocations consisting of the profile of X. The summarized candidate terms calculated for each organism and each frequency are shown in html style, and terms with a meaning similar to that of the query term are selected by the same biologists (PhD holders or PhD students).
Enzyme name extraction by pattern matching
Most functions, including metabolism, catabolism and synthesis, are expressed using an enzyme name. To compensate for the weakness of the vocabularies extracted using the two methods described above, enzyme names ending with ase are also extracted from the abstracts corresponding to a year. For example, for the GTP metabolism function, GTP cyclohydrolase, GTP hydrolase, GTPase and GTP guanylyltransferase are extracted as enzyme names to be related to GTP metabolism. However, some enzyme terms that end in ase are not related to these functions. For example, the function of permease belongs to transport. These unrelated terms are removed from the collected vocabularies semi-automatically.
Rule-based generation of syntactic/semantic variations
Syntactic variations such as folding of protein for protein folding are automatically generated. Furthermore, semantically similar/related terms (metabolism
metabolic, metastasis, metamorphosis, reducer, reduction) and derivationally related terms (apoptosis
apoptotic) of a GO term or a GO term consisting of single word are gathered using UMLS (for derivationally related terms), Word Net (http://www.cogsci.princeton.edu/~wn/) (for both) and expert knowledge (for both). Errors are generated in some automatic conversions. For example, the terms transport and exchange are similarly used in ion transport/exchange, but not in nuclear transport. Accordingly, conservative conversion terms are provided. Functional term variations are generated using these similar/related terms. When the same term is automatically generated for multiple GO-IDs, the superclass ID (higher concept class ID) is used. Hyponym terms (lower class terms, ex. phosphatidylinositol is the hyponym of phospholipid) are also gathered from the MeSH terms (http://www.nlm.nih.gov/mesh/meshhome.html).
Verbtechnical term combination variations
Some functions such as regulation, transport and synthesis are expressed frequently by the combination of a verb and technical terms. A predefined verb is combined with one or more technical terms. For example, GO0006846: acetate transport is assigned to ACTOR when the verb is transport, locate, localize, translocate, import or export and acetate is included in OBJECT. Furthermore, some functions can be determined based on the combination of a verb and an OBJECT or based only on the verb. GO0004672: protein kinase activity is assigned when the verb is phosphorylate and the ACTOR and OBJECT include a protein name. If the OBJECT does not include a protein name, the ACTOR may be a kinase (for compounds). When the verb is palmitoylate, GO0018318: protein amino acid palmitoylation is assigned to ACTOR without investigating terms in the OBJECT. These verbtechnical term combination variations are semi-automatically produced.
By applying the first two methods to about 190 major GO terms, we gathered about 3000 terms. Of these, less than 30% were commonly extracted using method 1 (co-occurrence) and method 2 (collocations). That is, these methods compensate for each other's weaknesses. By using all five methods, we gathered about 240 000 terms. (There were about 10 000 original GO terms.)
| Extraction of relations between genes and gene functions |
|---|
|
|
|---|
The biological function of each gene was annotated using the following procedure, which is illustrated in Figure 1. The example sentence is shown in Figure 3. The steps are as follows.
|
Step 1. Recognition of gene/protein/family names and GO functional terms
The gene name recognition method is described elsewhere (Koike and Takagi, 2004). Briefly, gene name recognition is carried out using the GENA gene name dictionary (http://gena.ontology.ims.u-tokyo.ac.jp/search/servlet/gena) and family name dictionary (http://marine.ims.u-tokyo.ac.jp:8080/Dict/family), which were constructed based on major database entries. In our system, a protein name that does not specify the gene locus is treated as a family name. For example, since 143-3 does not specify the gene locus (143-3 alpha, 143-3 beta, etc.), it is registered as a protein family name. The variations in gene name were generated based on these dictionaries and were quickly searched against abstracts using a devised trie with many heuristics, such as replacing special characters with spaces, searching inside and outside the parenthesis separately [e.g. mitogen-activated protein kinase (MAPK) 1
mitogen-activated kinase 1 + MAPK1], and using continuous expressions (e.g. GATA-4/5/6
GATA4, GATA5, GATA6). After gene/protein/family name recognition, ambiguities in gene names, especially in abbreviation names [e.g. TAK1 is the abbreviated synonym for MAP3K7 (mitogen-activated protein kinase kinase kinase 7) and NR2C2 (nuclear receptor subfamily 2, group C, member 2)] were resolved using full-name abbreviation pair search and keyword search. Finally, the existence of multiple expressions for the same gene was checked [e.g. multiple-name expression HAP1 (CYP1) in Saccharomyces cerevisiae: HAP1 is the gene name of YLR256W and YPL101w, but the second name CYP1 specifies this gene as YLR256W]. In our method, precision and recall were over 90% for the major eukaryotes (Koike and Takagi, 2004).The recognition of functional terms was also quickly done over all abstracts using a trie considering trivial term variation (replacement of special characters with a space).
Step 2. Shallow parsing, noun phrase bracketing and sentence structure analysis
Shallow parsing was done for sentences with gene name IDs using FDG-Lite (http://www.connexor.com/). After noun phrase bracketing using dependency/syntactic tags and morphological tags, parentheses, coordinate clauses, subordinate clauses, etc. were analyzed using various standard rules.
FDG-Lite, developed by Voutilainen et al. at the University of Helsinki, gives the base form, dependency/syntactic tags and morphological tags. When a determiner, adverbial and adjective modifiers, coordinating conjunction, participle, noun and pronoun are contiguous, they are regarded as a noun phrase. Boundary recognition of noun phrases including a coordinating conjunction and comma requires the use of certain devices. The number of coordinating conjunctions before the target coordinating conjunction, whether or not a past_participle_modifier is located after the target coordinating conjunction, whether or not the verb is before or after the target coordinating conjunction, and whether or not the target coordinating conjunction is in a subordinate phrase or adverbial phrase beginning with an interrogative are checked for the boundary of the noun phrase including coordinating conjunctions and comma.
In principle, a predecessor noun phrase of the predicate verb is regarded as a subject, and just behind the noun phrase or preposition phrase of the predicate verb is regarded as an object. Certain rules are used for complicated sentence structures, such as coordinate-conjunction and insertion-phrase structures. For example (ignoring adverb phrases and prepositions for simplicity):
- NP1 verb1 NP2 coordinating_conjunction verb2 NP3
The subject of verb2 is NP1
- NP1, Verb1-ing NP2, Verb2 NP3
The subject of verb1 and verb2 is NP1.
- NP1, NP2 verb1 NP3, verb2 NP4
NP2NP3 is an insertion phrase; the subject of verb2 is NP1
- NP1 verb1 (predefined verb, such as belong, consist, encode) NP2 [relative pronoun] verb2 NP3
The subjects of verb2 are NP1 and NP2.
- NP1 verb1 to-infinitive verb2 NP2
The subject of verb2 is NP1.
- NP1, Verb1-ing NP2, Verb2 NP3
In noun phrases including a modifier_of_noun:past_participle and participle, the subject and object inside the phrase are also extracted.
- NP1 verb-ing NP2
NP1 verb NP2
- NP1 verb-en NP2
NP2 verb NP1
- NP1 verb-en NP2
Simple anaphora (coreference of term or phrase with its antecedent) resolution was also tried. When a pronoun appeared after a relative pronoun, the previously appearing gene name was assigned after checking for singular/plural consistency. For example, in the sentence In S.cerevisiae, OAC is in the inner mitochondrial membranes, and deletion of its gene greatly reduces transport of oxaloacetate sulfate, our program recognizes its gene based on the OAC.
Step 3. ACTOROBJECT relationships extraction
The genefunction relationships are extracted when they are expressed in ACTOROBJECT relationships with predefined verbs or in modification relationships. Here, ACTOR (agent) means the doer of action and OBJECT means the receiver of action (higher concept of object of subjectobject). Basically, only when ACTOR is a gene name and OBJECT is a gene function, the relationship is extracted. For some verbs, such as require, the reverse relation is extracted. We use these terms, since relationships between ACTOR/OBJECT and gene name/function are not affected by the passive voice or active voice although subjectobject relationships are affected (in most cases, the subject is protein and the object is its function in active voice, while the opposite holds true in passive voice).
The extraction patterns are roughly summarized in Table 1. In each sentence, only the gene function extracted using the corresponding pattern is highlighted. The kinds of verbs were predefined. As shown in Table 1, the ACTOR and OBJECT extraction patterns were not limited to subjectobject relationships. The gene and its function can be expressed in a modification relationship, subjectcomplement relationship, subjectadverb relationship and so on.
|
Step 4. GO-ID assignment to genes (type-1 extraction)
After extraction of the genefunction relationship, whether it is negative or affirmative and whether it is a contingent fact (including investigate, test, examine, study, design and predicate) or not are checked. (The negative and contingent facts are also stored in the database PRIME with marks. However, in the following discussion, these relationships are not used.) For the verbtechnical term combinations (as described in Augmentation of functional terms), the verb is confirmed to be the predefined one. For some terms, it is difficult to determine whether the assigned function is appropriate or not from one sentence. For example, GO0006350: transcription is defined as the synthesis of either RNA on a template of DNA or DNA on a template of RNA by the Gene Ontology Consortium. In many contexts, the ACTOR of transcription is simply the protein activator. Accordingly, only when at least one key word such as zinc-finger, Pol_I, Pol_II, Pol_III and TFIIB appear in the same abstract is the GO-ID accepted. Finally, the gene-ID and GO-ID relationship is the output.
Step 5. Keyword search in object/complement (type-2 extraction)
In the example sentences shown in Figure 3 a complete GO term is expressed in each sentence. However, if the sentence includes an expression such as chromosome III segregation, the same ID cannot be assigned. While the resolution of collocation variants has been well studied in NLP (Jacquemin and Royaute, 1994) it is still a challenging task. Here, we tried a simple keyword search. The score for each word consisting of functional terms (=TermScore[i], i-th term score) was defined by 1/[1 + log(+1)], where frequency is the frequency of appearance in abstracts over 2 years. The sum score of each collocation (=SumScore) was calculated. The score for each collocation with the given key words was defined as
j=given keywordsTermScore[j]/SumScore (=CollScore). When the top, CollScore, was over 0.75, the corresponding GO-ID was accepted. The threshold of 0.75 was determined by using about 100 learning abstract sets.
Step 6. GO-ID assignment to genes (type-2 extraction)
Step 6 is the same as Step 4, but for type-2 extractions.
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
Evaluation method
To evaluate the performance of our extraction function, we used the same abstracts used for GO-term annotation in the SGD database (http://www.yeastgenome.org/) as an S.cerevisiae test set and those for GO annotation (GOA; Camon et al., 2003) as an Homo sapiens test set. These annotated data include GO evidence codes. Since they include GO-IDs assigned based on sequence similarity, we used only the abstracts with evidence code IDA:inferred from direct assay. Furthermore, since the SGD and GOA annotations were done using full papers and the biological functions are not described in some abstracts, we used abstracts that included the corresponding gene names. In total, we used 510 abstracts (726 genefunction relationships) for S.cerevisiae and 202 abstracts (226 genefunction relationships) for H.sapiens. The recall [=true_positive/(true_positive+false_negative)] was calculated using these abstracts and annotated relationships. Whether each genefunction relationship could be extracted from the corresponding abstract was used as the evaluation metric. Since not all relationships described in each paper were extracted in SGD and GOA (probably because the annotators primary query to PUBMED are gene names, other gene information are not necessarily extracted), two kinds of precisions [=true_positive/(true_positive+false_positive)] were calculated. One was calculated using SGD/GOA annotation as the gold standard (type-1 and type-2). The other was calculated based on 100 randomly selected genefunction relationships extracted using each method (type-1 and type-2) for S.cerevisiae and H.sapiens. In this calculation, whether the assigned GO-ID was appropriate or not was determined using the same criteria, criteria-1, -2 or -3. In the following section, the precision and recall for each method and the causes of the false-positive and false-negative errors are discussed.
Precision and recall
Precision and recall for type-1 extractions
Tables 2 and 3 shows the results of the type-1 extractions. Since the GO has a hierarchical structure, a superclass or subclass ID was assigned in some cases. The definition of criteria-1, -2 and -3 was described in Tables 2 and 3 comments. Although in this method, only abstract information is used for gene/protein function extraction, some GOA/SGD annotations are not described in the abstracts but described only in the body text. Therefore, we investigated whether each of 100 randomly selected GOA/SGD annotations was written in an in-depth class description (criteria-1), in a higher class description (criteria-2, -3), or neither. The numbers in parenthesis in Table 2 are estimated recall using these values. For example, for S.cerevisiae, 61% of the annotations were written in an in-depth class description (complete match with SGD annotation) in the abstracts. Therefore, 18.2/0.61 = 29.8% is the estimated recall.
|
|
As shown in Table 2, the recall rate was low. About 5% of our results were written in a lower (more detailed) class description than the SGD/GOA annotations for both organisms. The superclass ID annotations using our method were due to ignorance of the adverbs and insufficient vocabularies for the hyponym. For example, in the sentence Inhibition of angiogenesis by recombinant human platelet factor-4 and related peptides, GO0001525: angiogenesis instead of GO0016525: negative regulation of angiogenesis was assigned by our method. Classifying the adverbs, verbs and adjectives should enable more detailed annotations. When only the co-occurrence of GO-ID and gene-ID in the same sentence was used for the annotation, the recall rates (criteria-3) of S.cerevisiae and H.sapiens were 63 and 55%, respectively. However, the precisions were lower than 50% due to the ACTOROBJECT relationship not being considered. For example, the GO-assignment of GO0008152:metabolism to protein-B was erroneously done in the sentence protein-A is involved in the metabolism of protein B. In this case, the precision may differ greatly among the vocabularies prepared for each GO-ID.
The causes of the low recall rate even for criteria-3 are as follows (in order of frequency).
Incomplete vocabularies. In many cases, hyponyms were used in the abstract. For example, in response to phorbol ester is used in one abstract to represent the function response to organic substance. The knowledge that phorbol ester is a hyponym of organic substance is required. Although the MeSH terms described above and the UMLS hierarchy were used in some cases, the resolution of a broad class is difficult.
Function not described by a pattern in Table 1. The function expressions varied. For example, We purified Tip1p from a glucanase extract of yeast cell walls and analyzed the sugar chain involved in the cell wall linkage describes the function of GO0007047: cell wall organization and biogenesis. Although purify was registered as a predefined verb, purify NP (gene-name) from NP (apparatus name) was not provided for the organization process. Furthermore, some relationships are difficult to extract. For example, The main physiological roles of Odc1p and Odc2p are probably to supply 2-oxoadipate and 2-oxoglutarate from the mitochondrial matrix to the cytosol where they are used in the biosynthesis of lysine and glutamate, respectively, and in lysine catabolism implicitly indicates the functions of Odc1 and Odc2 as GO0006839: mitochondrial transport. Although these patterns and predefined verbs can be added, the task is never ending.
Function not written in one sentence. In some cases, the function is written over multiple sentences, and a pronoun is used in a subsequent sentence. Although some trials using multiple sentences have been reported (Krallinger and Padron, 2004), resolving this situation without degrading precision appears difficult.
Parser errors and structure errors There were only a few false negatives due to parser and structure errors, so few sentence structures had to be analyzed. The false negatives were due to incorrect recognition of the start point of prepositional phrases, incorrect recognition of modifier_of_noun: past_participle instead of main_verb: past_tense, and unresolved anaphora.
In Table 3, the numbers in parentheses were calculated using SGD and GOA annotations as the gold standard, while the numbers before the parentheses represent the precision based on manual checking of 100 randomly selected genefunction relationships. Although the deep class annotation (criteria-1) seems to be difficult, the precision for criteria-3 seem to be sufficient for practical use. The causes of the errors are as follows (in descending order of importance):
Gray zone errors. Many errors occurred in the semantically gray zone. For example, for the sentence IGIF has been found to enhance the production of interferon-gamma (IFN-gamma) and granulocyte/macrophage colony-stimulating factor (GM-CSF) while inhibiting the production of IL-10 in concanavalin A (Con A)-stimulated PBMC, the assignment of the function GO00042091: IL-10 biosynthesis to colony-stimulating factor (GM-CSF) is in the gray zone (extraction pattern is adverb phrase in Table 1). This sentence implicitly indicates the relation but does not state the obvious function. While these errors can be eliminated by limiting the number of extraction relationship patterns, doing so would reduce the recall.
Loose verb conditions. The limitation on verbs in the verbtechnical term combinations is quite loose in the catabolism, synthesis, biosynthesis, organization, biogenesis and metabolism combinations. This causes false positives. For example, for Saccharomyces cerevisiae possess two Escherichia coli endonuclease III homologs, NTG1 and NTG2, whose gene products function in the base excision repair pathway and initiate removal of a variety of oxidized pyrimidines from DNA, GO0006221: pyrimidine biosynthesis and GO0006281: DNA repair were assigned as NTG1 and NTG2. The former is a false positive. An additional screening devise is needed for such terms.
Parser errors and sentence structure analysis errors. There were only a few parser and sentence structure analysis errors in our small test sets. Some errors were observed in the modification relationships in NP preposition NP and in the recognition of long names.
Gene name recognition errors. There are some names in common for multiple genes. Although the resolution of this ambiguity by abbreviationfull name matching and keyword searching was tried, there were still some failures, as described elsewhere (Koike and Takagi, 2004).
Precision and recall for type-2 extractions
To compensate for the lack of various expressions for biological functions, key word match instead of complete term match was also tried. The results are summarized in Tables 4 and 5. Compared to the rates for the type-1 extractions, the recall rate was slightly higher, and the precision was slightly lower. The slight increase in the recall rate is because the keyword search was done within the noun phrase, ignoring the adverb. For example, for Swi5 and Ace2 are cell cycle-regulated transcription factors that activate expression of early G(1)-specific genes in Saccharomyces cerevisiae, our program extracted the relations
Swi5 and Ace2
be
cell cycle-regulated transcription factors
and
Swi5 and Ace2
be-activate
expression of early G(1)-specific genes
. A key word search was done in each noun phase, and only the hypernym GO0007049: cell cycle was assigned, although ID GO0000114: G1-specific transcription in mitotic cell cycle can be assigned to Swi5 and Ace2 genes by using all bold key words. To raise the recall rate, the score threshold was lowered to 0.75. However, some false positives are attributable to this lower threshold. For example, GO0015919: peroxisomal membrane transport was given by only the keywords peroxisomal membrane. In this case, function assignment without considering transport or an appropriate verb or other nouns caused the false positive.
|
|
Function extraction for each organism
We applied our method to all the abstracts with each MeSH term (homo sapiens, mice, rats, drosophila melanogaster, caenorhabditis elegans and saccharomyces cerevisiae). The results are summarized in Table 6. We also extracted the family namefunction relationships. Many of them (>80%) were not yet registered in the major databases such as LocusLink, RGD, GDB, SGD, Flybase and WormBase. These results are searchable at http://prime.ontology.ims.u-tokyo.ac.jp
|
When all abstracts were used, the recall rate of genefunction relationships in the previous test set of S.cerevisiae was 32.4% for criteria-1 and 69.2% for criteria-3. Those of H.sapiens were 31.0% for criteria-1 and 70.8% for criteria-3. Since the precision and recall rate at each genefunction relationship level (fact level) should differ from that at an abstract level, the precision was recalculated at the fact level and is summarized in Table 7 (family namefunction relationships are not included). Here, fact level is used to mean whether the extracted genefunction relationship is correct or not. When multiple evidential sentences are extracted for one genefunction relationship from multiple abstracts, if at least one sentence is correct, the genefunction relationship is regarded to be correct, i.e. a fact. The difference in precision among organisms in Table 7 is mainly due to the difference in precision of gene/protein name recognition for each organism. The precision for the genefunction relationship level in Table 7 is slightly lower than that for the abstract level in Tables 4 and 5. Erroneously extracted genefunction relationships consisted of only one or two evidential sentences. The precision could be increased to some extent by discarding the genefunction relationships with few evidential sentences.
|
| CONCLUSION |
|---|
|
|
|---|
We have developed an information extraction system that uses natural language techniques to assign GO IDs to each gene/protein/family found in abstracts. In this system, each sentence is shallowly parsed, and the ACTOROBJECT relationships are extracted using rule-based sentence structure analysis. When gene names and their functional terms are described in ACTOROBJECT relationships with predefined verb or modification relationships, the corresponding GO-IDs are assigned to the gene/protein/family. The gene/protein/family names are quickly recognized by the devised trie, which is constructed based on the GENA gene name dictionary and family name dictionary to extract ambiguous gene names that do not specify unique gene names. For wide recognition of the gene/protein functions, the functional terms are semi-automatically gathered based on GO using co-occurrence in the same abstract and the collocation similarities of the terms. The terms related to a GO term are mainly gathered using the first method, and the similar-meaning terms are mainly gathered using the second method. Additional hyponyms are gathered using an MeSH hierarchy, and semantic/syntactic variations of the gathered terms are generated using rule-based methods.
In a preliminary experiment, our system had a recall rate of about 4249% [criteria-3 in Table 2 and Table 4], with 9194% precision [criteria-3 in Table 3 and Table 5] for both S.cerevisiae and H.sapiens at an abstract level. Considering the percentage of actually described functions in the test set abstracts, the recall with NLP was even higher [5464%: criteria-3 in Table 2 and Table 5]. The precision of our method is higher than the simple co-occurrence rate of gene and functional term (<50%) in a sentence, since the ACTOR and OBJECT relationships are considered. Further, when all NCBI abstracts are used, the recall rate increases to about 70%, and the precision drops to about 8687% for both organisms at the fact level (genefunction relationship level). Although this evaluation allowed superclass identification (instead of detailed class) identification, the annotated GO class level seems to be sufficiently useful. Many of the false negatives and superclass recognitions (instead of detailed class recognition) were due to a lack of biological function terms. Some vocabularies are difficult to find using an automatic process, while some are detectable with our term-finding system, which uses co-occurrence and collocation similarities. Expanding the number of biological functional terms in our system should increase the recall.
Application of this method to abstracts using each major eukaryote MeSH term resulted in the extraction of over 190 000 non-redundant gene/protein GO-ID relationships and 150 000 family name GO-ID relationships for S.cerevisiae, C.elegans, D.melanogaster, M.musculus, R.norvegicus and H.sapiens. Many biological functions that were not extracted by a major database or consortium were extracted. The results are open to the public in the PRIME database (http://prime.ontology.ims.u-tokyo.ac.jp:8081/).
| Acknowledgments |
|---|
We thank the reviewers for their helpful suggestions and references. This work is supported in part by a grant-in aid for scientific research on priority area genome information science, from the Japanese Ministry of Education, Culture, Sports, Science and Technology.
Received on April 16, 2004; revised on September 21, 2004; accepted on October 5, 2004
| REFERENCES |
|---|
|
|
|---|
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., 25, 2529[CrossRef][Web of Science][Medline].
Blaschke, C. and Valencia, A. (2002) Automatic ontology construction from the literature. Genome Inform., 13, 201213.
Camon, E., Magrane, M., Barrell, D., Binns, D., Fleischmann, W., Kersey, P., Mulder, N., Oinn, T., Maslen, J., Cox, A., Apweiler, R. (2003) The gene ontology annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Res., 13, 662672
Chiang, J. and -H. and Yu, H.-C. (2003) MeKE: discovering the functions of gene products from biomedical literature via sentence alignment. Bioinformatics, 19, 14171422
Collier, N., Nobata, C., Tsujii, J. (2000) Comparison between Tagged Corpora for the Named Entity Task. Proceedings of the 18th International Conference on Computational Linguistics , Germany Saarbrucker, pp. 201207.
Friedman, C., Kra, P., Yu, H., Krauthammer, M., Rzhetsky, A. (2001) GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17, (Suppl. 1), S74S82[Abstract].
Fukuda, K., Tsunoda, T., Tamura, A., Takagi, T. (1998) Toward information extraction: identifying protein names from biological papers. Proceedings of the Pacific Symposium on Biocomputing, , pp. 705716.
Humphrey, K., Demetriou, G., Gaizauskas, R. (2000) Two applications of information extraction to biological science journal articles. Proceedings of the Pacific Symposium on Biocomputing Enzyme Interact. Protein Struct., , Hawaii, USA , pp. , pp. 505516.
Jacquemin, C. and Royaute, J. (1994) Retrieving terms and their variants in a lexicalised unification-based framework. Proceedings of SIGIR, , pp. 132141.
Kim, J.-J. and Park, J.C. (2004) Annotation of gene products in the literature with gene ontology terms using syntactic dependencies. Lect. Notes Artifi. Intell., (in press).
Koike, A. and Takagi, T. Proceedings of HLT/NAACL BioLINK Workshop, (2004) , pp. 916.
Koike, A., Kobayashi, Y., Takagi, T. (2003) Kinase pathway database: an integrated protein-kinase and NLP-based protein-interaction resource. Genome Res., 13, 12311243
Krallinger, M. and Padron, M.M. (2004) Prediction of GO annotation by combining entity specific sentence sliding window profiles. Proceedings of BioCreAtIvE, , Spain Granada.
Krymolowski, Y., Alex, B., Leidner, J.L. (2004) BioCreative Task 2.1: The EdinburghStanford System. Proceedings of BioCreAtIvE, .
Nenadic, G., Rice, S., Spasic, I., Ananiadou, S., Stapley, B. (2003) Selecting text features for gene name classification: from documents to terms. Proceedings of the ACL Workshop on Natural Language Processing in Biomedicine, , Japan Sapporo, pp. 121128.
Raychaudhuri, S., Chang, J., Sutphin, P., Altman, R. (2002) Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. Genome Res., 12, 203214
Rindflesch, T.C., Tanabe, L., Weinstein, J.N., Hunter, L. (2000) EDGAR: extraction of drugs, genes and relations from the biomedical literature. Proceedings of Pacific Symposium on Bioinformatics, , USA Hawaii, pp. 514525.
Salton, G., Wong, A., Yang, C.S. (1975) A vector space model for automatic indexing. Commun. ACM, 18, 613620[CrossRef].
Schug, J., Diskin, S., Mazzarelli, J., Brunk, B.P., Stoeckert, C.J., Jr. (2002) Predicting gene ontology functions from ProDom and CDD protein domains. Genome Res., 12, 648655
Singhal, A., Buckley, C., Mitra, M. (1996) Pivoted document length normalization. Proceedings of ACM SIGIR'96, , Zurich, Switzerland , pp. 2129.
Tanabe, L. and Wilbur, W.J. (2002) Tagging gene and protein names in biomedical text. Bioinformatics, 18, 11241132
Yakushiji, A., Tateishi, Y., Miyano, Y., Tsujii, J. (2001) Event extraction from biological papers using a full parser. Proceedings of Pacific Symposium on Bioinformatics, , USA Hawaii, pp. 408419.
Xie, H., Wasserman, A., Levine, Z., Novik, A., Grebinskiy, V., Shoshan, A., Mintz, L. (2002) Large-scale protein annotation through gene ontology. Genome Res., 12, 785794
This article has been cited by other articles:
![]() |
M. Torii, Z. Hu, C. H. Wu, and H. Liu BioTagger-GM: A Gene/Protein Name Recognition System J. Am. Med. Inform. Assoc., March 1, 2009; 16(2): 247 - 255. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. E. Crangle, J. M. Cherry, E. L. Hong, and A. Zbyslaw Mining experimental evidence of molecular function claims from the literature Bioinformatics, December 1, 2007; 23(23): 3232 - 3240. [Abstract] [Full Text] [PDF] |
||||
![]() |
J.-B. Lee, J.-j. Kim, and J. C. Park Automatic extension of Gene Ontology with flexible identification of candidate terms Bioinformatics, March 15, 2006; 22(6): 665 - 670. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







