Bioinformatics Advance Access originally published online on October 28, 2004
Bioinformatics 2005 21(6):829-831; doi:10.1093/bioinformatics/bti106
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GOChase: correcting errors from Gene Ontology-based annotations for gene products
1Seoul National University Biomedical Informatics (SNUBI), Seoul National University College of Medicine Seoul 110799, Korea
2Human Genome Research Institute, Seoul National University College of Medicine Seoul 110799, Korea
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: The Gene Ontology (GO) is a controlled biological vocabulary that provides three structured networks of terms to describe biological processes, cellular components and molecular functions. Many databases of gene products are annotated using the GO vocabularies. We found that some GO-updating operations are not easily traceable by the current biological databases and GO browsers. Consequently, numerous annotation errors arise and are propagated throughout biological databases and GO-based high-level analyses. GOChase is a set of web-based utilities to detect and correct the errors in GO-based annotations.
Availability: http://www.snubi.org/software/GOChase/
Contact: juhan{at}snu.ac.kr
| 1 INTRODUCTION |
|---|
|
|
|---|
The Gene Ontology (GO) provides structured, controlled biological vocabulary for describing genes and gene products in terms of their associated biological processes, cellular components and molecular functions (Ashburner et al., 2000). As more and more biological databases are using GO terms to annotate their gene products and many high-level methods analyzing GO annotations are being developed (Dennis et al., 2003; Doniger et al., 2003; Zeeberg et al., 2003), it is essential to provide a mechanism for preventing GO-based annotations from inconsistencies, errors or error propagations.
The structural foundation of GO is formally a directed acyclic graph (DAG) wherein the terms are equivalent to nodes and the relationships to the edges of the graph (Aho et al., 1983). The GO consortium provides DAG-Edit for editing GO. Monthly reports (http://www.geneontology.org/MonthlyReports/) are generated by a set of Perl scripts to describe what has happened to the ontologies each month. They report six different types of change that may have happened to a term: new terms, new obsoletions, term name changes, new definitions, new term merges and term movements.
We found that the two operations, new obsoletions and new term merges, are not easily traceable by the current biological databases and GO browsers and hence cause errors in GO-based annotations. These errors may have already created systematic errors in biological databases, GO browsers and GO-based high-level data analyses. Table 1 shows the number of gene products annotated to invalid GO terms (i.e. merged and obsolete) in various databases. It seems evident that the errors are widespread. For a fair comparison, we tested each database version against the corresponding latest GO version that might have been used for the annotation process.
|
| 2 PROGRAM OVERVIEW |
|---|
|
|
|---|
Different databases and methods use different GO versions. Without an error-proof mechanism, it is non-trivial to correct the widespread errors and error propagations. Although powerful ontology-management tools are available (Klein and Kiryakov, 2002; Noy and Musen, 2002), these general-purpose tools use heuristic algorithms that do not guarantee 100% exact matches. For the purpose of illustration, we applied PromptDiff (Noy and Musen, 2002) to compare the 2004 January and February versions. PromptDiff correctly detected more than 95% of most of the GO-updating operations. It missed one for new term (87 out of 88) and one for new term merge (3 out of 4). It exhibited three false positives for term name change (239 calls for 236 true positives) and perfect matches for new obsoletion (60 out of 60). For the 201 term movement operations, however, only 24 out of the 246 PromptDiff predictions were correct.
In contrast, the monthly report generated by the Perl scripts captures all GO-update operations applied to the previous version. Therefore, if we integrate all GO-update information in the monthly reports in sequence, it can serve as the gold standard for GO versioning information. GOChase is a set of web-based utilities available at http://www.snubi.org/software/GOChase/ to detect and correct the possible errors in GO-based annotations. GOChase integrates all monthly reports with major biological databases containing GO annotations (Table 1) and parses them into relational tables, which are then integrated into the GO DB schema (http://www.godatabase.org/dev/sql/doc/godb-sql-doc.html).
GOChase provides four web-based interfaces. (1) GOChase-History resolves the whole evolution history of a GO ID. As an example, the GO term, GO:0006489 (dolichyl-diphosphate biosynthesis), has repeatedly swung back and forth among the seven GO terms (i.e. metabolism, catabolism, biosynthesis, lipid metabolism, protein biosynthesis, protein metabolism and protein modification) by the 16 GO operations in the six updates between March 2001 and August 2003. (2) GOChase-Correct highlights a merged term and redirects it to the correct target term into which the merged term has been merged. For a discarded (or obsolete) term, GO consortium provides suggested alternative terms in the comments field of the obsolete term (http://www.geneontology.org/ontology/gene_ontology.obo), which is decided on by a curator. As of May 2004, there were 805 suggested alternative terms for 871 obsolete terms. For an obsolete term, GOChase recommends the nearest non-discarded parent term as well as the alternative terms whenever available. The databases that create GO annotations may well find this feature useful to fix the broken hyperlinks for the merged and obsolete terms. (3) A whole database like LocusLink can be input to GOChase in a flat-file format. The annotation errors will be reported with GOChase corrections. (4) When one inputs a GO ID, GOChase will resolve all gene products annotated with the ID across all the databases in Table 1. Of course, one can resolve the GO annotations for each gene product.
The annotation errors, i.e. annotations to the merged and obsolete GO terms, may exist in databases simply due to a time lag, as many databases update the annotations only periodically. We learned, however, that certain GO-update processes should be carefully traced to prevent error propagation. An error-conscious mechanism can help GO-based high-level analysis tools like clustering microarray data with GO annotations. Functionalities like showing the evolution history and redirecting to the correct target term may benefit GO browsers. When a database containing GO annotations is being updated, inconsistencies and errors should be checked against the latest version of GO, for which GOChase can help. Otherwise, the errors may be propagated to the secondary users.
| Acknowledgments |
|---|
This study was supported by a grant from Korea Health 21 R&D Project, Ministry of Health & Welfare, Republic of Korea (0405-BC0206040004).
Received on April 11, 2004; revised on July 10, 2004; accepted on September 14, 2004
| REFERENCES |
|---|
|
|
|---|
Aho, A.V., Hopcroft, J.E., Ullman, J.D. (1983) Directed graphs. Data Structures and Algorithms, , Reading, MA Addison-Wesley, pp. 219221.
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. (2000) Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., 25, 2529[CrossRef][Web of Science][Medline].
Dennis, G., Jr, Sherman, B.T., Hosack, D.A., Yang, J., Gao, W., Lane, H.C., Lempicki, R.A. (2003) DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol., 4, R60[CrossRef].
Doniger, S.W., Salomonis, N., Dahlquist, K.D., Vrainzan, K., Lawlor, S.C., Conklin, B.R. (2003) MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biol., 4, R7[CrossRef][Medline].
Klein, M., Fensel, D., Kiryakov, D., Ognyanov, D. (2002) Ontology versioning and change detection on the web. Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management (EKAW02) , Sigüenza, Spain .
Noy, N.F. and Musen, M.A. (2002) PromptDiff: a fixed point algorithm for comparing ontology versions. Proceedings of the 18th National Conference on Artificial Intelligence (AAAI-2002) , Edmonton, Canada .
Zeeberg, B.R., Feng, W., Wang, G., Wang, M.D., Fojo, A.T., Sunshine, M., Narasimhan, S., Kane, D.W., Reinhold, W.C., Lababidi, S., et al. (2003) GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol., 4, , pp. R28[CrossRef][Medline].
This article has been cited by other articles:
![]() |
F. Cordero, M. Botta, and R. A. Calogero Microarray data analysis and mining approaches Brief Funct Genomic Proteomic, January 22, 2008; (2008) elm034v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. R. Boverhof and T. R. Zacharewski Toxicogenomics in Risk Assessment: Applications and Needs Toxicol. Sci., February 1, 2006; 89(2): 352 - 360. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

