Bioinformatics Advance Access originally published online on September 29, 2005
Bioinformatics 2005 21(23):4199-4200; doi:10.1093/bioinformatics/bti695
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Do you do text?
1Centro Nacional de Biotecnología, CNB-CSIC Cantoblanco, Madrid, Spain
2The MITRE Corporation Bedford, MA, USA
3EBI-EMBL Hinxton Campus, UK
*To whom correspondence should be addressed.
Retrieving information from text has become an important area in bioinformatics, and not too surprisingly, this journal has published more than 30 papers on this topic since the first article published by the journal in 1998. In addition, ISCB (International Society for Computational Biology) (www.iscb.org) has organized special sessions in the ISMB conferences (Intelligent Systems for Molecular Biology, see www.iscb.org/ismb2005) and a specialized interest group (www.pdg.cnb.uam.es/BioLink/) for the last five years. In parallel, major computer science conferences in related areas have begun to include sessions on biology, e.g. the TREC (text retrieval conference) Genomics track (medir.ohsu.edu/
genomics), ICML (International Conference on Machine Learning) and a series of workshops organized in association with the ACL (Association for Computation Linguistics) and Human Language Technology meetings.
The requirements of the text mining community are similar to those in other areas of bioinformatics:
- Availability of high quality input information
- A set of objective metrics for the comparison of different methods
- The need to involve the biologist to keep the focus on developing applications that are suitable for end users and biological databases.
Exactly the same issues have been discussed, and partially solved, in the field of protein structure prediction, in part thanks to the organization of CASP (Critical Assessment of techniques for protein Structure Prediction) (predictioncenter.org) during the last 10 years.
With similar evaluation goals in mind, we organized BioCreAtIvE (critical assessment of information extraction in biology), focusing on two tasks. The first dealt with extraction and normalization of gene or protein names from text for three model organism databases (fly, mouse, yeast). The second task addressed issues of extracting functional annotations from text. Overall, 27 groups participated in the assessment (see www.pdg.cnb.uam.es/BioLINK/BioCreative.eval.html and www.mitre.org/public/biocreative/).
The results and the assessment were discussed in a meeting sponsored by EMBO (European Molecular Biology Organization), in Granada, Spain (www.pdg.cnb.uam.es/BioLINK/workshop_BioCreative_04/). The results for gene/protein name extraction showed that at least four groups provided systems that were able to extract gene names from sentences of MEDLINE abstracts at over 80% balanced precision and recall. For the subtask of recognizing the relation between names and normalized database identifiers, the results ranged from a maximum of 92% balanced precision and recall for yeast to 79% for mouse. These results indicate that this technology may now be mature enough to be used in production environments (e.g. document retrieval). However, the results for gene/protein names lag behind those obtained for identifying persons and locations for online news (9095%). The identification of the many other entities of interest in biology (chemical compounds, tissues, diseases, species and others) will involve additional challenges.
For the functional annotation task, systems were asked to identify a segment of text as evidence for a GO (Gene Ontology) annotation for a given protein in full text articles. In this case, participants were not given training examples of identified text segments. Annotations and text evidence were reviewed by expert annotators from the GO annotation team (www.ebi.ac.uk/GOA/) for validity. When both the protein name and the GO annotation were given, several systems provided correct evidence for the GO predictions 2530% of the time. The average performances were lower in a subtask in which the GO codes were not given. Interestingly, two systems provided a higher rate of correct predictions by focusing on high confidence cases. These results indicate that the retrieval of functional information is a challenging problem. We believe, however, that this first BioCreAtIvE assessment has laid the foundation for rapid progress in this area by providing an infrastructure, particularly training and test datasets, which will encourage researchers to test their systems against these datasets. A technical description of the results can be found at www.pdg.cnb.uam.es/BioLINK/workshop_BioCreative_04/handout/, and the full collection of methods and evaluation papers has been recently published (BMC Bioinformatics, 2005; 6 Suppl. 1). Indeed, BioCreAtIvE also delivered the associated collection of annotated data provided by the organizers and the corresponding evaluations of results from the participating groups (www.pdg.cnb.uam.es/BioLINK/workshop_BioCreative_04/results/). This collection will complement other datasets such as the GENIA corpus (http://www-tsujii.is.s.u-tokyo.ac.jp) as a valuable resource for training and testing methods.
Our main conclusions from the meeting are that:
- A number of groups achieved similar levels of performance, using a variety of technologies. These ranged from those based on natural language processing and linguistic analysis to machine learning and computational biology approaches used in protein structure prediction, gene finding and the like.
- An unbiased assessment, based on clear standards and objective evaluation, has provided a more realistic view of the state of the art than has been available to date from more limited evaluations reported in the literature. Setting up the assessment required a considerable effort in the preparation of the datasets, and the evaluation of the results required a considerable effort by human experts. In spite of this, the size and the quality of the available datasets are the main limitation of BioCreAtIvE and other assessment initiatives.
- An unbiased assessment, based on clear standards and objective evaluation, has provided a more realistic view of the state of the art than has been available to date from more limited evaluations reported in the literature. Setting up the assessment required a considerable effort in the preparation of the datasets, and the evaluation of the results required a considerable effort by human experts. In spite of this, the size and the quality of the available datasets are the main limitation of BioCreAtIvE and other assessment initiatives.
Indeed, a number of groups representing what could be called traditional bioinformatics are experiencing considerable success in the field and we encourage you to consider exploring this new field: have you tried your best in text mining?
Data and additional information are available at www.pdg.cnb.uam.es/BioLINK/workshop_BioCreative_04/results
| Acknowledgments |
|---|
The contribution of Alex Morgan, John Wilbur, Lorrie Tanabe and Vivian Lee was essential for the organization and evaluation of BioCreAtIvE. Essential, as well, was the participation of the 27 groups, and their input to the organization, evaluation and discussion of the results. The datasets for the first task were provided by NCBI (National Center for Biotechnology Information) and MITRE and the datasets for the second by EBIEMBL (European Bioinformatics InstituteEuropean Molecular Biology Laboratory). The MITRE contributions to BioCreAtIvE were supported in part by NSF (grant EIA-0326404), and those to CNB-CSIC and EBI were supported by the European Commission (grants TEMBLOR QLRT-2001-00015 and ORIEL IST-2001-32688).
Received on May 4, 2005; revised on September 22, 2005; accepted on September 24, 2005
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
U. Hahn and A. Valencia Semantic Mining in Biomedicine (Introduction to the papers selected from the SMBM 2005 Symposium, Hinxton, U.K., April 2005). Bioinformatics, March 15, 2006; 22(6): 643 - 644. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
