Skip Navigation


Bioinformatics Advance Access originally published online on June 7, 2005
Bioinformatics 2005 21(16):3450-3451; doi:10.1093/bioinformatics/bti528
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/16/3450    most recent
bti528v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Dieterich, G.
Right arrow Articles by Jänsch, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dieterich, G.
Right arrow Articles by Jänsch, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

MineBlast: a literature presentation service supporting protein annotation by data mining of BLAST results

Guido Dieterich *, Uwe Kärst , Jürgen Wehland and Lothar Jänsch

Division of Cell Biology, German Research Centre for Biotechnology (GBF) Mascheroder Weg 1, D-38124 Braunschweig, Germany

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 IMPLEMENTATION AND FEATURES
 3 RESULTS
 4 DISCUSSION AND CONCLUSION
 REFERENCES
 

Summary: MineBlast is a web service for literature search and presentation based on data-mining results received from UniProt. Users can submit a simple list of protein sequences via a web-based interface. MineBlast performs a BLASTP search in UniProt to identify names and synonyms based on homologous proteins and subsequently queries PubMed, using combined search terms inorder to find and present relevant literature.

Availability: http://leger2.gbf.de/cgi-bin/MineBlast.pl

Contact: gdi{at}gbf.de


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 IMPLEMENTATION AND FEATURES
 3 RESULTS
 4 DISCUSSION AND CONCLUSION
 REFERENCES
 
Comparative genome analyses are an essential part of the process of assigning functions to open reading frames. But even in Escherichia coli, arguably the best studied of all prokaryotic organisms, only 70% of the genes have known functions. To overcome this limitation, functional studies are carried out continuously in many different organisms characterizing so far unknown gene functions or providing additional functional insights that subsequently are archived primarily in literature databases, and later in sequence databases such as UniProt (Bairoch et al., 2005). Since comparative genome analyses are mainly based on sequence homology evaluated by tools such as BLAST (Altschul et al., 1997), genome annotations of previously finished genome projects may quickly become outdated. Apart from information stored in UniProt, most recent information on a gene usually exists in the biomedical literature covered by PubMed (Putnam, 1998).

To retrieve functional annotations and to assure the completeness of literature relevant to a list of genes, we implemented a two-step query in MineBlast to search for PubMed abstracts. In the first step, we perform a BLASTP query against UniProt finding homologous proteins and extracting gene names and synonyms, because global gene names overlap between different species, as well as literature references within entries. In the second step we query PubMed with a combined term of gene names and synonyms, and process found literature references in order to transform the data into a clearly laid out presentation that aids individual researchers in the manual and cost intensive process of literature evaluation.


    2 IMPLEMENTATION AND FEATURES
 TOP
 Abstract
 1 INTRODUCTION
 2 IMPLEMENTATION AND FEATURES
 3 RESULTS
 4 DISCUSSION AND CONCLUSION
 REFERENCES
 
MineBlast is accessed by entering FASTA-formatted protein sequences and setting optional parameters (e.g. published year over a specific period, BLAST E-value cut-off or additional search terms). A Perl script using BioPerl modules (Stajich et al., 2002) will process the query, and the user will receive an Email that the job is finished.

The script performs a BLASTP search against UniProt with an E-value or bit score minimum cut-off. UniProt entries of all homologous genes are parsed to extract annotations, gene names and synonyms. Biomedical literature references as mentioned in the entry are retrieved with the help of the efetch utility from the NCBI.

Gene names and synonyms are used to perform a keyword-based query using the PubMed interface to MEDLINE.

The resulting file representation is enhanced by the use of popup information boxes. These boxes contain additional information extracted from the annotations of the corresponding UniProt entry. MineBlast also provides direct links to query InterPro (Mulder et al., 2005), STRING (von Mering et al., 2005) and to access the BLAST result file for the query sequence.

Finally, MineBlast recognizes interaction types by 12 terms and putative gene names in the abstract. Genes names and synonyms are extracted initially from all UniProt entries.


    3 RESULTS
 TOP
 Abstract
 1 INTRODUCTION
 2 IMPLEMENTATION AND FEATURES
 3 RESULTS
 4 DISCUSSION AND CONCLUSION
 REFERENCES
 
We tested MineBlast with 74 selected protein sequences that were originally annotated without functional assignment and are supposed to be involved in virulence of Listeria monocytogenesEGD-e, a human pathogen (Glaser et al., 2001).

The MineBlast report reveals for 50% (37 proteins) of all subjected sequences additional functional information that can be extracted from UniProt and PubMed. For 22 proteins, a highly reliable functional annotation was retrieved from homologues and should be taken into account for the planned reannotation of the listerial genomes. In five of these cases, precise gene names can now be added to the corresponding database entries.

Whereas information about the subcellular localization and functionally relevant domains was mainly obtained directly from UniProt (MineBlast, Step 1), the PubMed search—considering automatically extracted synonyms and names—revealed that eight homologues were already the subject of individual studies and are cited either in the title or in the abstract of the found literature (MineBlast, Step 2). The result table can be found under http://leger2.gbf.de/Table1_MineBlast.html


    4 DISCUSSION AND CONCLUSION
 TOP
 Abstract
 1 INTRODUCTION
 2 IMPLEMENTATION AND FEATURES
 3 RESULTS
 4 DISCUSSION AND CONCLUSION
 REFERENCES
 
We have used BLAST to find entries in UniProt related by sequence similarity. The retrieved gene names and synonyms allow a broad query in PubMed, extracting the most recent state of biological information.

We have demonstrated with a sample analysis that MineBlast provides some highly valuable information. Additional assigned functions that were found by MineBlast have to be considered in new and ongoing projects of L. monocytogenes, a model organism for infection research.

Different approaches exist to address the problem of text mining in biology. The major interest of these existing systems is the identification of relevant biological entities (genes, proteins, etc.) in text thus enabling both the extraction of relationships between entities and pathway discovery from the literature (Becker et al., 2003; Jenssen et al., 2001; Mika and Rost, 2004; Perez-Iratxeta et al., 2003; Tanabe et al., 1999). MineBlast presents information that supports the functional annotation of genes, a further step towards the automation of the total process and the standardization of genome projects and furthermore can assist in the generation of experimental designs.

Specifying an Entrez Date ranging over a fixed period provides the possibility to search only for the newest literature in PubMed. The restriction of a query to a specific genus often also improves the accuracy. To cite an example: the query of ‘fmtc’ as a gene name in the Staphylococcus genus (query term: ‘fmtc AND staphylococcus’) in PubMed prevents the finding of ‘fmtc’ as an abbreviation for ‘familial medullary thyroid carcinoma’.

Since MineBlast includes data from UniProt as well as from PubMed it can also help to shorten the delay between the first evidence of a protein function that is published in the literature and its assignment to homologous proteins from other organisms. This is of highest interest for both researchers continuously involved in genome annotation and those who have to maintain the actuality of knowledgebases. Moreover, the functionality of MineBlast also supports the fast and reliable interpretation of results from proteome and transcriptome studies which very often suffer from incomplete or ‘hidden’ functional descriptions.


    Acknowledgments
 
We are very grateful to Victor Wray for critical proof-reading of the manuscript. This work was funded by the German Bundesministerium für Bildung und Forschung (BMBF) ‘Verbundvorhaben: Intergenomics–Bioinformatische Modellierung der Wechselwirkung von Genomen’ (031U110A/031U210A).

Conflict of Interest: none declared.

Received on April 5, 2005; revised on June 3, 2005; accepted on June 5, 2005

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 IMPLEMENTATION AND FEATURES
 3 RESULTS
 4 DISCUSSION AND CONCLUSION
 REFERENCES
 

    Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402[Abstract/Free Full Text].

    Bairoch, A., et al. (2005) The Universal Protein Resource (UniProt). Nucleic Acids Res., 33, D154–D159[Abstract/Free Full Text].

    Becker, K.G., et al. (2003) PubMatrix: a tool for multiplex literature mining. BMC Bioinformatics, 4, 61[CrossRef][Medline].

    Glaser, P., et al. (2001) Comparative genomics of Listeria species. Science, 294, 849–852[Abstract/Free Full Text].

    Jenssen, T.K., et al. (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat. Genet., 28, 21–28[CrossRef][Web of Science][Medline].

    Mika, S. and Rost, B. (2004) NLProt: extracting protein names and sequences from papers. Nucleic Acids Res., 32, W634–W637[Abstract/Free Full Text].

    Mulder, N.J., et al. (2005) InterPro, progress and status in 2005. Nucleic Acids Res., 33, D201–D205[Abstract/Free Full Text].

    Perez-Iratxeta, C., et al. (2003) Update on XplorMed: a web server for exploring scientific literature. Nucleic Acids Res., 31, 3866–3868[Abstract/Free Full Text].

    Putnam, N.C. (1998) Searching MEDLINE free on the Internet using the NationalLibrary of Medicine's PubMed. Clin. Excell. Nurse Pract., 2, 314–316[Medline].

    Stajich, J.E., et al. (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res., 12, 1611–1618[Abstract/Free Full Text].

    Tanabe, L., et al. (1999) MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques, 27, 1210–1217[Web of Science][Medline].

    von Mering, C., et al. (2005) STRING: known and predicted protein–protein associations, integrated and transferred across organisms. Nucleic Acids Res., 33, D433–D437[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
G. Dieterich, U. Karst, J. Wehland, and L. Jansch
VIS-O-BAC: exploratory visualization of functional genome studies from bacteria
Bioinformatics, March 1, 2006; 22(5): 630 - 631.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
G. Dieterich, U. Karst, E. Fischer, J. Wehland, and L. Jansch
LEGER: knowledge database and visualization tool for comparative genomics of pathogenic and non-pathogenic Listeria species
Nucleic Acids Res., January 1, 2006; 34(suppl_1): D402 - D406.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/16/3450    most recent
bti528v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Dieterich, G.
Right arrow Articles by Jänsch, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dieterich, G.
Right arrow Articles by Jänsch, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?