Skip Navigation


Bioinformatics Advance Access originally published online on September 13, 2005
Bioinformatics 2005 21(21):4046-4053; doi:10.1093/bioinformatics/bti657
This Article
Right arrow Full Text Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/21/4046    most recent
bti657v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wren, J. D.
Right arrow Articles by Melcher, U.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wren, J. D.
Right arrow Articles by Melcher, U.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oxfordjournals.org

Markov model recognition and classification of DNA/protein sequences within large text databases

Jonathan D. Wren 1,*, William H. Hildebrand 2, Sreedevi Chandrasekaran 2 and Ulrich Melcher 3

1Advanced Center for Genome Technology, Stephenson Research and Technology Center, Department of Botany and Microbiology, The University of Oklahoma 101 David L. Boren Blvd. Rm 2025, Norman, OK 73019, USA
2Department of Microbiology and Immunology, The University of Oklahoma Health Sciences Center Oklahoma City, OK, USA
3Department of Biochemistry and Molecular Biology, Oklahoma State University Stillwater, OK, USA

*To whom correspondence should be addressed.

Motivation: Short sequence patterns frequently define regions of biological interest (binding sites, immune epitopes, primers, etc.), yet a large fraction of this information exists only within the scientific literature and is thus difficult to locate via conventional means (e.g. keyword queries or manual searches). We describe herein a system to accurately identify and classify sequence patterns from within large corpora using an n-gram Markov model (MM).

Results: As expected, on test sets we found that identification of sequences with limited alphabets and/or regular structures such as nucleic acids (non-ambiguous) and peptide abbreviations (3-letter) was highly accurate, whereas classification of symbolic (1-letter) peptide strings with more complex alphabets was more problematic. The MM was used to analyze two very large, sequence-containing corpora: over 7.75 million Medline abstracts and 9000 full-text articles from Journal of Virology. Performance was benchmarked by comparing the results with Journal of Virology entries in two existing manually curated databases: VirOligo and the HLA Ligand Database. Performance estimates were 98 ± 2% precision/84% recall for primer identification and classification and 67 ± 6% precision/85% recall for peptide epitopes. We also find a dramatic difference between the amounts of sequence-related data reported in abstracts versus full text. Our results suggest that automated extraction and classification of sequence elements is a promising, low-cost means of sequence database curation and annotation.

Availability: MM routine and datasets are available upon request.

Contact: Jonathan.Wren{at}OU.edu


Received on March 1, 2005; revised on May 23, 2005; accepted on September 1, 2005

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
S. Van Vooren, B. Thienpont, B. Menten, F. Speleman, B. D. Moor, J. Vermeesch, and Y. Moreau
Mapping biomedical concepts onto the human genome by mining literature on chromosomal aberrations
Nucleic Acids Res., April 3, 2007; 35(8): 2533 - 2543.
[Abstract] [Full Text] [PDF]



Disclaimer:
Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.