Bioinformatics Advance Access published online on September 13, 2005
Bioinformatics, doi:10.1093/bioinformatics/bti657
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Advanced Center for Genome Technology, Stephenson Research and Technology Center, Department of Botany and Microbiology, The University of Oklahoma, 101 David L. Boren Blvd. Rm 2025, Norman, Oklahoma 73019
* To whom correspondence should be addressed.
Motivation: Short sequence patterns frequently define regions of biological interest (e.g. binding sites, immune epitopes, primers, etc.), yet a large fraction of this information exists only within the scientific literature and is thus difficult to locate via conventional means (e.g. keyword queries or manual searches). We describe herein a system to accurately identify and classify sequence patterns from within large corpora using an n-gram Markov Model (MM). Results: As expected, on test sets we found that identification of sequences with limited alphabets and/or regular structures such as nucleic acids (non-ambiguous) and peptide abbreviations (3-letter) was highly accurate, whereas classification of symbolic (1-letter) peptide strings with more complex alphabets was more problematic. The MM was used to analyze two very large, sequence-containing corpora: Over 7.75 million MEDLINE abstracts and 9,000 full-text articles from J Virology. Performance was benchmarked by comparing the results with J Virology entries in two existing manually curated databases: VirOligo and the HLA Ligand Database. Performance estimates were: 98% ± 2% precision/84% recall for primer identification and classification and 67% ± 6% precision/85% recall for peptide epitopes. We also find a dramatic difference between the amounts of sequence-related data reported in abstracts versus full-text. Our results suggest that automated extraction and classification of sequence elements is a promising, low-cost means of sequence database curation and annotation. Availability: MM routine and datasets available upon request.
Received March 1, 2005
Revised May 23, 2005
Accepted September 1, 2005
Article
Markov model recognition and classification of DNA/protein sequences within large text databases
2 Department of Microbiology and Immunology, The University of Oklahoma Health Sciences Center, Oklahoma
3 Department of Biochemistry and Molecular Biology, Oklahoma State University, Stillwater, Oklahoma
Jonathan D. Wren, E-mail: Jonathan.Wren{at}OU.edu
![]()
Abstract ![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
S. Van Vooren, B. Thienpont, B. Menten, F. Speleman, B. D. Moor, J. Vermeesch, and Y. Moreau Mapping biomedical concepts onto the human genome by mining literature on chromosomal aberrations Nucleic Acids Res., April 3, 2007; 35(8): 2533 - 2543. [Abstract] [Full Text] [PDF] |
||||
