Skip Navigation


Bioinformatics Advance Access originally published online on August 25, 2005
Bioinformatics 2005 21(22):4107-4115; doi:10.1093/bioinformatics/bti629
This Article
Right arrow Full Text Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
21/22/4107    most recent
bti629v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (12)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Karplus, K.
Right arrow Articles by Hughey, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Karplus, K.
Right arrow Articles by Hughey, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oxfordjournals.org

Calibrating E-values for hidden Markov models using reverse-sequence null models

Kevin Karplus 1,*, Rachel Karchin 2, George Shackelford 1 and Richard Hughey 1

1Department of Biomolecular Engineering, University of California Santa Cruz, CA 95064, USA
2Department of Biopharmaceutical Sciences, University of California San Francisco, CA, USA

*To whom correspondence should be addressed.

Motivation: Hidden Markov models (HMMs) calculate the probability that a sequence was generated by a given model. Log-odds scoring provides a context for evaluating this probability, by considering it in relation to a null hypothesis. We have found that using a reverse-sequence null model effectively removes biases owing to sequence length and composition and reduces the number of false positives in a database search.

Any scoring system is an arbitrary measure of the quality of database matches. Significance estimates of scores are essential, because they eliminate model- and method-dependent scaling factors, and because they quantify the importance of each match. Accurate computation of the significance of reverse-sequence null model scores presents a problem, because the scores do not fit the extreme-value (Gumbel) distribution commonly used to estimate HMM scores' significance.

Results: To get a better estimate of the significance of reverse-sequence null model scores, we derive a theoretical distribution based on the assumption of a Gumbel distribution for raw HMM scores and compare estimates based on this and other distribution families. We derive estimation methods for the parameters of the distributions based on maximum likelihood and on moment matching (least-squares fit for Student's t-distribution).

We evaluate the modeled distributions of scores, based on how well they fit the tail of the observed distribution for data not used in the fitting and on the effects of the improved E-values on our HMM-based fold-recognition methods.

The theoretical distribution provides some improvement in fitting the tail and in providing fewer false positives in the fold-recognition test. An ad hoc distribution based on assuming a stretched exponential tail does an even better job. The use of Student's t to model the distribution fits well in the middle of the distribution, but provides too heavy a tail. The moment-matching methods fit the tails better than maximum-likelihood methods.

Availability: Information on obtaining the SAM program suite (free for academic use), as well as a server interface, is available at http://www.soe.ucsc.edu/research/compbio/sam.html and the open-source random sequence generator with varying compositional biases is available at http://www.soe.ucsc.edu/research/compbio/gen_sequence

Contact: karplus{at}soe.ucsc.edu


Received on February 11, 2004; revised on August 10, 2005; accepted on August 12, 2005

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
M. Madera
Profile Comparer: a program for scoring and aligning profile hidden Markov models
Bioinformatics, November 15, 2008; 24(22): 2630 - 2631.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Stojmirovic, E. M. Gertz, S. F. Altschul, and Y.-K. Yu
The effectiveness of position- and composition-specific gap costs for protein similarity searches
Bioinformatics, July 1, 2008; 24(13): i15 - i23.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
R. I. Sadreyev and N. V. Grishin
Accurate statistical model of comparison between multiple sequence alignments
Nucleic Acids Res., April 1, 2008; 36(7): 2240 - 2248.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
Y.-K. Yu, E. M. Gertz, R. Agarwala, A. A. Schaffer, and S. F. Altschul
Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches
Nucleic Acids Res., November 6, 2006; 34(20): 5966 - 5973.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. Gough
Genomic scale sub-family assignment of protein domains
Nucleic Acids Res., July 28, 2006; 34(13): 3625 - 3633.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
G. Pugalenthi, K. Shameer, N. Srinivasan, and R. Sowdhamini
HARMONY: a server for the assessment of protein structures.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W231 - W234.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.