Bioinformatics Advance Access originally published online on August 16, 2005
Bioinformatics 2005 21(20):3824-3831; doi:10.1093/bioinformatics/bti627
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap
1Department of Bioengineering, University of California Berkeley, CA 94720, USA
2Department of Plant and Microbial Biology, University of California Berkeley, CA 94720, USA
*To whom correspondence should be addressed at Department of Plant and Microbial Biology, 111 Koshland Hall #3102, University of California, Berkeley, CA 94720-3102, USA
Motivation: Protein sequence comparison methods are routinely used to infer the intricate network of evolutionary relationships found within the rapidly growing library of protein sequences, and thereby to predict the structure and function of uncharacterized proteins. In the present study, we detail an improved statistical benchmark of pairwise protein sequence comparison algorithms. We use bootstrap resampling techniques to determine standard statistical errors and to estimate the confidence of our conclusions. We show that the underlying structure within benchmark databases causes Efron's standard, non-parametric bootstrap to be biased. Consequently, the standard bootstrap underpredicts average performance when used in the context of evaluating sequence comparison methods. We have developed, as an alternative, an unbiased statistical evaluation based on the Bayesian bootstrap, a resampling method operationally similar to the standard bootstrap.
Results: We apply our analysis to the comparative study of amino acid substitution matrix families and find that using modern matrices results in a small, but statistically significant improvement in remote homology detection compared with the classic PAM and BLOSUM matrices.
Availability: The sequence sets and code for performing these analyses are available from http://compbio.berkeley.edu/.
Contact: brenner{at}compbio.berkeley.edu
Received on April 14, 2005; revised on July 16, 2005; accepted on August 11, 2005
This article has been cited by other articles:
![]() |
S. Wong and M. A. Ragan MACHOS: Markov clusters of homologous subsequences Bioinformatics, July 1, 2008; 24(13): i77 - i85. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Stojmirovic, E. M. Gertz, S. F. Altschul, and Y.-K. Yu The effectiveness of position- and composition-specific gap costs for protein similarity searches Bioinformatics, July 1, 2008; 24(13): i15 - i23. [Abstract] [Full Text] [PDF] |
||||
