Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology
Baskin Center for Computer Engineering and Information Sciences, Applied Sciences Building, University of California at Santa Cruz Santa Cruz, CA 95064, USA
1The Sanger Centre, Hinxton Hall Hinxton, Cambs CB10 1RQ, UK
2Life Sciences Division (Mail Stop 29100), Lawrence Berkeley Laboratory, University of California Berkeley, CA 94720, USA
1To whom correspondence should be addressed. E-mail: kimmen{at}cse.ucsc.edu
We present a method for condensing the information in multiple alignments of proteins into a mixture of Dirichlet densities over amino acid distributions. Dirichiet mixture densities are designed to be combined with observed amino acid frequencies to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model or other statistical model. These estimates give a statistical model greater generalization capacity, so that remotely related family members can be more reliably recognized by the model. This paper corrects the previously published formula for estimating these expected probabilities, and contains complete derivations of the Dirichiet mixture formulas, methods for optimizing the mixtures to match particular databases, and suggestions for efficient implementation.
This article has been cited by other articles:
![]() |
P. P. Gardner The use of covariance models to annotate RNAs in whole genomes Brief Funct Genomic Proteomic, November 1, 2009; 8(6): 444 - 450. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Schwarz, P. N. Seibel, S. Rahmann, C. Schoen, M. Huenerberg, C. Muller-Reible, T. Dandekar, R. Karchin, J. Schultz, and T. Muller Detecting species-site dependencies in large multiple sequence alignments Nucleic Acids Res., October 1, 2009; 37(18): 5959 - 5968. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Fong, J. Wakefield, and K. Rice Bayesian mixture modeling using a hybrid sampler with application to protein subfamily identification Biostat., August 20, 2009; (2009) kxp033v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. W. Mount Using Hidden Markov Models to Align Multiple Sequences CSH Protocols, July 1, 2009; 2009(7): pdb.top41 - pdb.top41. [Abstract] [Full Text] |
||||
![]() |
E. L. Peterson, J. Kondev, J. A. Theriot, and R. Phillips Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment Bioinformatics, June 1, 2009; 25(11): 1356 - 1362. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Biegert and J. Soding Sequence context-specific profiles for homology searching PNAS, March 10, 2009; 106(10): 3770 - 3775. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. M. Moses and R. Durbin Inferring Selection on Amino Acid Preference in Protein Domains Mol. Biol. Evol., March 1, 2009; 26(3): 527 - 536. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. F. Altschul, E. M. Gertz, R. Agarwala, A. A. Schaffer, and Y.-K. Yu PSI-BLAST pseudocounts and the minimum description length principle Nucleic Acids Res., February 1, 2009; 37(3): 815 - 824. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Katzman, C. Barrett, G. Thiltgen, R. Karchin, and K. Karplus PREDICT-2ND: a tool for generalized protein local structure prediction Bioinformatics, November 1, 2008; 24(21): 2453 - 2459. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. P. Brown Efficient functional clustering of protein sequences using the Dirichlet process Bioinformatics, August 15, 2008; 24(16): 1765 - 1771. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. G. Glanville, D. Kirshner, N. Krishnamurthy, and K. Sjolander Berkeley Phylogenomics Group web servers: resources for structural phylogenomic analysis Nucleic Acids Res., July 13, 2007; 35(suppl_2): W27 - W32. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. K. Freyhult, J. P. Bollback, and P. P. Gardner Exploring genomic dark matter: A critical assessment of the performance of homology search methods on noncoding RNA Genome Res., January 1, 2007; 17(1): 117 - 125. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Muramatsu and M. Suwa Statistical analysis and prediction of functional residues effective for GPCR-G-protein coupling selectivity Protein Eng. Des. Sel., June 1, 2006; 19(6): 277 - 283. [Abstract] [Full Text] [PDF] |
||||
![]() |
S.R. EDDY Computational Analysis of RNAs Cold Spring Harb Symp Quant Biol, January 1, 2006; 71(0): 117 - 128. [Abstract] [PDF] |
||||
![]() |
R. Y. Kahsay, G. Gao, and L. Liao An improved hidden Markov model for transmembrane protein detection and topology prediction and its applications to complete genomes Bioinformatics, May 1, 2005; 21(9): 1853 - 1858. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. E. Crooks and S. E. Brenner An alternative model of amino acid replacement Bioinformatics, April 1, 2005; 21(7): 975 - 980. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. N. Price, K. H. Huang, E. J. Alm, and A. P. Arkin A novel method for accurate operon predictions in all sequenced prokaryotes Nucleic Acids Res., February 8, 2005; 33(3): 880 - 892. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. P. Xing and R. M. Karp MotifPrototyper: A Bayesian profile model for motif families PNAS, July 20, 2004; 101(29): 10523 - 10528. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Y. Lau and D. I. Chasman Functional classification of proteins and protein variants PNAS, April 27, 2004; 101(17): 6576 - 6581. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Hulo, C. J. A. Sigrist, V. Le Saux, P. S. Langendijk-Genevaux, L. Bordoli, A. Gattiker, E. De Castro, P. Bucher, and A. Bairoch Recent improvements to the PROSITE database Nucleic Acids Res., January 1, 2004; 32(90001): D134 - 137. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. S. Williams, D. I. Chasman, D. D. Hau, B. Hui, A. Y. Lau, and J. N. M. Glover Detection of Protein Folding Defects Caused by BRCA1-BRCT Truncation and Missense Mutations J. Biol. Chem., December 26, 2003; 278(52): 53007 - 53016. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Kim, D. Xu, J.-t. Guo, K. Ellrott, and Y. Xu PROSPECT II: protein structure prediction program for genome-scale applications Protein Eng. Des. Sel., September 1, 2003; 16(9): 641 - 650. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. D. Thomas, M. J. Campbell, A. Kejariwal, H. Mi, B. Karlak, R. Daverman, K. Diemer, A. Muruganujan, and A. Narechania PANTHER: A Library of Protein Families and Subfamilies Indexed by Function Genome Res., September 1, 2003; 13(9): 2129 - 2141. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. C. Ng and S. Henikoff Predicting Deleterious Amino Acid Substitutions Genome Res., May 1, 2001; 11(5): 863 - 874. [Abstract] [Full Text] |
||||
![]() |
E. J. Moler, D. C. Radisky, and I. S. Mian Integrating naive Bayes models and external knowledge to examine copper and iron homeostasis in S. cerevisiae Physiol Genomics, December 18, 2000; 4(2): 127 - 135. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. R. Sunyaev, F. Eisenhaber, I. V. Rodchenkov, B. Eisenhaber, V. G. Tumanyan, and E. N. Kuznetsov PSIC: profile extraction from sequence alignments with position-specific counts of independent observations Protein Eng. Des. Sel., May 1, 1999; 12(5): 387 - 394. [Abstract] [Full Text] [PDF] |
||||











