Bioinformatics Vol. 18 no. 12 2002
Pages 1641-1649
© 2002 Oxford University Press
Modeling the percolation of annotation errors in a database of protein sequences


1 Medical Research Council Biostatistics Unit, Cambridge
2 Computational Genomics Group, The European Bioinformatics Institute,
EMBL Cambridge Outstation, Cambridge, CB10 1SD, UK
3 Statistics Unit, Public Health Laboratory Service, London, UK
Received on April 5, 2002
; revised on May 30, 2002
; accepted on June 6, 2002
Public sequence databases contain information on the sequence, structure and function of proteins. Genome sequencing projects have led to a rapid increase in protein sequence information, but reliable, experimentally verified, information on protein function lags a long way behind. To address this deficit, functional annotation in protein databases is often inferred by sequence similarity to homologous, annotated proteins, with the attendant possibility of error. Now, the functional annotation in these homologous proteins may itself have been acquired through sequence similarity to yet other proteins, and it is generally not possible to determine how the functional annotation of any given protein has been acquired. Thus the possibility of chains of misannotation arises, a process we term error percolation. With some simple assumptions, we develop a dynamical probabilistic model for these misannotation chains. By exploring the consequences of the model for annotation quality it is evident that this iterative approach leads to a systematic deterioration of database quality.
Contact: WRG: wally.gilks{at}mrc-bsu.cam.ac.uk; BA and CAO: audit{at}ebi.ac.uk; ouzounis{at}ebi.ac.uk
* To whom correspondence should be addressed.
Both these authors contributed equally to this work.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
R. S. Datta, C. Meacham, B. Samad, C. Neyer, and K. Sjolander Berkeley PHOG: PhyloFacts orthology group prediction web server Nucleic Acids Res., July 1, 2009; 37(suppl_2): W84 - W89. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. S. Ooi, C. Y. Kwo, M. Wildpaner, F. L. Sirota, B. Eisenhaber, S. Maurer-Stroh, W. C. Wong, A. Schleiffer, F. Eisenhaber, and G. Schneider ANNIE: integrated de novo protein sequence annotation Nucleic Acids Res., July 1, 2009; 37(suppl_2): W435 - W440. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Barbe, S. Cruveiller, F. Kunst, P. Lenoble, G. Meurice, A. Sekowska, D. Vallenet, T. Wang, I. Moszer, C. Medigue, et al. From a consortium sequence to a unified sequence: the Bacillus subtilis 168 reference genome a decade later Microbiology, June 1, 2009; 155(6): 1758 - 1775. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. F. Rogers and A. Ben-Hur The use of gene ontology evidence codes in preventing classifier assessment bias Bioinformatics, May 1, 2009; 25(9): 1173 - 1177. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Kunin, A. Copeland, A. Lapidus, K. Mavromatis, and P. Hugenholtz A Bioinformatician's Guide to Metagenomics Microbiol. Mol. Biol. Rev., December 1, 2008; 72(4): 557 - 578. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Jocker, F. Hoffmann, A. Groscurth, and H. Schoof Protein function prediction and annotation in an integrated environment powered by web services (AFAWE) Bioinformatics, October 15, 2008; 24(20): 2393 - 2394. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Moscoso, E. Lopez, E. Garcia, and R. Lopez Implications of Physiological Studies Based on Genomic Sequences: Streptococcus pneumoniae TIGR4 Synthesizes a Functional LytC Lysozyme J. Bacteriol., September 1, 2005; 187(17): 6238 - 6241. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Aubourg, V. Brunaud, C. Bruyere, M. Cock, R. Cooke, A. Cottet, A. Couloux, P. Dehais, G. Deleage, A. Duclert, et al. GeneFarm, structural and functional annotation of Arabidopsis gene and protein families by a network of experts Nucleic Acids Res., January 1, 2005; 33(suppl_1): D641 - D646. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Andreoli, H. Prokisch, K. Hortnagel, J. C. Mueller, M. Munsterkotter, C. Scharfe, and T. Meitinger MitoP2, an integrated database on mitochondrial proteins in yeast and man Nucleic Acids Res., January 1, 2004; 32(90001): D459 - 462. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. ASHBURNER, C.J. MUNGALL, and S.E. LEWIS Ontologies for Biologists: A Community Model for the Annotation of Genomic Data Cold Spring Harb Symp Quant Biol, January 1, 2003; 68(0): 227 - 236. [Abstract] [PDF] |
||||





