Bioinformatics Advance Access originally published online on January 22, 2004
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics 20(4) © Oxford University Press 2004; all rights reserved.
Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical significance of Z-value statistics
1 Laboratoire de Physiologie Cellulaire Végétale, Département Réponse et Dynamique Cellulaire, UMR 5168 CNRS-CEA-INRA-Université J. Fourier, CEA Grenoble, 17 rue des Martyrs, F-38054, Grenoble cedex 09, France, 2 Gene-IT, 147 avenue Paul Doumer, F-92500 Rueil-Malmaison, France, 3 Laboratoire de Bioinformatique, Génomique et Modélisation, Département de Biologie Joliot Curie, CEA Saclay, F-91191 Gif sur Yvette Cedex, France and 4 Service de Développements pour la Bioinformatique Sud-Est, CEA Grenoble, 17 rue des Martyrs, F-38054, Grenoble cedex 09, France
Received on May 23, 2003
; revised on July 18, 2003
; accepted on August 4, 2003
Advance Access Publication January 22, 2004
Motivation:Different automatic methods of sequence alignments are routinely used as a starting point for homology searches and function inference. Confidence in an alignment probability is one of the major fundamentals of massive automatic genome-scale pairwise comparisons, for clustering of putative orthologs and paralogs, sequenced genome annotation or multiple-genomic tree constructions. Extreme value distribution based on the KarlinAltschul model, usually advised for large-scale comparisons are not always valid, particularly in the case of comparisons of non-biased with nucleotide-biased genomes (such that of Plasmodium falciparum). Z-values estimates based on Monte Carlo technics, can be calculated experimentally for any alignment output, whatever the method used. Empirically, a Z-value higher than
8 is supposed reasonable to assess that an alignment score is significant, but this arbitrary figure was never theoretically justified.
Results: In this paper, we used the BienayméChebyshev inequality to demonstrate a theorem of the upper limit of an alignment score probability (or P-value). This theorem implies that a computed Z-value is a statistical test, a single-linkage clustering criterion and that 1/Z-value2 is an upper limit to the probability of an alignment score whatever the actual probability law is. Therefore, this study provides the missing theoretical link between a Z-value cut-off used for an automatic clustering of putative orthologs and/or paralogs, and the corresponding statistical risk in such genome-scale comparisons (using non-biased or biased genomes).
Contact: emarechal{at}cea.fr
* To whom correspondence should be addressed.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
A. Yu. Mitrophanov and M. Borodovsky Statistical significance in biological sequence analysis Brief Bioinform, March 1, 2006; 7(1): 2 - 24. |
||||
![]() |
C. Botte, C. Jeanneau, L. Snajdrova, O. Bastien, A. Imberty, C. Breton, and E. Marechal Molecular Modeling and Site-directed Mutagenesis of Plant Chloroplast Monogalactosyldiacylglycerol Synthase Reveal Critical Residues for Activity J. Biol. Chem., October 14, 2005; 280(41): 34691 - 34701. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Petryszak, E. Kretschmann, D. Wieser, and R. Apweiler The predictive power of the CluSTr database Bioinformatics, September 15, 2005; 21(18): 3604 - 3609. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Lefebvre, J.-C. Aude, E. Glemet, and C. Neri Balancing protein similarity and gene co-expression reveals new links between genetic conservation and developmental diversity in invertebrates Bioinformatics, April 15, 2005; 21(8): 1550 - 1558. [Abstract] [Full Text] [PDF] |
||||


