Bioinformatics, Vol 14, 164-173, Copyright © 1998 by Oxford University Press
J Gracy and P Argos
MOTIVATION: Genome sequencing projects require the periodic application of
analysis tools that can classify and multiply align related protein
sequence domains. Full automation of this task requires an efficient
integration of similarity and alignment techniques. RESULTS: We have
developed a fully automated process that classifies entire protein sequence
databases, resulting in alignment of the homologous sequences. The
successive steps of the procedure are based on compositional and local
sequence similarity searches followed by multiple sequence alignments.
Global similarities are detected from the pairwise comparison of amino acid
and dipeptide compositions of each protein. After the elimination of all
but one sequence from each detected cluster of closely related proteins,
the remaining sequences are compiled in a suffix tree which is
self-compared to detect local sequence similarities. Sets of proteins which
share similar sequence segments are then weighted according to their
closeness and multiply aligned using a fast hierarchical dynamic
programming algorithm. Computational strategies were devised to minimize
computer processing time and memory space requirements. The accuracy of the
sequence classifications has been evaluated for 12 462 primary structures
distributed over 341 known families. The percentage of sequences with
missed or incorrect family assignments was 6.8% on the test set. This low
error level is only twice that of the manually constructed PROSITE database
( 3.4% ) and is substantially better than that found for the automatically
built PRODOM database ( 34.9% ). AVAILABILITY: The resulting database,
called DOMO, is available through database search routine SRS at Infobiogen
(http://www.infobiogen.fr/srs5/), EBI (http://srs.ebi.ac.uk:5000/) and EMBL
(http://www.embl- heidelberg.de/srs5/) World Wide Web sites. CONTACT:
gracy@infobiogen.fr
ARTICLES
Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment
European Molecular Biology Laboratory, Heidelberg, Germany.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
B. Lazareva-Ulitsky, K. Diemer, and P. D. Thomas On the quality of tree-based protein classification Bioinformatics, May 1, 2005; 21(9): 1876 - 1890. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. J. Su, L. Lu, S. Saxonov, and D. L. Brutlag eBLOCKs: enumerating conserved protein blocks to achieve maximal sensitivity and specificity Nucleic Acids Res., January 1, 2005; 33(suppl_1): D178 - D182. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. A. T. Silverstein, E. Shoop, J. E. Johnson, A. Kilian, J. L. Freeman, T. M. Kunau, I. A. Awad, M. Mayer, and E. F. Retzel The MetaFam Server: a comprehensive protein family resource Nucleic Acids Res., January 1, 2001; 29(1): 49 - 51. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Lonsdale, M. Crowe, B. Arnold, and B. C. Arnold Mendel-GFDb and Mendel-ESTS: databases of plant gene families and ESTs annotated with gene family numbers and gene family names Nucleic Acids Res., January 1, 2001; 29(1): 120 - 122. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. D. Thompson, F. Plewniak, J.-C. Thierry, and O. Poch DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches Nucleic Acids Res., August 1, 2000; 28(15): 2919 - 2926. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. G. Henikoff and S. Henikoff Drosophila Genomic Sequence Annotation Using the BLOCKS+ Database Genome Res., April 1, 2000; 10(4): 543 - 546. [Abstract] [Full Text] |
||||
![]() |
T. Kasahara and M. Kasahara Three Aromatic Amino Acid Residues Critical for Galactose Transport in Yeast Gal2 Transporter J. Biol. Chem., February 11, 2000; 275(6): 4422 - 4428. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Yona, N. Linial, and M. Linial ProtoMap: automatic classification of protein sequences and hierarchy of protein families Nucleic Acids Res., January 1, 2000; 28(1): 49 - 55. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. G. Henikoff, E. A. Greene, S. Pietrokovski, and S. Henikoff Increased coverage of protein families with the Blocks Database servers Nucleic Acids Res., January 1, 2000; 28(1): 228 - 230. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Burke, D. Davison, and W. Hide d2_cluster: A Validated Method for Clustering EST and Full-Length cDNA Sequences Genome Res., November 1, 1999; 9(11): 1135 - 1142. [Abstract] [Full Text] |
||||



