Bioinformatics, Vol 15, 480-500, Copyright © 1999 by Oxford University Press
A Elofsson and EL Sonnhammer
MOTIVATION: Protein families can be defined based on structure or sequence
similarity. We wanted to compare two protein family databases, one based on
structural and one on sequence similarity, to investigate to what extent
they overlap, the similarity in definition of corresponding families, and
to create a list of large protein families with unknown structure as a
resource for structural genomics. We also wanted to increase the
sensitivity of fold assignment by exploiting protein family HMMs. RESULTS:
We compared Pfam, a protein family database based on sequence similarity,
to Scop, which is based on structural similarity. We found that 70% of the
Scop families exist in Pfam while 57% of the Pfam families exist in Scop.
Most families that occur in both databases correspond well to each other,
but in some cases they are different. Such cases highlight situations in
which structure and sequence approaches differ significantly. The
comparison enabled us to compile a list of the largest families that do not
occur in Scop; these are suitable targets for structure prediction and
determination, and may be useful to guide projects in structural genomics.
It can be noted that 13 out of the 20 largest protein families without a
known structure are likely transmembrane proteins. We also exploited Pfam
to increase the sensitivity of detecting homologs of proteins with known
structure, by comparing query sequences to Pfam HMMs that correspond to
Scop families. For SWISSPROT+TREMBL, this yielded an increase in fold
assignment from 31% to 42% compared to using FASTA only. This method
assigned a structure to 22% of the proteins in Saccharomyces cerevisiae,
24% in Escherichia coli, and 16% in Methanococcus jannaschii.
ARTICLES
A comparison of sequence and structure protein domain families as a basis for structural genomics
Department of Biochemistry, Stockholm University, 106 91 Stockholm, Sweden. arne@biokemi.su.se
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
P. Sonego, A. Kocsor, and S. Pongor ROC analysis: applications to the classification of biological sequences and 3D structures Brief Bioinform, May 1, 2008; 9(3): 198 - 209. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Han, H. J. Kang, Y. Cho, S. Lee, Y. J. Kim, and S. Gong SNP@Domain: a web resource of single nucleotide polymorphisms (SNPs) within protein domain structures and sequences. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W642 - W644. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Weiner 3rd and E. Bornberg-Bauer Evolution of Circular Permutations in Multidomain Proteins Mol. Biol. Evol., April 1, 2006; 23(4): 734 - 743. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Kifer, O. Sasson, and M. Linial Predicting fold novelty based on ProtoNet hierarchical classification Bioinformatics, April 1, 2005; 21(7): 1020 - 1027. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Kaessmann, S. Zollner, A. Nekrutenko, and W.-H. Li Signatures of Domain Shuffling in the Human Genome Genome Res., November 1, 2002; 12(11): 1642 - 1650. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Geyer, O. T. Fackler, and B. M. Peterlin Subunit H of the V-ATPase Involved in Endocytosis Shows Homology to beta -Adaptins Mol. Biol. Cell, June 1, 2002; 13(6): 2045 - 2056. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. B. Pandit, D. Gosar, S. Abhiman, S. Sujatha, S. S. Dixit, N. S. Mhatre, R. Sowdhamini, and N. Srinivasan SUPFAM--a database of potential protein superfamily relationships derived by comparing sequence-based and structure-based families: implications for structural genomics and function annotation in genomes Nucleic Acids Res., January 1, 2002; 30(1): 289 - 293. [Abstract] [Full Text] [PDF] |
||||





