Bioinformatics, Vol 14, 423-429, Copyright © 1998 by Oxford University Press
L Holm and C Sander
MOTIVATION: To maximize the chances of biological discovery, homology
searching must use an up-to-date collection of sequences. However, the
available sequence databases are growing rapidly and are partially
redundant in content. This leads to increasing strain on CPU resources and
decreasing density of first-hand annotation. RESULTS: These problems are
addressed by clustering closely similar sequences to yield a covering of
sequence space by a representative subset of sequences. No pair of
sequences in the representative set has >90% mutual sequence identity.
The representative set is derived by an exhaustive search for close
similarities in the sequence database in which the need for explicit
sequence alignment is significantly reduced by applying deca- and
pentapeptide composition filters. The algorithm was applied to the union of
the Swissprot, Swissnew, Trembl, Tremblnew, Genbank, PIR, Wormpep and PDB
databases. The all-against-all comparison required to generate a
representative set at 90% sequence identity was accomplished in 2 days CPU
time, and the removal of fragments and close similarities yielded a size
reduction of 46%, from 260 000 unique sequences to 140 000 representative
sequences. The practical implications are (i) faster homology searches
using, for example, Fasta or Blast, and (ii) unified annotation for all
sequences clustered around a representative. As tens of thousands of
sequence searches are performed daily world-wide, appropriate use of the
non-redundant database can lead to major savings in computer resources,
without loss of efficacy. AVAILABILITY: A regularly updated non-redundant
protein sequence database (nrdb90), a server for homology searches against
nrdb90, and a Perl script (nrdb90.pl) implementing the algorithm are
available for academic use from http://www.embl-ebi.ac. uk/~holm/nrdb90.
CONTACT: holm@embl- ebi.ac.uk
ARTICLES
Removing near-neighbour redundancy from large protein sequence collections
EMBL-EBI, Cambridge CB10 1SD, UK.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
M. Chruszcz, W. Potrzebowski, M. D. Zimmerman, M. Grabowski, H. Zheng, P. Lasota, and W. Minor Analysis of solvent content and oligomeric states in protein crystals--does symmetry matter? Protein Sci., April 1, 2008; 17(4): 623 - 632. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Heger, E. Korpelainen, T. Hupponen, K. Mattila, V. Ollikainen, and L. Holm PairsDB atlas of protein sequence space Nucleic Acids Res., January 11, 2008; 36(suppl_1): D276 - D280. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Dinkel and H. Sticht A computational strategy for the prediction of functional linear peptide motifs in proteins Bioinformatics, December 15, 2007; 23(24): 3297 - 3303. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Lopez, A. Valencia, and M. L. Tress firestar--prediction of functionally important residues using structural templates and alignment reliability Nucleic Acids Res., July 13, 2007; 35(suppl_2): W573 - W577. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Lerman and B. E. Shakhnovich Defining functional distance using manifold embeddings of gene ontology annotations PNAS, July 3, 2007; 104(27): 11334 - 11339. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Benita, M. J. Wise, M. C. Lok, I. Humphery-Smith, and R. S. Oosting Analysis of High Throughput Protein Expression in Escherichia coli Mol. Cell. Proteomics, September 1, 2006; 5(9): 1567 - 1580. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Widmann, M. Hamady, and R. Knight DivergentSet, a Tool for Picking Non-redundant Sequences from Large Sequence Collections Mol. Cell. Proteomics, August 1, 2006; 5(8): 1520 - 1532. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Wallner and A. Elofsson Identification of correct regions in protein models using structural, alignment, and consensus information Protein Sci., April 1, 2006; 15(4): 900 - 913. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Flores, N. Echols, D. Milburn, B. Hespenheide, K. Keating, J. Lu, S. Wells, E. Z. Yu, M. Thorpe, and M. Gerstein The Database of Macromolecular Motions: new features added at the decade mark Nucleic Acids Res., January 1, 2006; 34(suppl_1): D296 - D301. [Abstract] [Full Text] [PDF] |
||||
![]() |
K.-J. Park, M. M. Gromiha, P. Horton, and M. Suwa Discrimination of outer membrane proteins using support vector machines Bioinformatics, December 1, 2005; 21(23): 4223 - 4229. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Kim and Y. Kliger Discovering hidden viral piracy Bioinformatics, December 1, 2005; 21(23): 4216 - 4222. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Prilusky, C. E. Felder, T. Zeev-Ben-Mordehai, E. H. Rydberg, O. Man, J. S. Beckmann, I. Silman, and J. L. Sussman FoldIndex(C): a simple tool to predict whether a given protein sequence is intrinsically unfolded Bioinformatics, August 15, 2005; 21(16): 3435 - 3438. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Smith, V. Kunin, L. Goldovsky, A. J. Enright, and C. A. Ouzounis MagicMatch--cross-referencing sequence identifiers across databases Bioinformatics, August 15, 2005; 21(16): 3429 - 3430. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Paiardini, F. Bossa, and S. Pascarella CAMPO, SCR_FIND and CHC_FIND: a suite of web tools for computational structural biology Nucleic Acids Res., July 1, 2005; 33(suppl_2): W50 - W55. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. G. Faux, S. P. Bottomley, A. M. Lesk, J. A. Irving, J. R. Morrison, M. G. de la Banda, and J. C. Whisstock Functional insights from the distribution and role of homopeptide repeat-containing proteins Genome Res., April 1, 2005; 15(4): 537 - 551. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. E. Shakhnovich, E. Deeds, C. Delisi, and E. Shakhnovich Protein structure and evolutionary history determine sequence space topology Genome Res., March 1, 2005; 15(3): 385 - 392. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. B. Kuznetsov and S. Rackovsky Comparative computational analysis of prion proteins reveals two fragments with unusual structural properties and a pattern of increase in hydrophobicity associated with disease-promoting mutations Protein Sci., December 1, 2004; 13(12): 3230 - 3244. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Paiardini, F. Bossa, and S. Pascarella Evolutionarily conserved regions and hydrophobic contacts at the superfamily level: The case of the fold-type I, pyridoxal-5'-phosphate-dependent enzymes Protein Sci., November 1, 2004; 13(11): 2992 - 3005. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. S. Armen, M. L. DeMarco, D. O. V. Alonso, and V. Daggett Pauling and Corey's {alpha}-pleated sheet structure may define the prefibrillar amyloidogenic intermediate in amyloid disease PNAS, August 10, 2004; 101(32): 11622 - 11627. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Vogel, S. A. Teichmann, and C. Chothia The immunoglobulin superfamily in Drosophila melanogaster and Caenorhabditis elegans and the evolution of complexity Development, December 22, 2003; 130(25): 6317 - 6328. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Polo, S. Confalonieri, A. E. Salcini, and P. P. Di Fiore EH and UIM: Endocytosis and More Sci. Signal., December 16, 2003; 2003(213): re17 - re17. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Orecchia, P. M. Lacal, C. Schietroma, V. Morea, G. Zambruno, and C. M. Failla Vascular endothelial growth factor receptor-1 is deposited in the extracellular matrix by endothelial cells and is a ligand for the {alpha}5{beta}1 integrin J. Cell Sci., September 1, 2003; 116(17): 3479 - 3489. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Maglich, J. A. Caravella, M. H. Lambert, T. M. Willson, J. T. Moore, and L. Ramamurthy The first completed genome sequence from a teleost fish (Fugu rubripes) adds significant diversity to the nuclear receptor superfamily Nucleic Acids Res., July 15, 2003; 31(14): 4051 - 4058. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Madera and J. Gough A comparison of profile hidden Markov model procedures for remote homology detection Nucleic Acids Res., October 1, 2002; 30(19): 4321 - 4328. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. P. Ponting Novel domains and orthologues of eukaryotic transcription elongation factors Nucleic Acids Res., September 1, 2002; 30(17): 3643 - 3652. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Li, L. Jaroszewski, and A. Godzik Sequence clustering strategies improve remote homology recognitions while reducing search times Protein Eng. Des. Sel., August 1, 2002; 15(8): 643 - 649. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. D. Emes and C. P. Ponting A new sequence motif linking lissencephaly, Treacher Collins and oral-facial-digital type 1 syndromes, microtubule dynamics and cell migration Hum. Mol. Genet., November 1, 2001; 10(24): 2813 - 2820. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Yanai, A. Derti, and C. DeLisi Genes linked by fusion events are generally of the same functional category: A systematic analysis of 30 microbial genomes PNAS, July 3, 2001; 98(14): 7940 - 7945. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. V. Grigoriev, C. Zhang, and S.-H. Kim Sequence-based detection of distantly related proteins with the same fold Protein Eng. Des. Sel., July 1, 2001; 14(7): 455 - 458. [Full Text] [PDF] |
||||
![]() |
D. P. Kreil and C. A. Ouzounis Identification of thermophilic species by the amino acid compositions deduced from their genomes Nucleic Acids Res., April 1, 2001; 29(7): 1608 - 1615. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. C. W. May Optimal classification of protein sequences and selection of representative sets from multiple alignments: application to homologous families and lessons for structural genomics Protein Eng. Des. Sel., April 1, 2001; 14(4): 209 - 217. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Bujnicki, A. Elofsson, D. Fischer, and L. Rychlewski LiveBench-1: Continuous benchmarking of protein structure prediction servers Protein Sci., February 1, 2001; 10(2): 352 - 361. [Abstract] [Full Text] |
||||
![]() |
S. Balasubramanian, T. Schneider, M. Gerstein, and L. Regan Proteomics of Mycoplasma genitalium: identification and characterization of unannotated and atypical proteins in a small model genome Nucleic Acids Res., August 15, 2000; 28(16): 3075 - 3082. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. A. Teichmann, J. Park, and C. Chothia Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements PNAS, December 8, 1998; 95(25): 14658 - 14663. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Zhang and S.-H. Kim Environment-dependent residue contact energies for proteins PNAS, March 14, 2000; 97(6): 2550 - 2555. [Abstract] [Full Text] [PDF] |
||||










