Skip Navigation

This Article
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (119)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Holm, L.
Right arrow Articles by Sander, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Holm, L.
Right arrow Articles by Sander, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics, Vol 14, 423-429, Copyright © 1998 by Oxford University Press


ARTICLES

Removing near-neighbour redundancy from large protein sequence collections

L Holm and C Sander
EMBL-EBI, Cambridge CB10 1SD, UK.

MOTIVATION: To maximize the chances of biological discovery, homology searching must use an up-to-date collection of sequences. However, the available sequence databases are growing rapidly and are partially redundant in content. This leads to increasing strain on CPU resources and decreasing density of first-hand annotation. RESULTS: These problems are addressed by clustering closely similar sequences to yield a covering of sequence space by a representative subset of sequences. No pair of sequences in the representative set has >90% mutual sequence identity. The representative set is derived by an exhaustive search for close similarities in the sequence database in which the need for explicit sequence alignment is significantly reduced by applying deca- and pentapeptide composition filters. The algorithm was applied to the union of the Swissprot, Swissnew, Trembl, Tremblnew, Genbank, PIR, Wormpep and PDB databases. The all-against-all comparison required to generate a representative set at 90% sequence identity was accomplished in 2 days CPU time, and the removal of fragments and close similarities yielded a size reduction of 46%, from 260 000 unique sequences to 140 000 representative sequences. The practical implications are (i) faster homology searches using, for example, Fasta or Blast, and (ii) unified annotation for all sequences clustered around a representative. As tens of thousands of sequence searches are performed daily world-wide, appropriate use of the non-redundant database can lead to major savings in computer resources, without loss of efficacy. AVAILABILITY: A regularly updated non-redundant protein sequence database (nrdb90), a server for homology searches against nrdb90, and a Perl script (nrdb90.pl) implementing the algorithm are available for academic use from http://www.embl-ebi.ac. uk/~holm/nrdb90. CONTACT: holm@embl- ebi.ac.uk
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Protein Sci.Home page
M. Chruszcz, W. Potrzebowski, M. D. Zimmerman, M. Grabowski, H. Zheng, P. Lasota, and W. Minor
Analysis of solvent content and oligomeric states in protein crystals--does symmetry matter?
Protein Sci., April 1, 2008; 17(4): 623 - 632.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Heger, E. Korpelainen, T. Hupponen, K. Mattila, V. Ollikainen, and L. Holm
PairsDB atlas of protein sequence space
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D276 - D280.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
H. Dinkel and H. Sticht
A computational strategy for the prediction of functional linear peptide motifs in proteins
Bioinformatics, December 15, 2007; 23(24): 3297 - 3303.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
G. Lopez, A. Valencia, and M. L. Tress
firestar--prediction of functionally important residues using structural templates and alignment reliability
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W573 - W577.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
G. Lerman and B. E. Shakhnovich
Defining functional distance using manifold embeddings of gene ontology annotations
PNAS, July 3, 2007; 104(27): 11334 - 11339.
[Abstract] [Full Text] [PDF]


Home page
Mol. Cell. ProteomicsHome page
Y. Benita, M. J. Wise, M. C. Lok, I. Humphery-Smith, and R. S. Oosting
Analysis of High Throughput Protein Expression in Escherichia coli
Mol. Cell. Proteomics, September 1, 2006; 5(9): 1567 - 1580.
[Abstract] [Full Text] [PDF]


Home page
Mol. Cell. ProteomicsHome page
J. Widmann, M. Hamady, and R. Knight
DivergentSet, a Tool for Picking Non-redundant Sequences from Large Sequence Collections
Mol. Cell. Proteomics, August 1, 2006; 5(8): 1520 - 1532.
[Abstract] [Full Text] [PDF]


Home page
Protein Sci.Home page
B. Wallner and A. Elofsson
Identification of correct regions in protein models using structural, alignment, and consensus information
Protein Sci., April 1, 2006; 15(4): 900 - 913.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. Flores, N. Echols, D. Milburn, B. Hespenheide, K. Keating, J. Lu, S. Wells, E. Z. Yu, M. Thorpe, and M. Gerstein
The Database of Macromolecular Motions: new features added at the decade mark
Nucleic Acids Res., January 1, 2006; 34(suppl_1): D296 - D301.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
K.-J. Park, M. M. Gromiha, P. Horton, and M. Suwa
Discrimination of outer membrane proteins using support vector machines
Bioinformatics, December 1, 2005; 21(23): 4223 - 4229.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
E. Kim and Y. Kliger
Discovering hidden viral piracy
Bioinformatics, December 1, 2005; 21(23): 4216 - 4222.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. Prilusky, C. E. Felder, T. Zeev-Ben-Mordehai, E. H. Rydberg, O. Man, J. S. Beckmann, I. Silman, and J. L. Sussman
FoldIndex(C): a simple tool to predict whether a given protein sequence is intrinsically unfolded
Bioinformatics, August 15, 2005; 21(16): 3435 - 3438.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. Smith, V. Kunin, L. Goldovsky, A. J. Enright, and C. A. Ouzounis
MagicMatch--cross-referencing sequence identifiers across databases
Bioinformatics, August 15, 2005; 21(16): 3429 - 3430.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Paiardini, F. Bossa, and S. Pascarella
CAMPO, SCR_FIND and CHC_FIND: a suite of web tools for computational structural biology
Nucleic Acids Res., July 1, 2005; 33(suppl_2): W50 - W55.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
N. G. Faux, S. P. Bottomley, A. M. Lesk, J. A. Irving, J. R. Morrison, M. G. de la Banda, and J. C. Whisstock
Functional insights from the distribution and role of homopeptide repeat-containing proteins
Genome Res., April 1, 2005; 15(4): 537 - 551.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
B. E. Shakhnovich, E. Deeds, C. Delisi, and E. Shakhnovich
Protein structure and evolutionary history determine sequence space topology
Genome Res., March 1, 2005; 15(3): 385 - 392.
[Abstract] [Full Text] [PDF]


Home page
Protein Sci.Home page
I. B. Kuznetsov and S. Rackovsky
Comparative computational analysis of prion proteins reveals two fragments with unusual structural properties and a pattern of increase in hydrophobicity associated with disease-promoting mutations
Protein Sci., December 1, 2004; 13(12): 3230 - 3244.
[Abstract] [Full Text] [PDF]


Home page
Protein Sci.Home page
A. Paiardini, F. Bossa, and S. Pascarella
Evolutionarily conserved regions and hydrophobic contacts at the superfamily level: The case of the fold-type I, pyridoxal-5'-phosphate-dependent enzymes
Protein Sci., November 1, 2004; 13(11): 2992 - 3005.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
R. S. Armen, M. L. DeMarco, D. O. V. Alonso, and V. Daggett
Pauling and Corey's {alpha}-pleated sheet structure may define the prefibrillar amyloidogenic intermediate in amyloid disease
PNAS, August 10, 2004; 101(32): 11622 - 11627.
[Abstract] [Full Text] [PDF]


Home page
DevelopmentHome page
C. Vogel, S. A. Teichmann, and C. Chothia
The immunoglobulin superfamily in Drosophila melanogaster and Caenorhabditis elegans and the evolution of complexity
Development, December 22, 2003; 130(25): 6317 - 6328.
[Abstract] [Full Text] [PDF]


Home page
Sci SignalHome page
S. Polo, S. Confalonieri, A. E. Salcini, and P. P. Di Fiore
EH and UIM: Endocytosis and More
Sci. Signal., December 16, 2003; 2003(213): re17 - re17.
[Abstract] [Full Text] [PDF]


Home page
J. Cell Sci.Home page
A. Orecchia, P. M. Lacal, C. Schietroma, V. Morea, G. Zambruno, and C. M. Failla
Vascular endothelial growth factor receptor-1 is deposited in the extracellular matrix by endothelial cells and is a ligand for the {alpha}5{beta}1 integrin
J. Cell Sci., September 1, 2003; 116(17): 3479 - 3489.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. M. Maglich, J. A. Caravella, M. H. Lambert, T. M. Willson, J. T. Moore, and L. Ramamurthy
The first completed genome sequence from a teleost fish (Fugu rubripes) adds significant diversity to the nuclear receptor superfamily
Nucleic Acids Res., July 15, 2003; 31(14): 4051 - 4058.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Madera and J. Gough
A comparison of profile hidden Markov model procedures for remote homology detection
Nucleic Acids Res., October 1, 2002; 30(19): 4321 - 4328.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C. P. Ponting
Novel domains and orthologues of eukaryotic transcription elongation factors
Nucleic Acids Res., September 1, 2002; 30(17): 3643 - 3652.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
W. Li, L. Jaroszewski, and A. Godzik
Sequence clustering strategies improve remote homology recognitions while reducing search times
Protein Eng. Des. Sel., August 1, 2002; 15(8): 643 - 649.
[Abstract] [Full Text] [PDF]


Home page
Hum Mol GenetHome page
R. D. Emes and C. P. Ponting
A new sequence motif linking lissencephaly, Treacher Collins and oral-facial-digital type 1 syndromes, microtubule dynamics and cell migration
Hum. Mol. Genet., November 1, 2001; 10(24): 2813 - 2820.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
I. Yanai, A. Derti, and C. DeLisi
Genes linked by fusion events are generally of the same functional category: A systematic analysis of 30 microbial genomes
PNAS, July 3, 2001; 98(14): 7940 - 7945.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
I. V. Grigoriev, C. Zhang, and S.-H. Kim
Sequence-based detection of distantly related proteins with the same fold
Protein Eng. Des. Sel., July 1, 2001; 14(7): 455 - 458.
[Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. P. Kreil and C. A. Ouzounis
Identification of thermophilic species by the amino acid compositions deduced from their genomes
Nucleic Acids Res., April 1, 2001; 29(7): 1608 - 1615.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
A. C. W. May
Optimal classification of protein sequences and selection of representative sets from multiple alignments: application to homologous families and lessons for structural genomics
Protein Eng. Des. Sel., April 1, 2001; 14(4): 209 - 217.
[Abstract] [Full Text] [PDF]


Home page
Protein Sci.Home page
J. M. Bujnicki, A. Elofsson, D. Fischer, and L. Rychlewski
LiveBench-1: Continuous benchmarking of protein structure prediction servers
Protein Sci., February 1, 2001; 10(2): 352 - 361.
[Abstract] [Full Text]


Home page
Nucleic Acids ResHome page
S. Balasubramanian, T. Schneider, M. Gerstein, and L. Regan
Proteomics of Mycoplasma genitalium: identification and characterization of unannotated and atypical proteins in a small model genome
Nucleic Acids Res., August 15, 2000; 28(16): 3075 - 3082.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
S. A. Teichmann, J. Park, and C. Chothia
Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements
PNAS, December 8, 1998; 95(25): 14658 - 14663.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
C. Zhang and S.-H. Kim
Environment-dependent residue contact energies for proteins
PNAS, March 14, 2000; 97(6): 2550 - 2555.
[Abstract] [Full Text] [PDF]



Disclaimer:
Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.