Bioinformatics Vol. 18 no. 1 2002
Pages 77-82
© 2002 Oxford University Press
Tolerating some redundancy significantly speeds up clustering of large protein databases
The Burnham Institute, 10901 N. Torrey Pines Road, La Jolla, CA 92037, USA
Received on May 1, 2001
; revised on July 6, 2001
; accepted on July 20, 2001
Motivation: Sequence clustering replaces groups of similar
sequences in a database with single representatives. Clustering
large protein databases like the NCBI Non-Redundant database (NR)
using even the best currently available clustering algorithms is
very time-consuming and only practical at relatively high sequence
identity thresholds. Our previous program, CD-HI, clustered NR at
90% identity in
1 h and at 75% identity in
1
day on a 1 GHz Linux PC (Li et al. , Bioinformatics,
17, 282, 2001); however even faster clustering speed is
needed because the size of protein databases are rapidly growing and
many applications desire a lower attainable thresholds.
Results: For our previous algorithm (CD-HI), we have employed
short-word filters to speed up the clustering. In this paper, we
show that tolerating some redundancy makes for more efficient use of
these short-word filters and increases the programs speed 100
times. Our new program implements this technique and clusters NR at
70% identity within 2 h, and at 50% identity in
5 days.
Although some redundancy is present after clustering, our new
programs results only differ from our previous
programs by less than 0.4%.
Availability: The program and its previous version are available at http://bioinformatics.burnham-inst.org/cd-hi
Contact: liwz{at}burnham-inst.org; adam{at}burnham-inst.org
* To whom correspondence should be addressed.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
B.-C. Lee and D. Kim A new method for revealing correlated mutations under the structural and functional constraints in proteins Bioinformatics, October 1, 2009; 25(19): 2506 - 2513. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. G. Leparc, T. Tuchler, G. Striedner, K. Bayer, P. Sykacek, I. L. Hofacker, and D. P. Kreil Model-based probe set optimization for high-performance microarrays Nucleic Acids Res., February 1, 2009; 37(3): e18 - e18. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. J. Grzymski, A. E. Murray, B. J. Campbell, M. Kaplarevic, G. R. Gao, C. Lee, R. Daniel, A. Ghadiri, R. A. Feldman, and S. C. Cary Metagenome analysis of an extreme microbial symbiosis reveals eurythermal adaptation and metabolic flexibility PNAS, November 11, 2008; 105(45): 17516 - 17521. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. N. Wass and M. J. E. Sternberg ConFunc--functional annotation in the twilight zone Bioinformatics, March 15, 2008; 24(6): 798 - 806. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Smialowski, A. J. Martin-Galiano, A. Mikolajka, T. Girschick, T. A. Holak, and D. Frishman Protein solubility: sequence based prediction and experimental verification Bioinformatics, October 1, 2007; 23(19): 2536 - 2542. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Fernandez-Fuentes, B. K. Rai, C. J. Madrid-Aliste, J. Eduardo Fajardo, and A. Fiser Comparative protein structure modeling by combining multiple templates and optimizing sequence-to-structure alignments Bioinformatics, October 1, 2007; 23(19): 2558 - 2565. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Fernandez-Fuentes, C. J. Madrid-Aliste, B. K. Rai, J. E. Fajardo, and A. Fiser M4T: a comparative protein structure modeling server Nucleic Acids Res., July 13, 2007; 35(suppl_2): W363 - W368. [Abstract] [Full Text] [PDF] |
||||
![]() |
C.-T. Su, C.-Y. Chen, and C.-M. Hsu iPDA: integrated protein disorder analyzer Nucleic Acids Res., July 13, 2007; 35(suppl_2): W465 - W472. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. A. Innis siteFiNDER|3D: a web-based tool for predicting the location of functional sites in proteins Nucleic Acids Res., July 13, 2007; 35(suppl_2): W489 - W494. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Balbis, A. Parmar, Y. Wang, G. Baquiran, and B. I. Posner Compartmentalization of Signaling-Competent Epidermal Growth Factor Receptors in Endosomes Endocrinology, June 1, 2007; 148(6): 2944 - 2954. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q.-B. Gao and Z.-Z. Wang Classification of G-protein coupled receptors at four levels Protein Eng. Des. Sel., November 1, 2006; 19(11): 511 - 516. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. K. Rai, C. J. Madrid-Aliste, J. E. Fajardo, and A. Fiser MMM: a sequence-to-structure alignment protocol Bioinformatics, November 1, 2006; 22(21): 2691 - 2692. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Lozada-Chavez, S. C. Janga, and J. Collado-Vides Bacterial regulatory networks are extremely flexible in evolution Nucleic Acids Res., July 13, 2006; 34(12): 3434 - 3445. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. V. McDonnell, T. Jiang, A. E. Keating, and B. Berger Paircoil2: improved prediction of coiled coils from sequence Bioinformatics, February 1, 2006; 22(3): 356 - 358. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Kosuge, T. Abe, T. Okido, N. Tanaka, M. Hirahata, Y. Maruyama, J. Mashima, A. Tomiki, M. Kurokawa, R. Himeno, et al. Exploration and Grading of Possible Genes from 183 Bacterial Strains by a Common Protocol to Identification of New Genes: Gene Trek in Prokaryote Space (GTPS) DNA Res, January 1, 2006; 13(6): 245 - 254. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Li, Z. He, and J. Zhou Selection of optimal oligonucleotide probes for microarrays using multiple criteria, global alignment and parameter estimation Nucleic Acids Res., October 24, 2005; 33(19): 6114 - 6123. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. R. Yang, R. Thomson, P. McNeil, and R. M. Esnouf RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins Bioinformatics, August 15, 2005; 21(16): 3369 - 3376. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Huang, H. Chen, and Z. Sun CTKPred: an SVM-based method for the prediction and classification of the cytokine superfamily Protein Eng. Des. Sel., August 1, 2005; 18(8): 365 - 368. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Fontana, E. Bindewald, S. Toppo, R. Velasco, G. Valle, and S. C. E. Tosatto The SSEA server for protein secondary structure alignment Bioinformatics, February 1, 2005; 21(3): 393 - 395. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Apweiler, A. Bairoch, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, et al. UniProt: the Universal Protein knowledgebase Nucleic Acids Res., January 1, 2004; 32(90001): D115 - 119. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Li, L. Jaroszewski, and A. Godzik Sequence clustering strategies improve remote homology recognitions while reducing search times Protein Eng. Des. Sel., August 1, 2002; 15(8): 643 - 649. [Abstract] [Full Text] [PDF] |
||||





