Skip Navigation

This Article
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow FREE Full Text (Screen PDF)
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (65)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Li, W.
Right arrow Articles by Godzik, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, W.
Right arrow Articles by Godzik, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics Vol. 18 no. 1 2002
Pages 77-82
© 2002 Oxford University Press

Tolerating some redundancy significantly speeds up clustering of large protein databases

Weizhong Li , Lukasz Jaroszewski and Adam Godzik *

The Burnham Institute, 10901 N. Torrey Pines Road, La Jolla, CA 92037, USA

Received on May 1, 2001 ; revised on July 6, 2001 ; accepted on July 20, 2001

Motivation: Sequence clustering replaces groups of similar sequences in a database with single representatives. Clustering large protein databases like the NCBI Non-Redundant database (NR) using even the best currently available clustering algorithms is very time-consuming and only practical at relatively high sequence identity thresholds. Our previous program, CD-HI, clustered NR at 90% identity in ~1 h and at 75% identity in ~1 day on a 1 GHz Linux PC (Li et al. , Bioinformatics, 17, 282, 2001); however even faster clustering speed is needed because the size of protein databases are rapidly growing and many applications desire a lower attainable thresholds.

Results: For our previous algorithm (CD-HI), we have employed short-word filters to speed up the clustering. In this paper, we show that tolerating some redundancy makes for more efficient use of these short-word filters and increases the program’s speed 100 times. Our new program implements this technique and clusters NR at 70% identity within 2 h, and at 50% identity in ~5 days. Although some redundancy is present after clustering, our new program’s results only differ from our previous program’s by less than 0.4%.

Availability: The program and its previous version are available at http://bioinformatics.burnham-inst.org/cd-hi

Contact: liwz{at}burnham-inst.org; adam{at}burnham-inst.org

* To whom correspondence should be addressed.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
B.-C. Lee and D. Kim
A new method for revealing correlated mutations under the structural and functional constraints in proteins
Bioinformatics, October 1, 2009; 25(19): 2506 - 2513.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
G. G. Leparc, T. Tuchler, G. Striedner, K. Bayer, P. Sykacek, I. L. Hofacker, and D. P. Kreil
Model-based probe set optimization for high-performance microarrays
Nucleic Acids Res., February 1, 2009; 37(3): e18 - e18.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
J. J. Grzymski, A. E. Murray, B. J. Campbell, M. Kaplarevic, G. R. Gao, C. Lee, R. Daniel, A. Ghadiri, R. A. Feldman, and S. C. Cary
Metagenome analysis of an extreme microbial symbiosis reveals eurythermal adaptation and metabolic flexibility
PNAS, November 11, 2008; 105(45): 17516 - 17521.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. N. Wass and M. J. E. Sternberg
ConFunc--functional annotation in the twilight zone
Bioinformatics, March 15, 2008; 24(6): 798 - 806.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
P. Smialowski, A. J. Martin-Galiano, A. Mikolajka, T. Girschick, T. A. Holak, and D. Frishman
Protein solubility: sequence based prediction and experimental verification
Bioinformatics, October 1, 2007; 23(19): 2536 - 2542.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
N. Fernandez-Fuentes, B. K. Rai, C. J. Madrid-Aliste, J. Eduardo Fajardo, and A. Fiser
Comparative protein structure modeling by combining multiple templates and optimizing sequence-to-structure alignments
Bioinformatics, October 1, 2007; 23(19): 2558 - 2565.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
N. Fernandez-Fuentes, C. J. Madrid-Aliste, B. K. Rai, J. E. Fajardo, and A. Fiser
M4T: a comparative protein structure modeling server
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W363 - W368.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C.-T. Su, C.-Y. Chen, and C.-M. Hsu
iPDA: integrated protein disorder analyzer
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W465 - W472.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C. A. Innis
siteFiNDER|3D: a web-based tool for predicting the location of functional sites in proteins
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W489 - W494.
[Abstract] [Full Text] [PDF]


Home page
EndocrinologyHome page
A. Balbis, A. Parmar, Y. Wang, G. Baquiran, and B. I. Posner
Compartmentalization of Signaling-Competent Epidermal Growth Factor Receptors in Endosomes
Endocrinology, June 1, 2007; 148(6): 2944 - 2954.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
Q.-B. Gao and Z.-Z. Wang
Classification of G-protein coupled receptors at four levels
Protein Eng. Des. Sel., November 1, 2006; 19(11): 511 - 516.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
B. K. Rai, C. J. Madrid-Aliste, J. E. Fajardo, and A. Fiser
MMM: a sequence-to-structure alignment protocol
Bioinformatics, November 1, 2006; 22(21): 2691 - 2692.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
I. Lozada-Chavez, S. C. Janga, and J. Collado-Vides
Bacterial regulatory networks are extremely flexible in evolution
Nucleic Acids Res., July 13, 2006; 34(12): 3434 - 3445.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. V. McDonnell, T. Jiang, A. E. Keating, and B. Berger
Paircoil2: improved prediction of coiled coils from sequence
Bioinformatics, February 1, 2006; 22(3): 356 - 358.
[Abstract] [Full Text] [PDF]


Home page
DNA ResHome page
T. Kosuge, T. Abe, T. Okido, N. Tanaka, M. Hirahata, Y. Maruyama, J. Mashima, A. Tomiki, M. Kurokawa, R. Himeno, et al.
Exploration and Grading of Possible Genes from 183 Bacterial Strains by a Common Protocol to Identification of New Genes: Gene Trek in Prokaryote Space (GTPS)
DNA Res, January 1, 2006; 13(6): 245 - 254.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
X. Li, Z. He, and J. Zhou
Selection of optimal oligonucleotide probes for microarrays using multiple criteria, global alignment and parameter estimation
Nucleic Acids Res., October 24, 2005; 33(19): 6114 - 6123.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Z. R. Yang, R. Thomson, P. McNeil, and R. M. Esnouf
RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins
Bioinformatics, August 15, 2005; 21(16): 3369 - 3376.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
N. Huang, H. Chen, and Z. Sun
CTKPred: an SVM-based method for the prediction and classification of the cytokine superfamily
Protein Eng. Des. Sel., August 1, 2005; 18(8): 365 - 368.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
P. Fontana, E. Bindewald, S. Toppo, R. Velasco, G. Valle, and S. C. E. Tosatto
The SSEA server for protein secondary structure alignment
Bioinformatics, February 1, 2005; 21(3): 393 - 395.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
R. Apweiler, A. Bairoch, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, et al.
UniProt: the Universal Protein knowledgebase
Nucleic Acids Res., January 1, 2004; 32(90001): D115 - 119.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
W. Li, L. Jaroszewski, and A. Godzik
Sequence clustering strategies improve remote homology recognitions while reducing search times
Protein Eng. Des. Sel., August 1, 2002; 15(8): 643 - 649.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.