Skip Navigation


Bioinformatics Advance Access originally published online on May 26, 2006
Bioinformatics 2006 22(13):1658-1659; doi:10.1093/bioinformatics/btl158
This Article
Right arrow Full Text Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
22/13/1658    most recent
btl158v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (159)
Google Scholar
Right arrow Articles by Li, W.
Right arrow Articles by Godzik, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, W.
Right arrow Articles by Godzik, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

Weizhong Li * and Adam Godzik

Burnham Institute for Medical Research La Jolla, CA 92037, USA

*To whom correspondence should be addressed.

Motivation: In 2001 and 2002, we published two papers (Bioinformatics, 17, 282–283, Bioinformatics, 18, 77–82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST.

Availability: http://cd-hit.org

Contact: liwz{at}sdsc.edu


Received on March 23, 2006; revised on April 20, 2006; accepted on April 20, 2006

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
DNA ResHome page
K. Mochida, T. Yoshida, T. Sakurai, K. Yamaguchi-Shinozaki, K. Shinozaki, and L.-S. P. Tran
In silico Analysis of Transcription Factor Repertoire and Prediction of Stress Responsive Transcription Factors in Soybean
DNA Res, November 2, 2009; (2009) dsp023v1.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
W. Li, H. McWilliam, A. R. de la Torre, A. Grodowski, I. Benediktovich, M. Goujon, S. Nauche, and R. Lopez
Non-redundant patent sequence databases with value-added annotations at two levels
Nucleic Acids Res., November 1, 2009; (2009) gkp960v1.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
P. Vanhee, J. Reumers, F. Stricher, L. Baeten, L. Serrano, J. Schymkowitz, and F. Rousseau
PepX: a structural database of non-redundant protein-peptide complexes
Nucleic Acids Res., October 30, 2009; (2009) gkp893v1.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
M. Giannakis, H. K. Backhed, S. L. Chen, J. J. Faith, M. Wu, J. L. Guruge, L. Engstrand, and J. I. Gordon
Response of Gastric Epithelial Progenitors to Helicobacter pylori Isolates Obtained from Swedish Patients with Chronic Atrophic Gastritis
J. Biol. Chem., October 30, 2009; 284(44): 30383 - 30394.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
S. Lata and G.P.S. Raghava
Prediction and classification of chemokines and their receptors
Protein Eng. Des. Sel., July 1, 2009; 22(7): 441 - 444.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Bernsel, H. Viklund, A. Hennerdal, and A. Elofsson
TOPCONS: consensus prediction of membrane protein topology
Nucleic Acids Res., July 1, 2009; 37(suppl_2): W465 - W468.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Pavelka, E. Chovancova, and J. Damborsky
HotSpot Wizard: a web server for identification of hot spots in protein engineering
Nucleic Acids Res., July 1, 2009; 37(suppl_2): W376 - W383.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
K. Mochida, T. Yoshida, T. Sakurai, Y. Ogihara, and K. Shinozaki
TriFLDB: A Database of Clustered Full-Length Coding Sequences from Triticeae with Applications to Comparative Grass Genomics
Plant Physiology, July 1, 2009; 150(3): 1135 - 1146.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
A. Hernandez, M. J. Mate, P. C. Sanchez-Diaz, A. Romero, F. Rojo, and J. L. Martinez
Structural and Functional Analysis of SmeT, the Repressor of the Stenotrophomonas maltophilia Multidrug Efflux Pump SmeDEF
J. Biol. Chem., May 22, 2009; 284(21): 14428 - 14438.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
B. Andreopoulos, A. An, X. Wang, and M. Schroeder
A roadmap of clustering algorithms: finding a match for a biomedical application
Brief Bioinform, May 1, 2009; 10(3): 297 - 314.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
L. Childs, Z. Nikoloski, P. May, and D. Walther
Identification and classification of ncRNA molecules using graph properties
Nucleic Acids Res., May 1, 2009; 37(9): e66 - e66.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Lo, Y.-Y. Chiu, E. A. Rodland, P.-C. Lyu, T.-Y. Sung, and W.-L. Hsu
Predicting helix-helix interactions from residue contacts in membrane proteins
Bioinformatics, April 15, 2009; 25(8): 996 - 1003.
[Abstract] [Full Text] [PDF]


Home page
Plant CellHome page
Y. Okazaki, M. Shimojima, Y. Sawada, K. Toyooka, T. Narisawa, K. Mochida, H. Tanaka, F. Matsuda, A. Hirai, M. Y. Hirai, et al.
A Chloroplastic UDP-Glucose Pyrophosphorylase from Arabidopsis Is the Committed Enzyme for the First Step of Sulfolipid Biosynthesis
PLANT CELL, March 1, 2009; 21(3): 892 - 909.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. Gunther, J. von Eichborn, P. May, and R. Preissner
JAIL: a structure-based interface library for macromolecules
Nucleic Acids Res., January 1, 2009; 37(suppl_1): D338 - D341.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. Gerlach, E. V. Kriventseva, N. Rahman, C. E. Vejnar, and E. M. Zdobnov
miROrtho: computational survey of microRNA genes
Nucleic Acids Res., January 1, 2009; 37(suppl_1): D111 - D117.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
O. Goldenberg, E. Erez, G. Nimrod, and N. Ben-Tal
The ConSurf-DB: pre-calculated evolutionary conservation profiles of protein structures
Nucleic Acids Res., January 1, 2009; 37(suppl_1): D323 - D327.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
I. Letunic, T. Doerks, and P. Bork
SMART 6: recent updates and new developments
Nucleic Acids Res., January 1, 2009; 37(suppl_1): D229 - D232.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. H. Saier Jr, M. R. Yen, K. Noto, D. G. Tamang, and C. Elkan
The Transporter Classification Database: recent advances
Nucleic Acids Res., January 1, 2009; 37(suppl_1): D274 - D278.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. J. Bordner
Predicting small ligand binding sites in proteins using backbone structure
Bioinformatics, December 15, 2008; 24(24): 2865 - 2871.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
L. Holm, S. Kaariainen, P. Rosenstrom, and A. Schenkel
Searching protein structure databases with DaliLite v.3
Bioinformatics, December 1, 2008; 24(23): 2780 - 2781.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
J. Ren, L. Wen, X. Gao, C. Jin, Y. Xue, and X. Yao
CSS-Palm 2.0: an updated software for palmitoylation sites prediction
Protein Eng. Des. Sel., November 1, 2008; 21(11): 639 - 644.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
T. Zhang, H. Zhang, K. Chen, S. Shen, J. Ruan, and L. Kurgan
Accurate sequence-based prediction of catalytic residues
Bioinformatics, October 15, 2008; 24(20): 2329 - 2338.
[Abstract] [Full Text] [PDF]


Home page
Sci SignalHome page
M. L. Miller, L. J. Jensen, F. Diella, C. Jorgensen, M. Tinti, L. Li, M. Hsiung, S. A. Parker, J. Bordeaux, T. Sicheritz-Ponten, et al.
Linear Motif Atlas for Phosphorylation-Dependent Signaling
Sci. Signal., September 2, 2008; 1(35): ra2 - ra2.
[Abstract] [Full Text] [PDF]


Home page
Plant CellHome page
M. Watanabe, K. Mochida, T. Kato, S. Tabata, N. Yoshimoto, M. Noji, and K. Saito
Comparative Genomics and Reverse Genetics Analysis Reveal Indispensable Functions of the Serine Acetyltransferase Gene Family in Arabidopsis
PLANT CELL, September 1, 2008; 20(9): 2484 - 2496.
[Abstract] [Full Text] [PDF]


Home page
Phil Trans R Soc BHome page
F. R. Tabita, T. E Hanson, S. Satagopan, B. H Witte, and N. E Kreel
Phylogenetic and evolutionary relationships of RubisCO and the RubisCO-like proteins and the functional lessons provided by diverse molecular forms
Phil Trans R Soc B, August 27, 2008; 363(1504): 2629 - 2640.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. P. Brown
Efficient functional clustering of protein sequences using the Dirichlet process
Bioinformatics, August 15, 2008; 24(16): 1765 - 1771.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
E. Capriotti and M. A. Marti-Renom
RNA structure alignment by a unit-vector approach
Bioinformatics, August 15, 2008; 24(16): i112 - i118.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
C. S. Miller and D. Eisenberg
Using inferred residue contacts to distinguish between correct and incorrect protein models
Bioinformatics, July 15, 2008; 24(14): 1575 - 1582.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Gao and J. Skolnick
DBD-Hunter: a knowledge-based method for the prediction of DNA-protein interactions
Nucleic Acids Res., July 1, 2008; 36(12): 3978 - 3992.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
A. Bernsel, H. Viklund, J. Falk, E. Lindahl, G. von Heijne, and A. Elofsson
Prediction of membrane-protein topology from first principles
PNAS, May 20, 2008; 105(20): 7177 - 7181.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Park and V. Helms
Prediction of the translocon-mediated membrane insertion free energies of protein sequences
Bioinformatics, May 15, 2008; 24(10): 1271 - 1277.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
P. May, S. Wienkoop, S. Kempa, B. Usadel, N. Christian, J. Rupprecht, J. Weiss, L. Recuenco-Munoz, O. Ebenhoh, W. Weckwerth, et al.
Metabolomics- and Proteomics-Assisted Genome Annotation and Analysis of the Draft Metabolic Network of Chlamydomonas reinhardtii
Genetics, May 1, 2008; 179(1): 157 - 166.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
Y.-R. Tang, Z.-Y. Sheng, Y.-Z. Chen, and Z. Zhang
An improved prediction of catalytic residues in enzyme structures
Protein Eng. Des. Sel., May 1, 2008; 21(5): 295 - 302.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
X. Xu, J. Wu, J. Xiao, Y. Tan, Q. Bao, F. Zhao, and X. Li
PlasmoGF: an integrated system for comparative genomics and phylogenetic analysis of Plasmodium gene families
Bioinformatics, May 1, 2008; 24(9): 1217 - 1220.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
C. G. Artieri, W. Haerty, B. P. Gupta, and R. S. Singh
Sexual Selection and Maintenance of Sex: Evidence from Comparisons of Rates of Genomic Accumulation of Mutations and Divergence of Sex-Related Genes in Sexual and Hermaphroditic Species of Caenorhabditis
Mol. Biol. Evol., May 1, 2008; 25(5): 972 - 979.
[Abstract] [Full Text] [PDF]


Home page
NEJMHome page
G. Palacios, J. Druce, L. Du, T. Tran, C. Birch, T. Briese, S. Conlan, P.-L. Quan, J. Hui, J. Marshall, et al.
A New Arenavirus in a Cluster of Fatal Transplant-Associated Diseases
N. Engl. J. Med., March 6, 2008; 358(10): 991 - 998.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
R. Rubinstein and A. Fiser
Predicting disulfide bond connectivity in proteins by correlated mutations analysis
Bioinformatics, February 15, 2008; 24(4): 498 - 504.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
O. J. Jabado, Y. Liu, S. Conlan, P. L. Quan, H. Hegyi, Y. Lussier, T. Briese, G. Palacios, and W. I. Lipkin
Comprehensive viral oligonucleotide probe design using conserved protein regions
Nucleic Acids Res., January 17, 2008; 36(1): e3 - e3.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
E. V. Kriventseva, N. Rahman, O. Espinosa, and E. M. Zdobnov
OrthoDB: the hierarchical catalog of eukaryotic orthologs
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D271 - D275.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. M. R. Davila, P. N. Mendes, G. Wagner, D. A. Tschoeke, R. R. C. Cuadrat, F. Liberman, L. Matos, T. Satake, K. A. C. S. Ocana, O. Triana, et al.
ProtozoaDB: dynamic visualization and exploration of protozoan genomes
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D547 - D552.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C. Yeats, J. Lees, A. Reid, P. Kellam, N. Martin, X. Liu, and C. Orengo
Gene3D: comprehensive structural and functional annotation of genomes
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D414 - D418.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
S. Moslavac, K. Nicolaisen, O. Mirus, F. Al Dehni, R. Pernil, E. Flores, I. Maldener, and E. Schleiff
A TolC-Like Protein Is Required for Heterocyst Development in Anabaena sp. Strain PCC 7120
J. Bacteriol., November 1, 2007; 189(21): 7887 - 7895.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
K. Chen and L. Kurgan
PFRES: protein fold classification by using evolutionary information and predicted secondary structure
Bioinformatics, November 1, 2007; 23(21): 2843 - 2850.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
M. Islinger, G. H. Luers, K. W. Li, M. Loos, and A. Volkl
Rat Liver Peroxisomes after Fibrate Treatment: A SURVEY USING QUANTITATIVE MASS SPECTROMETRY
J. Biol. Chem., August 10, 2007; 282(32): 23055 - 23069.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C. A. Innis
siteFiNDER|3D: a web-based tool for predicting the location of functional sites in proteins
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W489 - W494.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
G. Lopez, A. Valencia, and M. L. Tress
firestar--prediction of functionally important residues using structural templates and alignment reliability
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W573 - W577.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
B. E. Suzek, H. Huang, P. McGarvey, R. Mazumder, and C. H. Wu
UniRef: comprehensive and non-redundant UniProt reference clusters
Bioinformatics, May 15, 2007; 23(10): 1282 - 1288.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. Feng and E. R.M. Tillier
A fast and flexible approach to oligonucleotide probe design for genomes and gene families
Bioinformatics, May 15, 2007; 23(10): 1195 - 1202.
[Abstract] [Full Text] [PDF]


Home page
RNAHome page
S. NG Kwang Loong and S. K. Mishra
Unique folding of precursor microRNAs: Quantitative evidence and implications for de novo identification
RNA, February 1, 2007; 13(2): 170 - 187.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
G. Lopez, A. Valencia, and M. Tress
FireDB--a database of functionally important residues from proteins of known structure
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D219 - D223.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.