Skip Navigation



Bioinformatics Advance Access published online on May 26, 2006

Bioinformatics, doi:10.1093/bioinformatics/btl158
This Article
Right arrow Advance Access manuscript (PDF) Freely available
Right arrowOA All Versions of this Article:
22/13/1658    most recent
btl158v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Li, W.
Right arrow Articles by Godzik, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, W.
Right arrow Articles by Godzik, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2006 The Author(s)
Received March 23, 2006
Revised April 20, 2006
Accepted April 20, 2006

Applications note

cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

Weizhong Li 1 * and Adam Godzik 1

1 Burnham Institute for Medical Research, La Jolla, CA 92037, USA

* To whom correspondence should be addressed.
Weizhong Li, E-mail: liwz{at}sdsc.edu


   Abstract

Motivation: In 2001 and 2002, we published two papers (Bioinformatics 17: 282-283, Bioinformatics 18: 77-82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est, and cd-hit-est-2d. Cd-hit-2d compares two protein data sets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database; and cd-hit-est-2d compares two nucleotide data sets. All these programs can handle huge data sets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST.

Availability: http://cd-hit.org.


Associate Editor: Golan Yona
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Appl. Environ. Microbiol.Home page
P. D. Schloss, S. L. Westcott, T. Ryabin, J. R. Hall, M. Hartmann, E. B. Hollister, R. A. Lesniewski, B. B. Oakley, D. H. Parks, C. J. Robinson, et al.
Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities
Appl. Envir. Microbiol., December 1, 2009; 75(23): 7537 - 7541.
[Abstract] [Full Text] [PDF]


Home page
DNA ResHome page
K. Mochida, T. Yoshida, T. Sakurai, K. Yamaguchi-Shinozaki, K. Shinozaki, and L.-S. P. Tran
In silico Analysis of Transcription Factor Repertoire and Prediction of Stress Responsive Transcription Factors in Soybean
DNA Res, December 1, 2009; 16(6): 353 - 369.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. Thomas, S. Karnik, R. S. Barai, V. K. Jayaraman, and S. Idicula-Thomas
CAMP: a useful resource for research on antimicrobial peptides
Nucleic Acids Res., November 18, 2009; (2009) gkp1021v1.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. A. Lee, R. Rentzsch, and C. Orengo
GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains
Nucleic Acids Res., November 18, 2009; (2009) gkp1049v1.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
V. K. Sharma, N. Kumar, T. Prakash, and T. D. Taylor
MetaBioME: a database to explore commercially useful enzymes in metagenomic datasets
Nucleic Acids Res., November 11, 2009; (2009) gkp1001v1.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
W. Li, H. McWilliam, A. R. de la Torre, A. Grodowski, I. Benediktovich, M. Goujon, S. Nauche, and R. Lopez
Non-redundant patent sequence databases with value-added annotations at two levels
Nucleic Acids Res., November 1, 2009; (2009) gkp960v1.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
P. Vanhee, J. Reumers, F. Stricher, L. Baeten, L. Serrano, J. Schymkowitz, and F. Rousseau
PepX: a structural database of non-redundant protein-peptide complexes
Nucleic Acids Res., October 30, 2009; (2009) gkp893v1.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
M. Giannakis, H. K. Backhed, S. L. Chen, J. J. Faith, M. Wu, J. L. Guruge, L. Engstrand, and J. I. Gordon
Response of Gastric Epithelial Progenitors to Helicobacter pylori Isolates Obtained from Swedish Patients with Chronic Atrophic Gastritis
J. Biol. Chem., October 30, 2009; 284(44): 30383 - 30394.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
S. Lata and G.P.S. Raghava
Prediction and classification of chemokines and their receptors
Protein Eng. Des. Sel., July 1, 2009; 22(7): 441 - 444.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Bernsel, H. Viklund, A. Hennerdal, and A. Elofsson
TOPCONS: consensus prediction of membrane protein topology
Nucleic Acids Res., July 1, 2009; 37(suppl_2): W465 - W468.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Pavelka, E. Chovancova, and J. Damborsky
HotSpot Wizard: a web server for identification of hot spots in protein engineering
Nucleic Acids Res., July 1, 2009; 37(suppl_2): W376 - W383.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
K. Mochida, T. Yoshida, T. Sakurai, Y. Ogihara, and K. Shinozaki
TriFLDB: A Database of Clustered Full-Length Coding Sequences from Triticeae with Applications to Comparative Grass Genomics
Plant Physiology, July 1, 2009; 150(3): 1135 - 1146.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
A. Hernandez, M. J. Mate, P. C. Sanchez-Diaz, A. Romero, F. Rojo, and J. L. Martinez
Structural and Functional Analysis of SmeT, the Repressor of the Stenotrophomonas maltophilia Multidrug Efflux Pump SmeDEF
J. Biol. Chem., May 22, 2009; 284(21): 14428 - 14438.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
B. Andreopoulos, A. An, X. Wang, and M. Schroeder
A roadmap of clustering algorithms: finding a match for a biomedical application
Brief Bioinform, May 1, 2009; 10(3): 297 - 314.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
L. Childs, Z. Nikoloski, P. May, and D. Walther
Identification and classification of ncRNA molecules using graph properties
Nucleic Acids Res., May 1, 2009; 37(9): e66 - e66.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Lo, Y.-Y. Chiu, E. A. Rodland, P.-C. Lyu, T.-Y. Sung, and W.-L. Hsu
Predicting helix-helix interactions from residue contacts in membrane proteins
Bioinformatics, April 15, 2009; 25(8): 996 - 1003.
[Abstract] [Full Text] [PDF]


Home page
Plant CellHome page
Y. Okazaki, M. Shimojima, Y. Sawada, K. Toyooka, T. Narisawa, K. Mochida, H. Tanaka, F. Matsuda, A. Hirai, M. Y. Hirai, et al.
A Chloroplastic UDP-Glucose Pyrophosphorylase from Arabidopsis Is the Committed Enzyme for the First Step of Sulfolipid Biosynthesis
PLANT CELL, March 1, 2009; 21(3): 892 - 909.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. Gunther, J. von Eichborn, P. May, and R. Preissner
JAIL: a structure-based interface library for macromolecules
Nucleic Acids Res., January 1, 2009; 37(suppl_1): D338 - D341.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. Gerlach, E. V. Kriventseva, N. Rahman, C. E. Vejnar, and E. M. Zdobnov
miROrtho: computational survey of microRNA genes
Nucleic Acids Res., January 1, 2009; 37(suppl_1): D111 - D117.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
O. Goldenberg, E. Erez, G. Nimrod, and N. Ben-Tal
The ConSurf-DB: pre-calculated evolutionary conservation profiles of protein structures
Nucleic Acids Res., January 1, 2009; 37(suppl_1): D323 - D327.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
I. Letunic, T. Doerks, and P. Bork
SMART 6: recent updates and new developments
Nucleic Acids Res., January 1, 2009; 37(suppl_1): D229 - D232.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. H. Saier Jr, M. R. Yen, K. Noto, D. G. Tamang, and C. Elkan
The Transporter Classification Database: recent advances
Nucleic Acids Res., January 1, 2009; 37(suppl_1): D274 - D278.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. J. Bordner
Predicting small ligand binding sites in proteins using backbone structure
Bioinformatics, December 15, 2008; 24(24): 2865 - 2871.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
L. Holm, S. Kaariainen, P. Rosenstrom, and A. Schenkel
Searching protein structure databases with DaliLite v.3
Bioinformatics, December 1, 2008; 24(23): 2780 - 2781.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
J. Ren, L. Wen, X. Gao, C. Jin, Y. Xue, and X. Yao
CSS-Palm 2.0: an updated software for palmitoylation sites prediction
Protein Eng. Des. Sel., November 1, 2008; 21(11): 639 - 644.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
T. Zhang, H. Zhang, K. Chen, S. Shen, J. Ruan, and L. Kurgan
Accurate sequence-based prediction of catalytic residues
Bioinformatics, October 15, 2008; 24(20): 2329 - 2338.
[Abstract] [Full Text] [PDF]


Home page
Sci SignalHome page
M. L. Miller, L. J. Jensen, F. Diella, C. Jorgensen, M. Tinti, L. Li, M. Hsiung, S. A. Parker, J. Bordeaux, T. Sicheritz-Ponten, et al.
Linear Motif Atlas for Phosphorylation-Dependent Signaling
Sci. Signal., September 2, 2008; 1(35): ra2 - ra2.
[Abstract] [Full Text] [PDF]


Home page
Plant CellHome page
M. Watanabe, K. Mochida, T. Kato, S. Tabata, N. Yoshimoto, M. Noji, and K. Saito
Comparative Genomics and Reverse Genetics Analysis Reveal Indispensable Functions of the Serine Acetyltransferase Gene Family in Arabidopsis
PLANT CELL, September 1, 2008; 20(9): 2484 - 2496.
[Abstract] [Full Text] [PDF]


Home page
Phil Trans R Soc BHome page
F. R. Tabita, T. E Hanson, S. Satagopan, B. H Witte, and N. E Kreel
Phylogenetic and evolutionary relationships of RubisCO and the RubisCO-like proteins and the functional lessons provided by diverse molecular forms
Phil Trans R Soc B, August 27, 2008; 363(1504): 2629 - 2640.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. P. Brown
Efficient functional clustering of protein sequences using the Dirichlet process
Bioinformatics, August 15, 2008; 24(16): 1765 - 1771.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
E. Capriotti and M. A. Marti-Renom
RNA structure alignment by a unit-vector approach
Bioinformatics, August 15, 2008; 24(16): i112 - i118.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
C. S. Miller and D. Eisenberg
Using inferred residue contacts to distinguish between correct and incorrect protein models
Bioinformatics, July 15, 2008; 24(14): 1575 - 1582.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Gao and J. Skolnick
DBD-Hunter: a knowledge-based method for the prediction of DNA-protein interactions
Nucleic Acids Res., July 1, 2008; 36(12): 3978 - 3992.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
A. Bernsel, H. Viklund, J. Falk, E. Lindahl, G. von Heijne, and A. Elofsson
Prediction of membrane-protein topology from first principles
PNAS, May 20, 2008; 105(20): 7177 - 7181.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Park and V. Helms
Prediction of the translocon-mediated membrane insertion free energies of protein sequences
Bioinformatics, May 15, 2008; 24(10): 1271 - 1277.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
P. May, S. Wienkoop, S. Kempa, B. Usadel, N. Christian, J. Rupprecht, J. Weiss, L. Recuenco-Munoz, O. Ebenhoh, W. Weckwerth, et al.
Metabolomics- and Proteomics-Assisted Genome Annotation and Analysis of the Draft Metabolic Network of Chlamydomonas reinhardtii
Genetics, May 1, 2008; 179(1): 157 - 166.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
Y.-R. Tang, Z.-Y. Sheng, Y.-Z. Chen, and Z. Zhang
An improved prediction of catalytic residues in enzyme structures
Protein Eng. Des. Sel., May 1, 2008; 21(5): 295 - 302.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
X. Xu, J. Wu, J. Xiao, Y. Tan, Q. Bao, F. Zhao, and X. Li
PlasmoGF: an integrated system for comparative genomics and phylogenetic analysis of Plasmodium gene families
Bioinformatics, May 1, 2008; 24(9): 1217 - 1220.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
C. G. Artieri, W. Haerty, B. P. Gupta, and R. S. Singh
Sexual Selection and Maintenance of Sex: Evidence from Comparisons of Rates of Genomic Accumulation of Mutations and Divergence of Sex-Related Genes in Sexual and Hermaphroditic Species of Caenorhabditis
Mol. Biol. Evol., May 1, 2008; 25(5): 972 - 979.
[Abstract] [Full Text] [PDF]


Home page
NEJMHome page
G. Palacios, J. Druce, L. Du, T. Tran, C. Birch, T. Briese, S. Conlan, P.-L. Quan, J. Hui, J. Marshall, et al.
A New Arenavirus in a Cluster of Fatal Transplant-Associated Diseases
N. Engl. J. Med., March 6, 2008; 358(10): 991 - 998.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
R. Rubinstein and A. Fiser
Predicting disulfide bond connectivity in proteins by correlated mutations analysis
Bioinformatics, February 15, 2008; 24(4): 498 - 504.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
O. J. Jabado, Y. Liu, S. Conlan, P. L. Quan, H. Hegyi, Y. Lussier, T. Briese, G. Palacios, and W. I. Lipkin
Comprehensive viral oligonucleotide probe design using conserved protein regions
Nucleic Acids Res., January 17, 2008; 36(1): e3 - e3.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
E. V. Kriventseva, N. Rahman, O. Espinosa, and E. M. Zdobnov
OrthoDB: the hierarchical catalog of eukaryotic orthologs
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D271 - D275.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. M. R. Davila, P. N. Mendes, G. Wagner, D. A. Tschoeke, R. R. C. Cuadrat, F. Liberman, L. Matos, T. Satake, K. A. C. S. Ocana, O. Triana, et al.
ProtozoaDB: dynamic visualization and exploration of protozoan genomes
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D547 - D552.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C. Yeats, J. Lees, A. Reid, P. Kellam, N. Martin, X. Liu, and C. Orengo
Gene3D: comprehensive structural and functional annotation of genomes
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D414 - D418.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
S. Moslavac, K. Nicolaisen, O. Mirus, F. Al Dehni, R. Pernil, E. Flores, I. Maldener, and E. Schleiff
A TolC-Like Protein Is Required for Heterocyst Development in Anabaena sp. Strain PCC 7120
J. Bacteriol., November 1, 2007; 189(21): 7887 - 7895.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
K. Chen and L. Kurgan
PFRES: protein fold classification by using evolutionary information and predicted secondary structure
Bioinformatics, November 1, 2007; 23(21): 2843 - 2850.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
M. Islinger, G. H. Luers, K. W. Li, M. Loos, and A. Volkl
Rat Liver Peroxisomes after Fibrate Treatment: A SURVEY USING QUANTITATIVE MASS SPECTROMETRY
J. Biol. Chem., August 10, 2007; 282(32): 23055 - 23069.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C. A. Innis
siteFiNDER|3D: a web-based tool for predicting the location of functional sites in proteins
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W489 - W494.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
G. Lopez, A. Valencia, and M. L. Tress
firestar--prediction of functionally important residues using structural templates and alignment reliability
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W573 - W577.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
B. E. Suzek, H. Huang, P. McGarvey, R. Mazumder, and C. H. Wu
UniRef: comprehensive and non-redundant UniProt reference clusters
Bioinformatics, May 15, 2007; 23(10): 1282 - 1288.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. Feng and E. R.M. Tillier
A fast and flexible approach to oligonucleotide probe design for genomes and gene families
Bioinformatics, May 15, 2007; 23(10): 1195 - 1202.
[Abstract] [Full Text] [PDF]


Home page
RNAHome page
S. NG Kwang Loong and S. K. Mishra
Unique folding of precursor microRNAs: Quantitative evidence and implications for de novo identification
RNA, February 1, 2007; 13(2): 170 - 187.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
G. Lopez, A. Valencia, and M. Tress
FireDB--a database of functionally important residues from proteins of known structure
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D219 - D223.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.