Bioinformatics Advance Access published online on May 26, 2006
Bioinformatics, doi:10.1093/bioinformatics/btl158
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Burnham Institute for Medical Research, La Jolla, CA 92037, USA
* To whom correspondence should be addressed.
Motivation: In 2001 and 2002, we published two papers (Bioinformatics 17: 282-283, Bioinformatics 18: 77-82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est, and cd-hit-est-2d. Cd-hit-2d compares two protein data sets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database; and cd-hit-est-2d compares two nucleotide data sets. All these programs can handle huge data sets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST. Availability: http://cd-hit.org.
Received March 23, 2006
Revised April 20, 2006
Accepted April 20, 2006
Applications note
cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
Weizhong Li 1 *
and
Adam Godzik 1
Weizhong Li, E-mail: liwz{at}sdsc.edu
![]()
Abstract
Associate Editor: Golan Yona
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
P. D. Schloss, S. L. Westcott, T. Ryabin, J. R. Hall, M. Hartmann, E. B. Hollister, R. A. Lesniewski, B. B. Oakley, D. H. Parks, C. J. Robinson, et al. Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities Appl. Envir. Microbiol., December 1, 2009; 75(23): 7537 - 7541. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Mochida, T. Yoshida, T. Sakurai, K. Yamaguchi-Shinozaki, K. Shinozaki, and L.-S. P. Tran In silico Analysis of Transcription Factor Repertoire and Prediction of Stress Responsive Transcription Factors in Soybean DNA Res, December 1, 2009; 16(6): 353 - 369. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Thomas, S. Karnik, R. S. Barai, V. K. Jayaraman, and S. Idicula-Thomas CAMP: a useful resource for research on antimicrobial peptides Nucleic Acids Res., November 18, 2009; (2009) gkp1021v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A. Lee, R. Rentzsch, and C. Orengo GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains Nucleic Acids Res., November 18, 2009; (2009) gkp1049v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. K. Sharma, N. Kumar, T. Prakash, and T. D. Taylor MetaBioME: a database to explore commercially useful enzymes in metagenomic datasets Nucleic Acids Res., November 11, 2009; (2009) gkp1001v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Li, H. McWilliam, A. R. de la Torre, A. Grodowski, I. Benediktovich, M. Goujon, S. Nauche, and R. Lopez Non-redundant patent sequence databases with value-added annotations at two levels Nucleic Acids Res., November 1, 2009; (2009) gkp960v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Vanhee, J. Reumers, F. Stricher, L. Baeten, L. Serrano, J. Schymkowitz, and F. Rousseau PepX: a structural database of non-redundant protein-peptide complexes Nucleic Acids Res., October 30, 2009; (2009) gkp893v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Giannakis, H. K. Backhed, S. L. Chen, J. J. Faith, M. Wu, J. L. Guruge, L. Engstrand, and J. I. Gordon Response of Gastric Epithelial Progenitors to Helicobacter pylori Isolates Obtained from Swedish Patients with Chronic Atrophic Gastritis J. Biol. Chem., October 30, 2009; 284(44): 30383 - 30394. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Lata and G.P.S. Raghava Prediction and classification of chemokines and their receptors Protein Eng. Des. Sel., July 1, 2009; 22(7): 441 - 444. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Bernsel, H. Viklund, A. Hennerdal, and A. Elofsson TOPCONS: consensus prediction of membrane protein topology Nucleic Acids Res., July 1, 2009; 37(suppl_2): W465 - W468. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Pavelka, E. Chovancova, and J. Damborsky HotSpot Wizard: a web server for identification of hot spots in protein engineering Nucleic Acids Res., July 1, 2009; 37(suppl_2): W376 - W383. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Mochida, T. Yoshida, T. Sakurai, Y. Ogihara, and K. Shinozaki TriFLDB: A Database of Clustered Full-Length Coding Sequences from Triticeae with Applications to Comparative Grass Genomics Plant Physiology, July 1, 2009; 150(3): 1135 - 1146. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Hernandez, M. J. Mate, P. C. Sanchez-Diaz, A. Romero, F. Rojo, and J. L. Martinez Structural and Functional Analysis of SmeT, the Repressor of the Stenotrophomonas maltophilia Multidrug Efflux Pump SmeDEF J. Biol. Chem., May 22, 2009; 284(21): 14428 - 14438. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Andreopoulos, A. An, X. Wang, and M. Schroeder A roadmap of clustering algorithms: finding a match for a biomedical application Brief Bioinform, May 1, 2009; 10(3): 297 - 314. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Childs, Z. Nikoloski, P. May, and D. Walther Identification and classification of ncRNA molecules using graph properties Nucleic Acids Res., May 1, 2009; 37(9): e66 - e66. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Lo, Y.-Y. Chiu, E. A. Rodland, P.-C. Lyu, T.-Y. Sung, and W.-L. Hsu Predicting helix-helix interactions from residue contacts in membrane proteins Bioinformatics, April 15, 2009; 25(8): 996 - 1003. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Okazaki, M. Shimojima, Y. Sawada, K. Toyooka, T. Narisawa, K. Mochida, H. Tanaka, F. Matsuda, A. Hirai, M. Y. Hirai, et al. A Chloroplastic UDP-Glucose Pyrophosphorylase from Arabidopsis Is the Committed Enzyme for the First Step of Sulfolipid Biosynthesis PLANT CELL, March 1, 2009; 21(3): 892 - 909. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Gunther, J. von Eichborn, P. May, and R. Preissner JAIL: a structure-based interface library for macromolecules Nucleic Acids Res., January 1, 2009; 37(suppl_1): D338 - D341. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Gerlach, E. V. Kriventseva, N. Rahman, C. E. Vejnar, and E. M. Zdobnov miROrtho: computational survey of microRNA genes Nucleic Acids Res., January 1, 2009; 37(suppl_1): D111 - D117. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Goldenberg, E. Erez, G. Nimrod, and N. Ben-Tal The ConSurf-DB: pre-calculated evolutionary conservation profiles of protein structures Nucleic Acids Res., January 1, 2009; 37(suppl_1): D323 - D327. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Letunic, T. Doerks, and P. Bork SMART 6: recent updates and new developments Nucleic Acids Res., January 1, 2009; 37(suppl_1): D229 - D232. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. H. Saier Jr, M. R. Yen, K. Noto, D. G. Tamang, and C. Elkan The Transporter Classification Database: recent advances Nucleic Acids Res., January 1, 2009; 37(suppl_1): D274 - D278. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J. Bordner Predicting small ligand binding sites in proteins using backbone structure Bioinformatics, December 15, 2008; 24(24): 2865 - 2871. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Holm, S. Kaariainen, P. Rosenstrom, and A. Schenkel Searching protein structure databases with DaliLite v.3 Bioinformatics, December 1, 2008; 24(23): 2780 - 2781. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Ren, L. Wen, X. Gao, C. Jin, Y. Xue, and X. Yao CSS-Palm 2.0: an updated software for palmitoylation sites prediction Protein Eng. Des. Sel., November 1, 2008; 21(11): 639 - 644. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Zhang, H. Zhang, K. Chen, S. Shen, J. Ruan, and L. Kurgan Accurate sequence-based prediction of catalytic residues Bioinformatics, October 15, 2008; 24(20): 2329 - 2338. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. L. Miller, L. J. Jensen, F. Diella, C. Jorgensen, M. Tinti, L. Li, M. Hsiung, S. A. Parker, J. Bordeaux, T. Sicheritz-Ponten, et al. Linear Motif Atlas for Phosphorylation-Dependent Signaling Sci. Signal., September 2, 2008; 1(35): ra2 - ra2. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Watanabe, K. Mochida, T. Kato, S. Tabata, N. Yoshimoto, M. Noji, and K. Saito Comparative Genomics and Reverse Genetics Analysis Reveal Indispensable Functions of the Serine Acetyltransferase Gene Family in Arabidopsis PLANT CELL, September 1, 2008; 20(9): 2484 - 2496. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. R. Tabita, T. E Hanson, S. Satagopan, B. H Witte, and N. E Kreel Phylogenetic and evolutionary relationships of RubisCO and the RubisCO-like proteins and the functional lessons provided by diverse molecular forms Phil Trans R Soc B, August 27, 2008; 363(1504): 2629 - 2640. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. P. Brown Efficient functional clustering of protein sequences using the Dirichlet process Bioinformatics, August 15, 2008; 24(16): 1765 - 1771. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Capriotti and M. A. Marti-Renom RNA structure alignment by a unit-vector approach Bioinformatics, August 15, 2008; 24(16): i112 - i118. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. S. Miller and D. Eisenberg Using inferred residue contacts to distinguish between correct and incorrect protein models Bioinformatics, July 15, 2008; 24(14): 1575 - 1582. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Gao and J. Skolnick DBD-Hunter: a knowledge-based method for the prediction of DNA-protein interactions Nucleic Acids Res., July 1, 2008; 36(12): 3978 - 3992. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Bernsel, H. Viklund, J. Falk, E. Lindahl, G. von Heijne, and A. Elofsson Prediction of membrane-protein topology from first principles PNAS, May 20, 2008; 105(20): 7177 - 7181. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Park and V. Helms Prediction of the translocon-mediated membrane insertion free energies of protein sequences Bioinformatics, May 15, 2008; 24(10): 1271 - 1277. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. May, S. Wienkoop, S. Kempa, B. Usadel, N. Christian, J. Rupprecht, J. Weiss, L. Recuenco-Munoz, O. Ebenhoh, W. Weckwerth, et al. Metabolomics- and Proteomics-Assisted Genome Annotation and Analysis of the Draft Metabolic Network of Chlamydomonas reinhardtii Genetics, May 1, 2008; 179(1): 157 - 166. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y.-R. Tang, Z.-Y. Sheng, Y.-Z. Chen, and Z. Zhang An improved prediction of catalytic residues in enzyme structures Protein Eng. Des. Sel., May 1, 2008; 21(5): 295 - 302. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Xu, J. Wu, J. Xiao, Y. Tan, Q. Bao, F. Zhao, and X. Li PlasmoGF: an integrated system for comparative genomics and phylogenetic analysis of Plasmodium gene families Bioinformatics, May 1, 2008; 24(9): 1217 - 1220. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. G. Artieri, W. Haerty, B. P. Gupta, and R. S. Singh Sexual Selection and Maintenance of Sex: Evidence from Comparisons of Rates of Genomic Accumulation of Mutations and Divergence of Sex-Related Genes in Sexual and Hermaphroditic Species of Caenorhabditis Mol. Biol. Evol., May 1, 2008; 25(5): 972 - 979. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Palacios, J. Druce, L. Du, T. Tran, C. Birch, T. Briese, S. Conlan, P.-L. Quan, J. Hui, J. Marshall, et al. A New Arenavirus in a Cluster of Fatal Transplant-Associated Diseases N. Engl. J. Med., March 6, 2008; 358(10): 991 - 998. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Rubinstein and A. Fiser Predicting disulfide bond connectivity in proteins by correlated mutations analysis Bioinformatics, February 15, 2008; 24(4): 498 - 504. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. J. Jabado, Y. Liu, S. Conlan, P. L. Quan, H. Hegyi, Y. Lussier, T. Briese, G. Palacios, and W. I. Lipkin Comprehensive viral oligonucleotide probe design using conserved protein regions Nucleic Acids Res., January 17, 2008; 36(1): e3 - e3. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. V. Kriventseva, N. Rahman, O. Espinosa, and E. M. Zdobnov OrthoDB: the hierarchical catalog of eukaryotic orthologs Nucleic Acids Res., January 11, 2008; 36(suppl_1): D271 - D275. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. M. R. Davila, P. N. Mendes, G. Wagner, D. A. Tschoeke, R. R. C. Cuadrat, F. Liberman, L. Matos, T. Satake, K. A. C. S. Ocana, O. Triana, et al. ProtozoaDB: dynamic visualization and exploration of protozoan genomes Nucleic Acids Res., January 11, 2008; 36(suppl_1): D547 - D552. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Yeats, J. Lees, A. Reid, P. Kellam, N. Martin, X. Liu, and C. Orengo Gene3D: comprehensive structural and functional annotation of genomes Nucleic Acids Res., January 11, 2008; 36(suppl_1): D414 - D418. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Moslavac, K. Nicolaisen, O. Mirus, F. Al Dehni, R. Pernil, E. Flores, I. Maldener, and E. Schleiff A TolC-Like Protein Is Required for Heterocyst Development in Anabaena sp. Strain PCC 7120 J. Bacteriol., November 1, 2007; 189(21): 7887 - 7895. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Chen and L. Kurgan PFRES: protein fold classification by using evolutionary information and predicted secondary structure Bioinformatics, November 1, 2007; 23(21): 2843 - 2850. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Islinger, G. H. Luers, K. W. Li, M. Loos, and A. Volkl Rat Liver Peroxisomes after Fibrate Treatment: A SURVEY USING QUANTITATIVE MASS SPECTROMETRY J. Biol. Chem., August 10, 2007; 282(32): 23055 - 23069. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. A. Innis siteFiNDER|3D: a web-based tool for predicting the location of functional sites in proteins Nucleic Acids Res., July 13, 2007; 35(suppl_2): W489 - W494. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Lopez, A. Valencia, and M. L. Tress firestar--prediction of functionally important residues using structural templates and alignment reliability Nucleic Acids Res., July 13, 2007; 35(suppl_2): W573 - W577. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. E. Suzek, H. Huang, P. McGarvey, R. Mazumder, and C. H. Wu UniRef: comprehensive and non-redundant UniProt reference clusters Bioinformatics, May 15, 2007; 23(10): 1282 - 1288. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Feng and E. R.M. Tillier A fast and flexible approach to oligonucleotide probe design for genomes and gene families Bioinformatics, May 15, 2007; 23(10): 1195 - 1202. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. NG Kwang Loong and S. K. Mishra Unique folding of precursor microRNAs: Quantitative evidence and implications for de novo identification RNA, February 1, 2007; 13(2): 170 - 187. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Lopez, A. Valencia, and M. Tress FireDB--a database of functionally important residues from proteins of known structure Nucleic Acids Res., January 12, 2007; 35(suppl_1): D219 - D223. [Abstract] [Full Text] [PDF] |
||||
















