Bioinformatics Advance Access originally published online on May 26, 2006
Bioinformatics 2006 22(13):1658-1659; doi:10.1093/bioinformatics/btl158
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
Burnham Institute for Medical Research La Jolla, CA 92037, USA
*To whom correspondence should be addressed.
Motivation: In 2001 and 2002, we published two papers (Bioinformatics, 17, 282283, Bioinformatics, 18, 7782) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST.
Availability: http://cd-hit.org
Contact: liwz{at}sdsc.edu
Received on March 23, 2006; revised on April 20, 2006; accepted on April 20, 2006
This article has been cited by other articles:
![]() |
M. L. Miller, L. J. Jensen, F. Diella, C. Jorgensen, M. Tinti, L. Li, M. Hsiung, S. A. Parker, J. Bordeaux, T. Sicheritz-Ponten, et al. Linear Motif Atlas for Phosphorylation-Dependent Signaling Sci. Signal., September 2, 2008; 1(35): ra2 - ra2. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. L. Miller, L. J. Jensen, F. Diella, C. Jorgensen, M. Tinti, L. Li, M. Hsiung, S. A. Parker, J. Bordeaux, T. Sicheritz-Ponten, et al. Linear Motif Atlas for Phosphorylation-Dependent Signaling Sci. Signal., September 2, 2008; 1(37): ra2 - ra2. [Abstract] [Full Text] |
||||
![]() |
J. Ren, L. Wen, X. Gao, C. Jin, Y. Xue, and X. Yao CSS-Palm 2.0: an updated software for palmitoylation sites prediction Protein Eng. Des. Sel., August 27, 2008; (2008) gzn039v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. P. Brown Efficient functional clustering of protein sequences using the Dirichlet process Bioinformatics, August 15, 2008; 24(16): 1765 - 1771. [Abstract] [PDF] |
||||
![]() |
E. Capriotti and M. A. Marti-Renom RNA structure alignment by a unit-vector approach Bioinformatics, August 15, 2008; 24(16): i112 - i118. [Abstract] [PDF] |
||||
![]() |
C. S. Miller and D. Eisenberg Using inferred residue contacts to distinguish between correct and incorrect protein models Bioinformatics, July 15, 2008; 24(14): 1575 - 1582. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Gao and J. Skolnick DBD-Hunter: a knowledge-based method for the prediction of DNA-protein interactions Nucleic Acids Res., July 1, 2008; 36(12): 3978 - 3992. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Bernsel, H. Viklund, J. Falk, E. Lindahl, G. von Heijne, and A. Elofsson Prediction of membrane-protein topology from first principles PNAS, May 20, 2008; 105(20): 7177 - 7181. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Park and V. Helms Prediction of the translocon-mediated membrane insertion free energies of protein sequences Bioinformatics, May 15, 2008; 24(10): 1271 - 1277. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. May, S. Wienkoop, S. Kempa, B. Usadel, N. Christian, J. Rupprecht, J. Weiss, L. Recuenco-Munoz, O. Ebenhoh, W. Weckwerth, et al. Metabolomics- and Proteomics-Assisted Genome Annotation and Analysis of the Draft Metabolic Network of Chlamydomonas reinhardtii Genetics, May 1, 2008; 179(1): 157 - 166. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y.-R. Tang, Z.-Y. Sheng, Y.-Z. Chen, and Z. Zhang An improved prediction of catalytic residues in enzyme structures Protein Eng. Des. Sel., May 1, 2008; 21(5): 295 - 302. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Xu, J. Wu, J. Xiao, Y. Tan, Q. Bao, F. Zhao, and X. Li PlasmoGF: an integrated system for comparative genomics and phylogenetic analysis of Plasmodium gene families Bioinformatics, May 1, 2008; 24(9): 1217 - 1220. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. G. Artieri, W. Haerty, B. P. Gupta, and R. S. Singh Sexual Selection and Maintenance of Sex: Evidence from Comparisons of Rates of Genomic Accumulation of Mutations and Divergence of Sex-Related Genes in Sexual and Hermaphroditic Species of Caenorhabditis Mol. Biol. Evol., May 1, 2008; 25(5): 972 - 979. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Palacios, J. Druce, L. Du, T. Tran, C. Birch, T. Briese, S. Conlan, P.-L. Quan, J. Hui, J. Marshall, et al. A New Arenavirus in a Cluster of Fatal Transplant-Associated Diseases N. Engl. J. Med., March 6, 2008; 358(10): 991 - 998. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Rubinstein and A. Fiser Predicting disulfide bond connectivity in proteins by correlated mutations analysis Bioinformatics, February 15, 2008; 24(4): 498 - 504. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. J. Jabado, Y. Liu, S. Conlan, P. L. Quan, H. Hegyi, Y. Lussier, T. Briese, G. Palacios, and W. I. Lipkin Comprehensive viral oligonucleotide probe design using conserved protein regions Nucleic Acids Res., January 17, 2008; 36(1): e3 - e3. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. V. Kriventseva, N. Rahman, O. Espinosa, and E. M. Zdobnov OrthoDB: the hierarchical catalog of eukaryotic orthologs Nucleic Acids Res., January 11, 2008; 36(suppl_1): D271 - D275. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. M. R. Davila, P. N. Mendes, G. Wagner, D. A. Tschoeke, R. R. C. Cuadrat, F. Liberman, L. Matos, T. Satake, K. A. C. S. Ocana, O. Triana, et al. ProtozoaDB: dynamic visualization and exploration of protozoan genomes Nucleic Acids Res., January 11, 2008; 36(suppl_1): D547 - D552. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Yeats, J. Lees, A. Reid, P. Kellam, N. Martin, X. Liu, and C. Orengo Gene3D: comprehensive structural and functional annotation of genomes Nucleic Acids Res., January 11, 2008; 36(suppl_1): D414 - D418. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Moslavac, K. Nicolaisen, O. Mirus, F. Al Dehni, R. Pernil, E. Flores, I. Maldener, and E. Schleiff A TolC-Like Protein Is Required for Heterocyst Development in Anabaena sp. Strain PCC 7120 J. Bacteriol., November 1, 2007; 189(21): 7887 - 7895. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Chen and L. Kurgan PFRES: protein fold classification by using evolutionary information and predicted secondary structure Bioinformatics, November 1, 2007; 23(21): 2843 - 2850. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Islinger, G. H. Luers, K. W. Li, M. Loos, and A. Volkl Rat Liver Peroxisomes after Fibrate Treatment: A SURVEY USING QUANTITATIVE MASS SPECTROMETRY J. Biol. Chem., August 10, 2007; 282(32): 23055 - 23069. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. A. Innis siteFiNDER|3D: a web-based tool for predicting the location of functional sites in proteins Nucleic Acids Res., July 13, 2007; 35(suppl_2): W489 - W494. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Lopez, A. Valencia, and M. L. Tress firestar--prediction of functionally important residues using structural templates and alignment reliability Nucleic Acids Res., July 13, 2007; 35(suppl_2): W573 - W577. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. E. Suzek, H. Huang, P. McGarvey, R. Mazumder, and C. H. Wu UniRef: comprehensive and non-redundant UniProt reference clusters Bioinformatics, May 15, 2007; 23(10): 1282 - 1288. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Feng and E. R.M. Tillier A fast and flexible approach to oligonucleotide probe design for genomes and gene families Bioinformatics, May 15, 2007; 23(10): 1195 - 1202. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. NG Kwang Loong and S. K. Mishra Unique folding of precursor microRNAs: Quantitative evidence and implications for de novo identification RNA, February 1, 2007; 13(2): 170 - 187. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Lopez, A. Valencia, and M. Tress FireDB--a database of functionally important residues from proteins of known structure Nucleic Acids Res., January 12, 2007; 35(suppl_1): D219 - D223. [Abstract] [Full Text] [PDF] |
||||










