Bioinformatics Advance Access originally published online on May 26, 2006
Bioinformatics 2006 22(13):1658-1659; doi:10.1093/bioinformatics/btl158
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
Burnham Institute for Medical Research La Jolla, CA 92037, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: In 2001 and 2002, we published two papers (Bioinformatics, 17, 282283, Bioinformatics, 18, 7782) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST.
Availability: http://cd-hit.org
Contact: liwz{at}sdsc.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
In recent years, the amount of biological sequence data has been growing explosively, which has imposed growing difficulties on analyzing them. The complexity of many sequence analyses is of the order of n2, where n is the number of sequences to be considered. One such example is protein sequence clustering, which groups similar proteins into clusters based on their sequence similarities. To address this computational challenging problem, we developed a novel method and published a program, cd-hit (Li et al., 2001, 2002a), which can efficiently handle huge databases. For example, it takes only a few hours to cluster the NCBI-nr with
3 million proteins on a single high-end workstation. Since its release, cd-hit has been used by many groups, including Uniprot (Apweiler et al., 2004) and PDB (Bourne et al., 2004), in various research fields. In our group, we applied it to generate non-redundant protein datasets to reduce the database search efforts and to improve the homology detection sensitivity (Li et al., 2002a).
The algorithm behind cd-hit is short word filtering, which can determine that the similarity between two sequences is below a certain value without performing an actual sequence alignment. This algorithm is not limited to protein sequence clustering; it can also be applied to many other analyses that involve a large amount of sequence comparisons.
Here, we present several new programs based on cd-hit algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA database and cd-hit-est-2d compares two nucleotide datasets. The common advantages of these programs are ultrahigh speed and the ability to handle huge databases.
| 2 ALGORITHMS |
|---|
|
|
|---|
2.1 Short word filtering
The details of the algorithm for short word filtering were described in our earlier papers (Li et al., 2001, 2002a). In short, the minimum number of identical short substrings, called words, such as dipeptides, tripeptides and so on, shared by two proteins is a function of their sequence similarity. We calculated this function by analytical and large-scale statistical analyses. Therefore, we can effectively estimate that the similarity of two sequences is below a certain threshold by simple word counting and without an actual sequence alignment. For nucleotide sequences, we can also obtain such a short word requirement by a similar combination of analytical and statistical analyses.
We implemented this idea using an index table. For instance, the total number of possible pentapeptides is only 215 (each position has 21 possibilities, 20 amino acids plus X), and such an index table requires only 4 million entries, which just matches the RAM size of current computers. Index tables maximize the speed of short word counting. Details regarding how to choose an appropriate short word are documented in the cd-hit user's guide.
2.2 Cd-hit and cd-hit-est clustering
The original cd-hit program clusters a protein database, and its variant, cd-hit-est, clusters a DNA/RNA database. For eukaryotic genes, long introns can cause long gaps in sequence alignments, which significantly reduces the efficiency of short word filtering. So, practically, cd-hit-est can be applied only for non-intron-containing sequences, such as ESTs.
The clustering algorithm in both cd-hit and cd-hit-est is a greedy incremental clustering algorithm. Briefly, sequences are first sorted in order of decreasing length. The longest sequence becomes the representative of the first cluster. Then, each remaining sequence is compared with the representatives of existing clusters. If the similarity with any representative is above a given threshold, it is grouped into that cluster. Otherwise, a new cluster is defined with that sequence as the representative. For each sequence comparison, short word filtering is applied to the sequences to confirm whether the similarity is below the clustering threshold. If this cannot be confirmed, an actual sequence alignment is performed.
2.3 Cd-hit-2d and cd-hit-est-2d
Program cd-hit-2d compares two protein databases and identifies similar sequences between them above a certain threshold. Cd-hit-est-2d works for two DNA/RNA databases. For the same reason that we mentioned earlier, cd-hit-est-2d is a practical choice only for non-intron-containing sequences.
Given two databases, db1 and db2, cd-hit-2d or cd-hit-est-2d works in a straightforward way. Sequences in db1 are first sorted in order of decreasing length. Then, each sequence in db2 is compared with db1 from the top (the longest one), and if the similarity to any one in db1 is above the threshold, this sequence is attached to the matched one in db1. At the end of comparing, the program reports matches between db1 and db2 and also outputs a list of proteins in db2 that is not similar to any sequence in db1.
| 3 RESULTS |
|---|
|
|
|---|
The cd-hit package was written in C++ and was tested on Linux systems. It is distributed as an open source package and can be run on almost all systems that support C++ with little or no modification.
Some example runs are listed in Table 1. All these examples were performed on a Linux workstation with dual 3.0 GHz Xeon processors and 4 GB RAM. The programs used only one processor. For example, cd-hit took <8 h to cluster the NCBI-nr with more than 3.2 million proteins at 90% sequence identity level. Cd-hit-est-2d took a similar amount of time to identify the matches above 95% identity in both strands between human ESTs with
6 million sequences and
30 thousand human mRNAs.
|
Many options and functions were implemented for the users to control the clustering or comparing process. For example, a useful function is the incremental clustering that offers not only a higher speed but also a stable clustering structure for regularly updated databases. Full set of options are described in the documentation for the program.
In addition to the programs described above, the package contains several utility tools. Some tools help analyze, sort and format the clustering results. One script runs clustering in parallel mode by distributing jobs on a computer cluster (details can be found in the user's guide). The cd-hit package will be under regular maintenance and further development, which will focus on the efficiency at low sequence similarity thresholds. We are also open to adding new functionalities as suggested by users.
| Acknowledgments |
|---|
Funding to pay the Open Access publication charges was provided by the institutional funds of Burnham Institute for Medical Research.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Golan Yona
Received on March 23, 2006; revised on April 20, 2006; accepted on April 20, 2006
| REFERENCES |
|---|
|
|
|---|
Apweiler, R., et al. (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Res, . 32, (Database issue) D115D119
Bourne, P.E., et al. (2004) The distribution and query systems of the RCSB Protein Data Bank. Nucleic Acids Res, . 32, (Database issue) D223D225
Li, W., et al. (2001) Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 17, 282283
Li, W., et al. (2002a) Sequence clustering strategies improve remote homology recognitions while reducing search times. Bioinformatics, 15, 643649.
Li, W., et al. (2002b) Tolerating some redundancy significantly speeds up clustering of large protein databases. Protein Eng, . 18, 7782.
This article has been cited by other articles:
![]() |
S. Lata and G.P.S. Raghava Prediction and classification of chemokines and their receptors Protein Eng. Des. Sel., July 1, 2009; 22(7): 441 - 444. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Bernsel, H. Viklund, A. Hennerdal, and A. Elofsson TOPCONS: consensus prediction of membrane protein topology Nucleic Acids Res., July 1, 2009; 37(suppl_2): W465 - W468. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Pavelka, E. Chovancova, and J. Damborsky HotSpot Wizard: a web server for identification of hot spots in protein engineering Nucleic Acids Res., July 1, 2009; 37(suppl_2): W376 - W383. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Mochida, T. Yoshida, T. Sakurai, Y. Ogihara, and K. Shinozaki TriFLDB: A Database of Clustered Full-Length Coding Sequences from Triticeae with Applications to Comparative Grass Genomics Plant Physiology, July 1, 2009; 150(3): 1135 - 1146. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Hernandez, M. J. Mate, P. C. Sanchez-Diaz, A. Romero, F. Rojo, and J. L. Martinez Structural and Functional Analysis of SmeT, the Repressor of the Stenotrophomonas maltophilia Multidrug Efflux Pump SmeDEF J. Biol. Chem., May 22, 2009; 284(21): 14428 - 14438. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Andreopoulos, A. An, X. Wang, and M. Schroeder A roadmap of clustering algorithms: finding a match for a biomedical application Brief Bioinform, May 1, 2009; 10(3): 297 - 314. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Childs, Z. Nikoloski, P. May, and D. Walther Identification and classification of ncRNA molecules using graph properties Nucleic Acids Res., May 1, 2009; 37(9): e66 - e66. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Lo, Y.-Y. Chiu, E. A. Rodland, P.-C. Lyu, T.-Y. Sung, and W.-L. Hsu Predicting helix-helix interactions from residue contacts in membrane proteins Bioinformatics, April 15, 2009; 25(8): 996 - 1003. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Okazaki, M. Shimojima, Y. Sawada, K. Toyooka, T. Narisawa, K. Mochida, H. Tanaka, F. Matsuda, A. Hirai, M. Y. Hirai, et al. A Chloroplastic UDP-Glucose Pyrophosphorylase from Arabidopsis Is the Committed Enzyme for the First Step of Sulfolipid Biosynthesis PLANT CELL, March 1, 2009; 21(3): 892 - 909. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Gunther, J. von Eichborn, P. May, and R. Preissner JAIL: a structure-based interface library for macromolecules Nucleic Acids Res., January 1, 2009; 37(suppl_1): D338 - D341. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Gerlach, E. V. Kriventseva, N. Rahman, C. E. Vejnar, and E. M. Zdobnov miROrtho: computational survey of microRNA genes Nucleic Acids Res., January 1, 2009; 37(suppl_1): D111 - D117. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Goldenberg, E. Erez, G. Nimrod, and N. Ben-Tal The ConSurf-DB: pre-calculated evolutionary conservation profiles of protein structures Nucleic Acids Res., January 1, 2009; 37(suppl_1): D323 - D327. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Letunic, T. Doerks, and P. Bork SMART 6: recent updates and new developments Nucleic Acids Res., January 1, 2009; 37(suppl_1): D229 - D232. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. H. Saier Jr, M. R. Yen, K. Noto, D. G. Tamang, and C. Elkan The Transporter Classification Database: recent advances Nucleic Acids Res., January 1, 2009; 37(suppl_1): D274 - D278. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J. Bordner Predicting small ligand binding sites in proteins using backbone structure Bioinformatics, December 15, 2008; 24(24): 2865 - 2871. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Holm, S. Kaariainen, P. Rosenstrom, and A. Schenkel Searching protein structure databases with DaliLite v.3 Bioinformatics, December 1, 2008; 24(23): 2780 - 2781. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Ren, L. Wen, X. Gao, C. Jin, Y. Xue, and X. Yao CSS-Palm 2.0: an updated software for palmitoylation sites prediction Protein Eng. Des. Sel., November 1, 2008; 21(11): 639 - 644. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Zhang, H. Zhang, K. Chen, S. Shen, J. Ruan, and L. Kurgan Accurate sequence-based prediction of catalytic residues Bioinformatics, October 15, 2008; 24(20): 2329 - 2338. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. L. Miller, L. J. Jensen, F. Diella, C. Jorgensen, M. Tinti, L. Li, M. Hsiung, S. A. Parker, J. Bordeaux, T. Sicheritz-Ponten, et al. Linear Motif Atlas for Phosphorylation-Dependent Signaling Sci. Signal., September 2, 2008; 1(35): ra2 - ra2. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Watanabe, K. Mochida, T. Kato, S. Tabata, N. Yoshimoto, M. Noji, and K. Saito Comparative Genomics and Reverse Genetics Analysis Reveal Indispensable Functions of the Serine Acetyltransferase Gene Family in Arabidopsis PLANT CELL, September 1, 2008; 20(9): 2484 - 2496. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. R. Tabita, T. E Hanson, S. Satagopan, B. H Witte, and N. E Kreel Phylogenetic and evolutionary relationships of RubisCO and the RubisCO-like proteins and the functional lessons provided by diverse molecular forms Phil Trans R Soc B, August 27, 2008; 363(1504): 2629 - 2640. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. P. Brown Efficient functional clustering of protein sequences using the Dirichlet process Bioinformatics, August 15, 2008; 24(16): 1765 - 1771. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Capriotti and M. A. Marti-Renom RNA structure alignment by a unit-vector approach Bioinformatics, August 15, 2008; 24(16): i112 - i118. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. S. Miller and D. Eisenberg Using inferred residue contacts to distinguish between correct and incorrect protein models Bioinformatics, July 15, 2008; 24(14): 1575 - 1582. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Gao and J. Skolnick DBD-Hunter: a knowledge-based method for the prediction of DNA-protein interactions Nucleic Acids Res., July 1, 2008; 36(12): 3978 - 3992. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Bernsel, H. Viklund, J. Falk, E. Lindahl, G. von Heijne, and A. Elofsson Prediction of membrane-protein topology from first principles PNAS, May 20, 2008; 105(20): 7177 - 7181. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Park and V. Helms Prediction of the translocon-mediated membrane insertion free energies of protein sequences Bioinformatics, May 15, 2008; 24(10): 1271 - 1277. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. May, S. Wienkoop, S. Kempa, B. Usadel, N. Christian, J. Rupprecht, J. Weiss, L. Recuenco-Munoz, O. Ebenhoh, W. Weckwerth, et al. Metabolomics- and Proteomics-Assisted Genome Annotation and Analysis of the Draft Metabolic Network of Chlamydomonas reinhardtii Genetics, May 1, 2008; 179(1): 157 - 166. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y.-R. Tang, Z.-Y. Sheng, Y.-Z. Chen, and Z. Zhang An improved prediction of catalytic residues in enzyme structures Protein Eng. Des. Sel., May 1, 2008; 21(5): 295 - 302. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Xu, J. Wu, J. Xiao, Y. Tan, Q. Bao, F. Zhao, and X. Li PlasmoGF: an integrated system for comparative genomics and phylogenetic analysis of Plasmodium gene families Bioinformatics, May 1, 2008; 24(9): 1217 - 1220. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. G. Artieri, W. Haerty, B. P. Gupta, and R. S. Singh Sexual Selection and Maintenance of Sex: Evidence from Comparisons of Rates of Genomic Accumulation of Mutations and Divergence of Sex-Related Genes in Sexual and Hermaphroditic Species of Caenorhabditis Mol. Biol. Evol., May 1, 2008; 25(5): 972 - 979. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Palacios, J. Druce, L. Du, T. Tran, C. Birch, T. Briese, S. Conlan, P.-L. Quan, J. Hui, J. Marshall, et al. A New Arenavirus in a Cluster of Fatal Transplant-Associated Diseases N. Engl. J. Med., March 6, 2008; 358(10): 991 - 998. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Rubinstein and A. Fiser Predicting disulfide bond connectivity in proteins by correlated mutations analysis Bioinformatics, February 15, 2008; 24(4): 498 - 504. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. J. Jabado, Y. Liu, S. Conlan, P. L. Quan, H. Hegyi, Y. Lussier, T. Briese, G. Palacios, and W. I. Lipkin Comprehensive viral oligonucleotide probe design using conserved protein regions Nucleic Acids Res., January 17, 2008; 36(1): e3 - e3. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. V. Kriventseva, N. Rahman, O. Espinosa, and E. M. Zdobnov OrthoDB: the hierarchical catalog of eukaryotic orthologs Nucleic Acids Res., January 11, 2008; 36(suppl_1): D271 - D275. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. M. R. Davila, P. N. Mendes, G. Wagner, D. A. Tschoeke, R. R. C. Cuadrat, F. Liberman, L. Matos, T. Satake, K. A. C. S. Ocana, O. Triana, et al. ProtozoaDB: dynamic visualization and exploration of protozoan genomes Nucleic Acids Res., January 11, 2008; 36(suppl_1): D547 - D552. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Yeats, J. Lees, A. Reid, P. Kellam, N. Martin, X. Liu, and C. Orengo Gene3D: comprehensive structural and functional annotation of genomes Nucleic Acids Res., January 11, 2008; 36(suppl_1): D414 - D418. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Moslavac, K. Nicolaisen, O. Mirus, F. Al Dehni, R. Pernil, E. Flores, I. Maldener, and E. Schleiff A TolC-Like Protein Is Required for Heterocyst Development in Anabaena sp. Strain PCC 7120 J. Bacteriol., November 1, 2007; 189(21): 7887 - 7895. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Chen and L. Kurgan PFRES: protein fold classification by using evolutionary information and predicted secondary structure Bioinformatics, November 1, 2007; 23(21): 2843 - 2850. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Islinger, G. H. Luers, K. W. Li, M. Loos, and A. Volkl Rat Liver Peroxisomes after Fibrate Treatment: A SURVEY USING QUANTITATIVE MASS SPECTROMETRY J. Biol. Chem., August 10, 2007; 282(32): 23055 - 23069. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. A. Innis siteFiNDER|3D: a web-based tool for predicting the location of functional sites in proteins Nucleic Acids Res., July 13, 2007; 35(suppl_2): W489 - W494. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Lopez, A. Valencia, and M. L. Tress firestar--prediction of functionally important residues using structural templates and alignment reliability Nucleic Acids Res., July 13, 2007; 35(suppl_2): W573 - W577. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. E. Suzek, H. Huang, P. McGarvey, R. Mazumder, and C. H. Wu UniRef: comprehensive and non-redundant UniProt reference clusters Bioinformatics, May 15, 2007; 23(10): 1282 - 1288. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Feng and E. R.M. Tillier A fast and flexible approach to oligonucleotide probe design for genomes and gene families Bioinformatics, May 15, 2007; 23(10): 1195 - 1202. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. NG Kwang Loong and S. K. Mishra Unique folding of precursor microRNAs: Quantitative evidence and implications for de novo identification RNA, February 1, 2007; 13(2): 170 - 187. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Lopez, A. Valencia, and M. Tress FireDB--a database of functionally important residues from proteins of known structure Nucleic Acids Res., January 12, 2007; 35(suppl_1): D219 - D223. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||














