Bioinformatics Vol. 17 no. 3 2001
Pages 282-283
© 2001 Oxford University Press
Applications Note |
Clustering of highly homologous sequences to reduce the size of large protein databases
1 San Diego Supercomputer Center, La Jolla,
CA 92093, USA
2 The Burnham Institute, La Jolla, CA 92037,
USA
Received on October 4, 2000
; revised on November 1, 2000
; accepted on November 6, 2000
Summary: We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560000 sequences on a high-end PC. The output database, including only the representative sequences, can be used for more efficient and sensitive database searches.
Availability: The program is available from http://bioinformatics.burnham-inst.org/cd-hi
Contact: liwz{at}sdsc.edu or adam{at}burnham-inst.org
* To whom correspondence should be addressed.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
M. von Grotthuss, D. Plewczynski, G. Vriend, and L. Rychlewski 3D-Fun: predicting enzyme function from structure Nucleic Acids Res., July 1, 2008; 36(suppl_2): W303 - W307. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Guo, L. Yu, Z. Wen, and M. Li Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences Nucleic Acids Res., May 1, 2008; 36(9): 3025 - 3030. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Chruszcz, W. Potrzebowski, M. D. Zimmerman, M. Grabowski, H. Zheng, P. Lasota, and W. Minor Analysis of solvent content and oligomeric states in protein crystals--does symmetry matter? Protein Sci., April 1, 2008; 17(4): 623 - 632. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Heger, E. Korpelainen, T. Hupponen, K. Mattila, V. Ollikainen, and L. Holm PairsDB atlas of protein sequence space Nucleic Acids Res., January 11, 2008; 36(suppl_1): D276 - D280. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Brock, K. Talley, K. Coley, P. Kundrotas, and E. Alexov Optimization of Electrostatic Interactions in Protein-Protein Complexes Biophys. J., November 15, 2007; 93(10): 3340 - 3352. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Smialowski, A. J. Martin-Galiano, A. Mikolajka, T. Girschick, T. A. Holak, and D. Frishman Protein solubility: sequence based prediction and experimental verification Bioinformatics, October 1, 2007; 23(19): 2536 - 2542. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. A. Innis siteFiNDER|3D: a web-based tool for predicting the location of functional sites in proteins Nucleic Acids Res., July 13, 2007; 35(suppl_2): W489 - W494. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. E. Suzek, H. Huang, P. McGarvey, R. Mazumder, and C. H. Wu UniRef: comprehensive and non-redundant UniProt reference clusters Bioinformatics, May 15, 2007; 23(10): 1282 - 1288. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Qiu, M. Hue, A. Ben-Hur, J.-P. Vert, and W. S. Noble A structural alignment kernel for protein structures Bioinformatics, May 1, 2007; 23(9): 1090 - 1098. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Przybylski and B. Rost Consensus sequences improve PSI-BLAST through mimicking profile-profile alignments Nucleic Acids Res., April 1, 2007; 35(7): 2238 - 2246. [Abstract] [Full Text] [PDF] |
||||
![]() |
The UniProt Consortium The Universal Protein Resource (UniProt) Nucleic Acids Res., January 12, 2007; 35(suppl_1): D193 - D197. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Capriotti, R. Calabrese, and R. Casadio Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information Bioinformatics, November 15, 2006; 22(22): 2729 - 2734. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q.-B. Gao and Z.-Z. Wang Classification of G-protein coupled receptors at four levels Protein Eng. Des. Sel., November 1, 2006; 19(11): 511 - 516. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. R. Banatao, D. Cascio, C. S. Crowley, M. R. Fleissner, H. L. Tienson, and T. O. Yeates An approach to crystallizing proteins by synthetic symmetrization PNAS, October 31, 2006; 103(44): 16230 - 16235. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Friedberg Automated protein function prediction--the genomic challenge Brief Bioinform, September 1, 2006; 7(3): 225 - 242. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Liu, Z.-Z. Hu, M. Torii, C. Wu, and C. Friedman Quantitative Assessment of Dictionary-based Protein Named Entity Tagging J. Am. Med. Inform. Assoc., September 1, 2006; 13(5): 497 - 507. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Lee, B. Lee, I. Jang, S. Kim, and J. Bhak Localizome: a server for identifying transmembrane topologies and TM helices of eukaryotic proteins utilizing domain information. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W99 - W103. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Novatchkova, G. Schneider, R. Fritz, F. Eisenhaber, and A. Schleiffer DOUTfinder--identification of distant domain outliers using subsignificant sequence similarity. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W214 - W218. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. T.-H. Chang, Y.-Z. Weng, J.-H. Lin, M.-J. Hwang, and Y.-J. Oyang Protemot: prediction of protein binding sites with automatically extracted geometrical templates. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W303 - W309. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Li and A. Godzik Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences Bioinformatics, July 1, 2006; 22(13): 1658 - 1659. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Arnold, L. Bordoli, J. Kopp, and T. Schwede The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling Bioinformatics, January 15, 2006; 22(2): 195 - 201. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Kosuge, T. Abe, T. Okido, N. Tanaka, M. Hirahata, Y. Maruyama, J. Mashima, A. Tomiki, M. Kurokawa, R. Himeno, et al. Exploration and Grading of Possible Genes from 183 Bacterial Strains by a Common Protocol to Identification of New Genes: Gene Trek in Prokaryote Space (GTPS) DNA Res, January 1, 2006; 13(6): 245 - 254. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Winter, A. Henschel, W. K. Kim, and M. Schroeder SCOPPI: a structural classification of protein-protein interfaces Nucleic Acids Res., January 1, 2006; 34(suppl_1): D310 - D314. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. J. Korza and M. Bochtler Pseudomonas aeruginosa LD-Carboxypeptidase, a Serine Peptidase with a Ser-His-Glu Triad and a Nucleophilic Elbow J. Biol. Chem., December 9, 2005; 280(49): 40802 - 40812. [Abstract] [Full Text] [PDF] |
||||
![]() |
K.-J. Park, M. M. Gromiha, P. Horton, and M. Suwa Discrimination of outer membrane proteins using support vector machines Bioinformatics, December 1, 2005; 21(23): 4223 - 4229. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Guda and S. Subramaniam TARGET: a new method for predicting protein subcellular localization in eukaryotes Bioinformatics, November 1, 2005; 21(21): 3963 - 3969. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. R. Yang, R. Thomson, P. McNeil, and R. M. Esnouf RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins Bioinformatics, August 15, 2005; 21(16): 3369 - 3376. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Smith, V. Kunin, L. Goldovsky, A. J. Enright, and C. A. Ouzounis MagicMatch--cross-referencing sequence identifiers across databases Bioinformatics, August 15, 2005; 21(16): 3429 - 3430. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Huang, H. Chen, and Z. Sun CTKPred: an SVM-based method for the prediction and classification of the cytokine superfamily Protein Eng. Des. Sel., August 1, 2005; 18(8): 365 - 368. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Jaroszewski, L. Rychlewski, Z. Li, W. Li, and A. Godzik FFAS03: a server for profile-profile sequence alignments Nucleic Acids Res., July 1, 2005; 33(suppl_2): W284 - W288. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Qi, R. Lee, and S. Hayward A comprehensive and non-redundant database of protein domain movements Bioinformatics, June 15, 2005; 21(12): 2832 - 2838. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Schneider, A. Bairoch, C. H. Wu, and R. Apweiler Plant Protein Annotation in the UniProt Knowledgebase Plant Physiology, May 1, 2005; 138(1): 59 - 66. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Malde, E. Coward, and I. Jonassen A graph based algorithm for generating EST consensus sequences Bioinformatics, April 15, 2005; 21(8): 1371 - 1375. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Itoh, S. Goto, T. Akutsu, and M. Kanehisa Fast and accurate database homology search using upper bounds of local alignment scores Bioinformatics, April 1, 2005; 21(7): 912 - 921. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Pugalenthi, A. Bhaduri, and R. Sowdhamini GenDiS: Genomic Distribution of protein structural domain Superfamilies Nucleic Acids Res., January 1, 2005; 33(suppl_1): D252 - D255. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Th. Magnusson, H. Toyama, M. Saeki, A. Rojas, J. C. Reed, R. C. Liddington, J. P. Klinman, and R. Schwarzenbacher Quinone biogenesis: Structure and mechanism of PqqC, the final catalyst in the production of pyrroloquinoline quinone PNAS, May 25, 2004; 101(21): 7913 - 7918. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. L. Sierk and W. R. Pearson Sensitivity and selectivity in protein structure comparison Protein Sci., March 1, 2004; 13(3): 773 - 785. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Ben-Dor, N. Esterman, E. Rubin, and N. Sharon Biases and complex patterns in the residues flanking protein N-glycosylation sites Glycobiology, February 1, 2004; 14(2): 95 - 101. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. G. Wirth, R. Ricci, J. F. Gimenez-Abian, S. Taghybeeglu, N. R. Kudo, W. Jochum, M. Vasseur-Cognet, and K. Nasmyth Loss of the anaphase-promoting complex in quiescent cells causes unscheduled hepatocyte proliferation Genes & Dev., January 1, 2004; 18(1): 88 - 98. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. E. Bourne, K. J. Addess, W. F. Bluhm, L. Chen, N. Deshpande, Z. Feng, W. Fleri, R. Green, J. C. Merino-Ott, W. Townsend-Merino, et al. The distribution and query systems of the RCSB Protein Data Bank Nucleic Acids Res., January 1, 2004; 32(90001): D223 - 225. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Cotter, P. Guda, E. Fahy, and S. Subramaniam MitoProteome: mitochondrial protein sequence database and annotation system Nucleic Acids Res., January 1, 2004; 32(90001): D463 - 467. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Lavoie, F. Debeane, Q.-D. Trinh, J.-F. Turcotte, L.-P. Corbeil-Girard, M.-J. Dicaire, A. Saint-Denis, M. Page, G. A. Rouleau, and B. Brais Polymorphism, shared functions and convergent evolution of genes with sequences coding for polyalanine domains Hum. Mol. Genet., November 15, 2003; 12(22): 2967 - 2979. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. de Bono and C. Chothia Exegesis: a procedure to improve gene predictions and its use to find immunoglobulin superfamily proteins in the human and mouse genomes Nucleic Acids Res., November 1, 2003; 31(21): 6096 - 6103. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Mika and B. Rost UniqueProt: creating representative protein sequence sets Nucleic Acids Res., July 1, 2003; 31(13): 3789 - 3791. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Li, L. Jaroszewski, and A. Godzik Sequence clustering strategies improve remote homology recognitions while reducing search times Protein Eng. Des. Sel., August 1, 2002; 15(8): 643 - 649. [Abstract] [Full Text] [PDF] |
||||













