Bioinformatics Advance Access originally published online on March 22, 2007
Bioinformatics 2007 23(10):1282-1288; doi:10.1093/bioinformatics/btm098
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
UniRef: comprehensive and non-redundant UniProt reference clusters
Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC 20007, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
Motivation: Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences.
Results: The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90 or 50% sequence identity levels. UniRef100, UniRef90 and UniRef50 yield a database size reduction of
10, 40 and 70%, respectively, from the source sequence set. The reduced redundancy increases the speed of similarity searches and improves detection of distant relationships. UniRef entries contain summary cluster and membership information, including the sequence of a representative protein, member count and common taxonomy of the cluster, the accession numbers of all the merged entries and links to rich functional annotation in UniProtKB to facilitate biological discovery. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis.
Availability: UniRef is updated biweekly and is available for online search and retrieval at http://www.uniprot.org, as well as for download at ftp://ftp.uniprot.org/pub/databases/uniprot/uniref
Contact: bes23{at}georgetown.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
Associate Editor: Alex Bateman
Received on January 25, 2007; revised on March 2, 2007; accepted on March 7, 2007
This article has been cited by other articles:
![]() |
L. E. Ulrich and I. B. Zhulin The MiST2 database: a comprehensive genomics resource on microbial signal transduction Nucleic Acids Res., November 9, 2009; (2009) gkp940v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. D. Rawlings A large and accurate collection of peptidase cleavages in the MEROPS database Database, November 2, 2009; 2009(0): bap015 - bap015. [Abstract] [Full Text] [PDF] |
||||
![]() |
The UniProt Consortium The Universal Protein Resource (UniProt) in 2010 Nucleic Acids Res., October 20, 2009; (2009) gkp846v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Lobley, M. I. Sadowski, and D. T. Jones pGenTHREADER and pDomTHREADER: new methods for improved protein fold recognition and superfamily discrimination Bioinformatics, July 15, 2009; 25(14): 1761 - 1767. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Maupetit, P. Derreumaux, and P. Tuffery PEP-FOLD: an online resource for de novo peptide structure prediction Nucleic Acids Res., July 1, 2009; 37(suppl_2): W498 - W503. [Abstract] [Full Text] [PDF] |
||||
![]() |
S.-K. Lim, S. J. Kim, S. H. Cha, Y.-K. Oh, H.-J. Rhee, M.-S. Kim, and J. K. Lee Complete Genome Sequence of Rhodobacter sphaeroides KD131 J. Bacteriol., February 1, 2009; 191(3): 1118 - 1119. [Abstract] [Full Text] [PDF] |
||||
![]() |
The UniProt Consortium The Universal Protein Resource (UniProt) 2009 Nucleic Acids Res., January 1, 2009; 37(suppl_1): D169 - D174. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Igarashi, E. Heureux, K. S. Doctor, P. Talwar, S. Gramatikova, K. Gramatikoff, Y. Zhang, M. Blinov, S. S. Ibragimova, S. Boyd, et al. PMAP: databases for analyzing proteolytic events and pathways Nucleic Acids Res., January 1, 2009; 37(suppl_1): D611 - D618. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Forslund and E. L. L. Sonnhammer Predicting protein function from domain content Bioinformatics, August 1, 2008; 24(15): 1681 - 1687. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Loewenstein, E. Portugaly, M. Fromer, and M. Linial Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space Bioinformatics, July 1, 2008; 24(13): i41 - i49. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. P. Walsh, C. Webber, S. Searle, S. S. Sturrock, and G. J. Barton SCANPS: a web server for iterative protein sequence database searching by dynamic programing, with display in a hierarchical SCOP browser Nucleic Acids Res., July 1, 2008; 36(suppl_2): W25 - W29. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Cole, J. D. Barber, and G. J. Barton The Jpred 3 secondary structure prediction server Nucleic Acids Res., July 1, 2008; 36(suppl_2): W197 - W201. [Abstract] [Full Text] [PDF] |
||||
![]() |
The UniProt Consortium The Universal Protein Resource (UniProt) Nucleic Acids Res., January 11, 2008; 36(suppl_1): D190 - D195. [Abstract] [Full Text] [PDF] |
||||



