CLEANUP: a fast computer program for removing redundancies from nucleotide sequence databases
Centro di studio sui Mitocondri e Metabolismo Energetico, CNR
1Dipartimento di Biochimica e Biologia Molecolare, Università di Bari Italy
2 To whom correspondence should be addressed. E-mail:graziano{at}ava.-bu.cnr.it
A key concept in comparing sequence collections is the issue of redundancy. The production of sequence collections free from redundancy is undoubtedly very useful, both in performing statistical analyses and accelerating extensive database searching on nucleotide sequences. Indeed, publicly available databases contain multiple entries of identical or almost identical sequences. Performing statistical analysis on such biased data makes the risk of assigning high significance to non-sign patterns very high. In order to carry out unbiased statistical analysis as well as more efficient database searching it is thus necessary to analyse sequence data that have been purged of redundancy. Given that a unambiguous definition of redundancy is impracticable for biological sequence data, in the present program a quantitative description of redundancv will be used, based on the measure of sequence similarity. A sequence is considered redundant f it shows a degree of similarity and overlapping with a longer sequence in the database greater than a threshold fixed by the user.
In this paper we present a new algorithm based on an "approximate string matching" procedure, which is able to determine the overall degree of similarity between each pair of sequences contained in a nucleotide sequence database and to generate automatically nucleotide sequence collections free from redundancies.
Received on June 5, 1995; revised on October 2, 1995; accepted on October 2, 1995
This article has been cited by other articles:
![]() |
S. D. Baird, M. Turcotte, R. G. Korneluk, and M. Holcik Searching for IRES RNA, October 1, 2006; 12(10): 1755 - 1785. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Chevreux, T. Pfisterer, B. Drescher, A. J. Driesel, W. E.G. Muller, T. Wetter, and S. Suhai Using the miraEST Assembler for Reliable and Automated mRNA Transcript Assembly and SNP Detection in Sequenced ESTs Genome Res., June 1, 2004; 14(6): 1147 - 1159. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Cruveiller, K. Jabbari, O. Clay, and G. Bernardi Compositional Gene Landscapes in Vertebrates Genome Res., May 1, 2004; 14(5): 886 - 892. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Birdsell Integrating Genomics, Bioinformatics, and Classical Genetics to Study the Effects of Recombination on Genome Evolution Mol. Biol. Evol., July 1, 2002; 19(7): 1181 - 1197. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Pesole, S. Liuni, G. Grillo, F. Licciulli, F. Mignone, C. Gissi, and C. Saccone UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs. Update 2002 Nucleic Acids Res., January 1, 2002; 30(1): 335 - 340. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Musto, S. Cruveiller, G. D'Onofrio, H. Romero, and G. Bernardi Translational Selection on Codon Usage in Xenopus laevis Mol. Biol. Evol., September 1, 2001; 18(9): 1703 - 1707. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Bakheet, M. Frevel, B. R. G. Williams, W. Greer, and K. S. A. Khabar ARED: human AU-rich element-containing mRNA database reveals an unexpectedly diverse functional repertoire of encoded proteins Nucleic Acids Res., January 1, 2001; 29(1): 246 - 254. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Muilu, P. Rodriguez-Tomé, and A. Robinson GBuilder---An Application for the Visualization and Integration of EST Cluster Data Genome Res., January 1, 2001; 11(1): 179 - 184. [Abstract] [Full Text] |
||||
![]() |
G. Tóth, Z. Gáspári, and J. Jurka Microsatellites in Different Eukaryotic Genomes: Survey and Analysis Genome Res., July 1, 2000; 10(7): 967 - 981. [Abstract] [Full Text] |
||||
![]() |
G. RISTORI, M. SALVETTI, G. PESOLE, M. ATTIMONELLI, C. BUTTINELLI, R. MARTIN, and P. RICCIO Compositional bias and mimicry toward the nonself proteome in immunodominant T cell epitopes of self and nonself antigens FASEB J, March 1, 2000; 14(3): 431 - 438. [Abstract] [Full Text] |
||||
![]() |
G. Pesole, S. Liuni, G. Grillo, F. Licciulli, A. Larizza, W. Makalowski, and C. Saccone UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs Nucleic Acids Res., January 1, 2000; 28(1): 193 - 196. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. H. Jacobs, P. A. Stockwell, M. J. Schrieber, W. P. Tate, and C. M. Brown Transterm: a database of messenger RNA components and signals Nucleic Acids Res., January 1, 2000; 28(1): 293 - 295. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Burke, D. Davison, and W. Hide d2_cluster: A Validated Method for Clustering EST and Full-Length cDNA Sequences Genome Res., November 1, 1999; 9(11): 1135 - 1142. [Abstract] [Full Text] |
||||
![]() |
A. Barakat, G. Matassi, and G. Bernardi Distribution of genes in the genome of Arabidopsis thaliana and its implications for the genome organization of plants PNAS, August 18, 1998; 95(17): 10044 - 10049. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Burke, H. Wang, W. Hide, and D. B. Davison Alternative Gene Form Discovery and Candidate Gene Selection from Gene Indexing Projects Genome Res., March 1, 1998; 8(3): 276 - 290. [Abstract] [Full Text] |
||||





