Skip Navigation

This Article
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Grillo, G.
Right arrow Articles by Pesole, G.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Grillo, G.
Right arrow Articles by Pesole, G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© Oxford University Press

CLEANUP: a fast computer program for removing redundancies from nucleotide sequence databases

Giorgio Grillo , Marcella Attimonelli 1, Sabino Liuni and Graziano Pesole 1,2

Centro di studio sui Mitocondri e Metabolismo Energetico, CNR
1Dipartimento di Biochimica e Biologia Molecolare, Università di Bari Italy

2 To whom correspondence should be addressed. E-mail:graziano{at}ava.-bu.cnr.it

A key concept in comparing sequence collections is the issue of redundancy. The production of sequence collections free from redundancy is undoubtedly very useful, both in performing statistical analyses and accelerating extensive database searching on nucleotide sequences. Indeed, publicly available databases contain multiple entries of identical or almost identical sequences. Performing statistical analysis on such biased data makes the risk of assigning high significance to non-sign patterns very high. In order to carry out unbiased statistical analysis as well as more efficient database searching it is thus necessary to analyse sequence data that have been purged of redundancy. Given that a unambiguous definition of redundancy is impracticable for biological sequence data, in the present program a quantitative description of redundancv will be used, based on the measure of sequence similarity. A sequence is considered redundant f it shows a degree of similarity and overlapping with a longer sequence in the database greater than a threshold fixed by the user.

In this paper we present a new algorithm based on an "approximate string matching" procedure, which is able to determine the overall degree of similarity between each pair of sequences contained in a nucleotide sequence database and to generate automatically nucleotide sequence collections free from redundancies.


Received on June 5, 1995; revised on October 2, 1995; accepted on October 2, 1995

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
RNAHome page
S. D. Baird, M. Turcotte, R. G. Korneluk, and M. Holcik
Searching for IRES
RNA, October 1, 2006; 12(10): 1755 - 1785.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
B. Chevreux, T. Pfisterer, B. Drescher, A. J. Driesel, W. E.G. Muller, T. Wetter, and S. Suhai
Using the miraEST Assembler for Reliable and Automated mRNA Transcript Assembly and SNP Detection in Sequenced ESTs
Genome Res., June 1, 2004; 14(6): 1147 - 1159.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
S. Cruveiller, K. Jabbari, O. Clay, and G. Bernardi
Compositional Gene Landscapes in Vertebrates
Genome Res., May 1, 2004; 14(5): 886 - 892.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
J. A. Birdsell
Integrating Genomics, Bioinformatics, and Classical Genetics to Study the Effects of Recombination on Genome Evolution
Mol. Biol. Evol., July 1, 2002; 19(7): 1181 - 1197.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
G. Pesole, S. Liuni, G. Grillo, F. Licciulli, F. Mignone, C. Gissi, and C. Saccone
UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs. Update 2002
Nucleic Acids Res., January 1, 2002; 30(1): 335 - 340.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
H. Musto, S. Cruveiller, G. D'Onofrio, H. Romero, and G. Bernardi
Translational Selection on Codon Usage in Xenopus laevis
Mol. Biol. Evol., September 1, 2001; 18(9): 1703 - 1707.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
T. Bakheet, M. Frevel, B. R. G. Williams, W. Greer, and K. S. A. Khabar
ARED: human AU-rich element-containing mRNA database reveals an unexpectedly diverse functional repertoire of encoded proteins
Nucleic Acids Res., January 1, 2001; 29(1): 246 - 254.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
J. Muilu, P. Rodriguez-Tomé, and A. Robinson
GBuilder---An Application for the Visualization and Integration of EST Cluster Data
Genome Res., January 1, 2001; 11(1): 179 - 184.
[Abstract] [Full Text]


Home page
Genome ResHome page
G. Tóth, Z. Gáspári, and J. Jurka
Microsatellites in Different Eukaryotic Genomes: Survey and Analysis
Genome Res., July 1, 2000; 10(7): 967 - 981.
[Abstract] [Full Text]


Home page
FASEB J.Home page
G. RISTORI, M. SALVETTI, G. PESOLE, M. ATTIMONELLI, C. BUTTINELLI, R. MARTIN, and P. RICCIO
Compositional bias and mimicry toward the nonself proteome in immunodominant T cell epitopes of self and nonself antigens
FASEB J, March 1, 2000; 14(3): 431 - 438.
[Abstract] [Full Text]


Home page
Nucleic Acids ResHome page
G. Pesole, S. Liuni, G. Grillo, F. Licciulli, A. Larizza, W. Makalowski, and C. Saccone
UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs
Nucleic Acids Res., January 1, 2000; 28(1): 193 - 196.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
G. H. Jacobs, P. A. Stockwell, M. J. Schrieber, W. P. Tate, and C. M. Brown
Transterm: a database of messenger RNA components and signals
Nucleic Acids Res., January 1, 2000; 28(1): 293 - 295.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
J. Burke, D. Davison, and W. Hide
d2_cluster: A Validated Method for Clustering EST and Full-Length cDNA Sequences
Genome Res., November 1, 1999; 9(11): 1135 - 1142.
[Abstract] [Full Text]


Home page
Proc. Natl. Acad. Sci. USAHome page
A. Barakat, G. Matassi, and G. Bernardi
Distribution of genes in the genome of Arabidopsis thaliana and its implications for the genome organization of plants
PNAS, August 18, 1998; 95(17): 10044 - 10049.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
J. Burke, H. Wang, W. Hide, and D. B. Davison
Alternative Gene Form Discovery and Candidate Gene Selection from Gene Indexing Projects
Genome Res., March 1, 1998; 8(3): 276 - 290.
[Abstract] [Full Text]



Disclaimer:
Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.