Skip Navigation


Bioinformatics Advance Access originally published online on June 16, 2005
Bioinformatics 2005 21(16):3429-3430; doi:10.1093/bioinformatics/bti548
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/16/3429    most recent
bti548v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Smith, M.
Right arrow Articles by Ouzounis, C. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Smith, M.
Right arrow Articles by Ouzounis, C. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

MagicMatch—cross-referencing sequence identifiers across databases

Mike Smith {dagger}, Victor Kunin {ddagger}, Leon Goldovsky , Anton J. Enright § and Christos A. Ouzounis *

Computational Genomics Group, The European Bioinformatics Institute EMBL Cambridge Outstation, Cambridge CB10 1SD, UK

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 IMPLEMENTATION
 LIMITATIONS
 DISCUSSION
 AVAILABILITY
 REFERENCES
 

Motivation: At present, mapping of sequence identifiers across databases is a daunting, time-consuming and computationally expensive process, usually achieved by sequence similarity searches with strict threshold values.

Summary: We present a rapid and efficient method to map sequence identifiers across databases. The method uses the MD5 checksum algorithm for message integrity to generate sequence fingerprints and uses these fingerprints as hash strings to map sequences across databases. The program, called MagicMatch, is able to cross-link any of the major sequence databases within a few seconds on a modest desktop computer.

Availability: MagicMatch is available at the following URL (http://cgg.ebi.ac.uk/services/magicmatch/), including an interactive service for major databases and binary downloads for widely used platforms.

Contact: ouzounis{at}ebi.ac.uk


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 IMPLEMENTATION
 LIMITATIONS
 DISCUSSION
 AVAILABILITY
 REFERENCES
 
One of the most labour-intensive yet inevitable tasks in bioinformatics is mapping sequence identifiers across various databases. Numerous databases were designed for specific user requirements and are based on diverse schemas and entry accession keys (usually in the form of identifiers). Moreover, sequence identifiers may be altered between release versions of the same database and often the previous identifiers cannot be easily traced. A few resources—either public, such as NCBI's Entrez (Wheeler et al., 2005), or commercial, such as EMBL's SRS (Etzold and Argos, 1993), were developed to facilitate linking operations across databases. All these involve heavy and sustained resource utilization, extensive computation and do not readily allow rapid mapping of user-specified databases.

At present, the linking of user-specified databases is usually achieved by performing sequence similarity searches that identify which sequence entries are identical or highly similar to each other. Each sequence in a ‘query’ database is searched against other ‘reference’ databases using sequence database search methods such as FASTA (Pearson, 1990) or BLAST (Altschul et al., 1997), which can detect proteins of identical length and primary structure. The sequence identifiers of the query and reference entries are subsequently linked. This method is particularly inefficient owing to high computational cost yet is indispensable and, therefore, utilized by many groups working in bioinformatics.

Here we report an extremely rapid mechanism to map accessions among any sequence databases, which is based on the MD5 checksum algorithm for message integrity (Rivest, 1992). The MD5 algorithm takes as input a message of arbitrary length and produces a 128-bit fingerprint of the message, usually encoded as a 32-character hexadecimal number. The basic premise is that it is virtually infeasible to produce two messages having the same fingerprint or, conversely, produce a message that can potentially generate an prespecified fingerprint (Abzug, 1992 http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html). MD5 has been extensively used in data integrity checking to ensure error-free transmission across information channels, and is more reliable than other similar methods (Rivest, 1992).

In the context of bioinformatics, the identity of the sequence (and thus its resulting fingerprint) can be entirely encoded by its primary structure, i.e. the sequence of amino acid residues. Two identical sequences will then result in identical MD5 checksum fingerprints. In fact, checksums (e.g. the 64-bit CRC checksum) have been used in bioinformatics to ensure that the sequence information in a database is not modified. Also, 32-bit checksums have been used to generate non-redundant databases (Holm and Sander, 1998), yet these programs are not widely available—one such example being the nrdb program (W. Gish, unpublished). Other clustering programs have been used to identify redundancies, including CD-HIT (Li et al., 2001) and BlastClust (unpublished, available at http://www.ncbi.nlm.nih.gov/BLAST/docs/blastclust.html)— available from the current BLAST distribution (Altschul et al., 1997) (see below, for a comparison of performance). These, however, have not been developed with cross-referencing in mind: they are able to identify highly similar sequences which might be either excluded from a database or included in groupings for further analysis. The advantage of checksum algorithms, based on raw sequence information, is that they are able to generate unique identifiers based on the checksum fingerprints. Further, despite the fact that the possibility of a duplicate fingerprint occurring randomly cannot be excluded, the probability of this happening is miniscule. It is not possible to estimate these probabilities readily (http://www.woodmann.com/crackz/Tutorials/Md5info.htm).

Using MD5 checksums, speed is achieved by avoiding any pairwise search procedure, which is replaced by a hashing mechanism. The hashing procedure is realized by the representation of protein sequences as a 128-bit MD5 fingerprint, obtained from the MD5 algorithm (Rivest, 1992).


    IMPLEMENTATION
 TOP
 Abstract
 INTRODUCTION
 IMPLEMENTATION
 LIMITATIONS
 DISCUSSION
 AVAILABILITY
 REFERENCES
 
We have implemented the procedure of sequence fingerprinting, hashing and comparison of databases in a C++ program, named MagicMatch. MagicMatch receives as an input a FASTA-formatted sequence database and creates identifier-to-fingerprint tables, and/or uses these hash tables to cross-link sequence identifiers across databases. The execution of MagicMatch is extremely rapid, and in our tests on a Intel Xeon 2.8 GHz computer with 2 GB of memory under the Linux operating system, calculating cross-referencing the entire COGENT database (Janssen et al., 2003), containing 684 736 protein sequence entries from 178 complete genomes, to the complete Uniprot database (Bairoch et al., 2005), with 1.52 million sequences, takes 41 s, including the computation of fingerprints and their subsequent matching.

To compare performance with two other (clustering) programs, namely CD-HIT (Li et al., 2001) and BlastClust (Altschul et al., 1997), we obtained two identical sets of 1000 randomly selected sequences and identified duplicates through the three programs: BlastClust takes 68.75 s, CD-HIT takes 2.44 s, while MagicMatch takes 0.04 s (>1000-fold and 61-fold speeded-up, respectively).


    LIMITATIONS
 TOP
 Abstract
 INTRODUCTION
 IMPLEMENTATION
 LIMITATIONS
 DISCUSSION
 AVAILABILITY
 REFERENCES
 
MagicMatch is not a sequence comparison procedure and is designed solely to recognize identical sequences. It links all identical sequences across databases irrespective of sequence origin or species/strain information. It will not match sequences that are at least one character different, and thus it is unsuitable for linking sequence fragments or processed and truncated proteins to their full-length precursors. However, since these are not identical sequences they should not be marked as ‘identical’ across databases. Other procedures based on exact matching fragments by overlapping sequences are available.


    DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 IMPLEMENTATION
 LIMITATIONS
 DISCUSSION
 AVAILABILITY
 REFERENCES
 
We encourage all public sequence databases to use MagicMatch to calculate MD5 checksums for all their entries. This practice will allow an instant and consistent linking between all publicly available resources and totally eliminate the problem of database cross-referencing based on sequence content. Also, allowing a database search with sequence fingerprints can be a faster way of accessing database entries when the identifier of a protein sequence is unknown. The method is widely applicable to other database entries used in bioinformatics research, including DNA sequences and protein structures, given the general nature of the MD5 checksum algorithm for message integrity (Rivest, 1992).


    AVAILABILITY
 TOP
 Abstract
 INTRODUCTION
 IMPLEMENTATION
 LIMITATIONS
 DISCUSSION
 AVAILABILITY
 REFERENCES
 
MagicMatch binary downloads for Windows, Linux, Apple OS X and Solaris are available from http://cgg.ebi.ac.uk/services/magicmatch/


    Acknowledgments
 
We thank members of the Computational Genomics Group for discussions and the anonymous reviewers for comments.

Conflict of Interest: none declared.


    Footnotes
 
{dagger}Present address: Mullard Space Science Laboratory, Holmbury St Mary, Dorking, Surrey RH5 6NT, UK Back

{ddagger}Present address: Microbial Ecology Program, DoE Joint Genome Institute, 2800 Mitchell Drive, Building 400-404, Walnut Creek, CA 94598, USA Back

§Present address: Wellcome Trust Sanger Institute, Cambridge CB10 1SA, UK Back

Received on February 8, 2005; revised on June 13, 2005; accepted on June 15, 2005

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 IMPLEMENTATION
 LIMITATIONS
 DISCUSSION
 AVAILABILITY
 REFERENCES
 

    Abzug, M.T. (1992) MD5 Homepage (unofficial).

    Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402[Abstract/Free Full Text].

    Bairoch, A., et al. (2005) The Universal Protein Resource (UniProt). Nucleic Acids Res., 33, D154–D159[Abstract/Free Full Text].

    Etzold, T. and Argos, P. (1993) SRS—an indexing and retrieval tool for flat file data libraries. Comput. Appl. Biosci., 9, 49–57[Abstract/Free Full Text].

    Holm, L. and Sander, C. (1998) Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics, 14, 423–429[Abstract/Free Full Text].

    Janssen, P., et al. (2003) COmplete GENome Tracking (COGENT): a flexible data environment for computational genomics. Bioinformatics, 19, 1451–1452[Abstract/Free Full Text].

    Li, W., et al. (2001) Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 17, 282–283[Abstract/Free Full Text].

    Pearson, W.R. (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol., 183, 63–98[Web of Science][Medline].

    Rivest, R. (1992) RFC 1321: the MD5 message-digest algorithm. Internet Engineering Task Force.

    Wheeler, D.L., et al. (2005) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 33, D39–D45[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
M. D. McDowall, M. S. Scott, and G. J. Barton
PIPs: human protein-protein interaction prediction database
Nucleic Acids Res., January 1, 2009; 37(suppl_1): D651 - D656.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
P. D. Karp, C. A. Ouzounis, C. Moore-Kochlacs, L. Goldovsky, P. Kaipa, D. Ahren, S. Tsoka, N. Darzentas, V. Kunin, and N. Lopez-Bigas
Expansion of the BioCyc collection of pathway/genome databases to 160 genomes
Nucleic Acids Res., October 24, 2005; 33(19): 6083 - 6089.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
L. Goldovsky, P. Janssen, D. Ahren, B. Audit, I. Cases, N. Darzentas, A. J. Enright, N. Lopez-Bigas, J. M. Peregrin-Alvarez, M. Smith, et al.
CoGenT++: an extensive and extensible data environment for computational genomics
Bioinformatics, October 1, 2005; 21(19): 3806 - 3810.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/16/3429    most recent
bti548v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Smith, M.
Right arrow Articles by Ouzounis, C. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Smith, M.
Right arrow Articles by Ouzounis, C. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?