Bioinformatics Vol. 19 no. 4 2003
Pages 513-523
© 2003 Oxford University Press
Alignment-free sequence comparisona review
1 Department of Biometry & Epidemiology,
Medical University of South Carolina, 135 Cannon Street, Suite 303,
PO Box 250835, Charleston, SC 29425, USA
2 Biomathematics Group, ITQBUniversidad Nova Lisboa,
PO Box 127, 2780-156 Oeiras, Portugal
Received on July 15, 2002
; revised on September 27, 2002
; accepted on October 6, 2002
Motivation: Genetic recombination and, in particular, genetic shuffling are at odds with sequence comparison by alignment, which assumes conservation of contiguity between homologous segments. A variety of theoretical foundations are being used to derive alignment-free methods that overcome this limitation. The formulation of alternative metrics for dissimilarity between sequences and their algorithmic implementations are reviewed.
Results: The overwhelming majority of work on alignment-free sequence has taken place in the past two decades, with most reports published in the past 5 years. Two main categories of methods have been proposedmethods based on word (oligomer) frequency, and methods that do not require resolving the sequence with fixed word length segments. The first category is based on the statistics of word frequency, on the distances defined in a Cartesian space defined by the frequency vectors, and on the information content of frequency distribution. The second category includes the use of Kolmogorov complexity and Chaos Theory. Despite their low visibility, alignment-free metrics are in fact already widely used as pre-selection filters for alignment-based querying of large applications. Recent work is furthering their usage as a scale-independent methodology that is capable of recognizing homology when loss of contiguity is beyond the possibility of alignment.
Availability: Most of the alignment-free algorithms reviewed were implemented in MATLAB code and are available at http://bioinformatics.musc.edu/resources.html
Contact: almeidaj{at}musc.edu; svinga{at}itqb.unl.pt
* To whom correspondence should be addressed.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
S. R. Maetschke, K. S. Kassahn, J. A. Dunn, S.-P. Han, E. Z. Curley, K. J. Stacey, and M. A. Ragan A visual framework for sequence analysis using n-grams and spectral rearrangement Bioinformatics, March 15, 2010; 26(6): 737 - 744. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Domazet-Loso and B. Haubold Efficient estimation of pairwise distances between genomes Bioinformatics, December 15, 2009; 25(24): 3221 - 3227. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Tsafnat and E. W Coiera Computational Reasoning across Multiple Models JAMIA, November 1, 2009; 16(6): 768 - 774. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. A. Wu, S.-R. Jun, G. E. Sims, and S.-H. Kim Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method PNAS, August 4, 2009; 106(31): 12826 - 12831. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Giancarlo, D. Scaturro, and F. Utro Textual data compression in computational biology: a synopsis Bioinformatics, July 1, 2009; 25(13): 1575 - 1586. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. E. Sims, S.-R. Jun, G. A. Wu, and S.-H. Kim Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions PNAS, February 24, 2009; 106(8): 2677 - 2682. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Deloger, M. El Karoui, and M.-A. Petit A Genomic Distance Based on MUM Indicates Discontinuity between Most Bacterial Species and Genera J. Bacteriol., January 1, 2009; 191(1): 91 - 99. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. Dai, Y. Yang, and T. Wang Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison Bioinformatics, October 15, 2008; 24(20): 2296 - 2302. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Yang and L. Zhang Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction Nucleic Acids Res., March 1, 2008; 36(5): e33 - e33. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Hochreiter, M. Heusel, and K. Obermayer Fast model-based protein homology detection without alignment Bioinformatics, July 15, 2007; 23(14): 1728 - 1736. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. R. Kantorovitz, G. E. Robinson, and S. Sinha A statistical method for alignment-free comparison of regulatory sequences Bioinformatics, July 1, 2007; 23(13): i249 - i255. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Hohl and M. A. Ragan Is Multiple-Sequence Alignment Required for Accurate Inference of Phylogeny? Syst Biol, April 1, 2007; 56(2): 206 - 221. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Yu. Mitrophanov and M. Borodovsky Statistical significance in biological sequence analysis Brief Bioinform, March 1, 2006; 7(1): 2 - 24. |
||||
![]() |
A. Kocsor, A. Kertesz-Farkas, L. Kajan, and S. Pongor Application of compression-based distance measures to protein sequence classification: a methodological study Bioinformatics, February 15, 2006; 22(4): 407 - 412. [Abstract] [Full Text] [PDF] |
||||
![]() |
T.-J. Wu, Y.-H. Huang, and L.-A. Li Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences Bioinformatics, November 15, 2005; 21(22): 4125 - 4132. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. R. Pinto, L. A. Cowart, Y. A. Hannun, B. Rohrer, and J. S. Almeida Local correlation of expression profiles with gene annotations--proof of concept for a general conciliatory method Bioinformatics, April 1, 2005; 21(7): 1037 - 1045. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. C. Edgar Local homology recognition and distance measures in linear time using compressed amino acid alphabets Nucleic Acids Res., January 16, 2004; 32(1): 380 - 385. [Abstract] [Full Text] [PDF] |
||||






