Bioinformatics Vol. 18 no. 1 2002
Pages 100-108
© 2002 Oxford University Press
Integrated gene and species phylogenies from unaligned whole genome protein sequences
Department of Life Sciences, Indiana State University, Terre Haute, IN 47809, USA
Received on April 2, 2001
; revised on July 6, 2001
; accepted on July 20, 2001
Motivation: Most molecular phylogenies are based on sequence alignments. Consequently, they fail to account for modes of sequence evolution that involve frequent insertions or deletions. Here we present a method for generating accurate gene and species phylogenies from whole genome sequence that makes use of short character string matches not placed within explicit alignments. In this work, the singular value decomposition of a sparse tetrapeptide frequency matrix is used to represent the proteins of organisms uniquely and precisely as vectors in a high-dimensional space. Vectors of this kind can be used to calculate pairwise distance values based on the angle separating the vectors, and the resulting distance values can be used to generate phylogenetic trees. Protein trees so derived can be examined directly for homologous sequences. Alternatively, vectors defining each of the proteins within an organism can be summed to provide a vector representation of the organism, which is then used to generate species trees.
Results: Using a large mitochondrial genome dataset, we have produced species trees that are largely in agreement with previously published trees based on the analysis of identical datasets using different methods. These trees also agree well with currently accepted phylogenetic theory. In principle, our method could be used to compare much larger bacterial or nuclear genomes in full molecular detail, ultimately allowing accurate gene and species relationships to be derived from a comprehensive comparison of complete genomes. In contrast to phylogenetic methods based on alignments, sequences that evolve by relative insertion or deletion would tend to remain recognizably similar.
Availability: Both the program used to convert properly formatted sequence files into sparse n-gram matrices (aacode3) and the program used to generate PHYLIP compatible pairwise distance matrices from the Singular Value Decomposition (SVD) output (cosdist) are available at http://mama.indstate.edu/user/stuart. The SVD package is available at http://www.netlib.org/svdpack/index.html, and the PHYLIP package is available at http://evolution.genetics.washington.edu/phylip.html.
Contact: G-Stuart{at}indstate.edu
* To whom correspondence should be addressed.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
Q. Dai, Y. Yang, and T. Wang Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison Bioinformatics, October 15, 2008; 24(20): 2296 - 2302. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Wu, Z. Cai, X.-F. Wan, T. Hoang, R. Goebel, and G. Lin Nucleotide composition string selection in HIV-1 subtyping using whole genomes Bioinformatics, July 15, 2007; 23(14): 1744 - 1752. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. R. Kantorovitz, G. E. Robinson, and S. Sinha A statistical method for alignment-free comparison of regulatory sequences Bioinformatics, July 1, 2007; 23(13): i249 - i255. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Hohl and M. A. Ragan Is Multiple-Sequence Alignment Required for Accurate Inference of Phylogeny? Syst Biol, April 1, 2007; 56(2): 206 - 221. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. H. Chu, C. P. Li, and J. Qi Ribosomal RNA as molecular barcodes: a simple correlation analysis without sequence alignment Bioinformatics, July 15, 2006; 22(14): 1690 - 1701. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Larranaga, B. Calvo, R. Santana, C. Bielza, J. Galdiano, I. Inza, J. A. Lozano, R. Armananzas, G. Santafe, A. Perez, et al. Machine learning in bioinformatics Brief Bioinform, March 1, 2006; 7(1): 86 - 112. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. W. Stuart, K. Moffett, and J. J. Leader A Comprehensive Vertebrate Phylogeny Using Vector Representations of Protein Sequences from Whole Genomes Mol. Biol. Evol., April 1, 2002; 19(4): 554 - 562. [Abstract] [Full Text] [PDF] |
||||



