Skip Navigation

This Article
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow FREE Full Text (Screen PDF)
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (28)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Stuart, G. W.
Right arrow Articles by Baker, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Stuart, G. W.
Right arrow Articles by Baker, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics Vol. 18 no. 1 2002
Pages 100-108
© 2002 Oxford University Press

Integrated gene and species phylogenies from unaligned whole genome protein sequences

Gary W. Stuart *, Karen Moffett and Steve Baker

Department of Life Sciences, Indiana State University, Terre Haute, IN 47809, USA

Received on April 2, 2001 ; revised on July 6, 2001 ; accepted on July 20, 2001

Motivation: Most molecular phylogenies are based on sequence alignments. Consequently, they fail to account for modes of sequence evolution that involve frequent insertions or deletions. Here we present a method for generating accurate gene and species phylogenies from whole genome sequence that makes use of short character string matches not placed within explicit alignments. In this work, the singular value decomposition of a sparse tetrapeptide frequency matrix is used to represent the proteins of organisms uniquely and precisely as vectors in a high-dimensional space. Vectors of this kind can be used to calculate pairwise distance values based on the angle separating the vectors, and the resulting distance values can be used to generate phylogenetic trees. Protein trees so derived can be examined directly for homologous sequences. Alternatively, vectors defining each of the proteins within an organism can be summed to provide a vector representation of the organism, which is then used to generate species trees.

Results: Using a large mitochondrial genome dataset, we have produced species trees that are largely in agreement with previously published trees based on the analysis of identical datasets using different methods. These trees also agree well with currently accepted phylogenetic theory. In principle, our method could be used to compare much larger bacterial or nuclear genomes in full molecular detail, ultimately allowing accurate gene and species relationships to be derived from a comprehensive comparison of complete genomes. In contrast to phylogenetic methods based on alignments, sequences that evolve by relative insertion or deletion would tend to remain recognizably similar.

Availability: Both the program used to convert properly formatted sequence files into sparse n-gram matrices (aacode3) and the program used to generate PHYLIP compatible pairwise distance matrices from the Singular Value Decomposition (SVD) output (cosdist) are available at http://mama.indstate.edu/user/stuart. The SVD package is available at http://www.netlib.org/svdpack/index.html, and the PHYLIP package is available at http://evolution.genetics.washington.edu/phylip.html.

Contact: G-Stuart{at}indstate.edu

* To whom correspondence should be addressed.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
Q. Dai, Y. Yang, and T. Wang
Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison
Bioinformatics, October 15, 2008; 24(20): 2296 - 2302.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
X. Wu, Z. Cai, X.-F. Wan, T. Hoang, R. Goebel, and G. Lin
Nucleotide composition string selection in HIV-1 subtyping using whole genomes
Bioinformatics, July 15, 2007; 23(14): 1744 - 1752.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. R. Kantorovitz, G. E. Robinson, and S. Sinha
A statistical method for alignment-free comparison of regulatory sequences
Bioinformatics, July 1, 2007; 23(13): i249 - i255.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
M. Hohl and M. A. Ragan
Is Multiple-Sequence Alignment Required for Accurate Inference of Phylogeny?
Syst Biol, April 1, 2007; 56(2): 206 - 221.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
K. H. Chu, C. P. Li, and J. Qi
Ribosomal RNA as molecular barcodes: a simple correlation analysis without sequence alignment
Bioinformatics, July 15, 2006; 22(14): 1690 - 1701.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
P. Larranaga, B. Calvo, R. Santana, C. Bielza, J. Galdiano, I. Inza, J. A. Lozano, R. Armananzas, G. Santafe, A. Perez, et al.
Machine learning in bioinformatics
Brief Bioinform, March 1, 2006; 7(1): 86 - 112.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
G. W. Stuart, K. Moffett, and J. J. Leader
A Comprehensive Vertebrate Phylogeny Using Vector Representations of Protein Sequences from Whole Genomes
Mol. Biol. Evol., April 1, 2002; 19(4): 554 - 562.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.