Skip Navigation


Bioinformatics Advance Access originally published online on June 18, 2008
Bioinformatics 2008 24(17):1949-1950; doi:10.1093/bioinformatics/btn313
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/17/1949    most recent
btn313v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Poon, A. F. Y.
Right arrow Articles by Kosakovsky Pond, S. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Poon, A. F. Y.
Right arrow Articles by Kosakovsky Pond, S. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Spidermonkey: rapid detection of co-evolving sites using Bayesian graphical models

Art F. Y. Poon 1,*, Fraser I. Lewis 2, Simon D. W. Frost 1 and Sergei L. Kosakovsky Pond 1

1Division of Comparative Pathology and Medicine, Department of Pathology, University of California, San Diego, CA 92103, USA and 2Epidemiology Research Unit, Scottish Agricultural College, Inverness, Scotland, IV2 4JZ, UK

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Spidermonkey is a new component of the Datamonkey suite of phylogenetic tools that provides methods for detecting coevolving sites from a multiple alignment of homologous nucleotide or amino acid sequences. It reconstructs the substitution history of the alignment by maximum likelihood-based phylogenetic methods, and then analyzes the joint distribution of substitution events using Bayesian graphical models to identify significant associations among sites.

Availability: Spidermonkey is publicly available both as a web application at http://www.data-monkey.org and as a stand-alone component of the phylogenetic software package HyPhy, which is freely distributed on the web (http://www.hyphy.org) as precompiled binaries and open source.

Contact: afpoon@ucsd.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Detection of coevolving residues in a protein by the comparative analysis of homologous gene sequences is an important source of evidence for the functional and/or structural characterization of proteins. Similarly, comparative analysis of non-coding nucleotide sequences can reveal secondary structure, e.g. stem-loops in ribosomal RNAs. By failing to address the evolutionary nature of sequence variation, however, such methods are susceptible to spurious associations between sites due to identity by descent (Felsenstein, 1985). Additionally, pairwise association tests cannot capture higher order interactions and do not provide a means for compiling the ‘big picture’ from a list of significant pairs. Spidermonkey provides an easy-to-use web interface to a framework for detecting coevolving sites from coding and non-coding nucleotide or protein sequences, which combines phylogenetic and machine learning techniques to address these issues (Poon et al., 2007).


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The history of substitution events is inferred from an alignment using standard phylogenetic methods. If a tree is not uploaded with the alignment, then one is estimated using the neighbor-joining method (Saitou and Nei, 1987). A substitution model corresponding to the user-defined data type (nucleotide/codon/protein) is fitted to these data by maximum likelihood and the inferred ancestral sequences are used to map substitution events to branches in the tree (Kosakovsky Pond and Frost, 2005c). Replicate sets of ancestral sequences can be resampled from the posterior probability distribution and analyzed in parallel. For codon data, only non-synonymous substitutions are retained for further analysis. Invariant sites are automatically excluded in all cases. Correlated patterns of substitutions in the tree implies coevolution among sites. The joint distribution of substitutions in the tree is encoded as a binary state matrix, in which each row corresponds to a unique branch and each column to a site in the alignment, and is analyzed using Bayesian graphical models (BGMs).

A BGM is a compact representation of a joint probability distribution in which each node represents a distinct random variable (Pearl, 1988). An edge originating from ‘parent’ node P and terminating in ‘child’ node C postulates a conditional dependence between the corresponding sites, i.e. C is ‘influenced’ by P. We use the order-MCMC algorithm (Friedman and Koller, 2003) to infer the configuration of edges in the graph that best explains the data. Due to limited computing resources, we restrict BGM analyses on Spidermonkey to 150 sequences and 1000 nodes if k=1 or 75 nodes if k=2, where k is the maximum number of parents per node. Spidermonkey executes a single MCMC run with a burn-in period of 104 steps followed by 105 steps, sampled at regular intervals of 103 steps. We have found these default settings to provide sufficient conditions for convergence and sampling.


Figure 1
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Flowchart diagram of the Spidermonkey pipeline. Abbreviations: SLAC=single likelihood ancestor counting; FEL=fixed effects likelihood; IFEL=internal FEL; REL=random effects likelihood (Kosakovsky Pond and Frost, 2005c); PARRIS=a partitioning approach for robust inference of selection (Scheffler et al., 2006); GA-Branch=genetic algorithm for detecting branch-specific selection (Kosakovsky Pond and Frost, 2005b).

 

    3 IMPLEMENTATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
A web interface was constructed using custom Perl CGI and HyPhy batch language scripts (Kosakovsky Pond et al., 2005) and tested on the web browsers Safari, Firefox, Konqueror and Internet Explorer; and the computing platforms Mac OS X, Red Hat Linux, Windows XP Professional for 32- and 64-bit architectures and Windows 2003 Server. Presently, Spidermonkey is hosted on a Linux cluster comprising 20 quad-processor computing nodes. Its functionality is also available as a prepackaged analysis in HyPhy, which can be downloaded and run on local machines. Preprocessing of uploaded alignments (supporting NEXUS, PHYLIP, MEGA and FASTA formats), estimation of tree topology and MPI-enabled model selection and nucleotide and codon model fitting are handled using modified pre-existing scripts in the Datamonkey system (Kosakovsky Pond and Frost, 2005a; Fig. 1). The alignment, tree and analysis results are cached on our server for up to 96 h and can be retrieved from a temporary webpage with a randomized identifier.

The inferred distribution of substitutions in the tree is transferred to the Spidermonkey BGM scripts (Fig. 1). The subset of sites to be analyzed as a BGM can be arbitrary or determined by a user-defined threshold in the following statistics on substitutions per site: (1) raw count; (2) percentage of branches affected or (3) information entropy. The analysis reports edges with marginal posterior probabilities exceeding a default cutoff of 0.5, which may be reset to a user-defined value. A visualization of the graph (Gansner and North, 2000) can be exported in PNG, Postscript or PDF formats.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The availability of rapid algorithms using phylogenetic methods for detecting coevolving sites from sequence data is a critical resource for the accurate exploratory analysis of biological variation. Spidermonkey is a key component update of our Datamonkey suite of bioinformatic tools providing intuitive web access to cutting-edge methods for detecting coevolving sites.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank Selene Zarate and León Martínez-Castilla for their assistance in beta-testing.

Funding: This work was supported by grants AI43638, AI47745 and AI57167 from the National Institutes of Health, and by University of California San Diego Centers for AIDS Research / National Institute of Allergy and Infectious Disease (NIAID) developmental awards AI36214 to S.D.W.F. and S.L.K.P. F.I.L. received financial support from the Wellcome Trust.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alfonso Valencia

Received on April 15, 2008; revised on June 6, 2008; accepted on June 15, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Felsenstein J. Phylogenies and the comparative method. Am. Nat. (1985) 125:1–15.[CrossRef][Web of Science]

    Friedman N, Koller D. Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks. Mach. Learn. (2003) 50:95–125.[CrossRef]

    Gansner E, North S. An open graph visualization system and its applications to software engineering. In: Software: Practice and Experience. (2000) Chichester, New York: John Wiley & Sons, Ltd.

    Kosakovsky Pond SL, Frost S.DW. Datamonkey: rapid detection of selective pressure on individual sites of codon alignments. Bioinformatics (2005a) 21:2531–2533.[Abstract/Free Full Text]

    Kosakovsky Pond SL, Frost S.DW. A genetic algorithm approach to detecting lineage-specific variation in selection pressure. Mol. Biol. Evol. (2005b) 22:478–485.[Abstract/Free Full Text]

    Kosakovsky Pond SL, Frost S.DW. Not so different after all: a comparison of methods for detecting amino acid sites under selection. Mol. Biol. Evol. (2005c) 22:1208–1222.[Abstract/Free Full Text]

    Kosakovsky Pond SL, et al. HyPhy: hypothesis testing using phylogenies. Bioinformatics (2005) 21:676–679.[Abstract/Free Full Text]

    Pearl J. Probabilistic reasoning in intelligent systems: networks of plausible inference. (1988) San Mateo, CA: Morgan Kaufmann Publishers. 552.

    Poon A.FY, et al. An evolutionary-network model reveals stratified interactions in the V3 loop of the HIV-1 envelope. PLoS Comput. Biol. (2007) 3:e231.[CrossRef][Medline]

    Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. (1987) 4:406–425.[Abstract]

    Scheffler K, et al. Robust inference of positive selection from recombining coding sequences. Bioinformatics (2006) 22:2493–2499.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J. Virol.Home page
S. J. Lycett, M. J. Ward, F. I. Lewis, A. F. Y. Poon, S. L. Kosakovsky Pond, and A. J. L. Brown
Detection of Mammalian Virulence Determinants in Highly Pathogenic Avian Influenza H5N1 Viruses: Multivariate Analysis of Published Data
J. Virol., October 1, 2009; 83(19): 9901 - 9910.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/17/1949    most recent
btn313v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Poon, A. F. Y.
Right arrow Articles by Kosakovsky Pond, S. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Poon, A. F. Y.
Right arrow Articles by Kosakovsky Pond, S. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?