Skip Navigation


Bioinformatics Advance Access originally published online on December 4, 2006
Bioinformatics 2007 23(4):507-508; doi:10.1093/bioinformatics/btl613
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/4/507    most recent
btl613v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Fang, F.
Right arrow Articles by Dorman, K. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Fang, F.
Right arrow Articles by Dorman, K. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

cBrother: relaxing parental tree assumptions for Bayesian recombination detection

Fang Fang 1, Jing Ding 4, Vladimir N. Minin 5, Marc A. Suchard 5,6 and Karin S. Dorman 1,2,3,*

1 Bioinformatics and Computational Biology Program, Iowa State University Ames, IA 50011, USA
2 Department of Statistics, Iowa State University Ames, IA 50011, USA
3 Department of Genetics, Development and Cell Biology, Iowa State University Ames, IA 50011, USA
4 Ohio State University Medical Center Columbus, OH, 43220, USA
5 Department of Biomathematics, David Geffen School of Medicine at UCLA Los Angeles, CA 90095, USA
6 Department of Human Genetics, David Geffen School of Medicine at UCLA Los Angeles, CA 90095, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE DESCRIPTION
 SPEED-UP
 FIXED PARENTAL TREE IMPACT
 CONCLUSION
 REFERENCES
 

Summary: Bayesian multiple change-point models accurately detect recombination in molecular sequence data. Previous Java-based implementations assume a fixed topology for the representative parental data. cBrother is a novel C language implementation that capitalizes on reduced computational time to relax the fixed tree assumption. We show that cBrother is 19 times faster than its predecessor and the fixed tree assumption can influence estimates of recombination in a medically-relevant dataset.

Availability: cBrother can be freely downloaded from http://www.biomath.org/dormanks/ and can be compiled on Linux, Macintosh and Windows operating systems. Online documentation and a tutorial are also available at the site.

Contact: kdorman{at}iastate.edu


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE DESCRIPTION
 SPEED-UP
 FIXED PARENTAL TREE IMPACT
 CONCLUSION
 REFERENCES
 
The past 20 years have yielded myriad methods for detecting rare recombination events among divergent molecular sequences. The most common methods are phylogenetic-based, inferring recombination by identifying discordant phylogenetic relationships along the sequences. The Bayesian multiple change-point (MCP) model is one such approach that simultaneously locates crossover-points (COPs) and identifies possible parental genotypes while assessing statistical support for recombination (Suchard et al., 2002). The Java package DualBrothers implements recombination detection via the Bayesian MCP (Minin etal., 2005).

To dramatically reduce the topology space and computational complexity, MCP models generally assume a fixed and known topology relates all parental genotype sequences. Unfortunately, the fixed tree assumption fails when recombination among genotypes is possible, such as in HIV (Paraskevis et al., 2003). Even when genotype relationships are stable, only a single recombinant can be analyzed and extensive topological uncertainty within genotypes has prohibited the inclusion of multiple representative sequences per genotype. We implement a novel version of the MCP model, in C for native compilation, that relaxes the fixed parental tree assumption and uses improved likelihood calculations to substantially reduce computational run-time. cBrother both runs faster and eliminates some current restrictions of MCP models.


    SOFTWARE DESCRIPTION
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE DESCRIPTION
 SPEED-UP
 FIXED PARENTAL TREE IMPACT
 CONCLUSION
 REFERENCES
 
As input, cBrother takes an alignment of N + Q DNA/RNA sequences in the Phylip format and a command file. The first N sequences are representatives for P possible parental genotypes and the last Q sequences are putative recombinant sequences. Users specify the underlying evolutionary model, priors for model parameters and Markov chain Monte Carlo (MCMC) conditions in the command file. Restarting previous chains via check-pointing is also now possible and is a useful tool for achieving MCMC convergence and crash recovery.

The user can invoke the usual fixed parental tree assumption, specify only a fixed genotype tree or avoid all fixed tree assumptions using the command file option parent_tree. Setting parent_tree to a pre-estimated topology {tau}N with N terminal nodes specifies a fixed topology relating all N representative sequences. Specifying instead a topology {tau}P with only P terminal nodes fixes just the genotype relationships. Now the set of parental trees {{tau}n} consists of all possible N-taxa trees where representative sequences from the same genotype form monophyletic clades, but the branching order within genotypes varies. When the parent_tree option is set to ‘none’, the set of parental trees is similarly constructed, except the relationship among genotypes is no longer constrained. In all cases, the complete topology space includes all topologies produced by attaching the Q putative recombinants anywhere in tree {tau}N or all trees in {{tau}n}.

Experience with DualBrothers demonstrates that >90% of computational time is spent on likelihood calculations. Any small improvement in these calculations saves tremendous run-time. Current MCP models employ evolutionary models in which tree branch lengths are integrated out analytically. Exploiting this integration, cBrother computes and caches the finite-time transition probability matrix only once per likelihood calculation. Previous samplers recomputed this matrix along each branch and for each site of the sequence alignment.


    SPEED-UP
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE DESCRIPTION
 SPEED-UP
 FIXED PARENTAL TREE IMPACT
 CONCLUSION
 REFERENCES
 
We compare the run-time of cBrother to its predecessor while testing for recombination in HIV sequence L11793. [GenBank] The 1480 bp alignment contains the putative recombinant and eight representative parental sequences. For comparison purposes, we employ both samplers to draw inference under identical models with the fixed parental tree assumption and default transition kernel options. We generate MCMC chains with 51 000 steps and discard the first 5000 steps as burn-in. Standard diagnostics suggest adequate convergence and mixing under these conditions. cBrother takes 56 s ± 2.7 s (mean ± standard deviation, based on 10 independent runs) to simulate its chain, while DualBrothers takes 17 min 53 s ± 39.8 s. Through better caching and native compilation, these results indicate that cBrother is ~19 times faster than DualBrothers. Better caching alone accounts for 15% of the improvement.


    FIXED PARENTAL TREE IMPACT
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE DESCRIPTION
 SPEED-UP
 FIXED PARENTAL TREE IMPACT
 CONCLUSION
 REFERENCES
 
HIV sequence U88823 [GenBank] is a putative genotype A1/C recombinant virus isolated from a Rwandan patient (Gao et al., 1998), but the evolutionary relationship between genotypes A1 and C varies along the genome (Anderson et al., 2000). To examine the impact of relaxing the parental tree, we consider a full-length alignment of U88823 [GenBank] with the consensus sequences of A1, C and three other randomly chosen genotypes. We run two independent chains under each model and check-point incrementally until stringent convergence is achieved. The final MCMC chains contain 30 000 000 steps when estimating the genotype tree and 10 000 000 when assuming a fixed genotype tree. The extra samples needed to estimate genotype trees reduce, but do not eliminate, the speed advantage of cBrother.

Both models confirm that isolate U88823 [GenBank] is an A1/C recombinant with very high posterior probability >0.999. Figure 1 reports the genotype assignment to each region of U88823 [GenBank] along with estimated median COP locations and their posterior support. Here, COPs indicate locations where the query's nearest neighbor changes. All COPs are well supported (posterior support >0.95) under both models, but COP locations do not perfectly align. To quantify the difference between location estimates for the two models, we reconstruct posterior conditional location distributions for each MCMC run. These conditional distributions describe COP locations among those posterior samples that have a matching COP within a liberal range of the specified medians. The two pairs of distributions generated under the same model (fixed or relaxed parental tree) are not significantly different (P-value > 0.05 by Wilcoxon Mann–Whitney test of medians). However, distributions across models differ (P << 0.001), indicating that relaxing the fixed parental tree assumption can lead to significantly altered estimates. In particular, conditional distributions for the second COP (see Fig. 1) are strikingly different. Accurate estimates of COP locations are necessary to understand the effects of primary and secondary sequence characteristics on promoting recombination (Galetto et al., 2004; Moumen et al., 2003). Since the difference in medians is almost twice the length of the sequence bound to reverse transcriptase when recombination occurs, an uncertainty this large could impact downstream analyses.


Figure 1
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Estimated recombinant structure for isolate U88823 under a fixed and relaxed parental tree. We report inferred genotypes, median COP locations and their posterior support in brackets. Inference at the second COP is significantly altered, as shown by the location distributions obtained using a fixed (white) or relaxed (gray) tree.

 

    CONCLUSION
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE DESCRIPTION
 SPEED-UP
 FIXED PARENTAL TREE IMPACT
 CONCLUSION
 REFERENCES
 
cBrother's improved speed, check-pointing and ability to handle topological variation permit the analysis of larger or more complex datasets with improved accuracy. With growing numbers of recombinant sequences available, cBrother's ability to analyze multiple recombinants will also prove useful for illuminating recombinant origins.


    Acknowledgments
 
This work was supported by NIH grant GM068955.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on August 17, 2006; revised on November 27, 2006; accepted on November 28, 2006

    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE DESCRIPTION
 SPEED-UP
 FIXED PARENTAL TREE IMPACT
 CONCLUSION
 REFERENCES
 

    Anderson, J.P., et al. (2000) Testing the hypothesis of a recombinant origin of human immunodeficiency virus type 1 subtype E. J. Virol, . 74, 10752–10765[Abstract/Free Full Text].

    Galetto, R., et al. (2004) The structure of HIV-1 genomic RNA in the gp120 gene determines a recombination hot spot in vivo. J. Biol. Chem, . 279, 36625–36632[Abstract/Free Full Text].

    Gao, F., et al. (1998) A comprehensive panel of near-full-length clones and reference sequences for non-subtype B isolates of human immunodeficiency virus type 1. J. Virol, . 72, 5680–5698[Abstract/Free Full Text].

    Minin, V.N., et al. (2005) Dual multiple change-point model leads to more accurate recombination detection. Bioinformatics, 21, 3034–3042[Abstract/Free Full Text].

    Moumen, A., et al. (2003) Evidence for a mechanism of recombination during reverse transcription dependent on the structure of the acceptor RNA. J. Biol. Chem, . 278, 15973–15978[Abstract/Free Full Text].

    Paraskevis, D., et al. (2003) Analysis of the evolutionary relationships of HIV-1 and SIVcpz sequences using Bayesian inference: implications for the origin of HIV-1. Mol. Biol. Evol, . 20, 1986–1996[Abstract/Free Full Text].

    Suchard, M.A., et al. (2002) Oh brother, where art thou? A Bayes factor test for recombination with uncertain heritage. Syst. Biol, . 51, 715–728[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/4/507    most recent
btl613v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Fang, F.
Right arrow Articles by Dorman, K. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Fang, F.
Right arrow Articles by Dorman, K. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?