Bioinformatics Advance Access originally published online on January 18, 2005
Bioinformatics 2005 21(9):2106-2107; doi:10.1093/bioinformatics/bti274
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Mtreemix: a software package for learning and using mixture models of mutagenetic trees
1Max Planck Institute for Informatics Stuhlsatzenhausweg 85, 66123 Saarbrücken, Germany
2Institute of Virology, University of Cologne Fürst-Pückler-Strasse 56, 50935 Köln, Germany
3Center of Advanced European Studies and Research Ludwig-Erhard-Allee 2, 53175 Bonn, Germany
4Max Planck Institute of Molecular Plant Physiology Am Mühlenberg 1, 14476 Golm, Germany
*To whom correspondence should be addressed at Department of Mathematics, University of California, 898 Evans Hall, Berkeley, CA 94720, USA.
| Abstract |
|---|
|
|
|---|
Summary: Mixture models of mutagenetic trees constitute a class of probabilistic models for describing evolutionary processes that are characterized by the accumulation of permanent genetic changes. They have been applied to model the accumulation of chromosomal gains and losses in tumor development and the development of drug resistance-associated mutations in the HIV genome.
Mtreemix is a software package for estimating mutagenetic trees mixture models from observed cross-sectional data and for using these models for predictions. We provide programs for model fitting, model selection, simulation, likelihood computation and waiting time estimation.
Availability: Mtreemix, including source code, documentation, sample data files and precompiled Solaris and Linux binaries, is freely available for non-commercial users at http://mtreemix.bioinf.mpi-sb.mpg.de/
Contact: niko{at}math.berkeley.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Several evolutionary processes are adequately described as the accumulation of non-reversible genetic changes. For example, the progression of tumor development of colorectal cancer can be regarded as the accumulation of chromosomal gains or losses (Vogelstein et al., 1988). These alterations can be detected experimentally by comparative genomic hybridization. Similarly, the development of meningioma has been described as a clonal evolutionary process starting from the set of complete chromosomes and characterized by subsequent chromosomal gains and losses (Zang, 2001).
Amino acid substitutions may also be modeled as permanent under certain conditions. For example, human immunodeficiency virus (HIV) is exposed to strong selective pressure under antiviral drug therapy. In the rapidly evolving viral population advantageous mutations that confer resistance to the drugs are fixed and virtually never lost under continuous drug pressure (Shafer et al., 2000). Thus, the development of drug resistance in the HIV genome can be regarded as the accumulation of resistance-conferring mutations.
More generally, we consider clonal evolutionary processes on a finite set of events. An event can be a genetic alteration, such as the loss of a chromosome arm or an amino acid substitution, but could also be any phenotypic change. We assume that the occurrence of events is non-reversible in the considered time period. Such cumulative evolutionary processes have been modeled by weighted branchings (Desper et al., 1999). These directed tree models define a probability distribution on the set of all patterns of events. Branchings provide an intuitive model of directed dependencies between events and their time of occurrence. Model estimation relies on cross-sectional rather than longitudinal (time series) data. The single tree model (also known as oncogenetic tree) has been extended to mixtures of trees (so-called mutagenetic trees mixture models) in order to capture more complex evolutionary scenarios (Beerenwinkel et al., 2004).
Mtreemix is a software package for analyzing observed event occurrences by means of mutagenetic trees mixture models. We provide efficient C code for a variety of tasks, including model fitting, model selection and comparison, likelihood computation, simulation and computation of the waiting time distribution for all patterns of events (Table 1).
|
| 2 PROGRAMS |
|---|
|
|
|---|
Observed patterns of events are encoded in a N x
matrix, where each row corresponds to an observation and each column to an event. Matrix entries 0 and 1 denote the absence and the presence of an event, respectively. Missing data is encoded by 1. The program fit takes the data matrix and the number K of tree components as input and computes a K-mutagenetic trees mixture model. Estimation of a single tree is based on solving an instance of the maximum weight branching problem by a combinatorial algorithm. The mixture model is fitted iteratively by an EM algorithm that in the E-step, assigns the data to the tree components and estimates the missing data and in the M-step, fits the trees on the respective subsets (Beerenwinkel et al., 2004). The resulting graphical model is written out in the DOT network description language for visualization by the graphviz package (http://www.graphviz.org) (Gansner and North, 1999) and is saved for further analyses and predictions. The likelihood of a pattern in a single tree can be computed by a breadth-first search. The program loglike computes the log-likelihood in a mixture model as the weighted sum of the single tree values. The program select estimates the out-of-sample likelihood for different values of K by cross-validation runs on the data matrix. The results can be used for choosing the optimal number of trees K (model selection). Model variance can be analyzed with the bootstrap program by fitting models on resampled data. Confidence intervals of model parameters and edge counts in the bootstraped trees are reported.
Distr computes the distribution encoded by the model, i.e. the probability of all 2
patterns on
events. The program compare provides a measure of goodness of fit by comparing these density estimates between different models.
Timing is introduced into the models by assuming independent Poisson processes on the edges of the trees. The program time computes the Poisson rates either for a fixed time point at which the samples have been observed or for an exponentially distributed sampling time. Furthermore, by simulating the waiting process along the edges of the trees the distributions of waiting times for all patterns can be determined empirically. To draw samples from a model with and without their waiting times the procedures wait and sim, respectively, can be used.
Finally, the program transprob allows for computing the transition probability between any two patterns.
| 3 IMPLEMENTATION |
|---|
|
|
|---|
All programs are implemented in C on the basis of the LEDA package for efficient data types and algorithms (Mehlhorn and Näher, 1999). Tree reconstruction is based on an implementation of Edmonds' branching algorithm (Edmonds, 1967) that solves the maximum weight branching problem in
time. Most operations, including simulation, computation of likelihoods and transition probabilities, are based on variations of tree traversals in the single trees. Since these operations are linear in the number of trees K, their time complexity is
. The EM algorithm for model fitting takes
time in each step of the iteration. Program options are invoked via command line switches. For example,
- fit -f filestem -K 3
| 4 CONCLUSION |
|---|
|
|
|---|
Mtreemix provides efficient C programs for learning mutagenetic trees mixture models from cross-sectional data. The package allows for estimating the expected time to occurrence of specific patterns of events associated, for example, with disease progression or poor treatment outcome.
| Acknowledgments |
|---|
N.B. acknowledges funding by DFG under grant No. HO 1582/1-3.
Received on August 18, 2004; revised on December 21, 2004; accepted on January 11, 2005
| REFERENCES |
|---|
|
|
|---|
Beerenwinkel, N., Rahnenführer, J., Däumer, M., Hoffmann, D., Kaiser, R., Selbig, J., Lengauer, T. (2004) Learning multiple evolutionary pathways from cross-sectional data. Proceedings of the 8th Annual International Conference on Research in Computational Molecular BiologySan Diego, CA , pp. 3644.
Desper, R., Jiang, F., Kallioniemi, O.-P., Moch, H., Papadimitriou, C.H., Schäffer, A.A. (1999) Inferring tree models for oncogenesis from comparative genome hybridization data. J. Comp. Biol., 6, 3751.
Edmonds, J. (1967) Optimum branchings. J. Res. Natl Bur. Stand., 71B, 233240.
Gansner, E.R. and North, S.C. (1999) An open graph visualization system and its applications to software engineering. Softw. Pract. Exp., 30, 12031233.
Mehlhorn, K. and Näher, S. The LEDA Platform for Combinatorial and Geometric Computing, (1999) , Cambridge, UK Cambridge University Press.
Shafer, R.W., Kantor, R., Gonzales, M.J. (2000) The genetic basis of HIV-1 resistance to reverse transcriptase and protease inhibitors. AIDS Rev., 2, 211228.
Vogelstein, B., Fearon, E., Hamilton, S. (1988) Genetic alterations during colorectal-tumor development. N. Engl. J. Med., 319, 525532[Abstract].
Zang, K.D. (2001) Meningioma: a cytogenetic model of a complex benign human tumor, including data on 394 karyotyped cases. Cytogenet. Cell Genet., 93, 207220[CrossRef][ISI][Medline].
This article has been cited by other articles:
![]() |
J. Bogojeska, A. Alexa, A. Altmann, T. Lengauer, and J. Rahnenfuhrer Rtreemix: an R package for estimating evolutionary pathways and genetic progression scores Bioinformatics, October 15, 2008; 24(20): 2391 - 2392. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Ahlenstiel, K. Roomp, M. Daumer, J. Nattermann, M. Vogel, J. K. Rockstroh, N. Beerenwinkel, R. Kaiser, H.-D. Nischalke, T. Sauerbruch, et al. Selective Pressures of HLA Genotypes and Antiviral Therapy on Human Immunodeficiency Virus Type 1 Sequence Mutation at a Population Level Clin. Vaccine Immunol., October 1, 2007; 14(10): 1266 - 1273. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Beerenwinkel and M. Drton A mutagenetic tree hidden Markov model for longitudinal clonal HIV sequence data Biostat., January 1, 2007; 8(1): 53 - 71. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Beerenwinkel, T. Sing, T. Lengauer, J. Rahnenfuhrer, K. Roomp, I. Savenkov, R. Fischer, D. Hoffmann, J. Selbig, K. Korn, et al. Computational methods for the design of effective therapies against drug resistant HIV strains Bioinformatics, November 1, 2005; 21(21): 3943 - 3950. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Rahnenfuhrer, N. Beerenwinkel, W. A. Schulz, C. Hartmann, A. von Deimling, B. Wullich, and T. Lengauer Estimating cancer survival and clinical outcome based on genetic tumor progression scores Bioinformatics, May 15, 2005; 21(10): 2438 - 2446. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


