Bioinformatics Advance Access originally published online on January 31, 2007
Bioinformatics 2007 23(7):898-900; doi:10.1093/bioinformatics/btm027
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
COPYCAT : cophylogenetic analysis tool
1Center for Bioinformatics (ZBIT), Sand 14, Tübingen and 2Organismic Botany/Mycology, Auf der Morgenstelle 1, Tübingen, University of Tübingen, Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: We have developed the software COPYCAT which provides an easy and fast access to cophylogenetic analyses. It incorporates a wrapper for the program PARAFIT, which conducts a statistical test for the presence of congruence between host and parasite phylogenies. COPYCAT offers various features, such as the creation of customized host–parasite association data and the computation of phylogenetic host/parasite trees based on the NCBI taxonomy.
Availability: COPYCAT and its manual are freely available at http://www-ab.informatik.uni-tuebingen.de/software/copycat.
Contact: auch{at}informatik.uni-tuebingen.de
Supplementary information: Results of the real-world example can be found at http://www-ab.informatik.uni-tuebingen.de/software/copycat or Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
A core question in the evolutionary biology of mutualists and parasites is whether their phylogenies are congruent with the phylogenies of their hosts (Page, 2002). Only a handful of programs are currently available to conduct statistical tests for the presence of a cophylogenetic structure, amongst others the topology-based programs TreeMap and TreeFitter (see review in Stevens, 2004). Extensive simulations have shown that the tests for both overall phylogenetic congruence as well as the significance of individual associations as implemented in PARAFIT (Legendre et al., 2002) are statistically well-behaved and have acceptable error ratios (Legendre et al., 2002).
PARAFIT's command line interface is straightforward, but makes the program somewhat hard to use, especially for inexperienced users. Further, it accepts numeric input only and requires additional software such as DISTPCOA (Legendre Anderson, 1998) to compute eigenvalues and often also a third program to compute patristic distances from phylogenetic trees. To overcome these limitations and to present this valuable program to a broader audience, we have developed an easy-to-use GUI frontend application.
In a typical usage scenario, the user has to deliver three matrices to PARAFIT: a host-parasite association matrix as well as a host and a parasite distance matrix, usually derived from a phylogenetic tree. Frequently, specific marker sequences such as 16S-rRNA or ITS are used to derive a phylogeny for host and parasite species. However, this limits a study to species for which there is a common marker gene available. By using COPYCAT, the distance matrices can be derived from the NCBI taxonomy (NCBI, 2006).
| 2 THE PROGRAM AND ITS FEATURES |
|---|
|
|
|---|
2.1 Pre-processing of input data
In the first step, the user has to provide a text file in which each host–parasite association is represented by a line containing the species names (or taxon IDs) of the host and parasite, separated by a tab symbol. Pre-existing data sets can easily be converted to this format.
If the user chooses to derive host or associate distances from the NCBI taxonomy, the program tries to determine the appropriate taxon ID by matching the given entries with the NCBI data. Unassigned taxa (due to misspellings or missing taxonomical entries) become highlighted, so the user can correct them in place or specify a taxon ID manually. The most recent NCBI taxonomy data can be downloaded directly from the NCBI ftp server (NCBI, 2006) by selecting the accordant menu entry. In further filter dialogs, the user can specify the in- or exclusion of certain systematic divisions as defined by NCBI or she can perform a taxonomic reduction to genera or families.
Often a taxonomy is not as well resolved as, e.g. a molecular phylogenetic tree. Since this may also affect the resolution of the PARAFIT analyses, COPYCAT allows the user to compute resolution and cladistic information content (Thorley and Page, 2000) of the constructed trees and then to decide whether to run the cophylogenetic test.
An over-representation of some species within the association data (e.g. widespread parasites) may have a great influence on the statistical test for global significance. To detect those species, COPYCAT implements a model based on the broken stick distribution, which reflects the relative abundance of species within a random population of a given size (Legendre and Legendre, 1998). Over-represented species could then be excluded from further analysis.
2.2 Configuration and execution of PARAFIT
The user can either supply his/her own phylogenetic tree in Newick format, or his/her own distance matrices in extended PHYLIP format as, e.g. inferred from a sequence alignment, or instruct the application to derive a tree from the NCBI taxonomy database. This is done as follows: Each taxon is assigned a path to the root taxon of the taxonomy. A distance between two taxa is inferred by counting all taxa which are included in exactly one of the paths of the two taxa.
If user-supplied distance matrices include more taxa than the association matrix, COPYCAT offers the option to reduce the distance matrices accordingly. In the next step, COPYCAT invokes DISTPCOA to compute eigenvectors from the distance matrices as needed as input for PARAFIT. Several methods for correction of negative eigenvalues are available. Thereafter the user can choose to start the execution of PARAFIT with a specified number of permutations. Alternatively, COPYCAT can generate a zip file containing all necessary files to start the analysis on a remote computer providing more CPU power or memory.
2.3 Visualization and evaluation of the PARAFIT results
PARAFIT's output references taxa by assigning them a number in the order of appearance in the input file, thus making it difficult for the user to interpret the results, especially for large data sets. COPYCAT provides an easy way to reassign the original taxon labels and to display the association list together with PARAFIT's significance values accordingly. Significant associations will be highlighted according to a significance threshold which can be adjusted by the user.
| 3 A REAL-WORLD EXAMPLE |
|---|
|
|
|---|
3.1 Analysis of a large smut-fungus data set
To illustrate the power of COPYCAT, we analyzed the host plant index for European smut fungi as presented by Vanky (1994, 2005), including host synonyms from Palese and Moser (1997). Recent years have seen considerable progress in smut fungi taxonomy, and taxonomy of genera and higher ranks is now based on molecular and ultrastructural data (see references in Bauer et al., 2001). The danger of circular reasoning in cophylogenetic analyses due to artificial taxa defined by host relationships is therefore negligible. However, cophylogenetic analysis of smut fungi has so far been restricted to single smut fungi genera and molecular phylogenies, considering only few species, respectively (Begerow et al., 2004).
Including synonyms, our data set contained 1947 different fungus-plant associations. Using the NCBI taxonomy release of November 21, 2006, COPYCAT identified 645 associations, representing a total of 140 smut fungi and 437 host plants. For the parasite tree, resolution was 0.19, and for the host tree, 0.23. COPYCAT invoked DISTPCOA to compute eigenvectors (discarding those with negative eigenvalues) as well as PARAFIT using 999 permutations. Running time was 4 days on an AMD Opteron with 2 GHz and 4 GB RAM.
3.2 Results
The global test indicates a highly significant cophylogenetic structure (P = 0.001). The COPYCAT results for the individual host–parasite links and a summary based on the smut fungi genera are included in the supplementary material; major taxonomic host groups are indicated. Using a significance threshold of P = 0.05, PARAFIT recovers a total of 86 insignificant and 559 significant associations. Parasite genera are rather uniform with respect to their significance values, which is probably due to the fact that they are mostly restricted to certain host families. A general pattern observed is that associations of parasites of certain monocots (Poales) are insignificant, if these parasites belong to a clade mainly composed of parasites of other host plants, and vice versa. For instance, the associations of the Poales parasites within Microbotryales, Bauerago and Ustilentyloma, appear as insignificant, as well as those of the dicot parasites within Ustilaginales, Melanopsichium and Thecaphora. We conclude that the restriction of smut fungi genera to certain host taxa as well as the partial restriction of smut orders to Poales or non-Poales hosts are responsible for the highly significant overall cophylogenetic pattern observed with these data.
| 4 REQUIREMENTS |
|---|
|
|
|---|
COPYCAT is a stand-alone Java application using SWT as graphics engine. Versions for Windows, Linux (GTK) and MacOS are freely available, including the needed SWT files and a graphical installer. The application needs a pre-installed Java 1.5 (or newer) runtime environment and at least 512 MB of free RAM.
| 5 CONCLUSIONS |
|---|
|
|
|---|
Even though a more complete sampling would be necessary to draw final conclusions about the deep cophylogeny of smut fungi, our example demonstrates that COPYCAT greatly simplifies the usage of PARAFIT, including the preparation of input data, the use of taxonomic data, the analysis even of sizeable lists of hosts and associates, and the display of the results. With this easy-to-use approach, the valuable PARAFIT application has now been made available for a much broader audience in the field of cophylogenetic studies.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We like to thank Pierre Legendre for helpful comments and for providing the valuable programs PARAFIT and DISTPCOA to the community, as well as Stefan R. Henz for his help in supplying a functional Mac OS port. Financial support provided by the Deutsche Forschungsgemeinschaft for A.F.A. and M.G. is gratefully acknowledged.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Keith Crandall
Received on November 28, 2006; revised on January 19, 2007; accepted on January 23, 2007
| REFERENCES |
|---|
|
|
|---|
Bauer R, et al. Ustilaginomycetes. In: The Mycota VII, Systematics and Evolution, Part B.—McLaughlin DJ, et al, eds. (2001) Berlin: Springer-Verlag. 57–84.
Begerow D, et al. About the evolution of smut fungi on their hosts. In: Frontiers in Basidiomycete Mycology.—Agerer R, et al, eds. (2004) München: IHW Press. 81–98.
Legendre P, Anderson MJ. Program distpcoa. (1998) Département de sciences biologiques, Université de Montréal. 10.
Legendre P, et al. A statistical test for host-parasite coevolution. Systematic Biology (2002) 51:217–234.[CrossRef][Web of Science][Medline]
Legendre P, Legendre L. Numerical Ecology. (1998) 2nd edn. Amsterdam: Elsevier. 244–410.
NCBI. Taxonomy database. (2006) ftp://ftp.ncbi.nih.gov/pub/taxonomy.
Page RDM. Tangled Trees. Phylogeny, Cospeciation and Coevolution, Chapter Introduction. (2002) Chicago and London: The University of Chicago Press. 1–21.
Palese R, Moser DM. Synonymie-Index der Schweizer Flora und der angrenzenden Gebiete. (1997) Distributed by Zentrum des Datenverbundnetzes der Schweizer Flora.
Stevens J. Computational aspects of host-parasite phylogenies. Briefings in Bioinformatics (2004) 5:339–349.
Thorley JL, Page RD. RadCon: phylogenetic tree comparison and consensus. Bioinformatics (2000) 16:486–487.
Vanky K. European Smut Fungi. (1994) Stuttgart/Jena/New York: Gustav Fischer Verlag.
Vanky K. European Smut Fungi (Ustilaginomycetes p.p. and Microbotryales) according to recent nomenclature. Mycologia Balcanica (2005) 2:169–177.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||