Skip Navigation


Bioinformatics Advance Access originally published online on January 19, 2007
Bioinformatics 2007 23(7):815-824; doi:10.1093/bioinformatics/btm015
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary data
Right arrowOA All Versions of this Article:
23/7/815    most recent
btm015v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Google Scholar
Right arrow Articles by Dutilh, B. E.
Right arrow Articles by Huynen, M. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dutilh, B. E.
Right arrow Articles by Huynen, M. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Assessment of phylogenomic and orthology approaches for phylogenetic inference

B. E. Dutilh 1,*, V. van Noort 1, R. T. J. M. van der Heijden 1, T. Boekhout 2, B. Snel 3 and M. A. Huynen 1

1Center for Molecular and Biomolecular Informatics/Nijmegen Center for Molecular Life Sciences, Radboud University Nijmegen Medical Center, P.O. Box 9101, 6500 HB, Nijmegen, The Netherlands, 2Centraalbureau voor Schimmelcultures, Uppsalalaan 8, 3584 CT, Utrecht, The Netherlands and 3Bioinformatics Group, Department of Biology, Utrecht University, Padualaan 8, 3584 CH, Utrecht, The Netherlands

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Phylogenomics integrates the vast amount of phylogenetic information contained in complete genome sequences, and is rapidly becoming the standard for reliably inferring species phylogenies. There are, however, fundamental differences between the ways in which phylogenomic approaches like gene content, superalignment, superdistance and supertree integrate the phylogenetic information from separate orthologous groups. Furthermore, they all depend on the method by which the orthologous groups are initially determined. Here, we systematically compare these four phylogenomic approaches, in parallel with three approaches for large-scale orthology determination: pairwise orthology, cluster orthology and tree-based orthology.

Results: Including various phylogenetic methods, we apply a total of 54 fully automated phylogenomic procedures to the fungi, the eukaryotic clade with the largest number of sequenced genomes, for which we retrieved a golden standard phylogeny from the literature. Phylogenomic trees based on gene content show, relative to the other methods, a bias in the tree topology that parallels convergence in lifestyle among the species compared, indicating convergence in gene content.

Conclusions: Complete genomes are no guarantee for good or even consistent phylogenies. However, the large amounts of data in genomes enable us to carefully select the data most suitable for phylogenomic inference. In terms of performance, the superalignment approach, combined with restrictive orthology, is the most successful in recovering a fungal phylogeny that agrees with current taxonomic views, and allows us to obtain a high-resolution phylogeny. We provide solid support for what has grown to be a common practice in phylogenomics during its advance in recent years.

Contact: dutilh{at}cmbi.ru.nl

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Phylogenomics, i.e. using entire genomes to infer a species tree, has become the de facto standard for reconstructing reliable phylogenies (Ciccarelli et al., 2006; Daubin et al., 2002). Whereas phylogenetic trees, i.e. based on single gene families, may show conflict (Teichmann and Mitchison, 1999) due to a variety of causes, phylogenomic trees have held the promise that they can average out these anomalies by the sheer power of genome-scale data. As it is based on the maximum genetic information, a phylogenomic tree should be the best reflection of the evolutionary history of the species, assuming this history is tree-like (Doolittle, 1999; Ge et al., 2005). Although there are discordant processes at the level of gene repertoires, such as horizontal gene transfer (Doolittle, 1999) or differences in the rates of evolution and gene loss between paralogs in different species (Daubin et al., 2003), these have been shown to add noise rather than a directional bias (Dutilh et al., 2004). However, this does not mean that phylogenomics is the end of all conflicts in species trees (Jeffroy et al., 2006): there are many ways to integrate the information from the different gene families to form a single species phylogeny.

1.1 Phylogenomics
In taxonomy, the term ‘phylogenomics’ indicates the construction of a phylogeny on the basis of complete genome data. We can consider this type of phylogenomics as parallel phylogenetics over all gene families, combined with a synthesis step. This step from phylogenetics to phylogenomics integrates the phylogenetic information from the different gene families to form a single species phylogeny, and can be taken at successive levels in the process. As a guideline, we classify phylogenomic methods by the level where the step from phylogenetics to phylogenomics is made (Fig. 1). Here, we compare these four qualitatively different phylogenomic approaches.


Figure 1
View larger version (78K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Making phylogenomic trees. Before starting tree inference, OGs are defined (top row). Phylogenomics follows the steps of phylogenetics, from multiple alignment through distance, likelihood or parsimony to the reconstruction of a phylogeny. Integrating separate phylogenetics for each gene family (gray boxes) to phylogenomics (white boxes) can be done at every one of these steps. This defines the phylogenomic approach: gene content (after OG definition), superalignment (after multiple alignment), superdistance (after distance calculation) or supertree (after reconstruction of gene family trees). The phylogenomic trees we reconstructed are listed at the bottom, the number between square brackets indicates the number of target nodes that the tree recovered correctly.

 
For sequence-based phylogenomic methods, the first step is to make multiple alignments for every orthologous group (OG) (Delsuc et al., 2005). In the superalignment approach, the phylogenetic information is then combined by concatenating the multiple alignments to form a superalignment. Subsequently, conventional phylogenetic inference methods can be used to transform the alignment into a phylogeny. Superdistance trees continue the path of phylogenetics by first calculating distance matrices for all gene families. The phylogenomic distance between two species is then defined as the average distance between all the shared gene families (Kunin et al., 2005). Finally, the supertree approach (Bininda-Emonds, 2004; Daubin et al., 2002) takes the step from phylogenetics to phylogenomics at the very end. After phylogenetic trees have been composed for all gene families, an integration step combines the multiple gene family trees to form a single phylogenomic tree.

Of the methods based on whole-genome features (Delsuc et al., 2005), we only consider gene content here, as gene order in the fungi evolves too fast to retain a phylogenetic signal (Huynen et al., 2001). Gene content takes the step from phylogenetics to phylogenomics right after the definition of the OGs (Fig. 1). Species are regarded as ‘bags of genes’, and sequence information is only used to determine the OGs. To infer a phylogenomic tree from gene content data, a binary character matrix indicating the presence or absence of the OGs in all species can be treated in the same way as a multiple sequence alignment.

1.2 Orthology
The initial step in every phylogenomic approach is to determine which genes are to be compared between species (top row in Fig. 1). We compare the performance of three types of orthology definitions: pairwise orthology, cluster orthology, and tree-based orthology. The first two methods use sequence similarity scores to define orthologous groups of genes. Pairwise orthology is defined between only two species [e.g. bidirectional best hits or Inparanoid (Remm et al., 2001)], and cluster orthology [e.g. Clusters of Orthologous Groups (Tatusov et al., 1997)] is the natural extension of pairwise orthology to more than two species. Tree-based orthology comes closest to the original phylogenetic definition of orthology (Fitch, 1970). Rather than using only the sequence similarity scores, it analyses a phylogenetic tree of a homologous group of genes to obtain orthologous relations (van der Heijden et al., in press). Note that although tree-based orthology is an ideal approach to determine orthology at scalable levels of resolution, it needs to be operationalized: OGs have to be determined from the trees separately for each pair of species. The superalignment and supertree approaches, which consider a large set of species simultaneously, cannot deal with pairwise orthology or operationalized tree-based orthology (see ‘Methods’ and Supplementary Material).

1.3 Fungal phylogeny
To compare the performance of phylogenomic approaches, some kind of gold-standard phylogeny is imperative. We chose here to benchmark the phylogenomic methods using a phylogeny of real species. The alternative, to work with simulated evolutionary data (Hillis et al., 1994), would require the simulation of the evolution of complete genomes for which we lack the models and parameters. Prima facie, an approach that uses a known phylogeny appears to exclude the possibility for any improvements. However, due to ambiguities in the literature, our gold-standard phylogeny is not completely resolved. We expect that properly derived complete genome phylogenies will allow a higher resolution, both for the species analyzed here and for other (partly) unresolved clades in future analyses.

The fungi are the eukaryotic clade with the most sequenced genomes. Saccharomyces cerevisiae has been a model organism for decades, and in this era of comparative genomics, much work has focused on sequencing the genomes of more or less closely related species (Cliften et al., 2003; Dujon et al., 2004; Kellis et al., 2003). In total, 26 completely sequenced fungal genomes were available in public databases at the start of this study (September 2005): 22 Ascomycota, 3 Basidiomycota and the Microsporidium Encephalitozoon cuniculi (see Fig. 2 and Table 1). We included E.cuniculi as an outgroup because this was the most closely related complete genome to the fungi (Thomarat et al., 2004; Vivares et al., 2002), and Rhizopus oryzae was not available yet.


Figure 2
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Target phylogeny. Labeled nodes are supported by literature. Unresolved issues are indicated by multifurcating nodes (bold lines). The numbers at every node indicate the number of the trees in each of the phylogenomic approaches that recovered this node correctly. See Tables 1 and 2 in Supplementary Material for references that support this tree.

 

View this table:
[in this window]
[in a new window]

 
Table 1. The organisms included in this research

 
The fungal kingdom has been extensively studied by phylogeneticists. Traditional phenotypic methods [e.g. reviewed in (Guarro et al., 1999)], molecular phylogenetic analyses based on rRNA (Fell et al., 2000; Lopandic et al., 2005; Lutzoni et al., 2004; Scorzetti et al., 2002; Tehler et al., 2003) or small numbers of other proteins (Diezmann et al., 2004; James et al., 2006; Kouvelis et al., 2004; Kurtzman 2003), as well as some large-scale studies (Jeffroy et al., 2006; Kuramae et al., 2006; Robbertse et al., 2006; Rokas et al., 2003; Thomarat et al., 2004) have helped resolve many of the phylogenetic relationships in the fungal kingdom. Based on the available literature (Berbee et al., 2000; Delsuc et al., 2005; Diezmann et al., 2004; Jeffroy et al., 2006; Jones et al., 2004; Kouvelis et al., 2004; Kuramae et al., 2006; Kurtzman 2003; Lopandic et al., 2005; Lutzoni et al., 2004; Medina, 2005; Prillinger et al., 2002; Robbertse et al., 2006; Tehler et al., 2003; Thomarat et al., 2004), we composed a true fungal phylogeny (Fig. 2) that we use as a benchmark.

1.4 This study
Here, we compare the four phylogenomic and the three orthology approaches presented above (Fig. 1) in parallel, assessing their ability to infer the 19 target nodes derived from the literature. As many different methods and algorithms exist for most of these approaches, we include several implementations in order to buffer our findings from possible biases in the individual methods. Thus, we compose a total of 54 phylogenomic trees of the 26 complete fungal genomes, using completely automated methods.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Orthology
Sequences were downloaded from the respective fungal sequencing projects (see Table 1). We compare the performance of three types of orthology definitions: pairwise orthology, cluster orthology and tree-based orthology. Using Inparanoid (Remm et al., 2001), we detected 1 025 849 pairwise ‘InparanOGs’. For cluster orthology, we used a method based on COG (Tatusov et al., 1997), yielding 8044 triangle-based ‘triOGs’ and 10 754 pair-based ‘duOGs’. For specific purposes (Supplementary Material), we composed subsets of OGs without paralogs (8722 unambiguous duOGs and 6488 unambiguous triOGs) and OGs that occur exactly once in every species (64 pan-duOGs and 59 pan-triOGs). To compose a tree-based orthology, phylogenetic trees were analyzed with LOFT (van der Heijden et al., in press). LOFT does not impose a species tree on the data, but assigns orthology relations based on the species overlap between the branches of a phylogenetic tree. Because a tree-based orthology yields levels of orthology, it needs to be operationalized between species pairs. We identified 858 622 distance tree-duOGs, 820 007 distance tree-triOGs, 856 363 likelihood tree-duOGs and 822 570 likelihood tree-triOGs. Further details about the orthology approaches can be found in the Supplementary Material. Orthology predictions are available at www.cmbi.ru.nl/~dutilh/phylogenomics.

2.2 Phylogenomics
Phylogenomic trees based on gene content were calculated from presence–absence profiles using either distance (Dutilh et al., 2004; Korbel et al., 2002) or parsimony (Farris, 1977; Felsenstein, 1989). In the distance approach, we corrected for genome size, because distantly related species with large genomes may share more genes than closer related species with small genomes (Supplementary Material). For the superalignment approach, Muscle multiple alignments (Edgar, 2004) of either unambiguous cluster OGs or pan-OGs were concatenated to form a superalignment. Unambiguous OGs that are absent from certain species were coded with question marks, and form gaps in the alignment (Philippe et al., 2004). In some superalignment trees, we analyzed the effect of selecting unambiguously aligned amino acids by using GBlocks (Castresana, 2000). We used either distance or maximum likelihood approaches to reconstruct the superalignment trees. The superdistance trees were calculated from superdistance matrices, based on the average distance over all OGs that are shared between the two species. We analyzed the effect of correcting for rapidly evolving OGs by using SDM* (Criscuolo et al., 2006). Supertrees were composed from distance or maximum likelihood gene family trees. To integrate the different phylogenetic trees into a phylogenomic supertree, we used either the majority rule from Consense (Felsenstein, 1989) or CLANN (Creevey and McInerney, 2005). For further details, see the Supplementary Material; all the trees are available at www.cmbi.ru.nl/~dutilh/phylogenomics.

2.3 Scoring the reconstructed trees
To score the reconstructed phylogenomic trees, we use the target phylogeny in Fig. 2. A phylogeny receives one point for each of the resolved partitions that is correctly retrieved, so a maximum of 19 points can be obtained. Note that, for example, the node ‘Yli primitive in Hemiascomycetes’ refers to the (Ago, Cal, Cgl, Dha, Kla, Kwa, Sba, Sca, Sce, Skl, Sku, Smi, Spa) branch (see Fig. 2). This means that this node can contribute a point for a certain tree, even if the Hemiascomycetes are not monophyletic in that tree, for example, if Y.lipolytica clusters with Sch.pombe. In that case, however, the tree will not receive a point for the ‘Hemiascomycetes’ node.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We present a systematic comparison of two important factors in phylogenomic inference: the orthology approach and the level of integration of phylogenetic information to a genomic scale. We use various approaches, each with several different implementations, such as the inclusive pair-based or the more restrictive triangle-based cluster OGs; and distance, maximum likelihood or parsimony for the reconstruction of the tree (Fig. 1 and Supplementary Material). Thus, we automatically construct 54 phylogenies from the available genome data of 26 fungi. To assess the performance of the phylogenomic methods, we compare the nodes in the reconstructed trees to the 19 resolved nodes of a partly unresolved gold-standard phylogeny based on extensive literature research (Fig. 2 and Supplementary Material). All of the canonical phylogenomic methods that we tested perform remarkably well at reconstructing the known fungal phylogeny. The phylogenomic trees in the three sequence-based approaches (superalignment, superdistance and supertree) recovered at least 16 out of the 19 target nodes. This constitutes a major distinction with the gene content trees, which performed much less well: even the best methods recovered no more than 13 nodes. All the phylogenomic trees can be found in the Supplementary Material

3.1 Collapsing recent duplications to gain data
We included two types of cluster orthology: the inclusive pair-based ‘duOGs’, and the more restrictive triangle based ‘triOGs’ (see ‘Methods’). A subset of these cluster OGs are the unambiguous OGs, that occur no more than once in every species. Even more constrained are the pan-orthologs, that are both unambiguous and universal, occurring exactly once in every species. We detected 8722 unambiguous duOGs, 6488 unambiguous triOGs, 64 pan-duOGs and 59 pan-triOGs in the fungi. This result depends on collapsing the recent duplications, as identified from the phylogenies by LOFT (van der Heijden et al., in press), before selecting the unambiguous OGs from the cluster OGs (see Supplementary Material). Without collapsing recent duplications, we retrieved no more than 4421 unambiguous duOGs, 4887 unambiguous triOGs, 13 pan-duOGs and 13 pan-triOGs. This difference (an average of 42%) illustrates the necessity to filter out species-specific gene expansions and systematic errors, such as the diploid genome assembly of Can. albicans (Jones et al., 2004), to increase the number of genes that can be considered.

3.2 Orthology approaches
An orthology definition that considers a recent last common ancestor will have a higher resolution than one that considers a more ancient common ancestor. Thus, pairwise orthology and tree-based orthology should, in principle, obtain a higher resolution than cluster orthology, which includes in a single OG all gene duplications since the last common ancestor of all the species compared. However, pairwise orthology incorporates information from only two species, and may miss genes that cluster orthology and tree-based orthology can identify. We expected tree-based orthology, which includes sequence information from many different species, while allowing a high-resolution view where necessary, to combine the advantages of pairwise and cluster orthologies. However, although the orthology definition does turn out to be an important factor in the quality of a phylogenomic tree, the highest-scoring trees were based on either unambiguous cluster OGs (duOGs and triOGs) or pan-triOGs, rather than tree-based OGs.

It is striking that although there is a large overlap between the 64 pan-duOGs and 59 pan-triOGs (56 OGs are identical), the pan-triOGs give better trees in both the superalignment and the supertree approach. However, the choice for one of these orthology definitions is no guarantee for a good phylogeny. Both the unambiguous cluster OGs and the pan-triOGs also produced relatively low-scoring trees in every phylogenomic approach (Fig. 1).

3.3 Superalignment trees and supertrees can recover all target nodes
Superalignment can be considered the most successful phylogenomic approach: 4 of the 14 superalignment trees correctly infer all 19 target nodes (see Fig. 1). The most difficult to recover as a monophyletic group are the Ascomycota (although not for the trees constructed with maximum likelihood) and the (Mgr, Ncr) node (Fig. 2). In those superalignment trees that did not group M.grisea with N.crassa, neither of these species was preferentially found at the root of the Sordariomycetes.

Selecting the unambiguously aligned positions of the superalignment using GBlocks (Castresana, 2000) made it computationally possible to include more unambiguous OGs (Supplementary Material), which led the unambiguous duOGs to match the results of the unambiguous triOGs (Fig. 1). However, the decrease in the number of aligned positions that GBlocks brought about in the pan-triOGs resulted in a suboptimal tree (Fig. 1). It appears that it is not simply the selection of unambiguously aligned positions, but rather the increase in the amount of high-quality data that leads to a better phylogeny. To further test this, we composed Consense supertrees from an increasing number of phylogenetic distance trees of the most restrictive OG set, the 59 pan-triOGs. Interestingly, no two single gene trees were identical, and none was identical to the target: on average, they recover only 11.5 nodes. Yet, when we combine at least 30–40 phylogenetic trees to a supertree, we already recover the external gold-standard (Fig. 3 in the Supplementary Material).


Figure 3
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Phylogenomic trees. (a) One of the two highest scoring fungal topologies. This topology was recovered by four superalignment trees and one supertree. A ML tree based on a superalignment of pan-triOGs, a ML tree based on a GBlocks-filtered superalignment of unambiguous duOGs (present in >24 species, 132 409 positions; this is the tree displayed, only bootstrap values <100% are indicated) or triOGs (present in >24 species), a distance tree based on a superalignment of pan-triOGs and a Consense supertree based on phylogenetic distance trees of pan-triOGs. (b) Gene content tree. Bio-NJ distance tree based on the InparanOG gene content distance between two species (see ‘Methods’ and Supplementary Material). Like the other gene content trees, this tree indicates convergence in gene content of species with similar lifestyles.

 
Three of the 12 phylogenomic trees inferred using the supertree approach correctly recovered all 19 target nodes. The Consense supertree based on phylogenetic distance trees from pan-triOGs is identical to the four highest-scoring superalignment trees (Fig. 3a), but differs slightly from the equally high-scoring Clann supertrees based on phylogenetic maximum likelihood trees from both duOGs and triOGs (Supplementary Material). This is possible because of the unresolved nodes in the target phylogeny. Note that superdistance and gene content trees never retrieve all 19 target nodes.

3.4 Gene content trees have a phenotypic bias
Compared to the other phylogenomic methods, the gene content trees perform relatively poorly at recovering the required target nodes: on average, they only recover 10.38 nodes. Several numbers stand out in Fig. 2. While almost all the other trees group the Hymenomycetes, (Sce, Smi, Spa) and (Ago, Kla, Kwa, Skl) together, none of the gene content trees recover these nodes. The distance-based gene content trees also fail to retrieve Ascomycota as a monophyletic group, although this proves to be a problem for most superdistance trees as well. Interestingly, we find that part of the explanation for these biases can be found in the lifestyle of the fungi (Fig. 3b). Although Sch.pombe shares relatively many genes with the Basidiomycota (Fig. 2 in Supplementary Material), and might thus be expected to cluster at the root of the Ascomycota, the main dichotomy we find within the gene content tree of the fungi is between the yeasts on the one hand, and the filamentous fungi on the other. The dimorphic fungi, Sch.pombe, Y.lipolytica and in some cases Can.albicans as well, are more or less placed in between these two branches. The filamentous P.chrysosporium is drawn closer to the filamentous Euascomycetes within the Basidiomycota, breaking up the Hymenomycetes, and leaving the dimorphic Cry.neoformans and U.maydis as the more derived Basidiomycota in most trees. The filamentous Ash.gossypii stays close to its relatives, K.lactis and K.waltii, but the (Ago, Kla, Kwa, Skl) branch is never intact in the gene content trees: Sac.kluyveri is often at the root of this cluster. This may be a remnant genome size effect, as Sac.kluyveri is a very incompletely sequenced genome. To investigate the effect of the small outgroup E.cuniculi on the position of Sac.kluyveri, we removed E.cuniculi from the data set and recomposed the Bio-NJ distance tree based on the InparanOG gene content distance (Fig. 3b). The position of Sac.kluyveri did not alter (not shown).

This strong phenotypic effect does not explain the inability of gene content to reproduce the target branching order in the Saccharomyces sensu stricto branch. In part, this may be explained by the fact that the genome sequences of Sac.bayanus, Sac.kudriavzevii and Sac.mikatae only covered 85–95% (Cliften et al., 2003). Another issue that may specifically hinder the correct inference of the Saccharomyces sensu stricto branching order are differential gene losses following the complete genome duplication or alloploid genome fusion in these species (Langkjaer et al., 2003; Scannell et al., 2006; Wolfe and Shields, 1997). Due to the large number of redundant genes that resulted from this event, and the differential processes of gene loss that followed in the descendant lineages, a patchwork of overlapping gene repertoires will have been the result. Although such gene losses should not be in conflict with the evolutionary signal, it may be part of the reason that the gene content approaches were confounded, resulting in the deviations from the target phylogeny within the Saccharomyces sensu stricto clade.

3.5 Suggestions for the unresolved nodes in the fungal taxonomy
The target nodes we selected from the literature were recovered in most of our phylogenomic trees (Fig. 2). This high recovery rate supports our perhaps subjective gold-standard phylogeny. In addition, we were faced with three nodes that remained ambiguous in our review of the literature (Supplementary Material): the internal resolution of the (Ago, Kla, Kwa, Skl) partition, the most primitive clade in Euascomycetes and the most primitive clade in Ascomycota (bold lines in Fig. 2). In Table 2, we have scored the support for each of the possible branching orders in these unresolved nodes over the four phylogenomic approaches. Based on our phylogenomic data, we can make some careful conclusions about the issues that remained unresolved in the fungal phylogeny thus far.


View this table:
[in this window]
[in a new window]

 
Table 2. Support among the trees in each of the phylogenomic approaches for the different possible branchings in the unresolved nodes of the fungal taxonomy

 
In virtually all phylogenomic trees reconstructed in the current research, Ash.gossypii and K.lactis are sister species in the (Ago, Kla, Kwa, Skl) branch. In fact, the literature references that reject this hypothesis do so with low support (Diezmann et al., 2004; Kurtzman, 2003), while the references that support it present well-supported nodes (Jeffroy et al., 2006; Kuramae et al., 2006; Tehler et al., 2003). All the phylogenomic approaches support a clustering of K.waltii and Sac.kluyveri, except for the gene content trees. This suggests that the correct phylogeny is ((Ago, Kla), (Kwa, Skl)), as we also found in the high-scoring phylogenomic tree in Fig. 3a.

Our phylogenomic trees are also quite consistent regarding which clade should be placed at an ancestral position in the Euascomycetes (blue bold line in Fig. 2). Except for two of the superdistance trees, all sequence-based trees agree that Sta.nodorum groups with the Eurotiomycetes, and the Sordariomycetes are ancestral (Table 2). This is largely supported by the literature (Lopandic et al., 2005; Robbertse et al., 2006; Tehler et al., 2003), while the only contradictory references contain other Pleosporales or Dothideomycetes, but not the species Sta.nodorum itself. Strikingly, the Sta.nodorum node is the only ill-supported node in a recent analysis of Ascomycota (Robbertse et al., 2006).

The solution to the third unresolved issue, that of which is the most primitive of the three Ascomycotal clades (black bold line in Fig. 2), is less evident than the two above. The initial hypothesis was that Sch.pombe would be the first to branch off the Ascomycotal lineage (hence the name Archiascomycetes), which is also supported by most, but not all, literature references (Supplementary Material). In all but two of the gene content trees, the Euascomycetes are the most primitive Ascomycota, even though Sch.pombe clearly shares more genes with the Basidiomycota than do the other Ascomycota (Fig. 2 in Supplementary Material). Conversely, the superalignment trees confidently provide Archiascomycetes with this label, and the superdistance trees and supertrees are inconclusive. As the superalignment trees have correctly recovered most of the other nodes as well, we conclude that their placement of the Archiascomycetes as the most primitively branching ascomycotic clade is the most reliable. Thus, the topology depicted in Fig. 3a is our final suggestion for the fungal phylogeny.


    4 CONCLUDING REMARKS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have systematically compared four phylogenomic approaches in parallel with three orthology definitions that define OGs at different levels of resolution. Using various algorithms and tree-building methods, we composed a total of 54 fully automated phylogenomic trees. The main dichotomy in the topologies of the reconstructed trees is between trees reconstructed using a sequence-based method and trees reconstructed using gene content data (Fig. 4). The phylogenomic trees that best reproduced the target phylogeny can be found among the superalignment trees and the supertrees, using either unambiguous cluster OGs or pan-triOGs. However, although these approaches can yield trees that are completely consistent with the current opinions about the fungal phylogeny, they are not a guarantee for a successful phylogenomic tree. For example, the CLANN supertrees based on pan-duOGs still only retrieved 16 of the 19 target nodes.


Figure 4
View larger version (68K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Similarity between the phylogenomic trees composed in this research, ordered based on (a) the phylogenomic approach and (b) the orthology approach. As superalignment trees and supertrees cannot use pairwise or tree-based orthology, these approaches are excluded from (b). The small numbers in the matrices are the number of partitions shared between each pair of trees. These numbers are color coded: green (maximum 23) indicates many shared partitions and red indicates few shared partitions in the tree. The large numbers are the average number of shared partitions between all trees in the four main phylogenomic approaches.

 
Gene content trees recover relatively few of the target nodes. This is at least partly due to convergence in the gene repertoires of fungi with comparable phenotypes: the evolutionary and phenotypic signals are combined in one tree (Snel et al., 1999). For example, we observe that the filamentous Euascomycetes and P.chrysosporium are drawn closer together, breaking the generally accepted topology of both Ascomycota and Basidiomycota (Fig. 2). While prokaryotes from different lineages have previously been shown to assume convergent gene repertoires in comparable ecological niches (Zomorodipour and Andersson, 1999), this is the first time (to our knowledge) that a parallel between convergence in gene content and in phenotype has been shown in eukaryotes, to the extent that it affects gene content phylogeny.

This research strongly supports the fungal phylogeny as displayed in Fig. 3a. The node that was recovered by the fewest phylogenomic trees is the basal position of Archiascomycetes, represented by Sch.pombe here, within Ascomycota. All other nodes are supported by many of the trees (see Fig. 2 and Table 2). Although most of these branches are supported by recent literature (Table 1 in Supplementary Material), this research helped provide support for those cases that were inconclusive (Table 2 and Table 2 in Supplementary Material). What is striking in our phylogenetic findings is that several of the fungal groups presented in the Genbank Taxonomy Database (Wheeler et al., 2002) should actually be adjusted. For example, Candida, Kluyveromyces, Saccharomyces and the Saccharomycetaceae remain mentioned as clades, while their members should be regrouped (see also Diezmann et al., 2004; Kurtzman, 1998, 2003; Lopandic et al., 2005; Prillinger et al., 2002; Tehler et al., 2003).

Our phylogenomic trees of the fungi reproduced many of the clades in accordance with the current taxonomic views. At least for the fungi, we confirm a number of standard practices in the current phylogenomics field, albeit it with small differences relative to the less well-established approaches such as supertrees. A recent superalignment tree (Ciccarelli et al., 2006) has been criticized as being a ‘tree of one percent’ of the genome (Dagan and Martin, 2006). In the current study, we show that methods that are restrictive in selecting genes often create a phylogeny that is close to the gold-standard. Apparently, this selection procedure is necessary to filter out the noise caused by evolutionary processes like gene duplication and gene loss, even in the absence of horizontal transfer (Andersson, 2005). Complete genomes allow us to do this automatically and still retain enough genes to construct a reliable phylogeny. Our results indicate that a (1) maximum likelihood (2) superalignment tree based on (3) selected well-aligned positions of (4) unambiguous cluster OGs, automatically derived at the level of resolution most suitable for the group of species considered, will yield a respectable tree. Maximum likelihood (1), because we find that distance trees may have trouble with the outgroup we used in this study; superalignment (2), because, on average, this phylogenomic approach recovers the most target nodes; unambiguously aligned positions (3), because this enables the inclusion of more high-quality data; and, finally, unambiguous cluster OGs derived at the level of the taxon of interest (4), because this ensures that you have the highest resolution possible.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Funding to pay the Open Access publication charges was provided by the Kluyver Centre for Genomics of Industrial Fermentation, which is supported by the Netherlands Genomics Initiative (NGI).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on October 30, 2006; revised on January 15, 2007; accepted on January 15, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUDING REMARKS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    (http://www.broad.mit.edu) Broad Institute of MIT and Harvard..

    (http://www.jgi.doe.gov) DOE Joint Genome Institute..

    Andersson JO. Lateral gene transfer in eukaryotes. Cell Mol. Life Sci, ( (2005) ) 62, : 1182–1197.[CrossRef][ISI][Medline].

    Berbee ML, et al. Ribosomal DNA and resolution of branching order among the ascomycota: how many nucleotides are enough? Mol. Phylogenet. Evol, ( (2000) ) 17, : 337–344.[CrossRef][ISI][Medline].

    Bininda-Emonds ORP. The evolution of supertrees. Trends Ecol. Evol, ( (2004) ) 19, : 315–322.[CrossRef][Medline].

    Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol, ( (2000) ) 17, : 540–552.[Abstract/Free Full Text].

    Ciccarelli FD, et al. Toward automatic reconstruction of a highly resolved tree of life. Science, ( (2006) ) 311, : 1283–1287.[Abstract/Free Full Text].

    Cliften P, et al. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science, ( (2003) ) 301, : 71–76.[Abstract/Free Full Text].

    Creevey CJ, McInerney JO. Clann: investigating phylogenetic information through supertree analyses. Bioinformatics, ( (2005) ) 21, : 390–392.[Abstract/Free Full Text].

    Criscuolo A, et al. SDM: a fast distance-based approach for (super)tree building in phylogenomics. Syst. Biol, ( (2006) ) in press..

    Dagan T, Martin W. The tree of one percent. Genome Biol, ( (2006) ) 7, : 118.[CrossRef][Medline].

    Daubin V, et al. A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history. Genome Res, ( (2002) ) 12, : 1080–1090.[Abstract/Free Full Text].

    Daubin V, et al. Phylogenetics and the cohesion of bacterial genomes. Science, ( (2003) ) 301, : 829–832.[Abstract/Free Full Text].

    Dean RA, et al. The genome sequence of the rice blast fungus Magnaporthe grisea. Nature, ( (2005) ) 434, : 980–986.[CrossRef][Medline].

    Delsuc F, et al. Phylogenomics and the reconstruction of the tree of life. Nat. Rev. Genet, ( (2005) ) 6, : 361–375.[ISI][Medline].

    Dietrich FS, et al. The Ashbya gossypii genome as a tool for mapping the ancient Saccharomyces cerevisiae genome. Science, ( (2004) ) 304, : 304–307.[Abstract/Free Full Text].

    Diezmann S, et al. Phylogeny and evolution of medical species of Candida and related taxa: a multigenic analysis. J. Clin. Microbiol, ( (2004) ) 42, : 5624–5635.[Abstract/Free Full Text].

    Doolittle WF. Phylogenetic classification and the universal tree. Science, ( (1999) ) 284, : 2124–2129.[CrossRef][ISI][Medline].

    Dujon B, et al. Genome evolution in yeasts. Nature, ( (2004) ) 430, : 35–44.[CrossRef][Medline].

    Dutilh BE, et al. The consistent phylogenetic signal in genome trees revealed by reducing the impact of noise. J. Mol. Evol, ( (2004) ) 58, : 527–539.[CrossRef][ISI][Medline].

    Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, ( (2004) ) 32, : 1792–1797.[Abstract/Free Full Text].

    Farris RJ. Phylogenetic analysis under Dollo's law. Syst. Zool, ( (1977) ) 26, : 77–88..

    Fell JW, et al. Biodiversity and systematics of basidiomycetous yeasts as determined by large-subunit rDNA D1/D2 domain sequence analysis. Int. J. Syst. Evol. Microbiol, ( (2000) ) 50, (Pt 3): 1351–1371.[Abstract].

    Felsenstein J. PHYLIP – Phylogeny Inference Package (Version 3.2). Cladistics, ( (1989) ) 5, : 164–166..

    Fitch WM. Distinguishing homologous from analogous proteins. Syst. Zool, ( (1970) ) 19, : 99–113.[Medline].

    Galagan JE, et al. The genome sequence of the filamentous fungus Neurospora crassa. Nature, ( (2003) ) 422, : 859–868.[CrossRef][Medline].

    Galagan JE, et al. Sequencing of Aspergillus nidulans and comparative analysis with A. fumigatus and A. oryzae. Nature, ( (2005) ) 438, : 1105–1115.[CrossRef][Medline].

    Ge F, et al. The cobweb of life revealed by genome-scale estimates of horizontal gene transfer. PLoS. Biol, ( (2005) ) 3, : e316.[CrossRef][Medline].

    Goffeau A, et al. Life with 6000 genes. Science, ( (1996) ) 274, : 546–567.[Abstract/Free Full Text].

    Guarro J, et al. Developments in fungal taxonomy. Clin. Microbiol. Rev, ( (1999) ) 12, : 454–500.[Abstract/Free Full Text].

    Hillis DM, et al. Application and accuracy of molecular phylogenies. Science, ( (1994) ) 264, : 671–677.[Abstract/Free Full Text].

    Huynen MA, et al. Inversions and the dynamics of eukaryotic gene order. Trends Genet, ( (2001) ) 17, : 304–306.[CrossRef][ISI][Medline].

    James TY, et al. Reconstructing the early evolution of Fungi using a six-gene phylogeny. Nature, ( (2006) ) 443, : 818–822.[CrossRef][Medline].

    Jeffroy O, et al. Phylogenomics: the beginning of incongruence? Trends Genet, ( (2006) ) 22, : 225–231.[CrossRef][ISI][Medline].

    Jones T, et al. The diploid genome sequence of Candida albicans. Proc. Natl. Acad. Sci. USA, ( (2004) ) 101, : 7329–7334.[Abstract/Free Full Text].

    Kamper J, et al. Insights from the genome of the biotrophic fungal plant pathogen Ustilago maydis. Nature, ( (2006) ) 444, : 97–101.[CrossRef][Medline].

    Katinka MD, et al. Genome sequence and gene compaction of the eukaryote parasite Encephalitozoon cuniculi. Nature, ( (2001) ) 414, : 450–453.[CrossRef][Medline].

    Kellis M, et al. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature, ( (2004) ) 428, : 617–624.[CrossRef][Medline].

    Kellis M, et al. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature, ( (2003) ) 423, : 241–254.[CrossRef][Medline].

    Korbel JO, et al. SHOT: a web server for the construction of genome phylogenies. Trends Genet, ( (2002) ) 18, : 158–162.[CrossRef][ISI][Medline].

    Kouvelis VN, et al. The analysis of the complete mitochondrial genome of Lecanicillium muscarium (synonym Verticillium lecanii) suggests a minimum common gene organization in mtDNAs of Sordariomycetes: phylogenetic implications. Fungal. Genet. Biol, ( (2004) ) 41, : 930–940.[CrossRef][ISI][Medline].

    Kunin V, et al. The net of life: reconstructing the microbial phylogenetic network. Genome. Res, ( (2005) ) 15, : 954–959.[Abstract/Free Full Text].

    Kuramae E, et al. Phylogenomics reveal a robust fungal tree of life. FEMS Yeast Res, ( (2006) ) 6, : 1213–1220.[CrossRef][ISI][Medline].

    Kurtzman CP. Discussion of teleomorphic and anamorphic ascomycetous yeasts and a key to genera. In: The Yeasts, A Taxonomic Study., —Kurtzman CP, Fell JW, eds. ( (1998) ) The Netherlands: Elsevier, Amsterdam. 111–121..

    Kurtzman CP. Phylogenetic circumscription of Saccharomyces, Kluyveromyces and other members of the Saccharomycetaceae, and the proposal of the new genera Lachancea, Nakaseomyces, Naumovia, Vanderwaltozyma and Zygotorulaspora. FEMS Yeast Res, ( (2003) ) 4, : 233–245.[CrossRef][ISI][Medline].

    Langkjaer RB, et al. Yeast genome duplication was followed by asynchronous differentiation of duplicated genes. Nature, ( (2003) ) 421, : 848–852.[CrossRef][Medline].

    Loftus BJ, et al. The genome of the basidiomycetous yeast and human pathogen Cryptococcus neoformans. Science, ( (2005) ) 307, : 1321–1324.[Abstract/Free Full Text].

    Lopandic K, et al. Estimation of Phylogenetic relationships within the Ascomycota on the basis of 18S rDNA sequences and chemotaxonomy. Mycol. Progress, ( (2005) ) 4, : 205–214.[CrossRef].

    Lutzoni F, et al. Assembling the fungal tree of life: Progress, classification and evolution of subcellular traits. Am. J. Bot, ( (2004) ) 91, : 1446–1480.[Abstract/Free Full Text].

    Martinez D, et al. Genome sequence of the lignocellulose degrading fungus Phanerochaete chrysosporium strain RP78. Nat. Biotechnol, ( (2004) ) 22, : 695–700.[CrossRef][ISI][Medline].

    Medina M. Genomes, phylogeny, and evolutionary systems biology. Proc. Natl. Acad. Sci. USA, ( (2005) ) 102, (Suppl 1): 6630–6635.[Abstract/Free Full Text].

    Nierman WC, et al. Genomic sequence of the pathogenic and allergenic filamentous fungus Aspergillus fumigatus. Nature, ( (2005) ) 438, : 1151–1156.[CrossRef][Medline].

    Philippe H, et al. Phylogenomics of eukaryotes: impact of missing data on large alignments. Mol. Biol. Evol, ( (2004) ) 21, : 1740–1752.[Abstract/Free Full Text].

    Prillinger H, et al. Phylogeny and systematics of the fungi with special reference to the Ascomycota and Basidiomycota. Fungal Allergy and Pathogenicity, ( (2002) ) 81, : 207–295.[CrossRef][ISI].

    Remm M, et al. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol, ( (2001) ) 314, : 1041–1052.[CrossRef][ISI][Medline].

    Robbertse B, et al. A phylogenomic analysis of the Ascomycota. Fungal. Genet. Biol, ( (2006) ) 43, : 715–725.[CrossRef][ISI][Medline].

    Rokas A, et al. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature, ( (2003) ) 425, : 798–804.[CrossRef][Medline].

    Scannell DR, et al. Multiple rounds of speciation associated with reciprocal gene loss in polyploid yeasts. Nature, ( (2006) ) 440, : 341–345.[CrossRef][Medline].

    Scorzetti G, et al. Systematics of basidiomycetous yeasts: a comparison of large subunit D1/D2 and internal transcribed spacer rDNA regions. FEMS Yeast Res, ( (2002) ) 2, : 495–517.[ISI][Medline].

    Snel B, et al. Genome phylogeny based on gene content. Nat. Genet, ( (1999) ) 21, : 108–110.[CrossRef][ISI][Medline].

    Tatusov RL, et al. A genomic perspective on protein families. Science, ( (1997) ) 278, : 631–637.[Abstract/Free Full Text].

    Tehler A, et al. The full-length phylogenetic tree from 1551 ribosomal sequences of chitinous fungi, Fungi. Mycol. Res, ( (2003) ) 107, : 901–916.[CrossRef][ISI][Medline].

    Teichmann SA, Mitchison G. Is there a phylogenetic signal in prokaryote proteins? J. Mol. Evol, ( (1999) ) 49, : 98–107.[CrossRef][ISI][Medline].

    Thomarat F, et al. Phylogenetic analysis of the complete genome sequence of Encephalitozoon cuniculi supports the fungal origin of microsporidia and reveals a high frequency of fast-evolving genes. J. Mol. Evol, ( (2004) ) 59, : 780–791.[CrossRef][ISI][Medline].

    van der Heijden RTJM, et al. Orthology prediction at scalable resolution through automated analysis of phylogenetic trees. BMC Bioinformatics, . (in press)..

    Vivares CP, et al. Functional and evolutionary analysis of a eukaryotic parasitic genome. Curr. Opin. Microbiol, ( (2002) ) 5, : 499–505.[CrossRef][ISI][Medline].

    Wheeler DL, et al. Database resources of the National Center for Biotechnology Information: 2002 update. Nucleic Acids Res, ( (2002) ) 30, : 13–16.[Abstract/Free Full Text].

    Wolfe KH, Shields DC. Molecular evidence for an ancient duplication of the entire yeast genome. Nature, ( (1997) ) 387, : 708–713.[CrossRef][Medline].

    Wood V, et al. The genome sequence of Schizosaccharomyces pombe. Nature, ( (2002) ) 415, : 871–880.[CrossRef][Medline].

    Zomorodipour A, Andersson SG. Obligate intracellular parasites: Rickettsia prowazekii and Chlamydia trachomatis. FEBS Lett, ( (1999) ) 452, : 11–15.[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
B. E. Dutilh, B. Snel, T. J. G. Ettema, and M. A. Huynen
Signature Genes as a Phylogenomic Tool
Mol. Biol. Evol., August 1, 2008; 25(8): 1659 - 1667.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
B. E. Dutilh, Y. He, M. L. Hekkelman, and M. A. Huynen
Signature, a web server for taxonomic characterization of sequence samples using signature genes
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W470 - W474.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary data
Right arrowOA All Versions of this Article:
23/7/815    most recent
btm015v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Google Scholar
Right arrow Articles by Dutilh, B. E.
Right arrow Articles by Huynen, M. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow