Bioinformatics Advance Access originally published online on September 1, 2008
Bioinformatics 2008 24(21):2539-2541; doi:10.1093/bioinformatics/btn466
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bosque: integrated phylogenetic analysis software
Departmento de Oceanografía, Universidad de Concepción, Casilla 160-C, Concepción, Chile
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Phylogenetic analyses today involve dealing with computer files in different formats and often several computer programs. Although some widely used applications have integrated important functionalities for such analyses, they still work with local resources only: input/output files (users have to manage them) and local computing (users have sometimes to leave their programs, on their desktop computers, running for extended periods of time). To address these problems we have developed Bosque, a multi-platform client–server software that performs standard phylogenetic tasks either locally or remotely on servers, and integrates the results on a local relational database. Bosque performs sequence alignments and graphical visualization and editing of trees, thus providing a powerful environment that integrates all the steps of phylogenetic analyses.
Availability: http://bosque.udec.cl
Contact: sram{at}profc.udec.cl
| 1 INTRODUCTION |
|---|
|
|
|---|
Modern phylogenetic analyses use molecular sequence data (nucleotides and amino acids) to infer the phylogeny of organisms, genes or proteins. Hence, every analysis begins with the integration of a set of sequences of interest followed by their respective alignment, which then becomes the input data for the different phylogenetic methods that will produce, ultimately, a phylogenetic tree. Considering this basic pipeline, we have developed Bosque, a graphical program that integrates all described steps of a phylogenetic analysis. At the core of its development is the concept of Tree-Project which is a group consisting of: (i) a set of sequences, (ii) an alignment and (iii) a set of trees out of this alignment. The functionalities of Bosque are then divided into three parts: sequence management, alignment creation and editing, and tree reconstruction and editing. All these features are part of a single graphical and easy-to-use program that runs on the main desktop platforms available today: Linux, MacOS X and Windows.
| 2 DESCRIPTION |
|---|
|
|
|---|
The functionalities of Bosque may be divided into three groups: those for sequence management, those for alignment construction and those for tree editing. Besides, there is a client–server functionality that allows an easy way to share data among users working on a common environment (same data, same server access).
All the data are stored in a local relational database (implemented on SQLite1), whose format is transparent for the user, so it is not necessary to take care of files and formats once the data have been imported into the application. SQLite also has an open format, and the data can be manipulated (although not recommendable) by external programs.
Tree-Projects have the advantage of manipulating a set of data (sequences, alignment and trees), previously spread in several files (for example: fasta or nexus files, alignment files—possibly in fasta format as well—and newick files), into just one element.
2.1 Functionalities for sequences
An important feature of Bosque is the ability to import sequence data from different sources:
- Local files: fasta and GenBank files containing sequence data.
- Genbank-Entrez: by specifying search patterns that will actually query the GenBank's server via HTTP.
- Blast: by specifying a sequence query to public Blast servers, reading the results onto a built-in sequence browser.
When the two latter methods are used, the user automatically loads valuable information in addition to the mere sequence data, namely: accession numbers, complete organism name, lineage, complete GenBank information, etc.
The sequence editor allows the editing of the main fields of any sequence, i.e. definition, sequence data, accession number, organism name, lineage, etc. While, modifying this information is not so common, the translator of DNA sequences represents a useful tool for the analysis of these sequences in the amino acid space.2 With this editor it is possible to specify that a particular sequence is a protein-coding one, so a translation-tab on the window is activated to prepare a translation of the sequence using any of the following methods: (i) if the sequence has been downloaded from GenBank (either by Entrez-query or Blast-query) then one could use the proposed translations provided on the GenBank file, already on the local database, (ii) by selecting a protein sequence to serve as template for this sequence translation and (iii) by selecting a codon start position to begin a user controlled translation.
2.2 Functionalities for alignments
In order to compute the sequence alignments, Bosque makes use of the following programs: Muscle (Edgar, 2004) (local and remote) and Mafft (Katoh et al., 2002) (server-side only). These programs are used for the alignment of either nucleotide or amino acid sequences. However, when the Tree-Project is of type DNA protein coding, there is a third option for alignment, which is to align the nucleotide sequences into the amino acid space.
In the alignment editor, the user can replace, shift or delete the characters (molecular bases) of the alignment, in order to reach a result that can better represent the biological evolutionary processes that underlies a phylogenetic hypothesis. In order to improve the quality of the alignments, we have included the program Gblocks (Talavera and Castresana, 2007), which automatically eliminates poorly aligned sites.
Within this same window, a similarity table can be obtained after the alignment is constructed. This tool is particularly useful for analyzing sequences in order to find operational taxonomic units (OTU) within these sequences. Bosque can group them in OTUs automatically from the similarity table.
From the alignment, it is also possible to select a sequence and perform a Blast query directly against the NCBI-servers.
2.3 Functionalities for trees
This is the most important part of Bosque, since a phylogenetic tree is, normally, a needed product of any phylogenetic analysis. On the same Tree-Project, Bosque allows manipulating many trees. This is because users sometimes want to construct different trees using different methods and compare them to select the best according to expert knowledge. Moreover, in recent years, researchers have developed statistical techniques to compare different maximum likelihood models. By selecting the best model explaining the evolution of the sequences in the analysis,3 it is possible to choose the best fit model that better approximates the real tree, within the tested models. For this purpose, we have integrated into Bosque the program jModelTest (Posada, 2008), which works for nucleotide sequences.
For the phylogenetic tree reconstruction, we have included well-known programs publicly available to the scientific community. To this date, we have selected and included: PhyML (Guindon and Gascuel, 2003), Phylip (Felsenstein, 2005) and TreePuzzle (Schmidt et al., 2002).
In the future, we can include any other program into Bosque, provided we are given the permission of the corresponding authors. This feature gives remarkable flexibility and versatility to Bosque.
In the tree editor, it is possible to move, remove, collapse, expand and rearrange leaves and branches. The appearance (fonts and colors) are completely configurable for all the names on the tree plot. Support values can also be added to any phylogenetic branch within a tree. Currently, Bosque can plot the tree as a typical rectangular tree and as an unrooted circular tree. Finally, the tree editor can export the graphics as a SVG4 file that can be opened with any vector software like Inkscape (free available Linux, MacOS X and Windows) or Adobe Illustrator (commercially for MacOS X and Windows), and then generate an image with an appropriate resolution for scientific publication.
2.4 Network functionalities
Given the networking possibilities available nowadays, Bosque not only consists of a user graphical program (that can be run natively on Windows, MacOS X and Unix/Linux) but also includes a server (Bosque server), which complements the Bosque program with the following features:
- Remote execution of phylogenetic programs on dedicated servers. The Bosque server maintains a relational database (implemented on mySQL) in which users, jobs and phylogenetic resources are managed. If a client connects to a server, it is possible to execute remote programs and leave them until they finish. If the server is busy, the job enters into a queue of jobs for execution.
- Interaction with remote users. Within Bosque, the users can interact by using a chat window and, more important, by sharing phylogenetic resources. These resources include Sequences and Tree-Projects (which in turn include, as said before, sequences, alignment and multiple trees). The users can share their resources through the uploading of sequences or Tree-Projects to the Bosque server. Other users then, connected to same server, can download those resources. This functionality is particularly useful when multiple users are working with the same phylogenetic dataset.
By allowing remote use of computing resources the users can make better use of their computational infrastructures by assigning computer servers on the tasks of complex computations (that may be long in time, depending on the volume of the data) and avoiding the use of desktop computers on resource-and-time-consuming batch executions. The sharing of resources through the Bosque server is important to promote collaboration among researchers, who normally are unfamiliar with the overwhelming number of file formats and computer programs used on the different stages of conventional phylogenetic analyses.
| 3 FINAL DISCUSSION |
|---|
|
|
|---|
It is important to note that at the core of the design of Bosque, we have used two old and simple yet powerful ideas from computer sciences, namely networked computing and software reuse. The former has been already described above, while the latter is reflected in the reuse computer programs to perform the two most important processes in phylogenetic analyses: multiple sequence alignment and tree reconstruction. This is an important issue, since the cutting-edge developments on molecular phylogenetics and other sequence analysis areas in general, normally are accompanied with software developments implementing the novel methods, which in turn are normally written as console programs without elaborate and easy-to-use graphical interfaces. The reason of this lies on the fact that, at the stage of development of new methods or techniques in science, the researcher is interested on the fundamental functionality of the software (back-end) rather than in its ease for the user (front-end). Therefore, the researcher that is constantly applying new methods to analyze molecular data needs to spend time installing and learning complicated computer programs and in some cases pipelining a set of them. In many scientific centers, researchers have a bioinformatics group that is devoted to these tasks, but not all centers have access to such technical support. More importantly, phylogenetic analysis is frequently an iterative process, and thus it is desirable that the researcher assist in all stages of the analysis.
As stated recently by Kumar and Dudley (2007): the majority of the scientists at the forefront of experimental research are not bioinformaticians, therefore they normally are not so used to console interface programs. In this way, Bosque should help researchers working on molecular phylogenetics to use efficiently their computer resources, using, otherwise not-so-easy to use programs in a modern desktop graphical user interface.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Edwin Rodríguez for help with the development of the graphical tree plotting and editing, and our colleagues of the microbial oceanography group at our lab (PROFC) for testing every version of this software, providing important suggestions for improving the capabilities of this application. We are also grateful to David Posada for important suggestions and help on the integration of jModelTest into Bosque and to Heather Bouman for useful comments on the article.
Funding: Chilean National Commission for Scientific and Technological Research through the PBCT (grant RED12); FONDAP (grant 15010007) programs; the Millennium Scientific Initiative (grant EBMA P04/007).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alex Bateman
1SQLite is a small C library that implements a self-contained, embeddable, zero-configuration SQL database engine. http://www.sqlite.org. ![]()
2When they are DNA protein-coding sequences of course. ![]()
3According to likelihood ratio tests, AIC (Akaike information criterion) or BIC (Bayesian information criterion). ![]()
4SVG stands by scalable vector graphic and it is an open file format for describing 2D vector graphics. ![]()
Received on March 4, 2008; revised on August 21, 2008; accepted on August 28, 2008
| REFERENCES |
|---|
|
|
|---|
Edgar RC. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res (2004) 32:1792–1797.
Felsenstein J. Phylip (Phylogeny Inference Package) version 3.6. Distributed by the author. (2005) Department of Genome Sciences, University of Washington, Seattle.
Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol (2003) 52:696–704.
Katoh, et al. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res (2002) 30:3059–3066.
Kumar S, Dudley J. Bioinformatics software for biologists in the genomics era. Bioinformatics (2007) 23:1713–1717.
Posada D. jModelTest: phylogenetic model averaging. Mol. Biol. Evol (2008) 25:1253–1256.
Schmidt HA, et al. Tree-Puzzle: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics (2002) 18:502–504.
Talavera G, Castresana J. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst. Biol (2007) 56:564–577.[CrossRef][Web of Science][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||