Bioinformatics Advance Access originally published online on September 24, 2007
Bioinformatics 2007 23(24):3384-3385; doi:10.1093/bioinformatics/btm438
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TreeSnatcher: coding trees from images
1Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, Dr-Bohr-Gasse 9/6, A-1030 Vienna, 2University of Vienna, 3Medical University of Vienna and 4University of Veterinary Medicine Vienna, Vienna, Austria
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: TreeSnatcher is a GUI-driven JAVA application for the semi-automatic recognition of multifurcating phylogenetic trees in pixel images. The program accepts an image file as input and analyzes the topology and the metrics of a tree depicted. The analysis is carried out in a multiple-stage process using algorithms from image analysis. In the end, TreeSnatcher produces a Newick tree code that represents the tree structure optionally including branch lengths. TreeSnatcher can process trees with 100 leaves or more in a few seconds.
Availability: TreeSnatcher was developed in JAVA under Mac OS X and is executable on UNIX/Linux, Windows and Mac OS X systems. The application and its documentation can be freely downloaded from http://www.cibiv.at/software/treesnatcher.
Contact: arndt.von.haeseler{at}univie.ac.at
| 1 INTRODUCTION |
|---|
|
|
|---|
Figures of phylogenetic trees are widely used in publications to illustrate the result of an evolutionary analysis. However, as one cannot effortlessly extract a machine-readable representation, i.e. a Newick expression, of the phylogeny from such images, those are not suited for subsequent reanalysis or easy compilation of tree topologies. Therefore, a computer-readable representation of a published tree has either to be built completely by hand or by using special applications. The program TreeThief (Rambaut, 2000), presumably the first that meets this demand, allows the user to trace a given tree by clicking on each of its nodes in turn. However, TreeThief gets quite tedious if the number of taxa represented in the tree is large.
Here, we present a conceptually advanced program, TreeSnatcher. It identifies the topology of a tree (e.g. a figure from a publication) and its branch lengths almost automatically with only minor user interaction. The application features a sophisticated graphical user interface that is based on the JAVA Swing API.
| 2 METHODS |
|---|
|
|
|---|
2.1 Prerequisites
The current version of TreeSnatcher opens tree-image files in the formats PNG, JPG/JPEG or GIF. The PDF format is currently not supported. The user has to provide an image that (a) shows a tree and (b) complies with the following requirements:
- The image has a light and homogeneous background that is clearly separated from a dark foreground.
- The tree along with species names and branch captions constitute the foreground. Lettering and branches must not overlap.
- The branches consist of evenly thin, plain, not necessarily straight continuous lines.
- The inner nodes are branching points between lines and have no circles, rectangles, etc. inscribed.
- The outer nodes do not feature leaves: the edges end at a well-defined location.
If the image does not meet those requirements, the tree topology will not be identified correctly. An example of an admissible tree-image is displayed in Figure 1a.
|
2.2 WORKFLOW
Working with TreeSnatcher takes place along a mandatory succession of program stages. The program checks whether the user has performed the formal task(s) specified at any given stage and permits advancing to the next stage. On the other hand, the user controls all image manipulations and recognition tasks performed by the program. Figure 1b displays an early stage of the tree encoding production chain.
The entire workflow is as follows:
- The program converts the specified RGB image into a grayscale image (c.f. ITU-R Recommendation BT.709-3, 1998). As color information is not needed for the tree recognition, all RGB values get expressed as shades of gray.
- The user trims and cuts the image at will. That is to say, the user can select sub-trees or a subset of taxa from the converted image.
- The user corrects the image by drawing and erasing pixels. If a line of known length is present in the image, i.e. a scale, the user can mark it and specify a unit length in a dialog box to scale the tree.
- The program first separates all foreground shades from the background shades in order to extract the tree from the background. Then, it thins out the foreground to prepare the detection of tree nodes and edges. The user marks a location on an edge and instructs the program to color the foreground reachable from this location.
- The program carries out the detection of nodes and leaves based on the previously colored part of the image. The user rejects or adds tree nodes by drawing and erasing pixels. Then TreeSnatcher repeats recognition step 5 on demand.
- The program displays the inner nodes and edges of the tree. The user can provide individual edge lengths in a dialog box by clicking on the nodes that are uniquely assigned to an edge. This overrides a previous specification of a unit length for the whole tree.
- The user clicks on each leaf node in turn in order to type in the corresponding species name in a dialog box.
- The user either clicks on a tree node to designate it as the root of the tree, or clicks on a position on an edge to mark the root.
- The program constructs and displays the Newick expression that represents the tree structure. If the user has provided scaling information for the tree, the Newick string contains branch lengths.
Figure 1c displays the Newick code for the tree in Figure 1a. A detailed description of the thinning (skeletonizing) algorithm used in step 4 is given in Zhang and Suen (1984). In steps 4 and 5, several variants on the flood-filling technique are used (Burger and Burge, 2005).
| 3 PERFORMANCE |
|---|
|
|
|---|
We carried out some authentic tree recognition tasks with TreeSnatcher in which we analyzed images of different size and resolution on a PowerPC equipped Apple iBook 800 MHz with 640 MB main memory. The JAVA SDK 1.5 Virtual Machine running on top of Mac OS X consumes a fairly high amount of heap memory and reserves a huge portion of virtual memory, leaving a comparatively small amount of physical memory for TreeSnatcher. The performance decrease had to be compensated by swapping out images to disk.
The fourth and the fifth steps in the workflow are the most time consuming: The program performs binarizing and skeletonizing of a 300 x 400 RGB grayscale image in 2 s and of a 1400 x 1200 image in 8 s. For the first image representing a tree with 20 species the node and edge detection is done within 2 s. For the second image representing a tree with 100 species the latter is done in 12 s. The response time will be further improved by algorithmic means.
| 4 CONCLUSION |
|---|
|
|
|---|
We presented a program that automates the generation of a machine-readable representation of phylogenetic trees contained in images. TreeSnatcher uses established methods from image analysis to preprocess the source image and to detect the tree structure. The user supervises the semi-automatic recognition process on every program stage and makes corrections to the image where necessary. This novel approach capitalizes on the cooperation between user and program and works well in practice. Furthermore, it accelerates inputting huge topologies from scanned images. This suggests that TreeSnatcher is well fitted to support the research on complex phylogenies. TreeSnatcher will be helpful if one wants to combine phylogenies based on a collection of already published trees that are not electronically available (c.f. Bininda-Emonds, 2004).
| ACKNOWLEDGEMENT |
|---|
|
|
|---|
We thank the Vienna Science and Technology Fund (WWTF) for generous financial support.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Keith Crandall
Received on July 3, 2007; revised on August 15, 2007; accepted on August 17, 2007
| REFERENCES |
|---|
|
|
|---|
Bininda-Edmonds ORP. Phylogenetic supertrees: combining information to reveal the tree of life. Comput. Biol. Ser (2004) 4:1–550.
Burger W, Burge MJ. Digitale Bildverarbeitung (2005) Heidelberg, Berlin: Springer Verlag. 196–200.
International Telecommunications Union: ITU-R Recommendation BT.709-3 (1998) Basic Parameter Values for the HDTV Standard for the Studio and for International Programme Exchange.
Rambaut A. TreeThief: a tool for manual phylogenetic tree entry. (2000) http://evolve.zoo.ox.ac.uk/software/TreeThief/main.html.
Zhang TY, Suen CY. A fast parallel algorithm for thinning digital patterns. Image Process. Comput. Vis (1984) 27:236–239.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
