Skip Navigation


Bioinformatics Advance Access originally published online on November 21, 2006
Bioinformatics 2007 23(3):372-374; doi:10.1093/bioinformatics/btl592
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/3/372    most recent
btl592v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Google Scholar
Right arrow Articles by Katoh, K.
Right arrow Articles by Toh, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Katoh, K.
Right arrow Articles by Toh, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2006 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences

Kazutaka Katoh 1,* and Hiroyuki Toh 2

1 Digital Medicine Initiative, Kyushu University Fukuoka 812-8582, Japan
2 Medical Institute of Bioregulation, Kyushu University Fukuoka 812-8582, Japan

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM
 3 APPLICATION
 REFERENCES
 

Motivation: To construct a multiple sequence alignment (MSA) of a large number (>~10 000) of sequences, the calculation of a guide tree with a complexity of O(N2) to O(N3), where N is the number of sequences, is the most time-consuming process.

Results: To overcome this limitation, we have developed an approximate algorithm, PartTree, to construct a guide tree with an average time complexity of O(N log N). The new MSA method with the PartTree algorithm can align ~60 000 sequences in several minutes on a standard desktop computer. The loss of accuracy in MSA caused by this approximation was estimated to be several percent in benchmark tests using Pfam.

Availability: The present algorithm has been implemented in the MAFFT sequence alignment package (http://align.bmr.kyushu-u.ac.jp/mafft/software/).

Contact: katoh{at}bioreg.kyushu-u.ac.jp

Supplementary information: Supplementary information is available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM
 3 APPLICATION
 REFERENCES
 
Most multiple sequence alignment (MSA) programs use a guide tree. An MSA is computed along with the tree using a group-to-group alignment algorithm. When a large number of sequences are aligned, the construction of guide tree is the time- and space-limiting process. A distance matrix is usually calculated before tree building and it requires an O(N2) memory space, where N is the number of sequences. As for time complexity, MAFFT (Katoh et al., 2002, 2005) uses an O(N3) algorithm for constructing a variant of UPGMA guide tree. MUSCLE (Edgar, 2004a,b) uses a more efficient O(N2) algorithm. In a context where a large number of sequences are being routinely determined, the scalability of MSA methods is getting important. For instance, a Pfam (Finn et al., 2006) alignment of ABC transporter consists of ~30 000 sequences and Ribosomal Database Project II release 9 (Cole et al., 2005) contains over 200 000 SSU rRNA sequences. Here we describe a simple divisive clustering algorithm, PartTree, to construct a rough tree from a set of a large number (more than ~10 000) of unaligned sequences, with an average time complexity of O(N log N) and a space complexity of O(N).


    2 ALGORITHM
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM
 3 APPLICATION
 REFERENCES
 
Let Ni, j represent the number of sequences belonging to group j at recursive depth i (i ≥ 1). At the initial cycle (i = 1), j = 1 and N1,1 = N. Otherwise (i > 1), 1 ≤ j and Formula . The sequences are classified into n groups at each cycle, where n is a parameter given by user.

  1. The longest sequence among the Ni, j sequences is selected.
  2. The similarities between the longest sequence and the remaining Ni, j 1 sequences are calculated.
  3. From the Ni, j sequences, n sequences are picked up as ‘seeds’. They include (a) the longest sequence, (b) the sequence with the lowest similarity and (c) randomly selected n – 2 sequences.
  4. The similarities among the n seeds are computed. If two seeds are highly similar to each other, shorter one is excluded. The number of the remaining seeds is denoted as n'.
  5. An UPGMA tree is built among the n' sequences. If n' ≥ Ni, j, then the tree is returned to the parent cycle and no further child cycle is carried out.
  6. The similarities between the n' seeds and the remaining N n' sequences are calculated. Each of the remaining sequences is classified into either of n' groups, according to the similarity. The number of sequences in group j is denoted as Ni+1, j, and each group is subjected to the child cycle with depth i + 1.
  7. The subtrees returned from the n' child cycles are combined into a single new tree along with the UPGMA tree calculated in Step 5. The new tree is returned to the parent cycle.

The number of sequences belonging to group j at depth i is estimated as Ni, j ~ Ni–1, j/n ~ N/ni on average, and the cycle is recursively repeated until N/nI < n, where I is the maximum depth. Thus I is proportional to log N on average. At depth i, O(N) sequence comparisons are performed. The overall number of sequence comparisons is therefore proportional to N log N. The time complexity of the entire procedure depends on that for computing the similarities at Steps 2, 4 and 6. This algorithm does not require a standard distance matrix with N2 elements. Instead, a partial distance matrix, with nNi, j elements, is used at each cycle and is freed before calling the child cycles.


    3 APPLICATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM
 3 APPLICATION
 REFERENCES
 
The aforementioned algorithm has been implemented as the PartTree option of an MSA package MAFFT 6.0. See Figure 1B for the command-line usage. In Steps 2, 4 and 6, we use a rapid method to compute a similarity based on the number of shared 6mers (Higgins and Sharp, 1988; Jones et al., 1992; Katoh et al., 2002), with a length-dependent correction introduced in MAFFT v6 (see the MAFFT page for details). This algorithm requires O(L) steps at every comparison. Thus, the time complexity of the overall procedure is O(LN log N). We can use more accurate but time-consuming distance measures, such as FASTA (Pearson and Lipman, 1988), instead of the 6mer distance. This strategy is also implemented in MAFFT, as the FastaPartTree option, which requires FASTA v3.4 installed.


Figure 1
View larger version (78K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Speed (A) and accuracy (B and C) comparisons of the MAFFT-PartTree and existing progressive methods. In (A), CPU time T required by each method is plotted as a function of the number of sequences N. The relationship was fitted with a regression curve T = aNb. All methods were run on the RedHat Enterprise Linux WS4 on a dual 3.6 GHz Xeon with 4 GB of RAM. (B) Shows the average overlap scores (%) of each method. The symbols ‘*’ and ‘**’ represent significantly worse results than the method with the highest score at the 0.05 and 0.01 levels, respectively, by the Wilcoxon test. PartTree is virtually equivalent to UPGMA when N < n (shown in parentheses). (C) Shows the differences in accuracy score on individual alignments from NW-NS-1 to each of MUSCLE-fastest and PartTree (n = 50 and n = 1000) as a function of the number of sequences.

 
Two-round progressive technique (Katoh et al., 2002) can be combined with the present method, in which the guide tree is re-calculated from the initial MSA using PartTree and then an MSA is re-constructed. This method is referred to as PartTree-2.

The performances of the present methods were evaluated using a part (1197 entries) of the Pfam 20.0 database (Finn et al., 2006) and the full-length set of BAliBASE (Thompson et al., 2005). The following progressive MSA programs were compared (see Fig. 1B for detailed list): MUSCLE 3.6 (Edgar, 2004a,b), ClustalW 1.83 (Thompson et al., 1994), Kalign 2.01 (Lassmann and Sonnhammer, 2005), and MAFFT v5 and v6. MAFFT v5 uses a UPGMA algorithm with a time complexity of O(N3), whereas v6 adopted a faster UPGMA algorithm proposed in MUSCLE (Edgar, 2004b). See the mafft page for other differences between v5 and v6. Slower methods were applied to only smaller subsets (with 500–1000 or 500–10 000 sequences) of Pfam. See Figure 1A for the comparison of CPU time. Two-round methods are not shown in Figure 1A but approximately two times slower than the corresponding one-round methods.

Assuming all the Pfam alignments are correct, the accuracy of MSA methods were evaluated with overlap score (Lassmann and Sonnhammer, 2005) between a Pfam alignment and the result of each MSA method (Fig. 1B). The loss of accuracy of an alignment by introducing the present approximation gradually increases with N and was estimated to be ~3% when N ~ 10 000 and n = 50 (Fig. 1C). Note that all the progressive methods shown here are much less accurate than more elaborate methods, such as TCoffee [84.6 for overall BAliBASE; Notredame et al. (2000)], ProbCons [86.5; Do et al. (2005)] and MAFFT-L-INS-i (87.1). As for the topology of guide tree, the loss of accuracy from rigorous UPGMA was estimated to be 10% when N ~ 2000 and n = 50. See the Supplemental information for detailed discussion on the accuracy of tree topology.

The FastaPartTree option slightly improves the alignment accuracy in comparison with PartTree with 6mer distance, as shown in Figure 1B, because of more accurate guide tree. The Wu–Manber algorithm used in Kalign (Lassmann and Sonnhammer, 2005) might be worth considering as another distance measure. The two-round progressive method is also a practical solution to improve the accuracy of guide tree and alignment, at the cost of roughly doubled CPU time.


    Acknowledgments
 
The authors thank Robert C. Edgar for comments on his O(N2) UPGMA algorighm. The authors also thank Ken Jones, Shiraz Shah and Wataru Nemoto for providing comments and examples. This work was supported by Grant-in-Aid for ‘Comparative Genomics’ from MEXT of Japan and by JST-PDBj. Funding to pay the Open Access publication charges for this article was provided by Digital Medicine Initiative, Kyushu University.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Thomas Lengauer

Received on August 23, 2006; revised on October 30, 2006; accepted on November 17, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM
 3 APPLICATION
 REFERENCES
 

    Cole, J.R., et al. (2005) The ribosomal database project (RDP-II): sequences and tools for high-throughput rRNA analysis. Nucleic Acids Res, . 33, D294–D296[Abstract/Free Full Text].

    Do, C.B., et al. (2005) ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res, . 15, 330–340[Abstract/Free Full Text].

    Edgar, R.C. (2004a) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, . 32, 1792–1797[Abstract/Free Full Text].

    Edgar, R.C. (2004b) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, 5, 113[CrossRef][Medline].

    Finn, R.D., et al. (2006) Pfam: clans, web tools and services. Nucleic Acids Res, . 34, D247–D251[Abstract/Free Full Text].

    Higgins, D.G. and Sharp, P.M. (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73, 237–244[CrossRef][ISI][Medline].

    Jones, D.T., et al. (1992) The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci, . 8, 275–282[Abstract/Free Full Text].

    Katoh, K., et al. (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res, . 30, 3059–3066[Abstract/Free Full Text].

    Katoh, K., et al. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res, . 33, 511–518[Abstract/Free Full Text].

    Lassmann, T. and Sonnhammer, E.L. (2005) Kalign—an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics, 6, 298[CrossRef][Medline].

    Notredame, C., et al. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol, . 302, 205–217[CrossRef][ISI][Medline].

    Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448[Abstract/Free Full Text].

    Thompson, J.D., et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, . 22, 4673–4680[Abstract/Free Full Text].

    Thompson, J.D., et al. (2005) BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins, 61, 127–136[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
K. Katoh and H. Toh
Recent developments in the MAFFT multiple sequence alignment program
Brief Bioinform, July 1, 2008; 9(4): 286 - 298.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/3/372    most recent
btl592v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Google Scholar
Right arrow Articles by Katoh, K.
Right arrow Articles by Toh, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Katoh, K.
Right arrow Articles by Toh, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?