Skip Navigation


Bioinformatics Advance Access originally published online on November 22, 2006
Bioinformatics 2007 23(2):162-168; doi:10.1093/bioinformatics/btl590
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/2/162    most recent
btl590v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zhang, X.
Right arrow Articles by Kahveci, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhang, X.
Right arrow Articles by Kahveci, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

QOMA: quasi-optimal multiple alignment of protein sequences

Xu Zhang and Tamer Kahveci *

Computer and Information Science and Engineering, University of Florida Gainesville, FL 32611, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 MOTIVATION
 2 RELATED WORK
 3 ALGORITHM: COMPLETE K-PARTITE...
 4 IMPROVED ALGORITHM: SPARSE...
 5 EXPERIMENTAL EVALUATION
 6 CONCLUSION AND FUTURE...
 REFERENCES
 

Motivation: We consider the problem of multiple alignment of protein sequences with the goal of achieving a large SP (Sum-of-Pairs) score.

Results: We introduce a new graph-based method. We name our method QOMA (Quasi-Optimal Multiple Alignment). QOMA starts with an initial alignment. It represents this alignment using a K-partite graph. It then improves the SP score of the initial alignment through local optimizations within a window that moves greedily on the alignment. QOMA uses two parameters to permit flexibility in time/accuracy trade off: (1) The size of the window for local optimization. (2) The sparsity of the K-partite graph. Unlike traditional progressive methods, QOMA is independent of the order of sequences. The experimental results on BAliBASE benchmarks show that QOMA produces higher SP score than the existing tools including ClustalW, Probcons, Muscle, T-Coffee and DCA. The difference is more significant for distant proteins.

Availability: The software is available from the authors upon request.

Contact: tamer{at}cise.ufl.edu

Supplementary information: Supplementary material is available at Bioinformatics online.


    1 MOTIVATION
 TOP
 ABSTRACT
 1 MOTIVATION
 2 RELATED WORK
 3 ALGORITHM: COMPLETE K-PARTITE...
 4 IMPROVED ALGORITHM: SPARSE...
 5 EXPERIMENTAL EVALUATION
 6 CONCLUSION AND FUTURE...
 REFERENCES
 
Alignment of multiple (i.e. three or more) protein sequences is one of the most fundamental problems in computational biology. Multiple sequence alignment is widely used in many applications, such as protein structure prediction (Jones, 1999), phylogenetic analysis (Phillips et al., 2000), identification of conserved motifs and domains (Thompson et al., 1999) and protein classification (Grundy, 1997).

The quality of a multiple alignment is usually measured in terms of its SP (Sum-of-Pairs) score (Lee et al., 2002; Lipman et al., 1989; Walle et al., 2004; Stoye et al., 1997). The SP Score of an alignment, A, of sequences P1, P2, ... , PK is computed by adding the alignment scores of all of its induced pairwise alignments. It can be expressed as Formula, where K is the number of sequences, Pi is the sequence indexed by i and Score(Pi, Pj) is the score of the alignment of Pi and Pj induced by A. For simplicity, we assume fixed penalty for each indel during the computation of Score (Pi, Pj). Extension to affine gap is trivial.

Variations of the SP score are often used to reflect the biological relevance of the multiple alignment. Assigning different weights to sequences depending on their degree of similarity (Thompson et al., 1994) or different weights to different positions of the sequences depending on the structural and functional annotations (Zhang and Kahveci, 2006) are two examples to such variations. ClustalW (Thompson et al., 1994), e.g. optimizes Formula, where Wi and Wj are weights assigned to sequences Pi and Pj, respectively. For simplicity, we use the SP score in this paper. Extension to weighted SP score is trivial.

Finding the multiple sequence alignment that maximizes the SP score is an NP-complete problem (Wang and Jiangs 1994). A variety of heuristics have been developed. Progressive alignment methods form an important class of the heuristic methods. Such methods progressively align pairs of profiles in a certain order and produce a new profile until a single profile is left. A profile is either a sequence or the alignment of a set of sequences. Progressive methods, however, have an important shortcoming. The order that the profiles are chosen for alignment affects the quality of the alignment significantly. The optimal alignment may be different than all possible alignments obtained by considering all possible orderings of sequences (Zhang and Kahveci, 2006). A method, which can balance the running time and the alignment accuracy is seriously in demand.

In this paper, we consider the problem of maximizing the SP score of the alignment of multiple protein sequences. We develop a graph-based method named QOMA (Quasi-Optimal Multiple Alignment). QOMA starts by constructing an initial multiple alignment with the help of optimal pairwise alignments. It then builds a graph corresponding to the initial alignment. It iteratively places a window on this graph and improves the SP score of the initial alignment by optimizing the alignment inside the window. The location of the window is selected greedily as the one that has a chance of improving the SP score by the largest amount. QOMA uses two parameters to permit flexibility in time/accuracy trade off: (1) The window size. (2) The sparsity of the K-partite graph. The experimental results show that QOMA finds alignments with better SP score compared to existing tools including ClustalW, Probcons, Muscle, T-Coffee and DCA at similar running time. The improvement is more significant for distant proteins. The running time of QOMA is larger than that of the competing methods for large window sizes and for large number of sequences.

The rest of the paper is organized as follows. Section 2 discusses related work. Section 3 introduces the basic algorithm in detail. Section 4 discusses the improved algorithm. Section 5 presents experimental results and analysis. Section 6 concludes the paper and points out possible future directions.


    2 RELATED WORK
 TOP
 ABSTRACT
 1 MOTIVATION
 2 RELATED WORK
 3 ALGORITHM: COMPLETE K-PARTITE...
 4 IMPROVED ALGORITHM: SPARSE...
 5 EXPERIMENTAL EVALUATION
 6 CONCLUSION AND FUTURE...
 REFERENCES
 
Given a table of scores for matches and mismatches between all amino acids and penalties for insertions or deletions, the optimal alignment of two sequences can be determined by dynamic programming (DP). The time and space complexity of this methods is O(N2) (Needleman and Wunsch, 1970; Smith and Waterman, 1981; Gotoh, 1982), where N is the length of each sequence. This algorithm can be extended to align K sequences, but requires O(2K NK K2) time (Lipman et al., 1989; Gupta et al., 1995). MSA (Lipman et al., 1989) is a representative program using this algorithm. However, MSA suffers from computation expenses. Variety of heuristic algorithms have been developed to overcome this difficulty (Thompson et al., 1994). These heuristic methods can be classified into four groups: progressive, iterative, anchor-based and probabilistic.

Progressive methods find multiple alignment by iteratively picking two sequences from this set and replacing them with their alignment (i.e. consensus sequence) until all sequences are aligned into a single consensus sequence. Thus, progressive methods guarantee that never more than two sequences are simultaneously aligned. This approach is sufficiently fast to allow alignments of virtually any size. ClustalW (Thompson et al., 1994), T-coffee (Notredame et al., 2000), Treealign (Hein, 1989), POA (Lee et al., 2002) and MAFFT (Katoh et al., 2002) can be grouped into this class (Sze et al., 2005). Iterative methods start with an initial alignment. They then repeatedly refine this alignment through a series of iterations until no more improvements can be made. Muscle (Edgar, 2004) can be grouped into this class as well as the progressive method class since it uses a progressive alignment at each iteration. Anchor-based methods use local motifs (short common subsequences) as anchors. Later, the unaligned regions between consecutive anchors are aligned using other techniques. DIALIGN (Morgenstern et al., 1998), Align-m (Walle et al., 2004), L-align (Huang and Miller, 1991), Mavid (Bray and Pachter, 2004), and PRRP (Gotoh, 1996) belong to this class. Probabilistic methods pre-compute the substitution probabilities by analyzing known multiple alignments. They use these probabilities to maximize the substitution probabilities for a given set of sequences. Probcons (Do et al., 2004), and Hmmt (Eddy, 1995) can be grouped into this class. The common shortcomings of the methods above are: (1) The alignments of progressive methods depend on the order of sequences. (2) They do not provide flexible quality/time trade off.

Improving the alignment quality of an initial alignment has been traditionally done manually (e.g. through programs like MaM and WebMaM (Alkan et al., 2005)). There are a few methods, which aim to optimize the SP score by running DP alignment on all sequences simultaneously. MSA is the representative in this class (Gupta et al., 1996). DCA extends MSA by utilizing ‘divide-and-conquer’ strategy (Stoye et al., 1997). Unlike progressive methods, DCA divides the sequences recursively until they are shorter than a given threshold. It then uses MSA to find the optimal solutions for the smaller problems. The performance of DCA depends on how it divides the sequences. DCA uses a cut strategy that minimizes additional costs (Stoye, 1998) and does not guarantee to find optimal solution. It uses the longest sequence in the input sequences as reference to select the cut positions. This makes it order dependent, as there is no justification why this selection (or any other selection) optimizes the SP score of the alignment. On the contrary, our method is order independent.


    3 ALGORITHM: COMPLETE K-PARTITE GRAPH
 TOP
 ABSTRACT
 1 MOTIVATION
 2 RELATED WORK
 3 ALGORITHM: COMPLETE K-PARTITE...
 4 IMPROVED ALGORITHM: SPARSE...
 5 EXPERIMENTAL EVALUATION
 6 CONCLUSION AND FUTURE...
 REFERENCES
 
In this section, we introduce the basic QOMA algorithm for aligning K protein sequences. QOMA works in two steps: (1) It constructs an initial alignment and a K-partite graph corresponding to this alignment. (2) It iteratively places a window on the sequences and replaces the window with its optimal alignment. We call this the complete K-partite graph algorithm since a letter of a protein can be aligned with any letter of the other proteins within the same window. For simplicity, we use fixed penalty for each indel. Extension of the developed method to affine gap is trivial. Next, we describe these two steps in detail.

3.1 Constructing initial alignment
The purpose of constructing an initial alignment is to roughly identify the position of each node in final alignment. It is important to find this initial alignment quickly in order to minimize initialization overhead.

The initial alignment can be constructed in many ways. One possibility is to use an existing tool, such as ClustalW. This strategy, however, has the shortcoming that the initial alignment depends on another tool, which may be order-dependent. This makes QOMA partially order-dependent. We propose to construct the initial alignment from pairwise optimal alignments of sequences. In this strategy, first, sequence pairs are optimally aligned using DP (Gotoh, 1982). An edge is added between two nodes if the nodes are matched in this alignment. A weight is assigned to each edge as the substitution score of the two residues that constitute that edge. The substitution score is obtained from the underlying scoring matrix, such as BLOSUM62 (Henikoff and Henikoff, 1992). The weight of each node is defined as the sum of the weights of the edges that have that node on one end. A node set is then defined by selecting one node from the head of each sequence. The node which has the highest weight is selected from this set. This node is aligned with the nodes adjacent to it. Thus, the letters aligned at the end of this step constitute one column of the initial multiple alignment. The node set is then updated as the nodes immediately after the nodes in current set in each sequence. This process is repeated and columns are found until all the nodes are consumed. The alignment is obtained by concatenating all the columns. Gaps are inserted between nodes if needed. Unlike progressive tools, this strategy is order-independent. An example for initial alignment construction is shown in Supplementary Figure A1.

The time complexity of both of these strategies are O(K2N2) since pairwise comparisons dominate the running time. However, latter approach is faster. This is because it runs dynamic programming only once for each sequence pair. On the other hand, the former one performs two set of pairwise alignments. One to find a guide tree and another to align sequences progressively according to the guide tree.

3.2 Improving the SP score via local optimizations
After constructing the initial alignment, the nodes are placed roughly in their correct positions (or in a close by position) in the alignment. Next, the alignment is iteratively improved. At each iteration, a short window is placed on the existing alignment. The subsequences contained in this window are then replaced by their optimal alignments (see Supplementary Figure A2). DP algorithm (Gotoh, 1982) is used to find the optimal alignment. This is feasible since the cost of aligning a window is much less than that of the entire sequences.

This algorithm requires solving two problems. First, where should the window be placed? Second, when should the iterations stop? One obvious solution is to slide a window from left to right (or right to left) shifting by some predefined amount {Delta} at each iteration. In this case, the iterations will end once the window reaches the right end (or the left end) of the alignment (see Supplementary, Figure A2). This solution, however, has two problems. First, it is not clear which direction the window should be slid. Second, a window is optimized even if it is already a good alignment, resulting in a waste of computation time. We propose another solution. We compute an upper bound of the improvement of the SP score for every possible window position as follows. Let Xi denote the upper bound to the SP score for the window starting at position i in the alignment. This number can be computed as the sum of the scores of all pairwise optimal alignments of the subsequences in this window. Let Yi denote the current SP score of that window. The upper bound is computed as Xi Yi. We propose to greedily select the window that has the largest upper bound at each iteration. In order to ensure that this solution does not optimize more windows than the first one (i.e., sliding windows), we do not select a window position that is within {Delta}/2 positions to a previously optimized window. The iterations stop when all the remaining windows have an upper bound of zero or they are within {Delta}/2 positions of a previously optimized window. In our experiments, the second solution was slightly better. The second solution converged to the final result much faster than the first one. (data not shown.)

The time complexity of QOMA is Formula. This is because there are Formula positions for window. A dynamic programming solution is computed for each such window. The cost of each dynamic programming solution is O(2KWK K2). This algorithm is much faster than the optimal dynamic programming when W is much smaller than N. The space complexity is O(WK + KN). This is because dynamic programming for a window requires O(WK) space, and only one window is maintained at a time. Also O(KN) space is needed to store the sequences and the alignment. Note that the edges of the complete K-partite graph are not stored at this step as we already know that the graph is complete.

3.3 QOMA and optimality
This section analyzes QOMA. Let P1, P2, ... , PK be the protein sequences to be aligned. Let A* be an optimal alignment of P1, P2, ... , PK. Let S* denote the SP score of A*. Let A be an alignment of P1, P2, ... , PK. Let SP(A) be the SP score of A. We define the error induced by A as error (A) = S* SP(A). This expression, however, is not computable for finding S* is NP-complete. Instead, we compute the error of A as Formula, where Formula is an upper bound to S*. Hence, {varepsilon}(A) ≥ error(A). Here, Formula is computed as the sum of the scores of all optimal pairwise alignments of P1, P2, ... , PK. Let QOMA (A, W) be the alignment obtained by QOMA starting from initial alignment A using a window of size W. We define the percentage of improvement provided by QOMA over A using a window size of W as

Formula 1(1)

Our first lemma shows that QOMA always results in an alignment at least as good as the initial alignment (The proof is shown in the Supplementary material).

LEMMA 1
improve(A,W) ≥ 0, {forall}A, W. {square}

Corollary 1 follows from Lemma 1.

COROLLARY 1
S* = SP(QOMA(A*, W )), {forall}W.{square}

Corollary 1 implies that QOMA alters an initial alignment A only if A is not optimal. Next lemma discusses the impact of the window size on QOMA (See Supplementary material for proof).

LEMMA 2
SP (QOMA(A,W)) ≤ SP(QOMA (A,2W)).

Lemma 2 indicates that as W increases, the SP score of the resulting alignment increases. In other words, as W increases, SP(QOMA(A,W)) converges to S*.


    4 IMPROVED ALGORITHM: SPARSE GRAPH
 TOP
 ABSTRACT
 1 MOTIVATION
 2 RELATED WORK
 3 ALGORITHM: COMPLETE K-PARTITE...
 4 IMPROVED ALGORITHM: SPARSE...
 5 EXPERIMENTAL EVALUATION
 6 CONCLUSION AND FUTURE...
 REFERENCES
 
QOMA converges to optimal alignment as the window size (W) grows. However, this happens at the expense of exponential time complexity. In Section 3.1 we computed the time complexity of QOMA using complete K-partite graph as Formula 1 for proteins P1, P2, ... , PK. In this section, we reduce the time complexity of QOMA by sacrificing accuracy through use of sparse K-partite graph. The goal is to run QOMA within a limited time budget while using a larger window size.

The factor 2K in the complexity is incurred because each cell of the DP matrix is computed by considering 2K – 1 conditions (i.e. 2K – 1 neighboring cells). This is because there are 2K – 1 possible non-empty subsets of K residues. Each subset, here corresponds to a set of residues that align together, and thus to a neighboring cell. We propose to reduce the complexity by reducing the number of residues that can be aligned together. We do this by keeping only the edges between node pairs with high-possibility of matching.

Choosing the promising edges is crucial for the quality of the resulting alignment. We propose to store an edge between two nodes only if their corresponding letters are aligned in the optimal pairwise alignment. Thus we produce at most K–1 edges per node since each node is aligned with at most one node from each of the K – 1 sequences. We also introduce a deviation parameter d, where d is a non-negative integer. Let p[i] and q[ j] be the nodes corresponding to protein sequences p and q at positions i and j in the initial graph, respectively. We draw an edge between p[i] and q[ j] only if one of the following two conditions holds in the optimal pairwise alignment of p and q: (1) exist{delta}, |{delta}| ≤ d, such that p[i] is aligned with q[ j + {delta}]; (2) exist{delta}, |{delta}| ≤ d, such that q[ j] is aligned with p[i + {delta}]. In other words, we draw an edge between two nodes if their positions differ by at most d in the optimal alignment of p and q. For example, in Figure 1, p[2] aligns with q[2]. Therefore, we draw an edge from p[2] to q[1] and q[3] as well as q[2] since q[1] and q[3] are within d-neighborhood of (d = 1) of q[2].


Figure 1
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Sparse K-partite graph for two sequences for d = 0 and d = 1.

 
DP is modified for sparse K-partite graph as follows: each cell, [x1, x2, ... , xK] in the K-dimensional DP matrix corresponds to nodes P1[x1], P2[x2], ... , PK[xK]. Here Pi[ j] stands for the node at position j in sequence i. The set contains one node from each sequence, and can be either a residue or a gap. Thus, each cell defines a subgraph induced by its node set. For example, during the alignment of the sequences that have the K-partite graph as shown in Figure 2(a), the cell [3, 4, 4] corresponds to nodes P1[3], P2[4] and P3[4]. Figure 2(b) shows the induced subgraph of cell [3, 4, 4].


Figure 2
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 (a) A sparse K-partite graph for three sequences from a window of size 4 (b) The induced subgraph for the cell [3,4,4] of the dynamic programming matrix for the K-partite graph in (a).

 
The induced subgraph for each cell yields a set of connected components. The sparse graph algorithm uses connected components to improve the running time of DP as follows. During the computation of the value of a DP matrix cell, we allow two nodes to align only if they belong to the same connected component of the induced subgraph of that cell. For example, for cell [3, 4, 4], P2[4] and P3[4] can be aligned together, but P1[3] can not be aligned with P2[4] or P3[4] (see Fig. 2b). A connected component with n nodes produce 2n – 1 non-empty subsets. Thus, for a given cell, if there are t connected components and the tth component has nt nodes, then the cost of that cell becomes Formula 1. This is a significant improvement as the cost of a single cell is 2n1 +n2 +...+nt–1 using the complete K-partite graph. For example, in Figure 2, the cost for cell [3, 4, 4] drops from 23 – 1 = 7 to (21–1) +(22 – 1) = 4.

The connected components of an induced subgraph can be found in O(K2) time (i.e. the size of the induced subgraph) by traversing the induced subgraph once. Thus, the total time complexity of the sparse K-partite graph approach is Formula 1. The space complexity of using the sparse K-partite graph is O(WK + KN + N(K–1)K(2d + 1)/2). The first term denotes the space for the dynamic programming alignment within a window. The second term denotes the number of letters. The last term denotes the number of edges. The space complexity for the last two terms can be reduced by storing only the subgraph inside the window.


    5 EXPERIMENTAL EVALUATION
 TOP
 ABSTRACT
 1 MOTIVATION
 2 RELATED WORK
 3 ALGORITHM: COMPLETE K-PARTITE...
 4 IMPROVED ALGORITHM: SPARSE...
 5 EXPERIMENTAL EVALUATION
 6 CONCLUSION AND FUTURE...
 REFERENCES
 
Experimental setup: We used BAliBASE benchmarks (Thompson et al., 1999) reference 1 from version 1 (www-igbmc.u-strasbg.fr/BioInfo/BAliBASE/) and references 1, 2, ... , 8 from version 3 (www-bio3d-igbmc.u-strasbg.fr/BAliBASE/) for evaluation of our method. We use V1 and V3 to denote BAliBASE versions 1 and 3, respectively. We use R1 to R8 to denote reference 1–8. For example, we use V3-R4 to represent the reference 4 dataset from version 3. We split the V1-R1 dataset into three datasets (V1-R1-low, V1-R1-medium and V1-R1-high) according to the similarity of the sequences in the benchmarks as denoted in BAliBASE (low- and high-similarities). Similarly, V3-R1 is split into two datasets V3-R1-low and V3-R1-high containing low and high-similarity benchmarks. The number of sequences in the benchmarks in version 3 were usually too large for QOMA and DCA. Therefore, we created 1000 benchmarks from each reference by randomly selecting five sequences from the existing benchmarks. Thus, each of the benchmarks from version 3 contains five sequences.

We evaluated the SP score and the running time in our experiments. We do not report the BAliBASE scores since the purpose of QOMA is to maximize the SP score.

We implemented the complete and the sparse K-partite QOMA algorithms as discussed in the paper, using standard C. We used BLOSUM62 as a measure of similarity between amino acids. We used gap cost = –4 to penalize each indel. We used {Delta} = W/2 in our experiments since we achieved best quality per time with this value. We also downloaded ClustalW, Probcons, Muscle, T-coffee and DCA for comparison. We did not compare QOMA with our previous work HSA (Zhang and Kahveci, 2006) since HSA needs secondary structure information of proteins for alignment. To ensure a fair comparison, we ran ClustalW, Muscle, T-coffee, DCA and QOMA using the same parameters (gap cost = –4, similarity matrix = BLOSUM62). This was not possible for Probcons. We also ran all the competing methods using their default parameters.

We ran all our experiments on Intel Pentium 4, with 2.6 G Hz speed, and 512 MB memory. The operating system was Windows 2000.

Quality evaluation: We first evaluate the quality of QOMA. Table 1 shows the average SP score of QOMA using two strategies for constructing initial alignment and four values of W. Strategy 1 obtains the initial alignments from ClustalW. Strategy 2 obtains the initial alignments from the algorithm provided in Section 3.1. The table also shows the upper bound for the SP score and the SP score of ClustalW for comparison. The best results are shown in bold for each test. QOMA achieves higher SP score compared to ClustalW on average for all window sizes and for all datasets. The SP score of QOMA consistently increases as W increases. These results are justified by Lemmas 1 and 1. The SP score of Strategy 2 is usually higher than that of Strategy 1 for almost all cases of low- and medium-similarity. Both strategies are almost identical for highly similar sequences. There is a loose correlation between the initial SP score and the final SP score of QOMA. Higher initial SP scores usually imply higher SP scores of the end result. There are however exceptions especially for highly similar sequences. In the rest of the experiments, we use Strategy 2 to construct the initial alignments by default.


View this table:
[in this window]
[in a new window]

 
Tabel 1 The average SP scores of QOMA using complete K-partite graph with {Delta} = W/2 on BAliBASE benchmarks and upper bound score (Formula 1)

 
Table 2 shows us the SP scores of five existing tools, and QOMA on all the datasets when the competing tools are run using the same parameters as QOMA and using their default parameters. QOMA has higher SP score than all the tools compared for all the datasets. DCA always has second best score since it also targets on maximizing the SP score of alignments. The difference between the SP scores of QOMA and the other tools are more significant for low- and medium- similarity sequences. This is an important achievement because the alignment of such sequences are usually harder than highly similar sequences.


View this table:
[in this window]
[in a new window]

 
Table 2 The average SP scores of QOMA (using complete K-partite graph with W = 16) and five other tools on BAliBASE benchmarks

 
Table 3 shows the average percentage of improvement of QOMA over alignments of ClustalW using the improvement formula as given in Section 3.3, the dataset is V1-R1. As the window size increases, the increase in improvement percentage reduces. This indicates that QOMA converges to the optimal score at reasonably small window sizes. In other words, using window size larger than 16 will not improve the SP score significantly.


View this table:
[in this window]
[in a new window]

 
Table 3 The improvement (see Formula 1 in Section 3.3) of QOMA (using complete K-partite graph) over ClustalW on the V1-R1 dataset

 
Table 4 shows the average and the SD of the error incurred for each window due to using the sparse K-partite graph for QOMA. The average error decreases as d increases. Most of the reduction in error is obtained when d increases from 0 to 1. For example, for W = 16, the error reduction is 1.408 and 0.318 as d increases from 0 to 1 and from 1 to 2, respectively. This implies that the average improvement in the SP score degrades quickly for d > 1. The 95% error interval also significantly shrinks as d increases to 1. These results suggest that the sparse K-partite graph algorithm is effective even for small values of d.


View this table:
[in this window]
[in a new window]

 
Table 4 The average (µ), SD ({sigma}) of the error, S*SP, for a window using sparse version of QOMA on the V1-R1 dataset

 
Figure 3 shows the average SP scores of the resulting alignments using sparse K-partite graph for different values of d and using complete K-partite graph on the V1-R1 dataset. The complete K-partite graph algorithm produces the best SP scores. However, the SP scores of results from the sparse K-partite graph algorithm are very close to that of the complete K-partite graph algorithm. The quality of the sparse K-partite graph algorithm improves significantly when d increases from 0 to 1. The improvement is less when d increases from 1 to 2. This implies when d becomes larger than 1, it has small impact on the quality of alignment.


Figure 3
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 The SP scores of QOMA alignments using the complete and the sparse K-partite graphs for different values of d and W on the V1-R1 dataset.

 
Performance evaluation: Our second experiment set evaluates the running time of QOMA. Table 5 lists the running time of QOMA for the complete and the sparse K-partite graph algorithms for varying values of W. Experimental results show that QOMA runs faster for small W. The sparse K-partite graph algorithm is faster than the complete K-partite graph algorithm for all values of d for large W. The running time of QOMA increases as d increases. The results in this table agree with the time complexity we computed in Sections 3.3 and 4. Referring to Tables 1, 3 and 4, we conclude that when the window size is small, QOMA runs fast and has high SP score. As the window size increases, its performance drops but the SP score improves further. Thus, it is better to increase the window size and use sparse K-partite graph strategy to obtain high scoring results quickly. In the light of Figure 3, and Tables 4 and 5 we recommend d = 1 using sparse K-partite graph strategy. This is because most of the improvement in the SP score is obtained when d = 1. For d > 1, the score improves slightly while the running time increases at the same or higher rate.


View this table:
[in this window]
[in a new window]

 
Table 5 The running time of QOMA (in seconds) using complete K-partite graph and spare graph for different value of d and W on the V1-R1 dataset

 

    6 CONCLUSION AND FUTURE WORK
 TOP
 ABSTRACT
 1 MOTIVATION
 2 RELATED WORK
 3 ALGORITHM: COMPLETE K-PARTITE...
 4 IMPROVED ALGORITHM: SPARSE...
 5 EXPERIMENTAL EVALUATION
 6 CONCLUSION AND FUTURE...
 REFERENCES
 
We considered the problem of multiple alignment with high-SP score for protein sequences. We developed a graph-based algorithm called QOMA. QOMA first constructs an initial alignment of multiple sequences. In order to create this initial alignment, we developed a method based on the optimal alignment between all pairs of sequences. QOMA represents this alignment using a K-partite graph. It then improves the SP score of the initial alignment by iteratively placing a window on it and optimizing the alignment within this window. QOMA provides two parameters to permit flexibility in time and accuracy: (1) The window size. (2) The sparsity of the K-partite graph. Unlike traditional tools, QOMA is independent of the order of sequences.

In our experiments QOMA achieved significantly higher SP scores than existing tools, including ClustalW, Probcons, Muscle, T-Coffee and DCA. The quality improvement increased with the window size. QOMA had slightly better SP score using the complete K-partite graph algorithm than the sparse K-partite graph algorithm. However, the sparse K-partite graph algorithm was much faster when large window size is used. We found a good balance point between quality of result and running time at d = 1 using sparse K-partite graph.

Currently, we are working on extending the QOMA algorithm to align large number of sequences.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

Received on August 8, 2006; revised on November 15, 2006; accepted on November 16, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 MOTIVATION
 2 RELATED WORK
 3 ALGORITHM: COMPLETE K-PARTITE...
 4 IMPROVED ALGORITHM: SPARSE...
 5 EXPERIMENTAL EVALUATION
 6 CONCLUSION AND FUTURE...
 REFERENCES
 

    Alkan, C., et al. (2005) Manipulating multiple sequence alignments via MaM and WebMaM. Nucleic Acids Res, . 33, W295–W298[Abstract/Free Full Text].

    Bray, N. and Pachter, L. (2004) MAVID: constrained ancestral alignment of multiple sequences. Genome Res, . 14, 693–699[Abstract/Free Full Text].

    Do, C. B., et al. (2005) PROBCONS: Probabilistic consistency-based multiple sequence alignment. Genome Res, . 15, 333–340.

    Eddy, S.R. (1995) Multiple alignment using hidden Markov models. Proc Int Conf Intell Syst Mol Biol. 3, 114–120.

    Edgar, R. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, . 32, 1792–1797[Abstract/Free Full Text].

    Gotoh, O. (1982) An improved algorithm for matching biological sequences. J. Mol. Biol, . 162, 705–708[CrossRef][Web of Science][Medline].

    Gotoh, O. (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol, . 264, 823–838[CrossRef][Web of Science][Medline].

    Grundy, W.N. (1998) Family-based homology detection via pairwise sequence comparison. Proceedings of the Second Annual International Conference on Computational Molecular Biology (RECOMB'98), New York, NY ACM Press, pp. 94–100.

    Gupta, S., et al. (1996) Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. J. Comput. Biol, ..

    Gupta, S.K., et al. (1995) Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. J. Comput. Biol, . 2, 459–462[Medline].

    Hein, J. (1989) A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous sequences, when the phylogeny is given. Mol. Biol. Evol, . 6, 649–668[Abstract].

    Henikoff, S. and Henikoff, J.G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA, 89, 10915–10919[Abstract/Free Full Text].

    Huang, X. and Miller, W. (1991) A time-efficient, linear-space local similarity algorithm. Adv. Appl. Math, . 12, 337–357[CrossRef].

    Jones, D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol, . 292, 195–202[CrossRef][Web of Science][Medline].

    Katoh, K., et al. (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res, . 30, 3059–3066[Abstract/Free Full Text].

    Lee, C., et al. (2002) Multiple sequence alignment using partial order graphs. Bioinformatics, 18, 452–464[Abstract/Free Full Text].

    Lipman, D., et al. (1989) A tool for multiple sequence alignment. Proc. Natl Acad. Sci. USA, 86, 4412–4415[Abstract/Free Full Text].

    Morgenstern, B., et al. (1998) DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics, 14, 290–294[Abstract/Free Full Text].

    Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol, . 48, 443–453[CrossRef][Web of Science][Medline].

    Notredame, C., et al. (2000) T-coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol, . 302, 205–217[CrossRef][Web of Science][Medline].

    Phillips, A., et al. (2000) Multiple sequence alignment in phylogenetic analysis. Mol. Phylogene. Evol, . 16, 317–330.

    Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol, . 147, 195–197[CrossRef][Web of Science][Medline].

    Stoye, J. (1998) Multiple sequence alignment with the divide-and-conquer method. Gene, 211, GC45–GC56[CrossRef][Web of Science][Medline].

    Stoye, J., et al. (1997) DCA: an efficient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment. Comput. Appl. Biosci, . 13, 625–626[Abstract/Free Full Text].

    Sze, S.H., et al. (2005) In Miyano, S., Mesirov, J.P., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M.S. (Eds.). A polynomial time solvable formulation multiple sequence alignment. Proceedings of the 9th Annual International Conference on Research in Computational Molecular Biology (RECOMB), Lecture Notes in Computer Science Springer, pp. 204–216.

    Thompson, J., et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, . 22, 4673–4680[Abstract/Free Full Text].

    Thompson, J., et al. (1999) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res, . 27, 2682–2690[Abstract/Free Full Text].

    Van Walle, I., et al. (2004) Align-m—a new algorithm for multiple alignment of highly divergent sequences. Bioinformatics, 20, 1428–1435[Abstract/Free Full Text].

    Wang, L. and Jiang, T. (1994) On the complexity of multiple sequence alignment. J. Comput. Biol, . 1, 337–348[Medline].

    Zhang, X. and Kahveci, T. (2006) In Altman, R.B., Murray, T., Klein, T.E., Dunker, A.K., Hunter, L. (Eds.). A new approach for alignment of multiple proteins. Proceedings of the Pacific Symposium on Biocomputing (PSB) World Scientific, pp. 339–350.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
K. Katoh and H. Toh
Recent developments in the MAFFT multiple sequence alignment program
Brief Bioinform, July 1, 2008; 9(4): 286 - 298.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/2/162    most recent
btl590v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zhang, X.
Right arrow Articles by Kahveci, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhang, X.
Right arrow Articles by Kahveci, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?