Skip Navigation


Bioinformatics Advance Access originally published online on September 16, 2004
Bioinformatics 2005 21(1):20-30; doi:10.1093/bioinformatics/bth468
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/1/20    most recent
bth468v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Lu, C. L.
Right arrow Articles by Huang, Y. P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lu, C. L.
Right arrow Articles by Huang, Y. P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics vol. 21 issue 1 © Oxford University Press 2005; all rights reserved.

A memory-efficient algorithm for multiple sequence alignment with constraints

Chin Lung Lu * and Yen Pin Huang

Department of Biological Science and Technology, National Chiao Tung University Hsinchu 300, Taiwan, Republic of China

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PROBLEM FORMULATION
 3 ALGORITHM
 4 EXPERIMENTAL RESULTS
 5 CONCLUSIONS
 REFERENCES
 

Motivation: Recently, the concept of the constrained sequence alignment was proposed to incorporate the knowledge of biologists about structures/functionalities/consensuses of their datasets into sequence alignment such that the user-specified residues/nucleotides are aligned together in the computed alignment. The currently developed programs use the so-called progressive approach to efficiently obtain a constrained alignment of several sequences. However, the kernels of these programs, the dynamic programming algorithms for computing an optimal constrained alignment between two sequences, run in O({gamma}n 2) memory, where {gamma} is the number of the constraints and n is the maximum of the lengths of sequences. As a result, such a high memory requirement limits the overall programs to align short sequences~only.

Results: We adopt the divide-and-conquer approach to design a memory-efficient algorithm for computing an optimal constrained alignment between two sequences, which greatly reduces the memory requirement of the dynamic programming approaches at the expense of a small constant factor in CPU time. This new algorithm consumes only O({alpha}n) space, where {alpha} is the sum of the lengths of constraints and usually {alpha} << n in practical applications. Based on this algorithm, we have developed a memory-efficient tool for multiple sequence alignment with constraints.

Availability: http://genome.life.nctu.edu.tw/MUSICME

Contact: cllu{at}mail.nctu.edu.tw


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PROBLEM FORMULATION
 3 ALGORITHM
 4 EXPERIMENTAL RESULTS
 5 CONCLUSIONS
 REFERENCES
 
Multiple sequence alignment (MSA) is one of the fundamental problems in computational molecular biology that have been studied extensively, because it is a useful tool in the phylogenetic analyses among various organisms, the identification of conserved motifs and domains in a group of related proteins, the secondary and tertiary structure prediction of a protein (or RNA), and so on (Carrillo and Lipman, 1988; Chan et al., 1992; Gusfield, 1997; Nicholas et al., 2002; Notredame, 2002). Moreover, MSA is one of the most challenging problems in computational molecular biology because it has been shown to be NP-complete under the consideration of sum-of-pairs scoring criteria (Kececioglu, 1993; Wang and Jiang, 1994; Bonizzoni and Vedova, 2001), which means that it seems to be hard to design an efficient algorithm for finding the mathematically optimal alignment. Hence, some approximate methods (Gusfield, 1993; Pevzner, 1992; Bafna et al., 1997; Li et al., 2000) and heuristic methods (Feng and Doolittle, 1987; Taylor, 1987; Corpet, 1988; Higgins and Sharpe, 1988; Thompson et al., 1994) were introduced to overcome this problem.

Recently, the concept of the constrained sequence alignment was proposed to incorporate the knowledge of biologists regarding the structures/functionalities/consensuses of their datasets into sequence alignment such that the user-specified residues/nucleotides are aligned together in the computed alignment (Tang et al., 2003). (Tang et al. 2003) first designed a dynamic programming algorithm for finding an optimal constrained alignment of two sequences and then used it as a kernel to develop a constrained multiple sequence alignment (CMSA) tool based on the progressive approach, where each constraint considered by Tang et al. is a single residue/nucleotide only. Their proposed algorithm for the two sequences runs in O({gamma}n 4) time and consumes O(n 4) space, where {gamma} is the number of constrained residues and n is the maximum lengths of the sequences. Later, this result was improved independently by two groups of researchers to O({gamma}n 2) time and O({gamma}n 2) space using the same approach of dynamic programming (Yu, 2003; Chin et al., 2003). In fact, each constraint requested to be aligned together can represent a conserved site of a protein/DNA/RNA family and each conserved site may consist of a short segment of residues/nucleotides, instead of a single residue/nucleotide. In other words, the constraint specified by the biologists can be a fragment of several residues/nucleotides. For some applications, biologists may further expect that some mismatches are allowed among the residues/nucleotides of the columns requested to be aligned. Hence, (Tsai et al. 2004) studied such a kind of the constrained sequence alignment and designed an algorithm of O({gamma}n 2) time and O({gamma}n 2) space for two sequences. The improvements and extension above greatly increase the performances and practical usage of the CMSA tools developed using the progressive approach. However, the requirement of O({gamma}n 2) memory still limits the existing CMSA tools to align a set of short sequences, at most several hundreds of residues/nucleotides. To align large genomic sequences of at least several thousands of residues/nucleotides, there is a need to design a memory-efficient algorithm for the constrained pairwise sequence alignment (CPSA) problem, which is the key limiting factor relating to the applicable extent of the progressive CMSA tools. Hence, in this paper, we adopt the so-called divide-and-conquer approach to design a memory-efficient algorithm for solving the CPSA problem, which runs in O({gamma}n 2) time, but consumes only O({alpha}n) space, where {alpha} is the sum of the lengths of constraints and usually {alpha} << n in practical applications. Based on this algorithm, we have finally developed a memory-efficient CMSA tool using the progressive approach. Note that applying the divide-and-conquer approach to memory-efficiently align two or more sequences without any constraints has been studied extensively (Myers and Miller, 1988; Chao et al., 1994; Tönges et al., 1996; Stoye et al., 1997a; Stoye et al., 1997b; Stoye, 1998). In contrast to the progressive approach used here, the divide-and-conquer algorithms proposed by Stoye et al. (Tönges et al., 1996; Stoye et al., 1997a; Stoye et al., 1997b; Stoye, 1998) considered the input sequences simultaneously and heuristically compute the good, but not necessarily optimal, dividing positions so that the resulting total MSA is close to an optimal MSA of the original sequences. In fact, many other CMSAs have been proposed from various perspectives, even using different approaches (Schuler et al., 1991; Depiereux and Feytmans, 1992; Taylor, 1994; Myers et al., 1996; Notredame et al., 2000; Thompson et al., 2000; Sammeth et al., 2003). Of these various CMSAs, it is worth mentioning that (Myers et al., 1996) obtained their CMSA by performing progressive multiple alignment under position-based constraints that are given by users; (Sammeth et al. 2003) got their CMSA by performing simultaneous multiple alignment under segment-based constraints (as same as we studied here) that are pre-computed via a local segmented-based algorithm (Morgenstern, 1999). We refer the reader to their papers for details.


    2 PROBLEM FORMULATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PROBLEM FORMULATION
 3 ALGORITHM
 4 EXPERIMENTAL RESULTS
 5 CONCLUSIONS
 REFERENCES
 
Let O = {S 1, S 2, ..., S {chi}} be the set of {chi} sequences over the alphabet {Sigma}. Then an MSA of O is a rectangular matrix consisting of {chi} rows of characters of {Sigma} {cup} {–} such that no column consists entirely of dashes and removing dashes from row i leaves S i for any 1 ≤ i ≤ {chi}. The sum-of-pairs score (SP score) of an MSA is defined to be the sum of the scores of all columns, where the score of each column is the sum of the scores of all distinct pairs of characters in the column. In practice, the score of the pair of two dashes is usually set to zero. Then the problem of finding an MSA of O with the optimal SP score is the so-called sum-of-pairs MSA problem (Carrillo and Lipman, 1988; Chan et al., 1992; Gusfield, 1997; Nicholas et al., 2002; Notredame, 2002).

Let {delta}(T 1, T 2) denote the Hamming distance between two subsequences T 1 and T 2 of equal length, which is equal to the number of mismatched pairs in the alignment of T 1 and T 2 without any gap. Given an alignment L of O, a band is defined as a block of consecutive columns in L (i.e. a submatrix of L). For any band L' of L, let subset(S i , L') denote the subsequence of S i whose residues/nucleotides are all in the band L', where 1 ≤ i ≤ {chi}. A subsequence T = t 1 t 2 ... t {lambda} is said to appear in L if L contains a band L' of {lambda} columns, say {pi}1, {pi}2, ..., {pi}{lambda}, such that the characters of column {pi} j , 1 ≤ j ≤ {lambda}, are all equal to t j , or equivalently, subseq(S i , L') = T for each 1 ≤ i ≤ {chi}. If {delta}[subseq(S i , L'), T] ≤ {lambda} x {epsilon} for a given error ratio 0 ≤ {epsilon} < 1 [i.e. some mismatches are allowed between subseq(S i , L') and T], then T is said to approximately appear in L. From the biological viewpoint, T can be considered as the consensus among the subsequences in L' and hence T is also called as an induced consensus by the band L'. For any two subsequences T 1 and T 2, T 1 {precedes} T 2 is used to denote that T 1 (approximately) appears strictly before T 2 in L (i.e. their corresponding bands do not overlap). Let {Omega} = (C 1, C 2, ..., C {gamma}) be an ordered set of {gamma} constraints (i.e. subsequences), each with length of {lambda} i , where 1 ≤ i ≤ {gamma}. Then the CMSA of O with respect to {Omega} is defined as an alignment L of O in which all the constraints of {Omega} approximately appear in the order C 1 {precedes} C 2 {precedes} ··· {precedes} C {gamma} such that {delta}(subseq(Si , L' j ), C j ) ≤ {lambda} j x {epsilon} for all 1 ≤ i ≤ {chi} and 1 ≤ j ≤ {gamma}, where L' j is the band of L whose induced consensus is C j . Given a set O of {chi} sequences along with an ordered set {Omega} of {gamma} constraints and an error ratio {epsilon}, the so-called CMSA problem is to find a CMSA w.r.t. {Omega} with the optimal SP score. When the number of sequences in O is restricted to two (i.e. {chi} = 2), the CMSA problem is called as the CPSA problem.


    3 ALGORITHM
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PROBLEM FORMULATION
 3 ALGORITHM
 4 EXPERIMENTAL RESULTS
 5 CONCLUSIONS
 REFERENCES
 
In this section, we shall first design a memory-efficient algorithm for solving the CPSA problem with two given sequences A = a 1 a 2 ... a m and B = b 1 b 2 ... b n , a given ordered set {Omega} = (C 1, C 2, ..., C {gamma}) of {gamma} constraints, each with length of {lambda} i , 1 ≤ i ≤ {gamma}, and a given error threshold {epsilon}. After that, we shall use it as the kernel to heuristically solve the CMSA problem.

For any sequence T, let pref(T, l) [respectively, suff(T, l)] phase don't change denote the prefix (respectively, suffix) of T with length l. For any two characters a, b {Sigma}, let {sigma}(a, b) denote the score of aligning a with b. The gap penalty adopted here is the so-called affine gap penalty that penalizes a gap of length l with wo + l x we , where wo > 0 is the gap-open penalty and we > 0 is the gap-extension penalty. For convenience, let A i = pref(A, i) = a 1 a 2 ... a i , B j = pref(B, j) = b 1 b 2 ... b j and {Omega} k = (C 1, C 2, ..., C k ), where 1 ≤ i ≤ m, 1 ≤ j ≤ n and 1 ≤ k ≤ {gamma}. Let M k (i, j) denote the score of an optimal constrained alignment of A i and B j w.r.t. {Omega} k . Clearly, M{gamma}(m, n) is the score of an optimal constrained alignment of A and B w.r.t. {Omega}. An alignment L is called as a semi-constrained alignment of A i and B j w.r.t. {Omega} k if it is a constrained alignment of A i and B j w.r.t. {Omega} k–1 and also ends (or begins) with a band whose induced consensus is equal to a prefix of C k (or a suffix of C 1). N k (i, j, h) is defined to be the score of an optimal semi-constrained alignment of A i and B j w.r.t. {Omega} k that ends with an induced consensus equal to pref(C k , h). Let [respectively, ] be the maximum scores of all constrained alignments of A i and B j w.r.t. {Omega} k that end with a deletion pair (a i , –) [respectively, an insertion pair (–, b j )]. By definition, it is not hard to derive the recurrence of M k (i, j), 1 ≤ i ≤ m and 1 ≤ j ≤ n, as follows. If k = 0, then . If 1 ≤ k ≤ {gamma}, then . Clearly, , if {delta}(suff(A i , {lambda} k ), C k ) ≤ {lambda} k x {epsilon} and {delta}(suff(B j , {lambda} k ), C k ) ≤ {lambda} k x {epsilon}; otherwise, N k (i, j, {lambda} k ) = –{infty}. To simply describe the computation of and , we introduce another notation , which is defined to be the maximum score of all constrained alignments of A i and B j w.r.t. {Omega} k that end with a substitution pair (a i , b j ). Let denote the alignment of A i and B j with score that ends with a deletion pair (a i , –). Let L' be the portion of before the last aligned pair (a i , –). Then there are three possibilities when we consider the last aligned pair of L'.

Case 1: The last aligned pair of L' is a substitution pair. Then the score of L' is and (a i , –) is charged by a gap-open penalty and a gap-extension penalty in . Hence, .

Case 2: The last aligned pair of L' is a deletion pair. Then the score of L' is and (a i , –) is charged by only one gap-extension penalty in . Hence, .

Case 3: The last aligned pair of L' is an insertion pair. Then the score of L' is and (a i , –) is charged by a gap-open penalty and a gap-extension penalty in . Hence, .

In summary, . However, by including an extra into the right-hand side of the above recurrence, we can reformulate the above recurrence as . Similar to the discussion above, the recurrence of can be derived as .

According to the recurrences above, we designed an algorithm to compute M{gamma}(m, n) and its corresponding constrained alignment using the technique of dynamic programming as follows. For convenience, we depicted the recurrences of matrices M k , , and N k for all 0 ≤ k ≤ {gamma} by a three-dimensional (3D) grid graph G, which consists of (m + 1) x (n + 1) x ({gamma} + 1) entries and each entry (i, j, k) consists of four nodes M k , , and N k corresponding to M k , , and N k (i, j, {lambda} k ), respectively. Figure 1 shows the relationship of four adjacent entries (i, j, k), (i – 1, j, k), (i, j 1, k) and (i – 1, j – 1, k) of G for each fixed k.



View larger version (29K):
[in this window]
[in a new window]
 
Fig. 1 The schematic diagram of four adjacent entries of G, where entry (i, j, k) consists of four nodes M k , , and N k corresponding to M k (i, j), , and N k (i, j, {lambda} k ), respectively.

 
Note that there is a directed edge, which is not shown in Figure 1, with weight from the M k–1 node of the entry (i{lambda} k , j{lambda} k , k – 1) to the N k node of the entry (i, j, k). Then each path from M0(0, 0) node of entry (0, 0, 0) to M{gamma}(m, n) node of entry (m, n, {gamma}) corresponds to a constrained alignment of A and B w.r.t. {Omega}. As a result, an optimal constrained alignment of A and B can be obtained by backtracking a shortest path from M{gamma}(m, n) to M0(0, 0) in G. It is not hard to see that the algorithm costs both computer time and memory in the order of O({gamma}mn). We call the above algorithm based on the dynamic programming approach as CPSA-DP algorithm.

Hirschberg (1975) had developed a linear-space algorithm for solving the longest common subsequence problem based on the divide-and-conquer technique. Since then, this strategy has been extended to yield a number of memory-efficient algorithms for aligning biological sequences (Myers and Miller, 1988; Chao et al., 1994). In this paper, we generalize the Hirschberg's algorithm so that it is capable of dealing with the CPSA. As compared with others, our generalization is more complicated because the grid graph G dealt here is 3D, instead of 2D, and the input sequences are accompanied with several constraints that need to be considered carefully. The central idea of our memory-efficient algorithm is to determine a middle position (i mid, j mid, k mid) on an optimal path from M0(0, 0) to M{gamma}(m, n) in G so that we were able to divide the constrained alignment problem into two smaller constrained alignment problems; then these smaller constrained alignment problems are continued to be divided in the same manner, and finally the optimal constrained alignment is obtained completely by merging the series of the calculated mid-points (Fig. 2).



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 2 Schematic diagram of divide-and-conquer approach: two light gray areas are the reduced subproblems after middle position (i mid, j mid, k mid) is determined, each of which will be further divided into two subproblems of dark gray areas.

 
Before describing our algorithm, some notation must be introduced as follows. Let and denote the suffixes a i+1 a i+2 ... a m and b j + 1 b j + 2... b n of A and B, respectively, for 1 ≤ i ≤ m and 1 ≤ j ≤ n. Let denote the ordered subset (C k + 1, C k + 2, ..., C {gamma}) for 1 ≤ k ≤ {gamma}. Define to be the score of an optimal constrained alignment of and w.r.t. , and define ( and , respectively) to be the maximum score of all constrained alignments of and w.r.t. that begin with a substitution [deletion and insertion, respectively] pair (a i + 1, b j + 1) [(a i + 1, –) and (–, b j + 1), respectively]. Let {Omega} k (h) = [C 1, C 2, ..., C k – 1, pref(C k , h)] and , where 1 ≤ h ≤ {lambda} k . Let denote the score of an optimal semi-constrained alignment of and w.r.t. that begins with a band whose induced consensus is equal to suff(C k , {lambda} k h). Note that the recurrences for computing matrices , , , and can be developed similarly as those for computing M k , , , and N k , respectively. Clearly,. If {delta}[suff(A i , {lambda} k ), C k ] ≤ {lambda} k x {epsilon} and {delta}[suff(B j , {lambda} k ), C k ] ≤ {lambda} k x {epsilon}, then we can reformulate the recurrence of N k as follows: N k (i, j, 1) = M k – 1(i – 1, j – 1) + {sigma}(a i , b j ) and N k (i, j, h) = N k (i 1, j – 1, h – 1) + {sigma}(a i , b j ) for each 1 < h ≤ {lambda} k .

Next, we describe our divide-and-conquer algorithm, termed as CPSA-DC algorithm, for computing an optimal constrained alignment between A and B w.r.t. {Omega} as follows. The key point is to determine the middle position (i mid, j mid, k mid) of the optimal path in G to divide the problem into two subproblems, each of which is recursively divided into two smaller subproblems using the same way. Given an alignment L, we use score(L) to denote the score of L. Let L{gamma}(A, B) be an optimal constrained alignments of A and B w.r.t. {Omega} and clearly score[L{gamma}(A, B)] = M{gamma}(m, n). Let . Then, we partition L{gamma}(A, B) into two parts by cutting it at the position immediately after and we let denote the part containing and denote the remaining part, where denotes the last character in from B, and k mid denotes the largest index so that (approximately) appears in Then there are two possibilities when we consider the last aligned pair of .

Case 1: The last aligned pair of is a substitution pair [i.e. , ]. In this case, we have . If (, ) is not a constrained column in L{gamma}(A, B), then is an optimal constrained alignment of A i mid and B j mid w.r.t. ending with a substitution pair (, ), and is an optimal constrained alignment of and w.r.t. . Hence, . If (, ) is a constrained column in , then is an optimal semi-constrained alignment of and w.r.t. ending with a band L' whose induced consensus is equal to . If , then is an optimal semi-constrained alignment of and w.r.t. beginning with a band whose induced consensus is equal to . Moreover, the induced consensus of the merge of L' and have to be equal to . In this case, we have . If , then is an optimal constrained alignment of and w.r.t. , and hence .

Case 2: The last aligned pair of is a deletion pair [i.e. (,–)]. If the first aligned pair in is not a deletion pair, then . If the first aligned pair in is a deletion pair, then . We need to compensate it by adding wo because the open penalty of the gap containing and in L{gamma}(A, B) is charged twice by and .

In summary, the recurrence of M{gamma}(m, n) is derived as follows:

When is added to the right-hand side, the above recurrence is not changed, but can be reformulated as follows:


In other words, j mid, k mid and h mid are the indices j, k and h, where 1 ≤ j ≤ n, 0 ≤ k ≤ {gamma} and 1 ≤ h < {lambda} k , such that the following maximal value is the maximum.


Now, we show how to use O({alpha}n), instead of O({gamma}mn), memory to determine j mid, k mid and h mid, where {alpha} = {sum}1 ≤ k ≤ {gamma} {lambda} k and {alpha} ≤ min{m, n} intrinsically. In fact, a single matrix E of size ({gamma} + 1) x (n + 1) with each entry E(k, j) of {lambda} k + 4 space is enough to compute M k (i mid, j), , and N k (i mid, j, h), for 1 ≤ j ≤ n, 0 ≤ k ≤ {gamma} and 1 ≤ h ≤ {lambda} k . When reaching the entry (i, j, k) of 3D grid graph G, we use entry E(k, j) of E to hold the most recently computed values of M k (i, j), , and N k (i, j, h), which clearly needs a total of {lambda} k + 4 space. Note that the old values in entry E(k, j) will be moved into an extra entry, termed as V k whose space is equal to E(k, j), before they are overwritten by their newly computed values. Before moving the old values in E(k, j) into V k ; however, we need to first move M k (i – 1, j 1) in V k into a space, named as v k, k + 1, where 1 ≤ i ≤ m. The mechanism above will enable us to compute N k (i, j,1), which needs to refer to M k – 1(i – 1, j – 1) that is kept in v k – 1, k ; compute N k (i, j, h) for each 2 ≤ h ≤ {lambda} k , which needs to refer to N k (i – 1, j – 1, h – 1) that is kept in V k ; compute , which needs to refer M k (i – 1, j – 1) that is kept in V k ; and finally we were able to compute M k (i, j). Figure 3 shows the grid locations of E(k – 1), E(k) and the values in V k – 1 and V k when we reach the entry (i, j, k) of G for the computation, where E(k) denotes the k-th row of E. Hence, the total needed space for computing and storing all M k (i mid, j), , and N k (i mid, j, h) is the sum of the space of matrix E, the space of all V k and the space of all v k, k + 1, where 1 ≤ j ≤ n, 0 ≤ k ≤ {gamma} and 1 ≤ h ≤ {lambda} k , which is equal to O({alpha}n). Similarly, the required matrix, denoted by E, for computing all , , and still needs O({alpha}n) space. Hence, the determination of j mid, k mid and h mid can be performed in O({alpha}n) space. The details of CPSA-DC algorithm are described as follows. Note that the program code of BestScoreRev is similar to that of BestScore and hence is omitted here. In the codes, the variable E(M k (i mid, j)) is used to denote the value of M k (i mid, j) in E(k, j) and others are analogous. The global variables , H B (j, k, h) = {delta}(suff(B j , h), pref(C k , h)), and are computed in Algorithm BestScore so that they can be used directly in Algorithm CPSA-DC.



View larger version (10K):
[in this window]
[in a new window]
 
Fig. 3 The grid locations of E(k – 1), E(k) and the values in V k–1 and V k when the entry (i, j, k) of G, marked with ‘?’, is reached for the computation.

 
Algorithm CPSA-DC(i start, i end, j start, j end, k start, k end)Input: Sequences and with constraints )1: if (i start > i end) or (j start > j end) then

Align the nonempty sequence with spaces;

else

BestScore(i start,i mid,j start,j end,k start,k end);

BestScoreRev(i mid + 1,i end,j start,j end,k start,k end);

end if

2: max = –{infty};

for j = j start – 1 to j end do

for k = k start – 1 to k end do

if then

j mid = j; k mid = k; type = case 1;

end if

if then

;

j mid = j; k mid = k; type = case 2;

end if

if then

;

j mid = j; k mid = k; type = case 3;

end if

if k ≥ 1 then

for h = 1 to {lambda} k – 1 do

if and

then

if

max then

;

j mid = j; k mid = k; h mid = h; type = case 4;

end if

end if

end for

if and then

if

max then

;

j mid = j; k mid = k; h mid = h; type = case 5;

end if

end if

end if

end for

end for

3: if type = case 1 then

CPSA-DC(i start,i mid – 1,j start,j mid,k start,k mid);

Align with a space;

CPSA-DC(i mid + 1,i end,j mid + 1,j end,k mid+1,k end);

end if

if type = case 2 then

CPSA-DC(i start,i mid – 1,j start,j mid,k start,k mid);

Align with two spaces;

CPSA-DC(i mid + 2,i end,j mid+1,j end,k mid+1,k end);

end if

if type = case 3 then

CPSA-DC(i start,i mid – 1,j start,j mid – 1,k start,k mid);

Align with ;

CPSA-DC(i mid+1,i end,j mid+1,j end,k mid+1,k end);

end if

if type = case 4 then

CPSA-DC(i start,i midh mid,j start,j mid h mid,k start, k mid – 1);

Align with ;

CPSA-DC(i mid+{lambda}kh mid+1,i end, j mid+{lambda}kh mid + 1,j end,k mid+1,k end);

end if

if type = case 5 then

CPSA-DC(i start,i mid{lambda} k ,j start, j mid{lambda} k ,k start, k mid – 1);

Align with ;

CPSA-DC(i mid+1,i end,j mid+1,j end,k mid+1,k end);

end if

Algorithm BestScore(i start, i end, j start, j end, k start, k end)

Input: Sequences and

with constraints ()

1: /* Reindex */

m = i starti end + 1; n = j startj end + 1;

{gamma} = k startk end + 1;

2: /* Initialization */

for j = 0 to n do

for k = 0 to {gamma} do

;

if (j = 0) or (k > 0) then ;

else ;

if (j = 0) and (k = 0) then E(M k (i mid,j)) = 0;

else E(M k (i mid,j)) = –{infty};

if k ≥ 1 then

for h = 1 to {lambda} k do E(N k (i mid,j,h)) = –{infty};

end if

end for

end for 3: /* Computation */

for i = 1 to m do

for k = 0 to {gamma} do /* For the case of j = 0 */

Vk (M k (i mid, 0)) = E(M k (i mid,0));

if k ≥ 1 then

for h = 1 to {lambda} k do Vk (N k (i mid,0,h))

= E(N k (i mid,0,h)));

end if

;

;

end for

for j = 1 to n do /* For the case of j > 0 */

for k = 0 to {gamma} do

temp k (M k (i mid,j)) = E(M k (i mid,j));

if k ≥ 1 then

for h = 1 to {lambda} k do temp k (N k (i mid,j,h))

= E(N k (i mid,j,h));

end if

;

;

;

if k ≥ 1 then

for h = 1 to {lambda} k do

if h = 1 then

;

else

;

end if

end for

end if

;

v k,k+1 = V k (M k (i mid,j));

V k(M k (i mid,j)) = temp k (M k (i mid,j));

if k ≥ 1 then

for h = 1 to {lambda} k do

V k (N(i mid,j,h)) = temp k (N k (i mid,j,h));

end for

end if

if i = m and k ≥ 1 then

for h = 1 to {lambda} k do

if h = 1 then

if j = 1 and then

H A (k,h) = 1; else H A (k,h) = 0;

if then

H B (j,k,h) = 1; else H B (j,k,h) = 0;

else

if j = 1 and then

H A (k,h) = H A (k,h – 1) + 1;

if then H B (j,k,h)

=H B (j,k,h – 1)+ 1;

end if

end for

end if

end for

end for

end for

Now, we analyze the time-complexity of our CPSA-DC algorithm for solving the CPSA. As shown in Figure 2, after determining the middle position (i mid, j mid, k mid) of the optimal path in G, we can divide the original problem into two subproblems, each of which further can be recursively divided into two smaller subproblems using the same way. Note that regardless of where the optimal path passes through (i mid, j mid, k mid), the total size of the two reduced subproblems is just half the size of the original problem, where the size is measured by the number of entries in G. It is not hard to see that the time-complexity of determining the middle position of each subproblem at each recursive stage is proportional to the size of the subproblem. Let {Psi} denote the size of the original problem (i.e. {Psi} = {gamma}m n). Then the total time-complexity of our CPSA-DC algorithm is equal to , which is twice as high as the CPSA-DP algorithm. Using the CPSA-DC algorithm as a kernel, we were able to design a memory-efficient algorithm, termed CMSA-DC, for progressively aligning multiple input sequences into a CMSA according to the branching order of a guide tree. The above progressive method we adopted was proposed by (Tang et al. 2003). Owing to space limitation, we refer the reader to their paper for the details of its implementation.


    4 EXPERIMENTAL RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PROBLEM FORMULATION
 3 ALGORITHM
 4 EXPERIMENTAL RESULTS
 5 CONCLUSIONS
 REFERENCES
 
We use Java language to implement the CMSA-DC algorithm as a web server, called as MuSiC-ME (Memory-Efficient tool for Multiple Sequence Alignment with Constraints). The input of the MuSiC-ME system consists of a set of protein/DNA/RNA sequences and a set of user-specified constraints, each with a fragment of residue/nucleotide that (approximately) appears in all input sequences. The output of MuSiC-ME is a CMSA in which the fragments of the input sequences whose residues/nucleotides exhibit a given degree of similarity to a constraint are aligned together. For its biological applications, we refer the reader to other related papers (Tang et al., 2003; Tsai et al., 2004).

In the following, we evaluate our memory-efficient MuSiC-ME system and compare its running time and memory to the original MuSiC system (Tsai et al., 2004), whose kernel CPSA algorithm was implemented by the dynamic programming approach. We chose five families of protein/RNA sequences as our testing datasets, each of which has been shown to contain an ordered series of conserved motifs related to the structures/functionalities/consensuses of the family (McClure et al., 1994; Chin et al., 2003; Tang et al., 2003; Tsai et al., 2004): (1) the aspartic acid protease family (Protease), (2) the hemoglobins family (Globin), (3) the ribonuclease family (RNase), (4) the kinase family (Kinase) and (5) the 3'- untranslated region of the coronaviruses (CoV-3'-UTR). From each family, we have selected a representative set of sequences and adopted the ordered series of conserved motifs as the constraints. Table 1 lists the information of the tested families and their constraints. All tests were run with default parameters on IBM PC with 1.26 GHz processor and 512 MB RAM under Linux system. Table 2 lists the CPU time and memory usage of our experiments using MuSiC and MuSiC-ME. It shows that the memory usage of MuSiC-ME is much smaller than that of MuSiC for large-scale sequences, and the CPU time required by MuSiC-ME is smaller than that required by MuSiC for short sequences, since we have simplified the recurrences of the dynamic programming here.


View this table:
[in this window]
[in a new window]
 
Table 2 The information of the tested families and their constraints

 

View this table:
[in this window]
[in a new window]
 
Table 2 The comparison of CPU time and memory usage between MuSiC and MuSiC-ME

 
It is worth mentioning that in MuSiC-ME system, the letters representing the constraints are not just the individual residues/nucleotides, but also the IUPAC (International Union of Pure and Applied Chemistry) codes. For example, nucleotides