Bioinformatics Advance Access originally published online on September 16, 2004
Bioinformatics 2005 21(1):20-30; doi:10.1093/bioinformatics/bth468
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 21 issue 1 © Oxford University Press 2005; all rights reserved.
A memory-efficient algorithm for multiple sequence alignment with constraints
Department of Biological Science and Technology, National Chiao Tung University Hsinchu 300, Taiwan, Republic of China
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Recently, the concept of the constrained sequence alignment was proposed to incorporate the knowledge of biologists about structures/functionalities/consensuses of their datasets into sequence alignment such that the user-specified residues/nucleotides are aligned together in the computed alignment. The currently developed programs use the so-called progressive approach to efficiently obtain a constrained alignment of several sequences. However, the kernels of these programs, the dynamic programming algorithms for computing an optimal constrained alignment between two sequences, run in
(
n 2) memory, where
is the number of the constraints and n is the maximum of the lengths of sequences. As a result, such a high memory requirement limits the overall programs to align short sequences~only.
Results: We adopt the divide-and-conquer approach to design a memory-efficient algorithm for computing an optimal constrained alignment between two sequences, which greatly reduces the memory requirement of the dynamic programming approaches at the expense of a small constant factor in CPU time. This new algorithm consumes only
(
n) space, where
is the sum of the lengths of constraints and usually
<< n in practical applications. Based on this algorithm, we have developed a memory-efficient tool for multiple sequence alignment with constraints.
Availability: http://genome.life.nctu.edu.tw/MUSICME
Contact: cllu{at}mail.nctu.edu.tw
| 1 INTRODUCTION |
|---|
|
|
|---|
Multiple sequence alignment (MSA) is one of the fundamental problems in computational molecular biology that have been studied extensively, because it is a useful tool in the phylogenetic analyses among various organisms, the identification of conserved motifs and domains in a group of related proteins, the secondary and tertiary structure prediction of a protein (or RNA), and so on (Carrillo and Lipman, 1988; Chan et al., 1992; Gusfield, 1997; Nicholas et al., 2002; Notredame, 2002). Moreover, MSA is one of the most challenging problems in computational molecular biology because it has been shown to be NP-complete under the consideration of sum-of-pairs scoring criteria (Kececioglu, 1993; Wang and Jiang, 1994; Bonizzoni and Vedova, 2001), which means that it seems to be hard to design an efficient algorithm for finding the mathematically optimal alignment. Hence, some approximate methods (Gusfield, 1993; Pevzner, 1992; Bafna et al., 1997; Li et al., 2000) and heuristic methods (Feng and Doolittle, 1987; Taylor, 1987; Corpet, 1988; Higgins and Sharpe, 1988; Thompson et al., 1994) were introduced to overcome this problem.
Recently, the concept of the constrained sequence alignment was proposed to incorporate the knowledge of biologists regarding the structures/functionalities/consensuses of their datasets into sequence alignment such that the user-specified residues/nucleotides are aligned together in the computed alignment (Tang et al., 2003). (Tang et al. 2003) first designed a dynamic programming algorithm for finding an optimal constrained alignment of two sequences and then used it as a kernel to develop a constrained multiple sequence alignment (CMSA) tool based on the progressive approach, where each constraint considered by Tang et al. is a single residue/nucleotide only. Their proposed algorithm for the two sequences runs in
(
n 4) time and consumes
(n 4) space, where
is the number of constrained residues and n is the maximum lengths of the sequences. Later, this result was improved independently by two groups of researchers to
(
n 2) time and
(
n 2) space using the same approach of dynamic programming (Yu, 2003; Chin et al., 2003). In fact, each constraint requested to be aligned together can represent a conserved site of a protein/DNA/RNA family and each conserved site may consist of a short segment of residues/nucleotides, instead of a single residue/nucleotide. In other words, the constraint specified by the biologists can be a fragment of several residues/nucleotides. For some applications, biologists may further expect that some mismatches are allowed among the residues/nucleotides of the columns requested to be aligned. Hence, (Tsai et al. 2004) studied such a kind of the constrained sequence alignment and designed an algorithm of
(
n 2) time and
(
n 2) space for two sequences. The improvements and extension above greatly increase the performances and practical usage of the CMSA tools developed using the progressive approach. However, the requirement of
(
n 2) memory still limits the existing CMSA tools to align a set of short sequences, at most several hundreds of residues/nucleotides. To align large genomic sequences of at least several thousands of residues/nucleotides, there is a need to design a memory-efficient algorithm for the constrained pairwise sequence alignment (CPSA) problem, which is the key limiting factor relating to the applicable extent of the progressive CMSA tools. Hence, in this paper, we adopt the so-called divide-and-conquer approach to design a memory-efficient algorithm for solving the CPSA problem, which runs in
(
n 2) time, but consumes only
(
n) space, where
is the sum of the lengths of constraints and usually
<< n in practical applications. Based on this algorithm, we have finally developed a memory-efficient CMSA tool using the progressive approach. Note that applying the divide-and-conquer approach to memory-efficiently align two or more sequences without any constraints has been studied extensively (Myers and Miller, 1988; Chao et al., 1994; Tönges et al., 1996; Stoye et al., 1997a; Stoye et al., 1997b; Stoye, 1998). In contrast to the progressive approach used here, the divide-and-conquer algorithms proposed by Stoye et al. (Tönges et al., 1996; Stoye et al., 1997a; Stoye et al., 1997b; Stoye, 1998) considered the input sequences simultaneously and heuristically compute the good, but not necessarily optimal, dividing positions so that the resulting total MSA is close to an optimal MSA of the original sequences. In fact, many other CMSAs have been proposed from various perspectives, even using different approaches (Schuler et al., 1991; Depiereux and Feytmans, 1992; Taylor, 1994; Myers et al., 1996; Notredame et al., 2000; Thompson et al., 2000; Sammeth et al., 2003). Of these various CMSAs, it is worth mentioning that (Myers et al., 1996) obtained their CMSA by performing progressive multiple alignment under position-based constraints that are given by users; (Sammeth et al. 2003) got their CMSA by performing simultaneous multiple alignment under segment-based constraints (as same as we studied here) that are pre-computed via a local segmented-based algorithm (Morgenstern, 1999). We refer the reader to their papers for details.
| 2 PROBLEM FORMULATION |
|---|
|
|
|---|
Let
= {S 1, S 2, ..., S
} be the set of
sequences over the alphabet
. Then an MSA of
is a rectangular matrix consisting of
rows of characters of
{} such that no column consists entirely of dashes and removing dashes from row i leaves S i for any 1
i
. The sum-of-pairs score (SP score) of an MSA is defined to be the sum of the scores of all columns, where the score of each column is the sum of the scores of all distinct pairs of characters in the column. In practice, the score of the pair of two dashes is usually set to zero. Then the problem of finding an MSA of
with the optimal SP score is the so-called sum-of-pairs MSA problem (Carrillo and Lipman, 1988; Chan et al., 1992; Gusfield, 1997; Nicholas et al., 2002; Notredame, 2002).
Let
(T 1, T 2) denote the Hamming distance between two subsequences T 1 and T 2 of equal length, which is equal to the number of mismatched pairs in the alignment of T 1 and T 2 without any gap. Given an alignment
of
, a band is defined as a block of consecutive columns in
(i.e. a submatrix of
). For any band
' of
, let subset(S i ,
') denote the subsequence of S i whose residues/nucleotides are all in the band
', where 1
i
. A subsequence T = t 1 t 2 ... t
is said to appear in
if
contains a band
' of
columns, say
1,
2, ..., 
, such that the characters of column
j , 1
j
, are all equal to t j , or equivalently, subseq(S i ,
') = T for each 1
i
. If
[subseq(S i ,
'), T]
x
for a given error ratio 0
< 1 [i.e. some mismatches are allowed between subseq(S i ,
') and T], then T is said to approximately appear in
. From the biological viewpoint, T can be considered as the consensus among the subsequences in
' and hence T is also called as an induced consensus by the band
'. For any two subsequences T 1 and T 2, T 1
T 2 is used to denote that T 1 (approximately) appears strictly before T 2 in
(i.e. their corresponding bands do not overlap). Let
= (C 1, C 2, ..., C
) be an ordered set of
constraints (i.e. subsequences), each
with length of
i , where 1
i
. Then the CMSA of
with respect to
is defined as an alignment
of
in which all the constraints of
approximately appear in the order C 1
C 2
···
C
such that
(subseq(Si ,
' j ), C j )
j x
for all 1
i
and 1
j
, where
' j is the band of
whose induced consensus is C j . Given a set
of
sequences along with an ordered set
of
constraints and an error ratio
, the so-called CMSA problem is to find a CMSA w.r.t.
with the optimal SP score. When the number of sequences in
is restricted to two (i.e.
= 2), the CMSA problem is called as the CPSA problem.
| 3 ALGORITHM |
|---|
|
|
|---|
In this section, we shall first design a memory-efficient algorithm for solving the CPSA problem with two given sequences A = a 1 a 2 ... a m and B = b 1 b 2 ... b n , a given ordered set
= (C 1, C 2, ..., C
) of
constraints, each
with length of
i , 1
i
, and a given error threshold
. After that, we shall use it as the kernel to heuristically solve the CMSA problem.
For any sequence T, let pref(T, l) [respectively, suff(T, l)] phase don't change denote the prefix (respectively, suffix) of T with length l. For any two characters a, b
, let
(a, b) denote the score of aligning a with b. The gap penalty adopted here is the so-called affine gap penalty that penalizes a gap of length l with wo + l x we , where wo > 0 is the gap-open penalty and we > 0 is the gap-extension penalty. For convenience, let A i = pref(A, i) = a 1 a 2 ... a i , B j = pref(B, j) = b 1 b 2 ... b j and
k = (C 1, C 2, ..., C k ), where 1
i
m, 1
j
n and 1
k
. Let
k (i, j) denote the score of an optimal constrained alignment of A i and B j w.r.t.
k . Clearly, 
(m, n) is the score of an optimal constrained alignment of A and B w.r.t.
. An alignment
is called as a semi-constrained alignment of A i and B j w.r.t.
k if it is a constrained alignment of A i and B j w.r.t.
k1 and also ends (or begins) with a band whose induced consensus is equal to a prefix of C k (or a suffix of C 1).
k (i, j, h) is defined to be the score of an optimal semi-constrained alignment of A i and B j w.r.t.
k that ends with an induced consensus equal to pref(C k , h). Let
[respectively,
] be the maximum scores of all constrained alignments of A i and B j w.r.t.
k that end with a deletion pair (a i , ) [respectively, an insertion pair (, b j )]. By definition, it is not hard to derive the recurrence of
k (i, j), 1
i
m and 1
j
n, as follows. If k = 0, then
. If 1
k
, then
. Clearly,
, if
(suff(A i ,
k ), C k )
k x
and
(suff(B j ,
k ), C k )
k x
; otherwise,
k (i, j,
k ) =
. To simply describe the computation of
and
, we introduce another notation
, which is defined to be the maximum score of all constrained alignments of A i and B j w.r.t.
k that end with a substitution pair (a i , b j ). Let
denote the alignment of A i and B j with score
that ends with a deletion pair (a i , ). Let
' be the portion of
before the last aligned pair (a i , ). Then there are three possibilities when we consider the last aligned pair of
'.
Case 1: The last aligned pair of
' is a substitution pair. Then the score of
' is
and (a i , ) is charged by a gap-open penalty and a gap-extension penalty in
. Hence,
.
Case 2: The last aligned pair of
' is a deletion pair. Then the score of
' is
and (a i , ) is charged by only one gap-extension penalty in
. Hence,
.
Case 3: The last aligned pair of
' is an insertion pair. Then the score of
' is
and (a i , ) is charged by a gap-open penalty and a gap-extension penalty in
. Hence,
.
In summary,
. However, by including an extra
into the right-hand side of the above recurrence, we can reformulate the above recurrence as
. Similar to the discussion above, the recurrence of
can be derived as
.
According to the recurrences above, we designed an algorithm to compute 
(m, n) and its corresponding constrained alignment using the technique of dynamic programming as follows. For convenience, we depicted the recurrences of matrices
k ,
,
and
k for all 0
k
by a three-dimensional (3D) grid graph
, which consists of (m + 1) x (n + 1) x (
+ 1) entries and each entry (i, j, k) consists of four nodes
k ,
,
and
k corresponding to
k ,
,
and
k (i, j,
k ), respectively. Figure 1 shows the relationship of four adjacent entries (i, j, k), (i 1, j, k), (i, j 1, k) and (i 1, j 1, k) of
for each fixed k.
|
Note that there is a directed edge, which is not shown in Figure 1, with weight
from the
k1 node of the entry (i
k , j
k , k 1) to the
k node of the entry (i, j, k). Then each path from
0(0, 0) node of entry (0, 0, 0) to 
(m, n) node of entry (m, n,
) corresponds to a constrained alignment of A and B w.r.t.
. As a result, an optimal constrained alignment of A and B can be obtained by backtracking a shortest path from 
(m, n) to
0(0, 0) in
. It is not hard to see that the algorithm costs both computer time and memory in the order of
(
mn). We call the above algorithm based on the dynamic programming approach as CPSA-DP algorithm.
Hirschberg (1975) had developed a linear-space algorithm for solving the longest common subsequence problem based on the divide-and-conquer technique. Since then, this strategy has been extended to yield a number of memory-efficient algorithms for aligning biological sequences (Myers and Miller, 1988; Chao et al., 1994). In this paper, we generalize the Hirschberg's algorithm so that it is capable of dealing with the CPSA. As compared with others, our generalization is more complicated because the grid graph
dealt here is 3D, instead of 2D, and the input sequences are accompanied with several constraints that need to be considered carefully. The central idea of our memory-efficient algorithm is to determine a middle position (i mid, j mid, k mid) on an optimal path from
0(0, 0) to 
(m, n) in
so that we were able to divide the constrained alignment problem into two smaller constrained alignment problems; then these smaller constrained alignment problems are continued to be divided in the same manner, and finally the optimal constrained alignment is obtained completely by merging the series of the calculated mid-points (Fig. 2).
|
Before describing our algorithm, some notation must be introduced as follows. Let
and
denote the suffixes a i+1 a i+2 ... a m and b j + 1 b j + 2... b n of A and B, respectively, for 1
i
m and 1
j
n. Let
denote the ordered subset (C k + 1, C k + 2, ..., C
) for 1
k
. Define
to be the score of an optimal constrained alignment of
and
w.r.t.
, and define
(
and
, respectively) to be the maximum score of all constrained alignments of
and
w.r.t.
that begin with a substitution [deletion and insertion, respectively] pair (a i + 1, b j + 1) [(a i + 1, ) and (, b j + 1), respectively]. Let
k (h) = [C 1, C 2, ..., C k 1, pref(C k , h)] and
, where 1
h
k . Let
denote the score of an optimal semi-constrained alignment
of
and
w.r.t.
that begins with a band whose induced consensus is equal to suff(C k ,
k h). Note that the recurrences for computing matrices
,
,
,
and
can be developed similarly as those for computing
k ,
,
,
and
k , respectively. Clearly,
. If
[suff(A i ,
k ), C k ]
k x
and
[suff(B j ,
k ), C k ]
k x
, then we can reformulate the recurrence of
k as follows:
k (i, j, 1) =
k 1(i 1, j 1) +
(a i , b j ) and
k (i, j, h) =
k (i 1, j 1, h 1) +
(a i , b j ) for each 1 < h
k .
Next, we describe our divide-and-conquer algorithm, termed as CPSA-DC algorithm, for computing an optimal constrained alignment between A and B w.r.t.
as follows. The key point is to determine the middle position (i mid, j mid, k mid) of the optimal path in
to divide the problem into two subproblems, each of which is recursively divided into two smaller subproblems using the same way. Given an alignment
, we use score(
) to denote the score of
. Let 
(A, B) be an optimal constrained alignments of A and B w.r.t.
and clearly score[
(A, B)] = 
(m, n). Let
. Then, we partition 
(A, B) into two parts by cutting it at the position immediately after
and we let
denote the part containing
and
denote the remaining part, where
denotes the last character in
from B, and k mid denotes the largest index so that
(approximately) appears in
Then there are two possibilities when we consider the last aligned pair of
.
Case 1: The last aligned pair of
is a substitution pair [i.e.
,
]. In this case, we have
. If (
,
) is not a constrained column in 
(A, B), then
is an optimal constrained alignment of A i mid and B j mid w.r.t.
ending with a substitution pair (
,
), and
is an optimal constrained alignment of
and
w.r.t.
. Hence,
. If (
,
) is a constrained column in
, then
is an optimal semi-constrained alignment of
and
w.r.t.
ending with a band
' whose induced consensus is equal to
. If
, then
is an optimal semi-constrained alignment of
and
w.r.t.
beginning with a band
whose induced consensus is equal to
. Moreover, the induced consensus of the merge of
' and
have to be equal to
. In this case, we have
. If
, then
is an optimal constrained alignment of
and
w.r.t.
, and hence
.
Case 2: The last aligned pair of
is a deletion pair [i.e. (
,)]. If the first aligned pair in
is not a deletion pair, then
. If the first aligned pair in
is a deletion pair, then
. We need to compensate it by adding wo because the open penalty of the gap containing
and
in 
(A, B) is charged twice by
and
.
In summary, the recurrence of 
(m, n) is derived as follows:
![]() |
is added to the right-hand side, the above recurrence is not changed, but can be reformulated as follows:
![]() |
In other words, j mid, k mid and h mid are the indices j, k and h, where 1
j
n, 0
k
and 1
h <
k , such that the following maximal value is the maximum.
![]() |
Now, we show how to use
(
n), instead of
(
mn), memory to determine j mid, k mid and h mid, where
=
1
k
k and
min{m, n} intrinsically. In fact, a single matrix E of size (
+ 1) x (n + 1) with each entry E(k, j) of
k + 4 space is enough to compute
k (i mid, j),
,
and
k (i mid, j, h), for 1
j
n, 0
k
and 1
h
k . When reaching the entry (i, j, k) of 3D grid graph
, we use entry E(k, j) of E to hold the most recently computed values of
k (i, j),
,
and
k (i, j, h), which clearly needs a total of
k + 4 space. Note that the old values in entry E(k, j) will be moved into an extra entry, termed as V k whose space is equal to E(k, j), before they are overwritten by their newly computed values. Before moving the old values in E(k, j) into V k ; however, we need to first move
k (i 1, j 1) in V k into a space, named as v k, k + 1, where 1
i
m. The mechanism above will enable us to compute
k (i, j,1), which needs to refer to
k 1(i 1, j 1) that is kept in v k 1, k ; compute
k (i, j, h) for each 2
h
k , which needs to refer to
k (i 1, j 1, h 1) that is kept in V k ; compute
, which needs to refer
k (i 1, j 1) that is kept in V k ; and finally we were able to compute
k (i, j). Figure 3 shows the grid locations of E(k 1), E(k) and the values in V k 1 and V k when we reach the entry (i, j, k) of
for the computation, where E(k) denotes the k-th row of E. Hence, the total needed space for computing and storing all
k (i mid, j),
,
and
k (i mid, j, h) is the sum of the space of matrix E, the space of all V k and the space of all v k, k + 1, where 1
j
n, 0
k
and 1
h
k , which is equal to
(
n). Similarly, the required matrix, denoted by
, for computing all
,
,
and
still needs
(
n) space. Hence, the determination of j mid, k mid and h mid can be performed in
(
n) space. The details of CPSA-DC algorithm are described as follows. Note that the program code of BestScoreRev is similar to that of BestScore and hence is omitted here. In the codes, the variable E(
k (i mid, j)) is used to denote the value of
k (i mid, j) in E(k, j) and others are analogous. The global variables
,
B (j, k, h) =
(suff(B j , h), pref(C k , h)), and
are computed in Algorithm BestScore so that they can be used directly in Algorithm CPSA-DC.
|
Algorithm CPSA-DC(i start, i end, j start, j end, k start, k end)Input: Sequences
and
with constraints
)1: if (i start > i end) or (j start > j end) then Align the nonempty sequence with spaces;
else
BestScore(i start,i mid,j start,j end,k start,k end);
BestScoreRev(i mid + 1,i end,j start,j end,k start,k end);
end if
2: max =
;
for j = j start 1 to j end do
for k = k start 1 to k end do
if
then
j mid = j; k mid = k; type = case 1;
end if
if
then
;
j mid = j; k mid = k; type = case 2;
end if
if
then
;
j mid = j; k mid = k; type = case 3;
end if
if k
1 then
for h = 1 to
k 1 do
if
and
then
if
max then
;
j mid = j; k mid = k; h mid = h; type = case 4;
end if
end if
end for
if
and
then
if
max then
;
j mid = j; k mid = k; h mid = h; type = case 5;
end if
end if
end if
end for
end for
3: if type = case 1 then
CPSA-DC(i start,i mid 1,j start,j mid,k start,k mid);
Align
with a space;
CPSA-DC(i mid + 1,i end,j mid + 1,j end,k mid+1,k end);
end if
if type = case 2 then
CPSA-DC(i start,i mid 1,j start,j mid,k start,k mid);
Align
with two spaces;
CPSA-DC(i mid + 2,i end,j mid+1,j end,k mid+1,k end);
end if
if type = case 3 then
CPSA-DC(i start,i mid 1,j start,j mid 1,k start,k mid);
Align
with
;
CPSA-DC(i mid+1,i end,j mid+1,j end,k mid+1,k end);
end if
if type = case 4 then
CPSA-DC(i start,i mid h mid,j start,j mid h mid,k start, k mid 1);
Align
with
;
CPSA-DC(i mid+
k h mid+1,i end, j mid+
k h mid + 1,j end,k mid+1,k end);
end if
if type = case 5 then
CPSA-DC(i start,i mid
k ,j start, j mid
k ,k start, k mid 1);
Align
with
;
CPSA-DC(i mid+1,i end,j mid+1,j end,k mid+1,k end);
end if
Algorithm BestScore(i start, i end, j start, j end, k start, k end)
Input: Sequences
and
with constraints (
)
1: /* Reindex */
m = i start i end + 1; n = j start j end + 1;
= k start k end + 1;
2: /* Initialization */
for j = 0 to n do
for k = 0 to
do
;
if (j = 0) or (k > 0) then
;
else
;
if (j = 0) and (k = 0) then E(
k (i mid,j)) = 0;
else E(
k (i mid,j)) =
;
if k
1 then
for h = 1 to
k do E(
k (i mid,j,h)) =
;
end if
end for
end for 3: /* Computation */
for i = 1 to m do
for k = 0 to
do /* For the case of j = 0 */
Vk (
k (i mid, 0)) = E(
k (i mid,0));
if k
1 then
for h = 1 to
k do Vk (
k (i mid,0,h))
= E(
k (i mid,0,h)));
end if
;
;
end for
for j = 1 to n do /* For the case of j > 0 */
for k = 0 to
do
temp k (
k (i mid,j)) = E(
k (i mid,j));
if k
1 then
for h = 1 to
k do temp k (
k (i mid,j,h))
= E(
k (i mid,j,h));
end if
;
;
;
if k
1 then
for h = 1 to
k do
if h = 1 then
;
else
;
end if
end for
end if
;
v k,k+1 = V k (M k (i mid,j));
V k(
k (i mid,j)) = temp k (
k (i mid,j));
if k
1 then
for h = 1 to
k do
V k (
(i mid,j,h)) = temp k (
k (i mid,j,h));
end for
end if
if i = m and k
1 then
for h = 1 to
k do
if h = 1 then
if j = 1 and
then
A (k,h) = 1; else
A (k,h) = 0;
if
then
B (j,k,h) = 1; else
B (j,k,h) = 0;
else
if j = 1 and
then
A (k,h) =
A (k,h 1) + 1;
if
then
B (j,k,h)
=
B (j,k,h 1)+ 1;
end if
end for
end if
end for
end for
end for
Now, we analyze the time-complexity of our CPSA-DC algorithm for solving the CPSA. As shown in Figure 2, after determining the middle position (i mid, j mid, k mid) of the optimal path in
, we can divide the original problem into two subproblems, each of which further can be recursively divided into two smaller subproblems using the same way. Note that regardless of where the optimal path passes through (i mid, j mid, k mid), the total size of the two reduced subproblems is just half the size of the original problem, where the size is measured by the number of entries in
. It is not hard to see that the time-complexity of determining the middle position of each subproblem at each recursive stage is proportional to the size of the subproblem. Let
denote the size of the original problem (i.e.
=
m n). Then the total time-complexity of our CPSA-DC algorithm is equal to
, which is twice as high as the CPSA-DP algorithm. Using the CPSA-DC algorithm as a kernel, we were able to design a memory-efficient algorithm, termed CMSA-DC, for progressively aligning multiple input sequences into a CMSA according to the branching order of a guide tree. The above progressive method we adopted was proposed by (Tang et al. 2003). Owing to space limitation, we refer the reader to their paper for the details of its implementation.
| 4 EXPERIMENTAL RESULTS |
|---|
|
|
|---|
We use Java language to implement the CMSA-DC algorithm as a web server, called as MuSiC-ME (Memory-Efficient tool for Multiple Sequence Alignment with Constraints). The input of the MuSiC-ME system consists of a set of protein/DNA/RNA sequences and a set of user-specified constraints, each with a fragment of residue/nucleotide that (approximately) appears in all input sequences. The output of MuSiC-ME is a CMSA in which the fragments of the input sequences whose residues/nucleotides exhibit a given degree of similarity to a constraint are aligned together. For its biological applications, we refer the reader to other related papers (Tang et al., 2003; Tsai et al., 2004).
In the following, we evaluate our memory-efficient MuSiC-ME system and compare its running time and memory to the original MuSiC system (Tsai et al., 2004), whose kernel CPSA algorithm was implemented by the dynamic programming approach. We chose five families of protein/RNA sequences as our testing datasets, each of which has been shown to contain an ordered series of conserved motifs related to the structures/functionalities/consensuses of the family (McClure et al., 1994; Chin et al., 2003; Tang et al., 2003; Tsai et al., 2004): (1) the aspartic acid protease family (Protease), (2) the hemoglobins family (Globin), (3) the ribonuclease family (RNase), (4) the kinase family (Kinase) and (5) the 3'- untranslated region of the coronaviruses (CoV-3'-UTR). From each family, we have selected a representative set of sequences and adopted the ordered series of conserved motifs as the constraints. Table 1 lists the information of the tested families and their constraints. All tests were run with default parameters on IBM PC with 1.26 GHz processor and 512 MB RAM under Linux system. Table 2 lists the CPU time and memory usage of our experiments using MuSiC and MuSiC-ME. It shows that the memory usage of MuSiC-ME is much smaller than that of MuSiC for large-scale sequences, and the CPU time required by MuSiC-ME is smaller than that required by MuSiC for short sequences, since we have simplified the recurrences of the dynamic programming here.
|
|
It is worth mentioning that in MuSiC-ME system, the letters representing the constraints are not just the individual residues/nucleotides, but also the IUPAC (International Union of Pure and Applied Chemistry) codes. For example, nucleotides





