Bioinformatics Advance Access originally published online on August 27, 2004
Bioinformatics 2005 21(2):160-170; doi:10.1093/bioinformatics/bth497
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 21 issue 2 © Oxford University Press 2005; all rights reserved.
A new algorithm for detecting low-complexity regions in protein sequences
Department of Computer Engineering, Kyungpook National University Daegu 702-701, Korea
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Pair-wise alignment of protein sequences and local similarity searches produce many false positives because of compositionally biased regions, also called low-complexity regions (LCRs), of amino acid residues. Masking and filtering such regions significantly improves the reliability of homology searches and, consequently, functional predictions. Most of the available algorithms are based on a statistical approach. We wished to investigate the structural properties of LCRs in biological sequences and develop an algorithm for filtering them.
Results: We present an algorithm for detecting and masking LCRs in protein sequences to improve the quality of database searches. We developed the algorithm based on the complexity analysis of subsequences delimited by a pair of identical, repeating subsequences. Given a protein sequence, the algorithm first computes the suffix tree of the sequence. It then collects repeating subsequences from the tree. Finally, the algorithm iteratively tests whether each subsequence delimited by a pair of repeating subsequences meets a given criteria. Test results with 1000 proteins from 20 families in Pfam show that the repeating subsequences are a good indicator for the low-complexity regions, and the algorithm based on such structural information strongly compete with others.
Availability: http://bioinfo.knu.ac.kr/research/CARD/
Contact: swshin{at}bioinfo.knu.ac.kr
| 1 INTRODUCTION |
|---|
|
|
|---|
Sequence similarity searching is one of the most demanding techniques for utilizing the explosively growing quantity of biological sequence data. Sequence similarities obtained from a database search with a newly sequenced protein as a query, e.g. provides useful information for the detection of homologies and for discovering functional, structural and evolutionary information of the sequence. Searching a genomic database for sequences that are similar to a query cDNA sequence provides valuable information for identifying gene organization, polymorphism, function, etc. [For an excellent summary in this area, see Mount (2000).]
There are two types of similarity search, global and local, which differ in terms of the strategy of the algorithms. Global similarity searches seek to maximize the similarity score along the whole sequences (Needleman and Wunsch, 1970). Local similarity searches seek to maximize the score of some isolated regions (Smith and Waterman, 1981). Local similarity algorithms are usually more meaningful than the global ones because biological sequences include conserved patterns, which have possibly been mutated and/or rearranged in the course of gene evolution. Moreover, often, database searches involve partially sequenced or erroneous sequences, for which global similarity search does not give meaningful information.
Many algorithms for the local similarity searches, such as FASTA (Pearson, 1990) and BLAST (Altschul et al., 1997), among others, have been developed, and their performance has been evaluated. However, one of the difficulties encountered by these algorithms is to reliably distinguish related sequences from unrelated ones, because they produce many false positives from which relevant biological information is difficult to derive. For example, a high similarity score between a pair of unrelated protein sequences can occur because the sequences contain compositionally biased or non-random (so called low-complexity) regions of some amino acid residues (Wootton, 1994). Such a false score may lead to an erroneous prediction due to these regions, which are not genuine homology. [In this paper, we shall use the term low-complexity regions (LCRs) to denote the compositionally biased regions defined in Wootton and Federhen (1993).] The compositional and structural properties of LCRs and their dynamics are not well understood. They may be the results of mutational processes such as replication slippage, unequal crossing-over and biased nucleotide substitution. With no clear biological model available, however, it is challenging to develop an analytic method for identifying LCRs, if any, on a given biological sequence.
It has been shown that by filtering LCRs from protein sequences, the quality of similarity searches can be improved. Several heuristic preprocessing algorithms have been developed for masking a query sequence for LCRs before a database search is performed. Among others are three algorithms, XNU (Claverie and States, 1993), SEG (Wootton and Federhen, 1993) and CAST (Promponas et al., 2000). Given a protein sequence, they detect LCRs and mask them with an undefined residue type X (Altschul et al., 1994). It has been shown that with such masked sequences the quality of similarity searches can be improved.
XNU identifies LCRs in two ways, by internal repeats and intrinsic repeats. Internal repeats are the tandem arrangements of discrete units, and intrinsic repeats are the compositionally biased segments of a small number of distinct amino acids with no clear repeating pattern. These repeats are identified on a dot-plot matrix of self-comparison of the query sequence by scoring the local similarity with a PAM matrix and estimating the statistical significance of the score (Claverie and States, 1993).
SEG detects LCRs based on an information measure of the complexity state vector, which reflects residue composition appearing on a sliding window, with no regard of the patterns or periodicity of sequence repetitiveness. SEG is a two-pass algorithm. In the first pass, the algorithm identifies approximate segments of low complexity using the sliding window. In the second pass, these segments are optimized. If the information measure of an optimized region is lower than a given threshold, then the region is marked as an LCR.
CAST has a different approach to LCRs from the others. LCRs are defined as the regions that score higher than the threshold (cut-off) value in the results of a local similarity search with a homopolypeptide, i.e. a sequence composed of a single amino acid type. The similarity search is done using a dynamic programming algorithm. For a given input sequence, 20 similarity searches are performed iteratively, each against one amino type of homopolymer. In each of the iterations, the algorithm finds a region showing the highest score. If the score exceeds the chosen threshold, the algorithm masks the residues of the type identical to the homopolymer type.
In this paper we introduce a new algorithm, called CARD (complexity analysis with repeat delimit), for detecting LCRs based on repeating subsequences. The algorithm is different from XNU in that it targets only the regions of the sequence that are delimited by a pair of identical subsequences. If these subsequences are positioned in tandem or overlapped, the region containing the two identical subsequences is marked as an LCR. Otherwise, the algorithm, going from left to right, iteratively computes the (Shannon's) information of the repeat concatenated with each segment of the same length as that of the repeat. This iteration continues until either it reaches the right repeating subsequence and masks the subsequence as an LCR, or detects that the computed information is greater than that of the left repeating subsequence. The algorithm uses the suffix tree (Weiner, 1973), which is a powerful data structure for computing all repeating substrings in a given sequence. We carried out an extensive experiment with the four algorithms, XNU, SEG, CAST and CARD, to analyze their performance on a test dataset collected from the protein database Pfam (Bateman et al., 2002). The results show that our simple approach is strongly competitive with others. The results imply that repeating subsequences are a strong indicator for searching LCR.
| 2 SYSTEMS AND METHODS |
|---|
|
|
|---|
The algorithm CARD is written in Visual-C++, and developed on a Dell precision 420 workstation. Two versions are available; one running on Windows 2000 and the other on UNIX/Linux using the GNU C++ compiler.
| 3 PRELIMINARY |
|---|
|
|
|---|
In this paper, we will use the term biological sequence (or simply sequence) and subsequence, respectively, for the generic terms string and substring. Notice that in the field of computer science, the term subsequence is often used not only for a segment of a sequence but also for an ordered collection of (not necessarily contiguous) characters along a sequence. However, in this paper, we will use subsequence exclusively for a segment of a sequence, i.e. substring.
For a sequence s, a suffix of s is a trailing subsequence of s, and a prefix is a leading subsequence of s. For example, y is a suffix and x is a prefix of s, if s = xy. For a sequence s, by |s| we denote the number of residues in s, i.e. the length of s. Repeating subsequence (or simply repeat) in s is a subsequence that appears more than once in s. In a biological sequence, a pair of repeats can occur separated, in tandem or overlapped as shown in Figure 1. For a pair of identical repeats, we call the one located to the left as the left repeat, and for the one located to the right as the right repeat. For a sequence s, by s[i . j] we denote the subsequence of s from i-th position up to j-th position of sequence s. For two subsequences s[i . j] and s[k . l], by s[i . j]s[k . l] we denote the sequence constructed by concatenating the two subsequences.
|
Let
be an alphabet, i.e. a set of residues. We present the following well-known theorem (Harrison, 1978) which shows an important sequence property that will be used in the following section. THEOREM 1.
Let s, x, y and z be arbitrary sequences over
such that s = xy = yz. Then there exists subsequences u and v such that x = uv, z = vu, and y = (uv) p u = u(vu) p , for some integer p
0.
COROLLARY.
Let s, x, y and z be arbitrary sequences over
such that s = xy = yz. If |y|
|x|, then s consists of either a tandem repeats or two overlapping tandem repeats.
PROOF.
If |y| = |x|, then obviously s = yy. String s consists of tandem repeat of y. Suppose |y| > |x|, then clearly, p > 0, because, otherwise, |y| = |u|
|uv| = |x|. It follows that
![]() |
This implies that sequence s consists of overlapping two tandem repeats of (uv)(p + 1) and (vu)(p + 1) as shown in Figure 2.
|
A tree is a connected graph with no cycle. A rooted tree is a tree which has a designated node called root. (In this paper, by tree, we mean a rooted tree.) Labeled tree is a tree with a label assigned on each edge. Leaf nodes are the nodes with no child node. A node that is neither leaf nor root is called internal node. A path is a sequence of edges, and for a given path in a labeled tree, the path label of the path is the concatenated labels found along the path.
A suffix tree for a sequence s is an edge labeled tree, which satisfies the following two conditions:
- For every suffix y of sequence s, there is a unique path from the root to a leaf node with path label y. If y starts at i-th position of s, the leaf node is labeled with i.
- If two suffices y and y' of s have a common prefix, say x, then there is a common path from the root to an internal node with label x.
Figure 3 shows an example of the suffix tree for sequence baaabaaa$. [Notice that the end of sequence marker $ is appended to make the tree satisfy the conditions. For further details, refer to Gusfield (1997)].
|
There are many algorithms available that, given a sequence over a constant alphabet, construct a suffix tree in time linear to the length of the sequence (Gusfield, 1997). In this paper we use Ukkonen's algorithm for constructing suffix trees (Ukkonen, 1995).
The suffix tree of a sequence readily provides all repeating subsequences of the sequence and their locations. Every path label from the root to an internal node corresponds to repeating subsequences, and the labels in the leaf nodes of the subtree under the internal node are the starting positions of the repeats. In Figure 3, we can easily see that baaa is a repeat located at 1 and 5, aaa is a repeat located at 2 and 6, aa is a repeat located at 2, 3, 6 and 7, and so on. We need the following definition:
DEFINITION 1.
Let v be an internal node of a suffix tree for sequence s. Suppose that the length of the path label x from the root to node v is k, and the subtree rooted at v has m
2 leaves labeled with suffix numbers i 1,i 2, ..., i m . Then we call Pv =
k,i 1,i 2, ..., i m
the position list at v.
Position list P v implies that sequence s contain repeats of length k starting at each of the locations i 1, i 2, ..., i m . Notice that all the positions in P v are distinct. For convenience we keep these indices sorted in increasing order. Traversing the suffix tree depth first, we can compute the position list for every internal node. We can easily show that there can be at most n 2 internal nodes in a suffix tree for a sequence of length n.
In this paper, we use Shannon's information as the measure for the complexity of a subsequence. Suppose a sequence s over an alphabet
= {a 1, a 2, ..., a k } of size k (e.g. k = 20 in case of amino acid sequence), with each a i having a fractional composition P(a i ). For a sequence s, Shannon's information measure H(s) is defined as follows, where the logarithm's base is 2 (Shannon, 1951).
![]() |
Notice that if s consists of one symbol, i.e. P(a i ) = 1, for some i (and others are all zero), then H(s) = 0, meaning that string s has the lowest information content. The information is maximized when all P(a i ) are equal.
In this paper, we define LCR as a subsequence that is delimited by a pair of identical repeats and the information of each segment (of length equal to the length of repeat) is not greater than that of the repeat. (In the following section on the algorithm we will further elaborate on this part.) In Sections 5 and 6 we will carry out extensive experiments with a protein test set collected from Pfam in order to find optimal parameters for our algorithm and compare it with other algorithms for detecting LCRs. Our fundamental objective of filtering LCRs from a biological sequence is to mask out LCRs without affecting other conserved regions, called domains, that fold into a three-dimensional structure. (Pfam provide families of proteins with annotated domains.)
To measure the performance of such filtering algorithms we define three measures as follows. For a given sequence s o of length L, let s m be the masked sequence. Let D o and D m be the number of domains in s o and s m , respectively, detected when the masked sequence is queried to Pfam database. Let T X and N X , respectively, be the number of masked residues in the filtered sequence and the number of masked residues outside of the domains of the sequence. We define three ratios; detection ratio (DR), hit ratio (HR) and mark ratio (MR) as follows:
![]() |
| 4 ALGORITHM |
|---|
|
|
|---|
Now we present our algorithm, named CARD. Given a sequence s, the algorithm detects all LCRs in s and masks them in two phases. The algorithm first constructs a suffix tree T for s and for every internal node v, computes the position list P v . In the second phase, the algorithm, using the position lists, iteratively detects LCRs and masks them as follows:
For an internal node v, suppose that P v =
k,i 1,i 2, ..., i m
and let x j0 be the j-th repeat of length k starting at position i j . Let x ji be the i-th segment of length k to the right of x j0 (Fig. 4). For each adjacent pair of position indices i j and i j+1, 1
j < m, in P v , if the repeats appear in tandem or overlapped, then the algorithm masks subsequence s[i j . i j+1 + k 1] as an LCR. Otherwise, the algorithm examines the subsequence s[i j . i j + 1 + k 1] as follows. Let t = H(x j0), i.e. the information in subsequence s[i j . i j+1 + k 1]. The algorithm iteratively computes the information H(x j0 x j1), H(x j0 x j2), H(x j0 x j3), ... and so on, until it either computes an information t ' > t, or reaches the right repeat x (j+1)0 = s[i j+1 . i j+1 + k 1]. In the former case, the algorithm terminates the iteration, and in the latter case subsequence s[i j . i j+1 + k 1] will be masked as an LCR.
|
Now we present a pseudo code for algorithm CARD in Figure 5. Notice that step 6 of the algorithm first checks if the subsequence delimited by a pair of repeats is a tandem repeat. If it is, the algorithm simply masks the subsequence without computing its information. Otherwise, it enters the iteration of steps 9 and 10 to test if the information of the subsequence is not greater than that of the repeat as described above.
|
| 5 SPEEDING UP THE ALGORITHM |
|---|
|
|
|---|
If the repeats are in tandem or overlapped, step 6 of the algorithm simply marks the region spanned by the repeats as an LCR. However, for the other case, when the right and left repeat are separated by a gap, the algorithm, going from left to right, iteratively traces the variation of the information by step 9 through 10 to see if it can reach the right repeat without exceeding the threshold value. Theoretically, it would take too much time to examine all possible pairs of such repeats. There can be O(n 2) possible number of repeating subsequence pairs in a sequence of length n. For each region delimited by a pair of repeats, checking if it is an LCR takes O(n) time. Hence, the algorithm requires O(n 3) time in the worst case. [Recall that our algorithm for constructing a suffix tree takes O(n) time.] However, our experiments show that it is possible to speed up the algorithm using proper bounds on the length of repeats and LCRs.
5.1 Bounds on the length of LCR
For the lower bound of LCR length, we set it to 4, because an LCR length of 3 must be delimited either by repeats of length 1 separated by one residue, or delimited by overlapping repeats of length 2. It follows that the repeat-delimited subsequence of length 3 should be a homopolymer (AAA, e.g.). Such subsequences are abundant in protein domains, which must be avoided. To see how such a short lower bound affects the filtering, we ran CARD with 1000 test sequences and masked LCRs in the sequences. (In the next section, we will describe this test set in more detail.) With the 1000 masked test sequences, searching the Pfam database with the lower bound set to 3 resulted in the domain DR of 94.7%. With the lower bound set to 4, we found that the ratio increased up to 98.9%. This implies that protein domains contain significant amounts of homopolymers of length 3. So we set 4 as our default lower bound of LCR for our algorithm.
To find a proper upper bound, we carried out the same experiments to see how the number of LCRs changes depending on their length. Figure 6 shows these results. The figure shows that for the test set, no LCR of length
18 is detected. We chose 18 as the default upper bound on the LCR length.
|
5.2 Bounds on the length of repeating subsequences
The following theorem shows that with 2 as the lower bound on the length of repeats, CARD can detect all LCRs of length
4, the lower bound on the length of LCRs that we have set above. THEOREM 2.
Let w be a subsequence such that |w|
4. If w is masked as an LCR by algorithm CARD, then w is delimited by a pair of repeats whose length is at least two.
PROOF.
If w consists of more than one type of residues, it should be delimited by a pair of identical repeats of length at least two. Otherwise, if the length of the repeat is one, the threshold of information is zero, and the algorithm will terminate the iteration (steps 9 and 10) in the middle and fail to mask w as an LCR. If subsequence w consists of one type of residue, then w is homopoymer, and hence, consists of two overlapping subsequences of length |w| 1, that will be masked as an LCR by step 6 of the algorithm. Obviously, when |w| = 4, which is the lower bound of LCR length, the algorithm will mask w as an LCR with two overlapping identical repeats of length 3.
To find a proper upper bound on the length of repeating substrings which delimits an LCR, we ran CARD with the same test data that we used for the bounds on the length of LCRs, and examined how the number of masked residues changes, depending on the length of repeats. Figure 7 shows that the number of masked residues leveled off at repeating subsequence of length 8. Therefore, we set our default upper bound on the length delimiting repeats to 8.
|
5.3 Improving the algorithm further with domain property
We found the DR for two families LRR and EGF (87 and 91%, respectively) typically low compared to that (>97%) of the others. This fact implies that the domains of these two families contain many repeats. We examined the test sequences from these families, and found that their domains are short (23 and 35 residues, respectively, while the sequence average is 122), and have many leucine rich repeats, such as lsls, dldl, slsl, ylyl, flfl and lnln. It turned out that CARD had masked these tandem patterns as an LCR, causing the lower DR. Column 2 of Table 1 shows the total number of these six tandem patterns appearing in each family of our test set, and column 3 shows the number of those appearing inside the domains. This table shows that a significant portion (
70%) of the pattern exists in the domains. Hence, we decided not to mark those patterns to improve DR.
|
Adding those tandem patterns as an optional parameter to be excluded from masking, we ran CARD with the same test set. The results showed a significant increase in DR for the two families (for LRR up to 93%, and for EGF up to 92%) without affecting other families. This is because the range of masking inside the domains of the other families is not significant enough to affect the results of Pfam's domain search algorithm. This experiment shows the possibility of improving the algorithm's performance further with a repeat pattern library collected from wide varieties of biological sequences.
| 6 PERFORMANCE TEST AND ANALYSIS |
|---|
|
|
|---|
To evaluate the performance of our algorithm CARD, we ran it with a test set, together with the other three LCR filtering algorithms; SEG, XNU and CAST. Unfortunately, there is no benchmark test data available for evaluating LCR filtering algorithms. We constructed a test set by first selecting the top twenty families (see the first column of Table 1) from the Pfam protein database, which has about 4800 domain families, and then randomly choosing 50 protein sequences from each family, collecting a total of 1000 test sequences. This test set contains full protein sequences as well as protein fragments. Columns 2 and 3 of Table 2 show the number of domains and the proportion (%) of domain residues, respectively. Entries in column 2 show that some proteins have multiple domains and some do not have any. (For example, LRR family contains members having two domains LRR and zf-C2H2.) Column 4 of the table shows the proportion of the residues in LCRs identified by CARD that are delimited by a pair of overlapping or tandem repeats. In parentheses of the last two columns, it shows the standard deviation(s) of each proportion from the average. This column shows that on an average 78% of the LCRs are delimited by tandem or overlapping repeats. This data implies that CARD detects a significant proportion of LCRs in constant time, without going into the time consuming iteration (steps 8 through 10) of the algorithm.
|
With this test set we ran the four filtering algorithms, SEG [with default parameters W = 12, K 2(1) = 2.2 bit and K 2(2) = 2.5 bits], CAST (with threshold score 40), XNU (with PAM120 and probability cut-off = 0.01, max-search-offset = 4 and min-search-offset = 1) and CARD (with LCR length: lower bound = 3, upper bound = 18; repeat length: lower bound = 2, and upper bound = 8). Then the 1000 filtered (i.e. masked with X) sequences by each algorithm were queried to Pfam, and for each family, computed the three ratios that we have defined in Section 2: DR, HR and MR. The results are summarized in Table 3. Figure 8 compares the average ratios from the four algorithms over all test sequences.
|
|
Recall that the objective of filtering is to mask as much non-domain regions as possible without masking domain regions. Masking domain regions may decrease the detection ratio. Notice that by simply not masking any residue we can get the maximum detection ratio, 100%, because all Pfam entries have their domains identified with its own designated search algorithm. However, not masking any residues is giving up the objective of masking, i.e. to reduce high-scoring database queries due to compositionally biased regions. Figure 8 shows that the four algorithms have diverse properties. Overall, CAST and CARD exhibit quite contrasting performance compared with the other two.
Their average detection ratios (CAST, 98.3%; CARD, 98.5%) are higher than those of others (SEG, 96.6%; XNU, 97.0%), while their average masking ratios (CAST, 2.3%; CARD, 3.1%) are lower than those of the others (SEG, 7.2%; XNU, 5.6%). These results imply that SEG and XNU are over masking into domain regions.
In terms of HR, CAST is the best (73%). However, we cannot evaluate the algorithms by HR alone. We should also consider other measures. CAST's MR is the lowest due to the selective filtering procedure. Notice that for families GP120, rvp, COX1, RuBiscos, cytochrome and HCV, even though algorithm CARD's HR is significantly low (and some of them are quite lower than those of other algorithms), its DR is 100%. This is because the domain length in those families is quite large (>70% of the sequence length as shown in column 2), and the masking is not critical enough to affect Pfam's domain detection algorithm. As an example, Figure 9 shows a GP120 family member, human immunodeficiency virus type 1, and its masked version, where italics show the domain region.
|
To examine the statistical quality of the masked sequences by the four algorithms, we randomly chose five sequences from each of top 20 families in pfam DB, totaling 100. We masked these 100 sequences by each of the four algorithms, and using BLASTP at EBI, queried protein database UniProt with these masked sequences. For the query, we used BLASTP's default parameter; BLOSUM62, EXP THR = 10 and normal sensitivity. Table 4 summarizes the responses. Column 2 of the table shows the number of masked proteins, which varies significantly depending on the algorithm. Column 3 shows the number of cases where the top hit of the BLASTP search was the queried sequence itself. It shows that all the proteins masked by the four algorithms were at the top.
|
Column 4 shows the percentile of the top-ranking queried sequences whose P-value is greater than that of the sequences masked by CARD. For example, out of the 42 queried sequences masked by CAST, 60.4% (25 sequences) of them have a P-value greater than those from CARD. (Recall that the higher the P-value, the lower the significance of the similarity score.) This result shows that, in terms of statistical quality of the masked sequences, CARD is the best among the four algorithms. The last column shows the running time of the four algorithms for masking the 100 proteins, with a average length of 420 residues. CARD is very slow. On average, it takes
0.23 s to mask a sequence. The algorithm is implemented in C++. The major data structure of the algorithm is object lists for the suffix tree, which requires extensive dynamic memory management.
6.1 Giving a margin on the threshold information
The algorithm CARD uses the information content of the repeat as a threshold for deciding whether the region delimited by the repeating subsequence is an LCR or not. Using the same test set, we ran CARD to see how its filtering capability changes in terms of the three measures, DR, HR and MR, when the threshold is increased by certain percentage of the default value. Figures 10 12 show the results.
|
|
As expected, as the threshold increases, DR decreases, while MR increases. These figures show a minor change in all the three measures with the threshold increased up to 40%, and then a steep change. In particular, HR shows positive as well as negative effects up to 40%, which implies compositional variations of the subsequences delimited by repeats. We found that families with many short domains (e.g. LRR, WD40, ank, PPR and efhand) are the main source of such variations (see column 2 of Table 1). This experiment shows the possibility of increasing the MR of algorithm CARD by raising the threshold margin up to 40%, while keeping the algorithm's DR comparable to that of SEG and XNU.
| 7 DISCUSSION |
|---|
|
|
|---|
We have presented a new algorithm, named CARD, for detecting LCRs in a given protein sequence based on repeating subsequences. The repeating subsequences are found on the powerful data structure suffix tree constructed with the given sequence. We showed that using the property of repeating subsequences in the biological sequences, it is possible to speed up the algorithm and improve the quality of masking. Our experiments for the performance of the four LCR filtering algorithms with the measures DR, MR and HR show that CARD can compete fairly with the other three algorithms.
As far as we know there are no reports in the literature describing structural characteristics of LCRs in biological sequences. Our analysis with the test data shows that repeating subsequences, which can be efficiently collected from the suffix tree, are a good indicator for detecting LCR.
If we want to filter a large number of sequences and are concerned with the processing time, CARD is not the optimal choice. However, for an interactive application with a few sequences, CARD's processing time of about 0.23 s per protein would not be a serious problem. The current version of CARD has enough room to improve its running time. We are currently revising the algorithm for constructing the suffix tree based on the suffix array representation. This revision will improve the running time. Our next work is to extend our algorithm for filtering DNA sequences.
|
Received on September 30, 2003; revised on May 25, 2004; accepted on August 19, 2004
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F., Boguski, M.S., Gish, W., Wootton, J.C. (1994) Issue in searching molecular sequence database. Nat. Genet., 6, 119129[CrossRef][Web of Science][Medline].
Altschul, S.F., Madden, T.L., Schaffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997) Grpped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 33893402
Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiler, L., Eddy, S.R., Sam, G.-J., Howe, K.L., Marshall, M., Sonnhammer, E.L.L. (2002) The Pfam Protein Families Database. Nucleic Acids Res., 30, 276280
Claverie, J.-M. and States, D.J. (1993) Information enhancement methods for large scale sequence analysis. Comput. Chem., 17, 191201[CrossRef].
Gusfield, D. Algorithms on Strings, Trees, and Sequence, (1997) , NY Cambridge University Press, pp. 89107.
Harrison, M.A. Introduction to Formal Language Theory, (1978) , Reading, MA Addison-Wesley Publishing Company, pp. 59.
Mount, D. Bioinformatics, (2000) , AZ CSHL Press, University of Arizona, pp. 53203.
Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol., 48, 443453[CrossRef][Web of Science][Medline].
Pearson, W.R. (1990) Rapid and sensitive sequence comparison with FASTAP and FASTA. Methods Enzymol., 183, 6398[Web of Science][Medline].
Promponas, V.J., Enright, A.J., Tsoka, S., Kreil, D.P., Leroy, C., Hamodrakas, S., Sander, C., Ouzounis, C.A. (2000) CAST: an iterative algorithm for the complexity analysis of sequence tracts. Bioinformatics, 16, 915922
Shannon, C.E. (1951) Prediction and entropy of printed English. Bell Syst. Tech. J., 5060.
Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequence. J. Mol. Biol., 147, 195197[CrossRef][Web of Science][Medline].
Ukkonen, E. (1995) On-line construction of suffix-tree. Algorithmica, 14, 249260[CrossRef].
Weiner, P. (1973) Linear pattern matching algorithms. Proceedings of the 14th IEEE symposium on Switching and Automata Theory, , Washington, DC , pp. 111.
Wootton, J.C. (1994) Sequence with unusual amino acid composition. Curr. Opin. Struct. Biol., 14, 413421.
Wootton, J.C. and Federhen, S. (1993) Statistics of local complexity in amino acid sequence and sequence database. Comput. Chem., 17, 149163.
This article has been cited by other articles:
![]() |
I. B. Kuznetsov ProBias: a web-server for the identification of user-specified types of compositionally biased segments in protein sequences Bioinformatics, July 1, 2008; 24(13): 1534 - 1535. [Abstract] [Full Text] [PDF] |
||||
![]() |
C.-T. Su, C.-Y. Chen, and C.-M. Hsu iPDA: integrated protein disorder analyzer Nucleic Acids Res., July 13, 2007; 35(suppl_2): W465 - W472. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Li and T. Kahveci A Novel algorithm for identifying low-complexity regions in a protein sequence Bioinformatics, December 15, 2006; 22(24): 2980 - 2987. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||















