Skip Navigation


Bioinformatics Advance Access originally published online on October 27, 2004
Bioinformatics 2005 21(7):912-921; doi:10.1093/bioinformatics/bti076
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/7/912    most recent
bti076v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Itoh, M.
Right arrow Articles by Kanehisa, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Itoh, M.
Right arrow Articles by Kanehisa, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Fast and accurate database homology search using upper bounds of local alignment scores

Masumi Itoh *, Susumu Goto , Tatsuya Akutsu and Minoru Kanehisa

Bioinformatics Center, Institute for Chemical Research, Kyoto University Gokasho, Uji, Kyoto 611-0011, Japan

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 METHOD
 RESULTS
 DISCUSSION
 APPENDIX 1
 APPENDIX 2
 REFERENCES
 

Motivation: It is widely recognized that homology search and ortholog clustering are very useful for analyzing biological sequences. However, recent growth of sequence database size makes homolog detection difficult, and rapid and accurate methods are required.

Results: We present a novel method for fast and accurate homology detection, assuming that the Smith–Waterman (SW) scores between all similar sequence pairs in a target database are computed and stored. In this method, SW alignment is computed only if the upper bound, which is derived from our novel inequality, is higher than the given threshold. In contrast to other methods such as FASTA and BLAST, this method is guaranteed to find all sequences whose scores against the query are higher than the specified threshold. Results of computational experiments suggest that the method is dozens of times faster than SSEARCH if genome sequence data of closely related species are available.

Availability: The programs for fast homolog detection can be downloaded from ftp://ftp.kuicr.kyoto-u.ac.jp/itoh/

Contact: itoh{at}kuicr.kyoto-u.ac.jp


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 METHOD
 RESULTS
 DISCUSSION
 APPENDIX 1
 APPENDIX 2
 REFERENCES
 
Ortholog detection by homology search is one of the most important and reliable methods for annotating protein sequences. The Smith–Waterman algorithm (SW algorithm or local alignment algorithm) is known as a sensitive method for detecting similarities between protein sequences, and the SW-score is also used as a standard similarity measure between two sequences (Smith and Waterman, 1981). Although the SW-score and/or the E-value calculated from it is one of the most effective measures of protein homology (Brenner et al., 1998), the SW algorithm is not so efficient. This algorithm requires quadratic time for each comparison of two sequences. Therefore, recent growth of sequence database size makes homology search difficult.

Thus, many other efficient methods have been developed such as FASTA (Lipman and Pearson, 1985) BLAST (Altschul et al., 1990) and their variants (Altschul et al., 1997; Rognes et al., 1998; Zhang et al., 2000). Fast sequence comparison methods such as PatternHunter (Ma et al., 2002) and MUMmer (Delcher et al., 1999) have been developed for the whole-genome sequence comparisons. Fast and accurate methods have also been developed for alignment of expressed sequence tag (EST) sequences against genomes (Ogasawara and Morishita, 2003). Though these are very fast and some of these are very sensitive, none of these has a theoretical guarantee that it will find all sequences whose SW-scores against the query sequence are higher than a given threshold.

On the other hand, some approaches to directly speed up the SW algorithm are already reported (Rognes and Seeberg, 2000; Davidson, 2001) but they still require quadratic time in the worst case. In the field of theoretical computer science, extensive studies have been done on nearest-neighbor search (Muthukrishnan and Sahinalp, 2002) and fast sequence search (Myers, 1986, 1994) based on the edit distance or similar distances. However, we cannot apply these for the SW search since the properties of the SW-score and distances are quite different, as described in Methods.

To estimate the orthologs in protein families, discrimination of the types of similarities such as bidirectional best hit, best hit, reverse best hit and homology is important, and the all-against-all similarities between proteins are necessary for obtaining this information. The homologous relationships in the gene/protein universe can be represented as a network or a graph of genes/proteins, in which nodes represent sequences and edges represent similarities between pairs of the sequences. The types of the similarities are represented as labels or weights for the edge. The sub-graph structure around target proteins and the types of edges are very useful for large-scale detection of the ortholog clusters.

We have to accurately measure the similarities between all pairs of proteins to define the edges in the graph. For detection of distant homologs, there are profile-based methods such as PSI-BLAST (Altschul et al., 1997), and they work successfully using the information from multiple homologous sequences. However, these methods do not work for protein pairs, and therefore the SW algorithms are still needed for accurate ortholog detection.

Based on SW-scores, the KEGG/SSDB has been developed. The KEGG/SSDB is the database of homologous relationships among the whole protein-coding genes and gene clusters of ortholog candidates for the KEGG/GENES database (Kanehisa et al., 2002, 2004. The KEGG/GENES database contains the whole protein set of completely and partially sequenced genomes. The KEGG/SSDB contains SW-scores (the score-cut off is 70) for all proteins stored in the KEGG/GENES database and all information for types of edges. In other words, the KEGG/GENES database contains the nodes of the graph and the KEGG/SSDB contains the edges.

However, enormous computer power is required to assign types of edges for the whole protein set of a newly sequenced genome. Computing SW-scores against a large database takes quite a while, even for only one protein sequence. Moreover, the gene/protein universe has much redundancy and thus includes many similar sequences. This redundancy is increasing because genome sequencing projects on related species are ongoing.

Motivated by the issues introduced above, we consider a new method for fast and accurate homology search under the condition that many similar or redundant sequences and SW-scores among these sequences are stored in a database.

The basic idea of our method is as follows. If a target database includes at least one sequence similar to a query, we may reduce the number of homolog candidates in the database, by using homology information of similar sequences in the target database. For instance, if a query is similar to protein A, the query will not have homology with proteins not homologous to A and we do not need to compute alignments with such non-homologous proteins. This means possible sequence space can be restricted from the condition that the query has homology with protein A. Therefore, we can restrict a range of the SW-score between a query and homolog candidates by using SW alignments and SW-scores between the query and other proteins.

In this method, we derive an inequality of SW-scores among three given sequences to estimate an upper bound by using known scores. For the three sequences (q, t and c), we can estimate the upper bound of sw(q,c) by using sw(t,c) and the SW alignment between q and t [where sw(x,y) denotes the SW-score between sequence x and sequence y] (Fig. 1a). For a special case which covers most BLOSUM matrices (Henikoff and Henikoff, 1992) the inequality is simplified into a quasi metric (Stojmirovic, 2003; Waterman et al., 1976).



View larger version (23K):
[in this window]
[in a new window]
 
Fig. 1 (A) Estimation of an upper bound of the SW-score. If we know sw(t,c) and SW alignment between q and t, we can estimate an upper bound of sw(q,c). (B) Outline of fast and accurate search. At first, we find a template similar to the query with a fast heuristic method like BLAST. Then, upper bounds of SW-scores for homolog candidates are estimated, and irrelevant candidates are screened out. In this case, SW alignments are computed only for remaining candidates whose upper bounds of sw(q,c) are higher than the threshold (300).

 
Our method of using this upper bound estimation is summarized as follows (Fig. 1b). Given a query sequence q and a threshold of the SW-score, we first find a template sequence t having high similarity with q using one of the fast homology search tools such as BLAST. We then estimate an upper bound of the SW-score between q and each sequence c (candidate sequence) in the KEGG/SSDB, by using SW alignment between q and t and the SW-score between t and c. We only compute SW-scores for sequences with upper bounds higher than the threshold. Following this approach, we can significantly reduce search time without losing the sensitivity of the SW algorithm if we get a good upper bound. In the ultimate case of q = t, we only need to list sequences which have SW-scores against t that are higher than the threshold.

We also present the results of computational experiments, using the proposed approach with the KEGG/SSDB. The results suggest that the proposed approach and upper bounds are very useful for fast and accurate homology search via the KEGG/SSDB. Our method enabled high-sensitivity search with only a little more computation time than that of the BLAST search. Though we focus on the KEGG/SSDB in this paper, the method is general enough that it can be applied to other sequence databases if they contain SW-scores among similar sequences. Currently, the KEGG/SSDB is the only known public database containing all-against-all SW-scores. However, some studies have been done on clustering database sequences based on all-against-all SW-scores (Suharnan et al., 1997; Kriventseva et al., 2001, 2003), although these data are not publicly available. Our method may also be applied to homology search against such database sequences. We also discuss other potential applications of the derived inequality and another method for deriving upper bounds of local alignment scores.


    METHOD
 TOP
 Abstract
 INTRODUCTION
 METHOD
 RESULTS
 DISCUSSION
 APPENDIX 1
 APPENDIX 2
 REFERENCES
 
Triangle inequality
The triangle inequality is a well-known property for the distances. If there were an appropriate distance measure dist(x,y) on sequences, we would have the following:

which could be used to estimate a lower bound of dist(q,c) using dist(t,q) and dist(t,c) (a lower bound for distance corresponds to an upper bound for score). However, the SW-score is not distance and does not satisfy the triangle inequality. Though one might expect sw(x,z) ≥ sw(x,y) + sw(y,z), this inequality does not hold. Moreover, no explicit and concrete relationship between the distance and the score is known for practical score matrices (Mount, 2001) such as BLOSUM (Henikoff and Henikoff, 1992) or PAM (Dayhoff et al., 1978). Thus, we need to derive another inequality.

Alignment score
Here we briefly review SW alignment. Let be the set of amino acids (i.e. ). For each sequence x over , |x| denotes the length of x and xi denotes the i-th letter of x (i.e. x = x1 x2 ... xn if |x| = n). Then, a global alignment between sequences x and y is obtained by inserting gap symbols (denoted by ‘–’) into or at either end of x and y such that the resulting sequences x' and y' are of the length l, where it is not allowed for any i ≤ l that both and are gap symbols.

Score matrix s(a,b) is a function from to the set of reals, where [we need not consider s(–,–)]. We reasonably assume s(a,–) = s(–,b) = –d for all a,b where d > 0 and s(a,a) > 0 for all a . But, we do not necessarily assume s(x,y)=s(y,x). A local alignment between x and y is a global alignment between and , where and are substrings of x and y, respectively. The score of global alignment (or local alignment) is defined by . An optimal global alignment (resp. an optimal local alignment) is a global alignment (resp. a local alignment) having the largest score.

In the above, we assumed linear gap cost. However, affine gap cost should be used in practice. In the affine gap cost model, cost for the first position in each consecutive gap region is larger than that for other positions. When using affine gap cost, d and e (d > 0, e > 0) denote the cost per first gap position (i.e. opening gap cost) and the cost per other gap positions (i.e. extension gap cost), respectively.

Inequalities for Smith–Waterman scores
Let (t',q') be a local alignment between t and q [we need to distinguish (t',q') from (q',t') only if asymmetric score matrices are used]. We define POS++ (resp. POS+–, POS–+) to be the set of positions i such that and (resp. ,). We define SPOS+– to be the set of positions i such that and hold. SPOS–+ is defined in an analogous way. Thus, SPOS+– {cup} SPOS–+ is the set of gap opening positions. For i SPOS+– {cup} SPOS–+, gi denotes the length of the consecutive gap region starting from the position i. We define POS to be the set of positions of q that are not included in (t',q'). Figure 2 illustrates the definitions of these notations. Then, we obtain the following theorem on the SW-score with affine gap costs (see Appendix 1 for the proof).



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 2 Positions in local alignment. t, template sequence; q, query sequence; POS++, aligned positions between t and q; POS+–, positions where gaps are inserted to q; SPOS+– and SPOS–+, positions of opening-gap (X in figure); POS_–, positions of q not appearing in local alignment.

 
THEOREM 1

It should be noted that the last four terms in the right-hand side depend only on q and t (not on c). Therefore, we need to compute these terms only once, and thus additive time for computing an upper bound per candidate sequence is constant.

As a special case of Theorem 1, we have:

COROLLARY 1
If the score matrix satisfies and for all x,y , then

holds for any sequence q,t or c.

The conditions for a score matrix mean that the highest alignment score for each amino acid is given for aligning the amino acid to itself, and for any pair of amino acids (x,y), the largest difference of scores against an amino acid is attained when the amino acid is identical to x. We call the latter two terms of the right-hand side in Corollary 1 the additive factor. The upper bound of the SW-score between a query and a candidate is the sum of the additive factor and the SW-score between the query and a template. Note that the additive factor depends only on q and t. If a query and a template are identical, the additive factor will be 0 and equality is attained. If a query and a candidate are identical, equality is also attained. However, the additive factor is not necessarily 0.

Waterman et al. (1976) did some pioneering work related to the above inequality. Stojmirovic (2003) independently obtained the same inequality (under a slightly different condition using a considerably different proof technique) recently. But, he did not obtain Theorem 1 or similar results, nor apply it to practical problems (i.e. fast homology search). Spiro and Macura (2004) also independently obtained the same inequality as in Corollary 1 quite recently. They performed computational experiment in order to examine redundancy of a protein database based on the inequality. However, they did not obtain an explicit form of the inequality given in Theorem 1, nor perform computational experiment on homology search using upper bounds of local alignment scores.

Although the condition of Corollary 1 holds for most of BLOSUM score matrices including BLOSUM50 and BLOSUM62, it does not hold for several matrices (Stojmirovic and Pestov, 2003). For example, it does not hold for any practical PAM matrices and several BLOSUM matrices. Thus, Theorem 1 should be employed when using PAM and some other matrices.

Data set
We applied the proposed method to the whole protein-coding gene set of some species in the KEGG/GENES database using the KEGG/SSDB (Kanehisa et al., 2002, 2004) and tested the efficiency and features of the method.

We obtained the whole set of SW-scores for all-against-all sequence comparison from the KEGG/SSDB and other information for whole protein sets of Homo sapiens, Mus musculus, Rattus novergicus, Escherichia coli K12 and Shigella flexneri 301 from the KEGG/GENES database. The KEGG/SSDB stores SW-scores (no less than 70) for all-against-all pairs of whole protein sets in the KEGG/GENES database. The scores are computed by the SSEARCH program in FASTA package (Lipman and Pearson, 1985) using BLOSUM50 matrix, –12/–2 gap penalties (i.e. d = 12, e = 2). In fact, the KEGG/GENES database and the KEGG/SSDB included different kinds of species in our data set (158 species in GENES and 129 species in SSDB). Therefore, we used sequences and SW-scores for 119 strains of 99 species stored in both databases. This data set included five strains of E. coli and only one strain of S. flexneri.

Fast and accurate search for sequences of S. flexneri
To test the efficiency of our method under the condition that the whole set of protein sequences of similar species is known, we implemented a simple program by combining BLAST and SSEARCH programs using the BioRuby (http://www.bioruby.org/). The BioRuby is a suite of open software tools for bioinformatics written in the Ruby programming language.

We randomly picked up 100 sequences from the whole protein set of S. flexneri 301 and used them as queries, where the sequences of S. flexneri 301 were excluded from the database. An outline of our procedure for fast homology search is given in Figure 1b. At first, we find a similar sequence for each query from the KEGG/GENES database by using BLASTP (in the BLAST package), which is employed as a template sequence for the query. Using this template sequence and homology data for the template in the KEGG/SSDB, we screen out sequences whose upper bounds of SW-scores are lower than given threshold {theta}. Then, we compute SW alignments for all remaining candidate sequences using the SSEARCH program. The SSEARCH program is the most widespread implementation of the SW algorithm. In this experiment, we used {theta} = 150 and the default parameters for SSEARCH and BLASTP. The default substitution matrices are BLOSUM50 for SSEARCH and BLOSUM62 for BLASTP. If we use BLOSUM50 instead of BLOSUM62 for BLASTP, we may detect more remote homologs, but computation time will increase because the number of candidate target sequences will increase. However, the difference between BLOSUM matrices for BLASTP will have no effect on our method once a template sequence which is enough similar to the query sequence is found. Thus, we may even use BLOSUM80 for BLASTP. We measured time for our method against the KEGG/GENES database and compared with the naïve SW search using SSEARCH. All programs for our experiments were carried out on PowerMac G5 (2GHz PowerPC G5 x 2), where only one CPU was employed.

Quickly searchable sequences
As mentioned above, we need to compute SW alignments only for candidate sequences in a target database for which upper bounds of SW-scores against the query sequence are higher than the specified threshold {theta}. The SW-scores in the KEGG/SSDB are computed by using BLOSUM50, and the inequality from Corollary 1 can be applied for all scores in the KEGG/SSDB. Precisely, we need to compute SW alignments only against candidates satisfying the inequality

and the other sequences are screened out.

We would like to apply this inequality to fast search against the KEGG/GENES database by the procedure mentioned above. However, if the query sequence does not have enough similar sequences in the KEGG/GENES database, we are not able to apply our method. Additionally, the KEGG/SSDB does not store low SW-scores. Thus, to effectively screen out irrelevant sequences, the query sequence must have at least one template sequence satisfying

where swlowest is the lowest SW-score in the KEGG/SSDB. We call such a query sequence a quickly searchable sequence. It will be necessary to compute SW alignments between the query sequence and all sequences in the KEGG/GENES database if a query is not quickly searchable.

To estimate practical efficiency of the proposed method, we observed the ratio of the quickly searchable sequences to the sequences in each genome. We used swlowest = 70 for bacterial genomes (the lowest score of KEGG/SSDB), and swlowest = 100 for mammalian genomes (because of the limitation of memory space in PowerMac G5).


    RESULTS
 TOP
 Abstract
 INTRODUCTION
 METHOD
 RESULTS
 DISCUSSION
 APPENDIX 1
 APPENDIX 2
 REFERENCES
 
Results on S. flexneri
Here we present the results on efficiency of the proposed method where randomly picked 100 quickly searchable sequences from S. flexneri 301 were used as queries. Table 1 shows computation time for homology search against the KEGG/GENES database by our proposed method and by the naïve SW search. Our method produced exactly the same search results as the naïve SW search did.


View this table:
[in this window]
[in a new window]
 
Table 1 Comparison with naïve SW search

 
As shown in Table 1, naïve use of SSEARCH took about 500 s against the whole KEGG/GENES database per query in average. In contrast, our proposed method took only 6.3 s. It is ~78 times faster than SSEARCH. In our approach, total search time consisted mostly of BLAST search for a template sequence. Recall that the template search procedure is equivalent to homology search with BLAST against the KEGG/GENES database. This means that our proposed method provides exactly the same high sensitivity as SW search does, with a little additional computation time to BLAST.

Efficiency of the proposed method for respective queries depends mainly on computation time for SW alignment against remaining candidates. The time for this procedure widely varied according to the number of homologs of a template sequence, which is correlated with the number of homologs of a query sequence. If the query is a member of a large protein family, there are a lot of homolog candidates and much computation time will be needed. To put it the other way around, a rate-limiting factor of usual homology search is the computation against the non-homologous protein sequences and our method can cut off the time for most of this wasted computation. For some queries in Table 1, our method was not very efficient despite the small number of candidates. Computation time of the naïve SW search is proportional to the length of a query sequence. On the other hand, our method requires overhead time for loading SW-scores from the database and exchanging intermediate data between programs. The ratio of this overhead becomes larger for shorter query sequences, and thus our method is less efficient for shorter query sequences. However, it should be noted that our method was still 33 times faster than the naïve method even for the worst case (sfl:SF0102) of our experiment.

Results on the ratios of quickly searchable sequences
We showed above that the proposed method is much faster than the naïve method and needs only a little additional time compared to the BLAST search.

To apply our proposed method effectively, query sequences need to be quickly searchable, i.e. query sequences should have at least one template sequence satisfying the condition described in Method. For quickly searchable sequences, our method could reduce wasted computation by referring the SW-scores of similar template sequences. Therefore, the proposed method will be quite useful for all-against-all comparison of the whole protein set of new species and a database of already sequenced species when at least one closely related species is included in the database. The KEGG/GENES database includes multiple strains of several species. In particular, it contained five E. coli strains in this experiment. Orthologous sequences from different strains were quite similar, and thus most of the E. coli sequences were quickly searchable. Furthermore, not only the different strains in a species but also closely related species are frequently sequenced and thus we can speed up all-against-all comparison for such species. For example, S. flexneri 301 is closely related to E. coli, but there are no other strains of S. flexneri in our data set.

To estimate the practical efficiency of the proposed method, we observed the ratio of the number of quickly searchable sequences. In this experiment, we used SW-scores already stored in the KEGG/SSDB to detect template sequences, but similar results would be obtained for any fast heuristic (such as BLAST).

In each case, sequences whose strains were the same as the query’s strains were not used as template sequences. However, there may be the same sequences among different species/strains. In that case, we can make use of these sequences: If we find a template sequence which is the same as the query sequence, we simply list the sequences whose SW-scores against the template sequence are higher than {theta}. Therefore, we also observed the ratio of such sequences (shown by ‘Identical’ in Fig. 3).



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 3 The ratios of quickly searchable sequences. Each bar shows the ratio of the quickly searchable sequences to the whole protein sequences in respective species for various threshold {theta}. The ratio of the sequences having at least one identical sequence in the target database is also shown (denoted as Identical). When closely related species are included in the target database, the query sequence becomes quickly searchable in most cases.

 
The result is given in Figure 3. Each bar shows the ratio (percentage) of the quickly searchable sequences to the whole sequences, at respective thresholds {theta}.

It is seen that a large part of sequences in E. coli K12 and S. flexneri 301 are quickly searchable. Suppose that we are to carry out all-against-all comparison between the whole protein sequences of these species and the sequences stored in the KEGG/GENES database. For quickly searchable sequences, we need to compute SW-scores only against a small part of the KEGG/GENES (<1% on average) and the time for this part is negligible compared with the time for non-quickly searchable sequences. We need to search against the whole KEGG/GENES only for the rest of query sequences (2% for E. coli K12, 15% for S. flexneri 301 at {theta} = 150). Therefore, we can drastically speed up all-against-all comparison (e.g. ~50 times for E. coli and ~6.7 times for S. flexneri 301 at {theta} = 150), since computation time for the quickly searchable sequences accounts for ~1/78 of database sequences (Table 1). Existence of many quickly searchable sequences for E. coli and S. flexneri is explained by the fact that genome sequences of different strains of the same species and/or closely related species of E. coli and S. flexneri are already stored in the KEGG/GENES.

Figure 3 also shows that 85% of the sequences in E. coli K12 have identical template sequences, while only 34% of the sequences of S. flexneri 301 have identical template sequences. However, 85% of the sequences of S. flexneri 301 are quickly searchable at {theta} = 150. This also suggests that the proposed method is useful even if genomes of same species are not sequenced.

Focusing on the mammalian genomes, the ratios of quickly searchable sequences are not so high. However, their genome sequencing projects are still ongoing and the KEGG/GENES includes up to 13 000 genes out of 25 000 or more genes in each mammalian genome. In spite of this situation, 5~15% (at {theta} = 150) of their protein sequences can be screened out. It is expected that the percentage increases significantly if most protein sequences are accurately annotated and are put into the KEGG/GENES and the KEGG/SSDB.

Identities, additive factor and upper bound
Recall that the upper bound of sw(q,c) is calculated in Corollary 1 as the sum of sw(t,c) and sw(q,q)–sw(t, q) and that the latter part is called the additive factor. The additive factor is intrinsically linked to the similarity between the query sequence and the sequences in the database.

Figure 4 shows relationships between the sequence identity and the additive factor observed among sequences in E.coli K12. Horizontal axis and vertical axis show the identity of sequence pairs (t,q) and the average additive factor for the pairs respectively, where four groups of sequence data are examined that are classified according to the lengths of sequences. The additive factor at identity = 100% is not 0, because there are sequences having additional non-aligned regions on their terminals. It is seen that the additive factor and the sequence identity have a negative correlation. This is because, as implied by Corollary 1, the additive factor becomes high when the identity is low.



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 4 Relation between sequence identity and additive factor. The additive factors intrinsically correlate with sequence identities between queries and templates. Horizontal axis shows identities (%) and vertical axis shows additive factors at respective range of lengths of queries.

 
Figure 5 shows a relationship between the observed additive factor and the ratio of screened sequences in the KEGG/GENES database for randomly picked 100 template sequences from E. coli K12. As shown in Figure 5, ratios of screened sequences fall rapidly around thresholds ({theta}) for the SW-score, and most of the sequences in the database can be screened out, if additive factors are smaller than {theta} – 100. In this case, the KEGG/SSDB does not contain SW-scores <70 and the ratios become 0% at {theta} – 70. Therefore, the method is not useful when the additive factor is large. It is seen from Figure 4 that the additive factor will be <50~150 if the identity is >90~95%. Then, it is expected from Figure 5 that the proposed method works well when the threshold of the SW-score is 150~250.



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 5 Relation between additive factor and ratio of screened sequences. If a query is a quickly searchable sequence, most of the sequences in a target database can be screened out. Horizontal axis shows value of additive factor and vertical axis shows ratios of screened sequences at respective threshold for SW-scores.

 

    DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 METHOD
 RESULTS
 DISCUSSION
 APPENDIX 1
 APPENDIX 2
 REFERENCES
 
In this paper, we have proposed a novel approach for fast and accurate detection of homologs using pre-computed similarity scores among sequences in a target database. For that purpose, we have introduced an inequality for deriving upper bounds of local alignment scores. The usefulness of the approach was demonstrated using the KEGG/SSDB.

As is well known, the SW-scores and their derivatives, such as E-values, are one of the most sensitive measures for protein homology (Brenner et al., 1998). On the other hand, BLAST is one of the fastest heuristic algorithms for homology search. It is reported that performance of homology search by the SW alignment, especially around the twilight zone, is better than that of BLAST (Brenner et al., 1998).

This property is observed in our experiments. For example, eco:b0004 is a threonine synthase in E. coli K12 and a possible ortholog of this enzyme in Archaeoglubus fulgidus is afu:AF1316. The two sequences are included in the same COG(COG0498) (Tatusov et al., 1997, 2001) and annotated as the same KO(K01733) (Kanehisa et al., 2004). Homology search for the KEGG/GENES database by BLAST using BLOSUM50 missed afu:AF1316 even though we relaxed the threshold for E-value to 100. The E-values are affected by the target database size, but we could not detect afu:AF1316 even if we searched only against the whole set of proteins in A. fulgidus. The SW-score between eco:b0004 and afu:AF1316 was 180, which can be thought similar enough. Furthermore, BLAST missed another two proteins whose SW-scores are >200 and which are annotated as K01733 (species having these proteins are not included in COG). The E-values calculated by SSEARCH for all of them were <10–3. By changing substitution matrix to BLOSUM62, the afu:AF1316 could be detected, but many more ortholog candidates in the same COG and KO would be missed. In contrast, no proteins whose BLAST E-values are ≤10–1 were missed by the SW alignment, where the threshold for SW-scores was 150.

For the detection of orthologs, the graph structures of gene/protein universe and information of label of edges, showing the types of similarities such as bidirectional best hit, are useful. For searching evolutionary related proteins, there are also hidden Markov Model (HMM) profile-based methods such as HMMER (Eddy, 1998) or SAM (Karplus et al., 1998) and PSSM-based methods such as PSI-BLAST (Altschul et al., 1997). These methods successfully work for the prediction of function and/or detection of distant homologs. However, for accurate assignment of the edges and their types in the graph of gene/protein universe, it is important to measure relationships in evolution between pairs of proteins, not relationship among members of families or groups. Furthermore, we have to prepare the definition of protein families for HMM profile-based methods. For PSI-BLAST, we need to iterate homology search to construct the PSSM and we also need to manually refine candidates of homologs at respective steps of iteration to obtain the best result. Thus, these methods are not really suited to the detection of orthologs in a large scale. Therefore, the SW algorithm is useful for sensitive and accurate homology search, and it is still required to speed up SW search.

Our method can be said as the method to assign edges for newly added proteins using information of the existing graph of similarity networks of the gene/protein universe. Experimental results suggest that the proposed method screens out irrelevant sequences quite efficiently and provides accurate results which are the same as those of naïve SW alignment search, with only a little additional time to BLAST, if a query sequence is quickly searchable. For eco:b0004, our method was ~82 times faster than naïve SW search. Most queries will be quickly searchable if sequences in genomes of different strains and/or closely related species are stored in the database. Thus, our methods are quite effective for all-against-all comparison for such redundantly sequenced species.

Furthermore, there is a bias in the analysis of species because of the human interest, so there is a large redundancy for certain species within the newly sequenced genes/proteins. Recently, genome projects for multiple strains of some biologically important species, e.g. mammals, crops and pathogenic bacteria, are ongoing and the sequence redundancy is increasing. Moreover, genome sequences sometimes have minor changes during the process of the genome projects. In such a case, we may use sequences in previous versions as template sequences. These facts suggest that the proposed method will become more useful in the near future.

We can utilize this upper bound estimation for speeding up homology search in another way: clustering of database sequences. If we can properly cluster a database and select a representative sequence from each cluster, we only need to search against representative sequences and need to compute SW alignments for sequences in the clusters whose estimated upper bounds exceed the threshold (Itoh et al., 2004). This method directly corresponds to the reduction of sequence redundancy in a target database. There are some previous works on clustering sequences to improve speed and/or accuracy (Dubey et al., 2004; Li et al., 2001, 2002). But, our inequality may provide a theoretically guaranteed measure to obtain the same result as in the non-clustering case.

Though the proposed method is useful for fast and accurate homology search, it is not directly applicable to the maintenance of the KEGG/SSDB because the KEGG/SSDB contains all-against-all SW-scores that are no less than 70 and thus upper bounds will be always ≥70. But, maintenance of the KEGG/SSDB might be done efficiently if we could classify the sequences into clusters as mentioned above. At that time, we just need to maintain SW-scores among representative sequences. For that purpose, it is interesting to note that Corollary 1 can also be used for deriving a lower bound of sw(q,c) because it can be rewritten as

Using this, we can bound the possible range of SW-scores, which might be useful for clustering. In addition, we can apply our method without pre-computed score database, by using incremental procedure. Therefore, our method would be useful not only for maintenance of the score database but also for clustering sequences in a large database.

On the other hand, the inequality in Corollary 1 can be transformed as

assuming s(x,y) = s(y, x). The left-hand side can be thought as the difference of similarity between q and t and similarity between c and t. When the template sequence t is thought as a reference point in the sequence space, the left-hand side reflects the distance between q and c in the sequence space or the gene/protein universe. Thus, this inequality provides an upper bound of the measure of distance.

Since Corollary 1 is satisfied by any sequences, assuming s(x,y) = s(y, x), we can exchange q and c and obtain the following:

This also shows the same measure of distance between q and c but in the inverse direction. We can find that the provided upper bounds via respective inequalities are different and that the measures become asymmetric, which was also observed by Stojmirovic (2003). This asymmetric property is interesting for clustering. Here, we define two asymmetric distances as follows:

where dista->b denotes the distance from a to b (Figure 6a). It should be noted that we use ‘distance’ as a matter of convenience, though the distance should be symmetric in general. The asymmetricity reflects domain organizations of q and c. When two sequences q and c consist of the same single domain, distq->c and distc->q will be similar small values. However, suppose that c has one extra domain as shown in Figure 6b. Then, distc->q becomes large even if distq->c is small. We might use these distances for clustering of proteins so that domain organizations are well reflected. Asymmetric distance might be a powerful tool to cluster sequences and to explore the gene/protein universe.



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 6 (A) Asymmetric distance. The distances are defined as upper bounds of |sw(q,t) – sw(c,t)| (see Discussion). (B) Asymmetric property in comparison of multi-domain proteins. When two proteins consist of the same domain, the SW-score between the proteins is high and the distances in both directions become small. However, when an additional domain is included in c, the distance from c to q becomes large, even if the SW-score is high.

 
The upper bound of sw(q,c) given by Theorem 1 is not necessarily tight for species not having closely related species in the database. Thus, it is reasonable to develop algorithms for getting tighter upper bounds. For example, the following approaches might be useful:
  1. use of results of alignments for both (t,q) and (t,c). In this case, upper bounds should be computed quickly (e.g. constant or linear time);
  2. estimation of the maximum possible value of sw(q,c) without knowing c. For example, it may be useful to compute UB0(t,q,nc,{alpha}) = maxq [sw(q,c) | sw(t,c) = {alpha}, |c|=nc] for each possible score of {alpha} when t and q are given;
  3. estimation of the maximum possible value of sw(q,c) without knowing q. For example, it may be useful to compute UB1(t,c,nq,{alpha}) = maxq [sw(q,c) | sw(t,q)={alpha}, |q|=nq] for each possible score of {alpha} when t and c are given.

Though we have not yet developed effective algorithms for (1) or (2), we have developed an algorithm for (3). It should be noted that (3) is useful only if large pre-processing time is allowed for construction of the database. The developed algorithm and its experimental result are shown in Appendix 2. Though the algorithm is not yet practical, it might be possible to improve the practical computation time using some heuristics. Moreover, the technique might be extended for other problems and it is at least interesting from a computational viewpoint.

Application of the proposed approach is not limited to SW search. It is easy to imagine that if a sequence is similar to one protein family, the sequence cannot be similar to other protein families not having any homologous relation with the family. Thus, the proposed method might be applicable for the family detection based on profiles of protein families such as HMM (Betaman et al., 2004).

In our work, upper bounds were derived by considering the protein sequence space restricted with sequence similarities and alignments. The gene/protein universe will be biased compared with the space of theoretically possible sequences, and the bias comes from functional and/or evolutionary reasons. Therefore, we may obtain tighter upper bounds by analyzing this bias. Though we evaluated the proposed method using SW-scores, evaluation using such statistical values as E-value might be possible since E-value can be calculated as in BLAST from the lengths of two input sequences and its SW-score once the sequence database is fixed.

The whole genome sequences have been determined for >100 species, and genome sequencing projects are still ongoing for more species. Though sequences provided by these projects do not cover all gene/protein sequences, they are considered to be useful to capture a global view of the gene/protein universe, which is also important for understanding functions and evolution of proteins. Defining appropriate measures for comparing protein sequences, especially for multi-domain protein sequences, will be useful to develop methods for exploring the gene/protein universe. The work presented here is a step toward such a development.


    APPENDIX 1
 TOP
 Abstract
 INTRODUCTION
 METHOD
 RESULTS
 DISCUSSION
 APPENDIX 1
 APPENDIX 2
 REFERENCES
 
Here we give a proof sketch of Theorem 1.

To prove Theorem 1, we construct a multiple alignment among t,q,c assuming that an optimal local alignment al(q,c) (between q and c) and arbitrary local alignment al(t,q) (between t and q) are given. First, we consider the case where linear gap costs are used and holds in both al(q,c) and al(t,q) (i.e. the whole part of q appears in alignments). The multiple alignment al(t,q,c) is constructed by the following rules, where (tj,qi) al(t,q) means that column (tj,qi) appears in al(t,q):

  • (tj,qi) al(t,q) and (qi,ck) al(q,c)
    {Rightarrow} (tj,qi,ck) al(t,q,c),
  • (tj,qi) al(t,q) and (qi, ‘–’) al(q,c)
    {Rightarrow} (tj,qi,‘–’) al(t,q,c),
  • (‘–’,qi) al(t,q) and (qi,ck) al(q,c)
    {Rightarrow} (‘–’,qi,ck) al(t,q,c),
  • (tj,‘–’) al(t,q)
    {Rightarrow} (tj,‘–’,‘–’) al(t,q,c),
  • (‘–’,ck) al(q,c)
    {Rightarrow} (‘–’,‘–’,ck) al(t,q,c).
We extract a local alignment al(t,c) between t and c from (t',q',c') = al(t,q,c) by concatenating the columns such that or .

We obtain inequalities for each column of al(t,c) in the following way.





Summing up all cases, we have

where POS++ are defined for al(t,q).

Next, we consider the case where some parts of q do not appear in al(t,q) but the whole part of q appears in al(q,c). Let POS be the set of positions of q not appearing in al(t,q). Then, we should subtract from score[al(q,c)] and thus we have:

It is straightforward to see that the other cases for linear gap costs can be covered by this inequality.

To extend this inequality for affine gap costs, it is enough to replace gi · d with d + e (gi –1). Therefore, the theorem follows from sw(t,c) ≥ score[al(t,c)] and sw(q,c) = score[al(q,c)].


    APPENDIX 2
 TOP
 Abstract
 INTRODUCTION
 METHOD
 RESULTS
 DISCUSSION
 APPENDIX 1
 APPENDIX 2
 REFERENCES
 
Since it seems difficult to compute the exact value of maxq [sw(q,c) | sw(t,q)={alpha}], we relax the condition of sw(t,q) = {alpha} and consider the following problem, where we assume s(x,y) = s(y,x) for all x,y

PROBLEM 1
Given sequences t and c, alignment score {alpha} and the length of the query sequence nq, compute the maximum score between q and c under the condition that there exists a (not necessarily optimal) local alignment al(t,q) between t and q whose score is {alpha}.

The maximum score in Problem 1 is denoted by UB(t,c,nq,{alpha}) {i.e. UB(t,c,nq,{alpha}) = maxq [sw(q,c) | |q| = nq and [ al(t,q)] (score[al(t,q)] = {alpha})]}. Note that UB(t,c,nq,{alpha}) ≥ UB1(t,c,nq,{alpha}) holds. It is expected that UB(t,c,nq,{alpha}) is close to UB1(t,c,nq,{alpha}) in many practical cases because {alpha} is usually set to a value near to sw(t,t) and the following is expected in general (when varying q): the higher score[al(t,q)] is, the lower sw(q,c) is.

We show the algorithm for the case of global alignment and linear gap cost for the sake of simplicity. The algorithm can be modified for the case of the SW algorithm with affine gap cost without increasing the order of the time complexity.

The algorithm is based on dynamic programming. It uses a four-dimensional array F[i,j,k,p], which is defined as follows:

  • there exists a sequence q1 ... qk,
  • there exists a global alignment between t1 ... ti and q1 ... qk whose score is p (not necessarily optimal),
  • the score of an optimal global alignment between q1 ... qk and c1 ... cj is F[i,j,k,p],
where F[i,j,k,p]=–{infty} if there does not exist q1 ... qk satisfying the second condition. It should be noted that the range of p is the set of possible alignment scores and thus p can take negative values. The range of p is O(n) since the maximum alignment values between sequences of length O(n) is O(n). The core part of the dynamic programming procedure is given below.


It is straightforward to see that UB(t,c,nq,{alpha}) = F[|t|,|c|,nq,{alpha}] holds. Therefore, we can compute UB(t,c,nq,{alpha}) for (nq,{alpha}) in some ranges simultaneously. It is easy to see that this procedure takes O(n4 ) time and O(n4) space. It is also possible to extract q which attains UB(t,c,nq,{alpha}) using a traceback procedure.

THEOREM 2
Suppose that |t|, |c| and nq are O(n). Then, the values of UB(t,c,nq,{alpha}) for all possible (nq,{alpha}) can be computed in O(n4 ) time.

Here we show some supplemental results, which confirm that the DP algorithm gives in some cases tighter upper bounds than Corollary 1 does. A local alignment version of the algorithm was implemented using C-language where linear cost gaps were only considered. The program was run on a PC/LINUX cluster (2.8GHz Xeon CPUs), where only one CPU was employed for each case.

We used several pairs of short protein sequences from the SCOP/ASTRAL databases (Chandonia et al., 2004). For each pair of sequences (t,c), we examined several {alpha} and compared upper bounds given by the DP algorithm and Corollary 1, where nq was set to the maximum of |t| and |c|. BLOSUM50 score matrix was employed along with d=5. Because of the limit of memory space ({approx}2GB), the program could be applied to sequences with lengths <70.

Table A1 summarizes the results, where the number after a PDB code denotes the length of the sequence, and CPU denotes the CPU time (seconds). Cor. and DP denote the upper bounds obtained by Corollary 1 and the DP algorithm, respectively. In each case shown in Table A1, sw(t,q) = {alpha} held.


View this table:
[in this window]
[in a new window]
 
Table A1 Results on DP algorithm

 
It is seen that CPU times are not small even for short sequences. It is also seen that the DP algorithm gives tighter bounds than Corollary 1. It should be noted that the DP algorithm outputs the exact bound if sw(t,q)={alpha} holds. Though we showed interesting cases in Table A1, differences between upper bounds were usually very small or 0. This fact suggests that the inequalities given in Methods are usually tight in the worst case even if we use the constraint assumed in Theorem 2. However, it may be possible to get much tighter upper bounds using other constraints.


    Acknowledgments
 
This work was supported by grants from the Ministry of Education, Culture, Sports, Science and Technology of Japan, the Japan Society for the Promotion of Science and the Japan Science and Technology Corporation. The computational resource was provided by the Bioinformatics Center, Institute for Chemical Research, Kyoto University.

Received on July 13, 2004; revised on September 23, 2004; accepted on September 24, 2004

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 METHOD
 RESULTS
 DISCUSSION
 APPENDIX 1
 APPENDIX 2
 REFERENCES
 

    Altschul, S.F., Gish, W., Miller, W., Myers, E., Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410[CrossRef][ISI][Medline].

    Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402[Abstract/Free Full Text].

    Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., et al. (2004) The Pfam protein families database. Nucleic Acids Res., 32, D138–D141[Abstract/Free Full Text].

    Brenner, S.E., Chothia, C., Hubbard, T.J.P. (1998) Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl Acad. Sci., 95, 6073–6078[Abstract/Free Full Text].

    Chandonia, J.M., Hon, G., Walker, N.S., Lo Conte, L., Koehl, P., Levitt, M., Brenner, S.E. (2004) The ASTRAL compendium in 2004. Nucleic Acids Res., 32, D189–D192[Abstract/Free Full Text].

    Davidson, A. (2001) A fast pruning algorithm for optimal sequence alignment. Proc. 2nd IEEE Int. Symp. Bioinformatics Bioeng., 49–56.

    Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. (1978) A model of evolutionary change in proteins. In Dayhoff, M.O. (Ed.). Atlas of Protein Sequence and Structure, , Washington, DC National Biomedical Research Foundation Vol. 5, Suppl. 3,, pp. 345–352.

    Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., Salzverg, S.L. (1999) Alignment of whole genomes. Nucleic Acids Res., 27, 2369–2376[Abstract/Free Full Text].

    Dubey, A., Hwang, S., Rangel, C. (2004) Clustering protein sequence and structure space with infinite Gaussian mixture models. Proc. Pac. Symp. Biocomp., 9, 399–410.

    Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–763[Abstract/Free Full Text].

    Henikoff, S. and Henikoff, J. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci., 89, 10915–10919[Abstract/Free Full Text].

    Itoh, M., Akutsu, T., Kanehisa, M. (2004) Clustering of database sequences for fast homology search using upper bounds on alignment score. Genome Informatics, 15, 93–104.

    Kanehisa, M., Goto, S., Kawashima, S., Nakaya, A. (2002) The KEGG databases at GenomeNet. Nucleic Acids Res., 30, 42–46[Abstract/Free Full Text].

    Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., Hattori, M. (2004) The KEGG resource for deciphering the genome. Nucleic Acids Res., 32, D277–D280[Abstract/Free Full Text].

    Karplus, K., Barrett, C., Hughey, R. (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics, 14, 846–856[Abstract/Free Full Text].

    Kriventseva, E.V., Fleischmann, W., Zdobnov, E.M., Apweiler, R. (2001) CluSTr: a database of clusters of SWISS-PROT + TrEMBL proteins. Nucleic Acids Res., 29, 33–36[Abstract/Free Full Text].

    Kriventseva, E.V., Servant, F., Apweiler, R. (2003) Improvements to CluSTr: the database of SWISS-PROT + TrEMBL protein clusters. Nucleic Acids Res., 31, 388–389[Abstract/Free Full Text].

    Li, W., Jaroszewski, L., Godzik, A. (2001) Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 17, 282–283[Abstract/Free Full Text].

    Li, W., Jaroszewski, L., Godzik, A. (2002) Sequence clustering strategies improver remote homology recognitions while reducing search time. Protein Eng., 15, 643–649[Abstract/Free Full Text].

    Lipman, D.J. and Pearson, W.R. (1985) Rapid and sensitive protein similarity searches. Science, 227, 1435–1441[Abstract/Free Full Text].

    Ma, B., Tromp, J., Li, M. (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics, 18, 440–445[Abstract/Free Full Text].

    Matsuda, H., Ishida, T., Hashimoto, A. (1999) Classifying molecular sequences using a linkage graph with their pairwise similarities. Theor. Comput. Sci., 210, 305–325[CrossRef].

    Mount, D.W. Bioinformatics: Sequence and Genome Analysis, (2001) , Cold Spring Harbor, NY Cold Spring Harbor Laboratory Press.

    Muthukrishnan, S. and Sahinalp, S.C. (2000) Approximate nearest neighbors and sequence comparison with block operations. Proc. 32nd Ann. ACM Symp. Theory Comput., , pp. 416–424.

    Myers, E.W. (1986) An O(ND) difference algorithm and its variations. Algorithmica, 1, 251–266.

    Myers, E.W. (1994) A sublinear algorithm for approximate keyword searching. Algorithmica, 12, 345–374[CrossRef].

    Ogasawara, J. and Morishita, S. (2003) A fast and sensitive algorithm for aligning ESTs to the human genome. J. Bioinfo. Comput. Biol., 1, 363–386[CrossRef].

    Rognes, T. and Seeberg, E. (2000) Six-fold speed-up of Smith–Waterman sequence database searches using parallel processing on common microprocessors. Bioinformatics, 16, 699–706[Abstract/Free Full Text].

    Rognes, T. and Seeberg, E. (1998) SALSA: improved protein database searching by a new algorithm for assembly of sequence fragments into gapped alignments. Bioinformatics, 14, 839–845[Abstract/Free Full Text].

    Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197[CrossRef][ISI][Medline].

    Spiro, P.A. and Macura, N. (2004) A Local alignment metric for accelerating biosequence database search. J. Comp. Biol., 11, 61–82[CrossRef].

    Stojmirovic, A. (2003) Quasi-metric spaces with measure. arXiv e-print:math. GN/0311070.

    Stojmirovic, A. and Pestov, V. (2003) Index schemes for similarity search in datasets of short protein fragments. arXvi e-print:cs.DS/0309005.

    Suharnan, S., Itou, T., Matsuda, H., Mori, H. (1997) Clustering molecular sequences with their components. Genome Informatics, 8, 125–134.

    Tatusov, R.L., Koonin, E.V., Lipman, D.J. (1997) A genomic perspective on protein families. Science, 278, 631&#