Bioinformatics Advance Access originally published online on April 10, 2008
Bioinformatics 2008 24(11):1339-1343; doi:10.1093/bioinformatics/btn130
Simple is beautiful: a straightforward approach to improve the delineation of true and false positives in PSI-BLAST searches
1The Ohio State Biophysics Program, 2Departments of Biochemistry and Chemistry, Ohio State University, 484 W 12th Av. and 3Department of Physics, Ohio State University, 191 W Woodruff Av., Columbus OH 43210-1117, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: The deluge of biological information from different genomic initiatives and the rapid advancement in biotechnologies have made bioinformatics tools an integral part of modern biology. Among the widely used sequence alignment tools, BLAST and PSI-BLAST are arguably the most popular. PSI-BLAST, which uses an iterative profile position specific score matrix (PSSM)-based search strategy, is more sensitive than BLAST in detecting weak homologies, thus making it suitable for remote homolog detection. Many refinements have been made to improve PSI-BLAST, and its computational efficiency and high specificity have been much touted. Nevertheless, corruption of its profile via the incorporation of false positive sequences remains a major challenge.
Results: We have developed a simple and elegant approach to resolve the problem of model corruption in PSI-BLAST searches. We hypothesized that combining results from the first (least-corrupted) profile with results from later (most sensitive) iterations of PSI-BLAST provides a better discriminator for true and false hits. Accordingly, we have derived a formula that utilizes the E-values from these two PSI-BLAST iterations to obtain a figure of merit for rank-ordering the hits. Our verification results based on a gold-standard test set indicate that this figure of merit does indeed delineate true positives from false positives better than PSI-BLAST E-values. Perhaps what is most notable about this strategy is that it is simple and straightforward to implement.
Contact: bundschuh{at}mps.ohio-state.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Sequence alignment is one of the most widely used techniques in computational biology, particularly for functional annotation of non-characterized sequences. The power of this approach is directly dependent on the ability of sequence alignment tools to detect weak homologies.
BLAST (Altschul et al., 1990) is arguably the most frequently used sequence alignment tool. PSI-BLAST (Altschul et al., 1997)—an extended version of BLAST—is similarly popular. It is more effective and sensitive in detecting distant homologs due to its iterative profile-based search strategy (Gotoh, 1996; Thompson et al., 1999). In its first iteration, PSI-BLAST identifies related sequences that meet a specific inclusion threshold and utilizes these sequences to generate a profile or position specific score matrix (PSSM) to be used for the next iterative search. The PSSM is then used to align and score sequences in the database to search for new statistically significant hits. This process is performed iteratively until the model converges or no new statistically significant sequences are found. Previous studies have shown that iterative approaches, such as that used in PSI-BLAST, are more effective in finding additional homologs. However, a major potential problem of this approach is model corruption. With each subsequent iteration, there exists an increased probability of including a non-homologous sequence that meets the threshold for PSSM inclusion, and whose presence can, in turn, lead to the incorporation of other false positives. Such model corruption is, particularly, treacherous if the non-homologous sequence belongs to a large family. A naive approach to overcome this problem is to set a stringent inclusion threshold, but the trade-off is a loss of sensitivity, especially for the more remote homologs, thus diminishing the strength and utility of PSI-BLAST. Hence, the general interest in developing novel strategies to distinguish and eliminate early false positive sequences from true hits in homology detection searches.
Many different approaches have been studied and proposed to improve the discrimination of true and false positives. These methods typically incorporate extrinsic information into the search, such as, (predicted) structural (Bowie et al., 1991, 1996; Gough et al., 2001; Kann et al., 2005; Lee et al., 2007) or literature data (Becker et al., 2003; Kaplan et al., 2003; Shatkay et al., 2007), or use completely different techniques such as SAM (Eddy, 1996; Gribskov et al., 1987; Grundy, 1998; Hughey and Krogh, 1996; Karplus et al., 1998), or a combination of such other techniques (Alam et al., 2004). Despite their better performance in remote homology detection, these approaches are computationally more expensive than PSI-BLAST itself, thus hindering their wide acceptance by the user community.
Herein, we describe a new methodology to improve the statistical assessment of hits returned from a PSI-BLAST search that yields better delineation of true and false positives. The appeal of our approach is its simplicity: it can be integrated easily without incurring significant computational costs to the PSI-BLAST algorithm, while at the same time improving the accuracy of its searches.
The remainder of this article is organized as follows. In Section 2 we first outline the key idea of our approach. Then, we review the basics of BLAST significance assessment, and use these basics to heuristically derive a method to combine outputs from different rounds of PSI-BLAST into one aggregate result. In Section 3, we verify that this method indeed improves search performance by applying it to the gold standard database used to test PSI-BLAST by its own authors (Schaffer et al., 2001). We summarize our findings in Section 4.
| 2 METHODS |
|---|
|
|
|---|
2.1 Key idea
The goal of our approach is to improve on the high sensitivity of PSI-BLAST in terms of remote homolog detection without sacrificing computational efficiency—one of the main reasons for PSI-BLAST's popularity. Our approach is based on a well-known property of PSI-BLAST searches: in its early iterations PSI-BLAST is not very sensitive, but very specific. Upon performing more and more iterations, sensitivity increases while specificity suffers from what is called model corruption. This behavior can be understood as follows: As an iterative tool, PSI-BLAST attempts to build better and better models of the homologs to the original query sequence. The first round of PSI-BLAST, which is essentially BLAST, finds only relatively close homologs. These close homologs are then used to generate a profile for the next iteration that will be able to find more remote homologs. Iterating this process in theory yields progressively better models and thus finds more and more weakly related homologs to the original query. However, in practice, non-homologous false positives are frequently incorporated into the model at some point during this iterative process. Once this happens, the model is corrupted and completely loses its specificity for the true homologs of the original query, resulting in the false identification of large numbers of non-homologous sequences as putative true homologs. For this reason, it is recommended to run at most five to six iterations of PSI-BLAST instead of running PSI-BLAST until no more new sequences are identified (Schaffer et al., 2001). However, even after five or six iterations, it is often common to find that some non-homologous sequences are already incorporated. Thus, there is a trade-off between an increased sensitivity and an increased possibility for model corruption as more iterations are performed.
The key idea of our approach is that by combining the results from different PSI-BLAST iterations, the benefits of low corruption from the early rounds, and the benefits of high sensitivity from the later rounds can both be realized. More precisely, we expect that a true homolog identified in a later iteration should have already revealed its identity as a homolog in the second round (the one using the first profile generated by PSI-BLAST and least likely corrupted). This distant homolog may not have had the level of homology that would have been statistically significant in the second round, but it should at least exhibit some homology. One the other hand, a non-homolog reported in a later round due to model corruption should show a much lower degree of homology in the second round than a true homolog. We thus hypothesize that benchmarking hits from later iterations against hits from iteration two where the model is least corrupted should improve the accuracy of a PSI-BLAST search.
The beauty of our strategy is that it requires almost no additional computational effort—if PSI-BLAST is iterated to, say, the fifth round, the results for the second round will have been calculated anyway. All that is required is to store these results and combine them with the results of the last round. The obvious question that has to be resolved, though, is precisely how to combine the results from different rounds of PSI-BLAST. The remainder of this section will be devoted to providing an explicit formula for combining the E-values reported by PSI-BLAST in the two rounds into a combined figure of merit. The derivation of this formula will be somewhat heuristic in nature. However, an evaluation of this method on a gold standard database presented in the next section will show that our method is indeed very successful in combining the benefits from the different rounds of PSI-BLAST in spite of the somewhat heuristic nature of its derivation.
2.2 Review of BLAST statistics
In order to derive a way to combine the E-values reported by PSI-BLAST for the different rounds into a single figure of merit, it is useful to review how these E-values are actually calculated. If a query sequence with L-amino acids is aligned against a random subject sequence of M-amino acids using the Smith–Waterman (Smith and Waterman, 1981) local alignment algorithm,1 the local alignment score
can be considered a random variable. In the absence of gaps, this random variable has been proven (Karlin and Altschul, 1990, 1993; Karlin and Dembo, 1992) to follow an extreme value distribution
|
| (1) |
and K are two parameters that depend in a known way on the scoring system, the query sequence and the random ensemble used to choose the random subject sequences.
Even in the presence of gaps, heuristic arguments and numerical studies (Altschul and Gish, 1996; Collins et al., 1988; Mott, 1992; Schaffer et al., 2001; Smith et al., 1985; Vingron and Waterman, 1994; Waterman and Vingron, 1994) confirm that the local alignment score is still distributed according to the extreme value distribution Equation (1), albeit with modified values of the two parameters
and K. Thus, the P-value of an alignment score, or the probability of obtaining an alignment score of magnitude S or better in one alignment against a random subject sequence, is given by
|
| (2) |
BLAST and PSI-BLAST report E-values, or the expected number of random hits of score S or higher, instead of P-values. These two quantities are different since the P-value refers to the probability of having at least one hit at the prescribed significance level, while the E-value quantifies the expected number of these hits. The general relationship between these two quantities for random variables with exponential tails is the Poisson formula
|
| (3) |
By comparing with Equation (2) we find
|
| (4) |
It is important to note that this is the E-value for one single sequence comparison, which we indicate by the subscript seq. The convenient property of E-values is that the Bonferroni correction for repeating the alignment for every sequence in a database of N amino acids, and thus roughly N/M sequences, just manifests itself in a multiplication of the expected number of hits for a single sequence by the number of sequences, i.e.
|
| (5) |
It is this latter E-value that BLAST and PSI-BLAST report as the statistical significance of their hits.
2.3 Combining two E-values
In order to combine results from different rounds, we make the assumption that the different rounds of PSI-BLAST are statistically independent. At first glance, making this assumption seems heuristic since the models used in all PSI-BLAST iterations describe the same query sequence and thus are related. It is worth pointing out, however, that for precisely in the scenario we are most worried about, namely model corruption, the assumption of statistical independence between early and late rounds of PSI-BLAST is actually a valid one. As mentioned above, the main justification for this assumption is that it leads to good retrieval performance as shown in the next section using real biological data.
As a means to eliminate false positives resulting from model corruption, we have to accept a hit only if it is significant in the final round of PSI-BLAST as well as in the second round of PSI-BLAST. Then, a false hit occurs only if the same subject sequence is a false hit in the second round and in the final round. Under the assumption of statistical independence, this means that the total probability for a false hit with score S2 or higher in the second round and Sf or higher in the final round is
|
| (6) |
|
| (7) |
|
| (8) |
2.4 Verification protocol
One major consideration in performance evaluation is the accuracy of the assignment of true and false positives to the hits returned from a search. An ideal test set would be one with known (expert-curated) true and false positive relationships of the query sequences to those in the test database, making the assignment unambiguous and straightforward. In light of this consideration and given that our proposed method is an elaboration on PSI-BLAST, it therefore makes sense to use the same dataset used by the PSI-BLAST development team to evaluate their search refinement strategies. Consequently, we have verified our proposed methodology using the Aravind dataset (Aravind and Koonin, 1999) employed by Schaeffer et al. (2001) in their study of different methods to improve PSI-BLAST performance (Schaffer et al., 2001). This Aravind dataset is comprised of 103 queries of single protein domain sequences of the yeast Saccharomyces cerevisiae and a sequence database containing 6406 yeast proteins. We rationalized that using the same dataset should give the most direct and unbiased performance comparison between the two approaches. More importantly, the Aravind dataset contains the true positives to each query sequence and have been expertly curated by Schaffer's coworkers at the NCBI—by conducting PSI-BLAST searches against the yeast database, and analyzing the resultant alignments on a case-by-case basis via expert examination. Sequences in the yeast database that were not annotated as true positives were considered false by default.
For our study, a five-round PSI-BLAST search was conducted for each of the 103 queries against the complete non-redundant (NR) protein database with its 1 986 685 sequences (frozen as of 8/23/04). The choice of five rounds is based on previous empirical analysis by the PSI-BLAST developers that suggested most hits that will be found with an E-value
0.005 (the default threshold that PSI-BLAST uses to include sequences for PSSM construction) are usually found by round five or six (Altschul et al., 1997; Schaffer et al., 2001). The PSSM constructed for iteration two and iteration five were saved as checkpoints. The two checkpoint files per query were then used to search the annotated yeast database to produce a list of yeast hits (up to E-value of 1000) for iteration two and iteration five, respectively. Each hit in iteration two and five was assigned a yes (for true positive) or no (for false positive) according to the true positives for the Aravind test set described previously. To generate a list of common hits to be used for our proposed approach, hits that were found in one iteration but missing in the other were arbitrarily assigned an E-value of 10 000. This list of hits and their corresponding E-values at iteration two and iteration five were combined to obtain the figure of merit (Etot) using the formula Equation (8) in Section 2.3. An awk script that we used to perform the above procedure is available online at http://bioserv.mps.ohio-state.edu/SimpleIsBeautiful.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Evaluation and comparison to PSI-BLAST
To compare the performance of our proposed approach with PSI-BLAST's iterations two and five, all the hits generated from the 103 queries were pooled together and rank ordered by their E-values, or in our case, figures of merit (Etot). Coverage versus error analysis was conducted to examine the number of true positives identified by varying the error levels up to a threshold of 100. The result of this analysis is shown in Figure 1.
|
As shown clearly by the respective position occupied by each curve, our proposed methodology outperforms both iterations two and five of PSI-BLAST. The behavior of the two curves for PSI-BLAST's iterations two and five shows that there is indeed a tradeoff between sensitivity and specificity between different levels of PSI-BLAST searches: higher specificity for iteration two (later occurrence of the first false positive), but higher sensitivity for iteration five (more true positives found). In contrast, the curve corresponding to our method lies farthest right in the plot, encompassing more true positives with minimal increase in false positives.
3.2 Comparison to SAM-T2K
Although widely recognized as the most popular sequence alignment tool, PSI-BLAST is found to be less effective in remote homolog detection than the more computationally costly HMM-based methods (Park et al., 1998). Given the improved performance of our approach over the current PSI-BLAST method, it would be interesting to see how our proposed improvement to the PSI-BLAST algorithm performs when compared to state-of-the-art approaches. Similar to PSI-BLAST in its automated iterative search strategy, the HMM-based SAM-T2K (Hughey and Krogh, 1996; Karplus et al., 1998) has been shown to be superior to PSI-BLAST in distant homology detection in general (Park et al., 1998), thus we have chosen SAM-T2K for additional performance comparison.
We thus applied SAM-T2K to the Aravind test set following the standard procedure as prescribed in the SAM-T2K suite. In brief, for each query sequence in the Aravind test set, we used the script target2k to perform five iterations of the following cycle: (1) search the NR database for homologs, (2) perform a multiple alignment of the homologs to construct an HMM-model and (3) score the sequences in the NR database using this HMM-model. The multiple alignment generated after the final round was then used to create an HMM-model using the w0.5 script. This HMM-model was used to score the 6406 yeast sequences in the Aravind test set using the script hmmscore with the alignment parameter set to local alignment. All other parameters used to conduct the SAM-T2K search were left at their default values.
As shown in Figure 2, whereas our proposed method performs similarly to SAM-T2K, PSI-BLAST falls short on both specificity and sensitivity to either method—finding fewer true homologs for the same number of false positives. The results here reaffirm the general notion that the HMM-based SAM-T2K is a better remote homolog detection tool than PSI-BLAST. Nevertheless, the more sophisticated and complex SAM-T2K is computationally more expensive, making it less popular than PSI-BLAST. The implication of the performance of our combined E-values approach is thus more significant when put into the context of computational efficiency. For minimal additional computation cost (since our method uses outputs already produced by PSI-BLAST), our approach is able to elevate PSI-BLAST's performance to match that of SAM-T2K.
|
3.3 Receiver operating characteristic analysis
In addition to coverage versus error analysis, another common technique to quantify the performance of different approaches is a receiver operating characteristic (ROC) analysis, which measures the rate of true positives against the rate of false positives. The quantity ROC is derived by calculating the area under the curve and adopts a value between 0 (worst) to 1 (best). Although the recommended figure of merit for a comprehensive sequence database search is ROC50 (Gribskov and Robinson, 1996), we followed the lead of the PSI-BLAST developers (Schaffer et al., 2001) in their choice of ROC100 to compare the performance of PSI-BLAST iterations two and five, our combined E-values method using these two PSI-BLAST iterations, and SAM-T2K iteration five. We thus calculated the characteristic ROC100 as
|
| (9) |
|
| 4 DISCUSSION |
|---|
|
|
|---|
We have established a novel strategy to improve the accuracy of PSI-BLAST searches by validating the resultant hits from the final iterations against those obtained in earlier rounds. The verification results are significant given the improved performance shown by our method over PSI-BLAST at minimal computational cost. They demonstrate that our strategy enhances the discriminative capability of PSI-BLAST by maintaining the specificity conferred by the PSSM of iteration two, while at the same time, capitalizing on the advantage gained from iteratively improving the model to identify more true positives in iteration five. One may argue that further enhancement could potentially be achieved by including the E-values from the interim rounds of a PSI-BLAST search, but such an approach would further raise concerns about statistical independence. While for corrupted models at least, the assumption of statistical independence between iterations two and five is reasonable, it becomes invalid if E-values from immediately consecutive iterations are combined. Thus, incorporating E-values from intermediate rounds would eliminate the beauty of the mathematical framework used in this study, and therefore was not pursued for the purpose of this manuscript. Nevertheless, it may be worthwhile to explore this strategy in the future using machine learning techniques to replace the parameter free combination of E-values put forward in this article.
| 5 CONCLUSION |
|---|
|
|
|---|
The problem of model corruption is a major concern to both the developers and users of PSI-BLAST. The reliability of PSI-BLAST's assessment on whether two sequences are related depends heavily on its ability to discriminate true hits from false positives produced from corrupted queries. Many different strategies have been and are being explored to overcome this problem. Our proposed strategy is noteworthy in its simplicity—it utilizes information already output by PSI-BLAST and thus requires minimal changes to the existing algorithm.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
Funding: This work was partially supported by the National Science foundation through grant DBI-0317335 to R.B.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Thomas Lengauer
1 BLAST approximates the Smith–Waterman algorithm with some heuristics but the statistics of large scores is not affected by this heuristics. ![]()
Received on January 5, 2008; revised on February 28, 2008; accepted on April 7, 2008
| REFERENCES |
|---|
|
|
|---|
Alam I, et al. Comparative homology agreement search: an effective combination of homology-search methods. Proc. Natl Acad. Sci USA (2004) 101:13814–13819.
Altschul SF, Gish W. Local alignment statistics. Methods Enzymol (1996) 266:460–480.[Web of Science][Medline]
Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol (1990) 215:403–410.[CrossRef][Web of Science][Medline]
Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:3389–3402.
Aravind L, Koonin EV. Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches. J. Mol. Biol (1999) 287:1023–1040.[CrossRef][Web of Science][Medline]
Becker KG, et al. PubMatrix: a tool for multiplex literature mining. BMC Bioinformatics (2003) 4:61.[CrossRef][Medline]
Bowie JU, et al. A method to identify protein sequences that fold into a known three-dimensional structure. Science (1991) 253:164–253.
Bowie JU, et al. Three-dimensional profiles for measuring compatibility of amino acid sequence with three-dimensional structure. Methods Enzymol (1996) 266:598–616.[Web of Science][Medline]
Collins JF, et al. The significance of protein sequence similarities. Comput. Appl. Biosci (1988) 4:67–71.
Eddy SR. Hidden Markov models. Curr. Opin. Struct. Biol (1996) 6:361–365.[CrossRef][Web of Science][Medline]
Gotoh O. Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol (1996) 264:823–838.[CrossRef][Web of Science][Medline]
Gough J, et al. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol (2001) 313:903–919.[CrossRef][Web of Science][Medline]
Gribskov M, et al. Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA (1987) 84:4355–4358.
Gribskov M, Robinson NL. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem (1996) 20:25–33.[CrossRef][Web of Science][Medline]
Grundy WN. Homology detection via family pairwise search. J. Comput. Biol (1998) 5:479–491.[Web of Science][Medline]
Hughey R, Krogh A. Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput. Appl. Biosci (1996) 12:95–107.
Kann MG, et al. A structure-based method for protein sequence alignment. Bioinformatics (2005) 21:1451–1456.
Kaplan N, et al. PANDORA: keyword-based analysis of protein sets by integration of annotation sources. Nucleic Acids Res (2003) 31:5617–5626.
Karlin S, Altschul SF. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci. USA (1990) 87:2264–2268.
Karlin S, Altschul SF. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl Acad. Sci. USA (1993) 90:5873–5877.
Karlin S, Dembo A. Limit distributions of maximal segmental score among Markov-dependent partial sums. Adv. Appl. Prob (1992) 24:113–140.[CrossRef]
Karplus K, et al. Hidden Markov models for detecting remote protein homologies. Bioinformatics (1998) 14:846–856.
Lee MM, et al. Distant homology detection using a LEngth and STructure-based sequence Alignment Tool (LESTAT). Proteins (2007) 71:1409–1419.[CrossRef][Web of Science]
Mott R. Maximum likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bull. Math. Biol (1992) 54:59–75.[Web of Science]
Park J, et al. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol (1998) 284:1201–1210.[CrossRef][Web of Science][Medline]
Schaffer AA, et al. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res (2001) 29:2994–3005.
Shatkay H, et al. SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. Bioinformatics (2007) 23:1410–1417.
Smith TF, Waterman MS. Comparison of biosequences. Adv. Appl. Math (1981) 2:482–489.[CrossRef]
Smith TF, et al. The statistical distribution of nucleic acid similarities. Nucleic Acids Res (1985) 13:645–656.
Thompson JD, et al. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res (1999) 27:2682–2690.
Vingron M, Waterman MS. Sequence alignment and penalty choice. Review of concepts, case studies and implications. J. Mol. Biol (1994) 235:1–12.[CrossRef][Web of Science][Medline]
Waterman MS, Vingron M. Rapid and accurate estimates of statistical significance for sequence data base searches. Proc. Natl Acad. Sci. USA (1994) 91:4625–4628.
This article has been cited by other articles:
![]() |
M. M. Lee, M. K. Chan, and R. Bundschuh SIB-BLAST: a web server for improved delineation of true and false positives in PSI-BLAST searches Nucleic Acids Res., July 1, 2009; 37(suppl_2): W53 - W56. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Jung and D. Kim SIMPRO: simple protein homology detection method by using indirect signals Bioinformatics, March 15, 2009; 25(6): 729 - 735. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



