Bioinformatics Advance Access originally published online on August 12, 2008
Bioinformatics 2008 24(19):2172-2176; doi:10.1093/bioinformatics/btn422
High-performance signal peptide prediction based on sequence alignment techniques
Center of Applied Molecular Engineering, University of Salzburg, Jakob-Haringerstraße 5, 5020 Salzburg, Austria
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: The accuracy of current signal peptide predictors is outstanding. The most successful predictors are based on neural networks and hidden Markov models, reaching a sensitivity of 99% and an accuracy of 95%. Here, we demonstrate that the popular BLASTP alignment tool can be tuned for signal peptide prediction reaching the same high level of prediction success. Alignment-based techniques provide additional benefits. In spite of high success rates signal peptide predictors yield false predictions. Simple sequences like polyvaline, for example, are predicted as signal peptides. The general architecture of learning systems makes it difficult to trace the cause of such problems. This kind of false predictions can be recognized or avoided altogether by using sequence comparison techniques. Based on these results we have implemented a public web service, called Signal-BLAST. Predictions returned by Signal-BLAST are transparent and easy to analyze.
Availability: Signal-BLAST is available online at http://sigpep.services.came.sbg.ac.at/signalblast.html
Contact: sippl{at}came.sbg.ac.at
| 1 INTRODUCTION |
|---|
|
|
|---|
Many proteins contain signal peptides for the translocation of proteins through membranes of prokaryotic and eukaryotic cells. The general structure of a signal peptide consists of a positively charged N-terminal region (n-region), followed by a hydrophobic core region (h-region), a C-terminal region (c-region) and a cleavage site. The typical size of signal peptides is in the range of 20–30 residues. Although signal peptides have a common functional role and similar physical properties, there is generally low sequence homology among the members of this peptide family. In fact, as remarked by Ladunga (1999), methods that rely on sequence alignment frequently fail to identify signal peptides. Extensive BLASTP database searches for a set of 2500 signal peptides yielded<33% correct predictions (Ladunga, 1999).
Such results support the view, that signal peptide prediction requires sophisticated approaches, like neural networks, that appear to be more sensitive than conventional sequence alignment techniques. The most popular signal peptide predictors are quite complex and are usually built upon learning systems. The most prominent ones being the artificial neural network-based version of SignalP, SignalP-NN (Nielsen et al., 1997) and the hidden Markov model (HMM)-based version SignalP-HMM (Nielsen and Krogh, 1998). The quality of these computerized signal peptide predictors is amazing, reaching a sensitivity of up to 99% and an accuracy of up to 95%. On the other hand, these tools sometimes return predictions that disagree with the general structure of signal peptides. For example, polyvaline does not have the characteristic pattern of signal peptides, but this sequence is nevertheless predicted as a signal peptide by SignalP and SignalP-HMM.
The reason for this behavior is difficult to analyze, as the common feature of such learning systems is that there is no defined formula or rule that describes a signal peptide. Instead, typical signal peptides and typical non-signal carrying proteins are presented to the algorithm, which then finds a way to reproduce this classification within its allowed parameter ranges. The architecture of these systems makes it difficult to trace the flow of computations so that the cause of incorrect predictions is difficult to analyze. In a biological context, predictors are highly desirable which offer this kind of transparency.
Here, we present a new tool called Signal-BLAST, which provides this transparency with the same level of performance as the most sophisticated learning systems. As the name suggests, Signal-BLAST uses the functionality of the BLAST package (Altschul et al., 1990, 1997) to find similarities between a query sequence and reference data. The reference data are divided into signal peptides and non-signal peptides. A query sequence is predicted to have a signal peptide, if the closest match is found within the signal peptide set, and it is predicted to lack a signal peptide when the closest match is found in the non-signal peptide set.
To tune Signal-BLAST and to asses its performance, we use a set of reference data composed of experimentally verified proteins from Uniprot (Boeckmann et al., 2003) Release 11.0. We find that Signal-BLAST has an average sensitivity of up to 98%, while having an accuracy of 95%. This is the same quality as current artificial neural network or HMM-based predictors.
| 2 MATERIALS AND METHODS |
|---|
|
|
|---|
The basic idea in using BLASTP for signal peptide prediction is to match a query sequence against two curated databases, one for signal peptides and one for non-signal peptides. The query is then predicted to be a signal peptide if the most significant hit is found in the signal peptide database rather than the non-signal peptide database. The signal and non-signal peptide databases used here are derived from Uniprot 11.0.
The standard parameters of BLASTP are inappropriate for short sequences like signal peptides. Therefore, to use BLASTP for signal peptide prediction a suitable set of parameters has to be found. We employed the common receiver operating characteristic (ROC) analysis (Swets, 1996) to maximize the area under the respective ROC curve, thereby optimizing the prediction accuracy for signal peptides (Fawcett, 2005). This analysis revealed that it is necessary to disable filtering of low complexity regions in the query sequence (BLASTP option -F F), as many signal peptides fall into this category. Another boost in performance is achieved by lowering the threshold for extending hits. Due to the high degree of variance in signal peptides, BLASTP performs better, if this value is lowered from the original value of 11 down to 9 (-f 9). Finally, the threshold for the e-value is removed to retrieve the full set of alignments. Variations of other parameters do not yield further improvements. For example, neither a lowered word length of 2, nor the use of the PAM matrix (Dayhoff, 1979) for scoring increases the performance further. Also, the position-specific iterated versions of BLAST did not surpass the results of BLASTP.
|
The results returned by BLASTP using these parameters on the signal peptide datasets, are promising, but there are still two interrelated problems. The first is, that the predictions are not sensitive enough. While there are almost no false positives, the sensitivity is as low as 0.3 on some protein families. The second problem is that the two rankings obtained from the signal peptide and non-signal peptide databases need to be calibrated against each other. To address these problems we employ the signal peptide bias
, 0<
1, in the following way. Each alignment has an associated e-value returned by BLASTP. For alignments between the query sequence and the sequences in the signal peptide database the e-value is multiplied by
, and for matches to the non-signal peptide database the e-value is multiplied by 1–
. Hence, the parameter
controls the rate of true positive signal peptides over the rate of false positives. Suitable values for
are obtained from the ROC analysis. The best prediction is defined as the highest scoring hit from either of the two databases, contrary to predictors like PPT-DB (Wishart et al., 2008), which do not use a non-signal peptide database. In a final step, Signal-BLAST verifies whether or not the best prediction qualifies as a signal peptide. For this at least five residues of the query sequence need to be aligned with the target sequence between target N-terminus and target cleavage site. Otherwise the query sequence is reported as a non-signal peptide. Moreover, when the N-terminus of a query sequence starts at the cleavage site the result is reported as a mature protein whose signal peptide has been cleaved off. From the analysis of Signal-BLAST performance, we obtain three sets of parameters suitable for the most common application scenarios. The parameter set called SP1 is optimized for accuracy and sensitivity and predicts the highest number of true positives (Table 1), while SP3, which is optimized for accuracy and specificity, predicts the least amount of false positives. SP2 offers the highest level of accuracy, balancing sensitivity and specificity.
In addition to these standard parameter sets, an advanced interface provides detailed control over the four parameters of Signal-BLAST. First, it is possible to control the signal peptide bias (
) instead of picking one of the three default values, SP1 to SP3. By default, Signal-BLAST uses the subset of signal peptides found in Uniprot 11.0, that are experimentally verified. With a second option, this set can be expanded to include also predicted signal peptides found in the Uniprot 11.0 database. Activating this option improves the sensitivity of Signal-BLAST at the cost of a significant higher number of false positives. As a third parameter, a significancy threshold can be set, indicating the minimum bit score considered to represent a significant alignment. If no alignment above this threshold is found, a null result is returned. A final option, suited for test purposes (jackknife test), removes the query protein if it is found in the database.
|
A big advantage of Signal-BLAST is the ease of maintaining the predictor. Learning systems need an elaborate preparation of the training sets and considerable computational resources for the training phase. Maintaining Signal-BLAST is comparatively easy. Building the Uniprot 11.0 dataset for Signal-BLAST is done in four steps. First, the status of the signal peptide for each protein is determined. Three different results are possible: experimentally verified signal peptides (3634 sequences), predicted signal peptides (12976 sequences) and non-signal carrying proteins (104476 sequences) as summarized in Table 2. Signal-BLAST is designed to predict only those peptides that are annotated with the SIGNAL keyword in Uniprot. Sequences like signal anchors (also called type II membrane proteins) are reported as non-signal peptides and consequently they are part of the negative dataset. A special case is provided by the so-called uncleavable signal peptides, a set of 24 peptides, annotated in Uniprot 11.0 as signal peptide, not cleavable, where 10 of these are annotated as experimentally verified. Signal-BLAST reports hits to such sequences as signal peptides and marks them as not cleavable.
For regular prediction, only the first and third dataset are used. In addition, predicted signal peptides may be optionally used. The second step in preparing the data is to extract the cleavage site of each protein that has an annotated signal peptide and save this information. As there is a big difference in the composition of signal peptides in eukaryotic, Gram-negative and Gram-positive organisms, the data are split into different sets, one for each category of organism. This results in a total of nine different databases for Signal-BLAST — one for each category and group of organism (Table 2). The whole update process is fully automated requiring only a few minutes CPU time. Since by definition the cleavage site is situated between the signal peptide and the mature protein, signal peptide prediction methods generally use sequences that span the signal peptide and the N-terminal part of the mature protein. For the Signal-BLAST database we use the first 60 residues (including the signal peptide) similar to SignalP which uses the first 70 residues (Nielsen et al., 1997).
An example of a Signal-BLAST report is shown in Figure 1. The report provides a list of the 10 top-scoring alignments together with the respective BLASTP parameters. In addition, the BLASTP alignment obtained for the query sequence and the top hit (i.e. the best target found) is shown along with the cleavage site found in the target sequence. We note that Signal-BLAST reports partial alignments obtained from the BLAST program. The query sequence of Figure 1 is GANAB_MOUSE, the alpha subunit of glucosidase II from mouse. The best hit obtained is LV2B_MOUSE (murine immunoglobulin lambda light chain). BLASTP returns a short alignment of 15 residue pairs where seven pairs are identical. The identical residue pairs are in close proximity to the cleavage site. The corresponding e-value of +39 is statistically insignificant, which is quite typical for Signal-BLAST since signal peptides produce short alignments with a small number of identities. Nevertheless, the e-value reported by BLASTP is a most useful parameter for proper ranking. The example demonstrates that Signal-BLAST is able to recognize signal peptides in cases where the sequence identity is low and where at the same time the length of the signal peptide (32 residues) is rather unusual (the average signal peptide length in eukaryotes is in the order of 25 residues).
|
| 3 RESULTS |
|---|
|
|
|---|
We compare the performance of predicting signal peptides with Signal-BLAST versus the current best predictor, SignalP v3 (Bendtsen et al., 2004). To measure the quality of the predictors, an unbiased test set has to be derived. This set may not contain any data that are part of the database, which were used to train the respective signal peptide predictors since a good predictor will generally reproduce its training database.
We derive a suitable test dataset from Uniprot 11.0 by chopping off the N-terminal 60 residue fragments from all proteins, removing duplicate entries. These fragments serve as the target database against which the performance of the predictors is measured.
The last update of SignalP was reported in 2002. Therefore, removing all sequences annotated before 2002 ensures that the remaining sequences were not used in the training of SignalP. For Signal-BLAST we use a jackknife test, removing the query protein before the prediction is performed.
The remaining proteins are split into three sets of data. The positive test set contains all entries, that have an experimentally determined signal peptide. The negative test set is composed of all proteins where the annotation within the first 60 amino acids clearly identifies the protein as a non-signal peptide carrying protein. The third set contains all data, that do not fit into either group; this subset is unreliable and is not used in the benchmark.
Finally, the data are split into three groups of organisms, eukaryotic, Gram-negative and Gram-positive bacteria. This leaves 176 eukaryotic, 63 Gram-negative and 22 Gram-positive signal peptides, for a total of 261 signal peptides. The negative test set is substantially larger, containing 19 751 eukaryotic, 6297 Gram-negative and 3269 Gram-positive proteins, totalling 29 317 proteins.
All tests are performed with the standard parameters of SignalP v3. For Signal-BLAST, we report the results obtained with the three optimized parameter sets, SP1 to SP3. Table 3 shows the results of the eukaryotic data, predicted by the different predefined sets of parameters of Signal-BLAST, along with the two versions of SignalP v3.
|
|
As expected from previous reports SignalP reaches a sensitivity of 0.99. The specificity is 0.9 yielding a total accuracy of 0.95. In contrast, Signal-BLAST SP1 shows a sensitivity of 0.97, a specificity of 0.9 and an accuracy of 0.94. The other sets of parameters perform very similar with an accuracy of 0.94 and 0.93, respectively.
For Gram-negative bacteria (Table 4), the results of SignalP v3-HMM are similar, while SignalP v3-NN exhibits a somewhat lower level of sensitivity. However, the accuracy is still very good at 0.94 for the HMM and 0.92 for the NN version. Signal-BLAST does even better by predicting every single signal peptide. The specificity is also very good with a value of at least 0.94, leading to a total accuracy between 0.96 and 0.99.
Table 5 provides the results for Gram-positive bacteria. Again SignalP v3-HMM shows good performance in its predictions, whereas SignalP v3-NN looses sensitivity. The accuracy is found to be 0.94 for the HMM and 0.91 for SignalP v3-NN. Signal-BLAST has exceptionally good results for Gram-positive bacteria. Independent of the signal bias, which has no influence on the test data, Signal-BLAST offers better sensitivity, specificity and accuracy than SignalP.
Table 6 summarizes the sensitivity, specificity and accuracy for the three groups of organisms. Signal-BLAST SP1, which most closely resembles the performance of SignalP, performs virtually identical to the two learning systems within a range of 2%.
It is clear that the performance of Signal-BLAST solely depends on sequence similarities that can be detected by the BLASTP program. In a similar way, although less transparent, learning systems depend on the similarities among the sequences in the training set. This information is generalized to statistical models in various ways. It is therefore, advisable to characterize the success of the various methods in terms of the sequence similarity found in the sequence databases used for training and prediction. The respective analysis is particularly straightforward for Signal-BLAST in terms of the jackknife test. As already noted the test consists of 261 individual query sequences. These query sequences are used to scan the complete set of annotated signal (3634 sequences) and non-signal peptides (104 476 sequences) summarized in Table 2, where the length of each query and target sequence is confined to 60 residues.
|
|
Figure 2 shows the distribution of the maximum percentage of sequence identity found among the 261 query sequences and the respective 108 109 target sequences. The fraction of true positives on this set obtained as a function of sequence identity is shown in Figure 3. The maximum pairwise similarity between query and target is generally low. In particular, for >50% of the query sequences the closest match among the target proteins has a sequence identity of <30%. The performance of regular BLASTP is generally very good for long sequences, but less reliable for short sequences in the size range used for signal peptide prediction. Nevertheless, when properly tuned, BLASTP performs remarkably well on short signal peptide containing sequences. It is particularly noteworthy that the ratio of true positives is very high over the whole range of similarities (Fig. 3).
|
|
In spite of the astounding performance of current signal peptide predictors, it is possible to trigger false positive predictions for artificial sequences. This is demonstrated by the construction of an artificial sequence where the 19-residue long signal peptide of the sequence CLM1_HUMAN[CMRF35-like molecule 1 (Precursor)] is replaced by a polyvaline of the same length. Polyvaline lacks several essential properties of signal peptides. In particular, the polyvaline sequence does not contain residues with positive charges nor does it contain helix-breaking residues that generally characterize the boundary between the h- and c-region. Although there is no experimental proof that polyvaline cannot act as a signal peptide it is nevertheless highly unlikely and we therefore, expect that signal peptide predictors will reject polyvaline as a possible signal peptide. Moreover, a BLAST search of polyvaline against the whole sequence database does not retrieve any signal peptides.
When the modified protein is presented to SignalP v3-NN, all five scores surpass the cutoff value, indicating that there is a signal peptide present, and the HMM of SignalP v3-HMM predicts n-, h- and c-regions, characteristic for a clear cleavage site (Fig. 4). It is rather surprising that SignalP predicts polyvaline as a signal peptide. One might speculate, that a protein similar in sequence to the mature part of CLM1_HUMAN is contained in the training set which is recognized by the neural network. This behavior is frequently encountered in the case of overtrained networks (Astion et al., 1993). The result is perhaps even more surprising for the HMM (SignalP-HMM) since the underlying statistical model is thought to generalize the characteristic properties of signal peptides. However, by construction the artificial polyvaline sequence definitely lacks the n-, h- and c-regions predicted by SignalP-HMM. On the other hand, as expected from the negative result obtained in the polyvaline database search, Signal-BLAST does not yield significant alignments of polyvaline and signal peptides.
|
| 4 CONCLUSIONS |
|---|
|
|
|---|
By implementing Signal-BLAST, we demonstrate that it is possible to provide high-quality signal peptide prediction by sequence alignment. Comparing the quality of the results between SignalP and Signal-BLAST, it can be observed that on an average the predictors perform similar with respect to sensitivity, specificity and accuracy. For bacteria, Signal-BLAST proves to be even more reliable than the learning systems. At the same time the sequence alignments returned provide additional information regarding the biological context and significance of the result. For predictors, it is a particular advantage that target databases contain all available information on known sequences. Given the simplicity in updating the Signal-BLAST databases this goal is easy to achieve.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Limsoon Wong
Received on June 6, 2008; revised on August 5, 2008; accepted on August 7, 2008
| REFERENCES |
|---|
|
|
|---|
Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol (1990) 215:403–410.[CrossRef][Web of Science][Medline]
Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:3389–3402.
Astion ML, et al. Overtraining in neural networks that interpret clinical data. Clin. Chem (1993) 39:1998–2004.[Abstract]
Bendtsen JD, et al. Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol (2004) 340:783–795.[CrossRef][Web of Science][Medline]
Boeckmann B, et al. The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res (2003) 31:365–370.
Dayhoff MO. Atlas of Protein Sequence and Structure. (1979) 5(Suppl. 3.). Washington, D.C: National Biomedical Research Foundation.
Fawcett T. An introduction to ROC analysis. Pattern Recogn. Lett (2005) 27:861–874.[CrossRef]
Ladunga I. PHYSEAN: PHYsical SEquence ANalysis for the identification of protein domains on the basis of physical and chemical properties of amino acids. Bioinformatics (1999) 15:1028–1038.
Nielsen H, et al. A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Int. J. Neural Syst (1997) 8:581–599.[CrossRef][Medline]
Nielsen H, Krogh A. Prediction of signal peptides and signal anchors by a hidden Markov model. Proc. Int. Cong. Intell. Syst. Mol. Biol (1998) 6:122–130.
Swets JA. Signal Detection Theory and ROC Analysis in Psychology and Diagnostics: Collected Papers. (1996) Lawrence Erlbaum Associates, Inc.
Wishart DS, et al. PPT-DB: the protein property prediction and testing database. Nucleic Acids Res (2008) 36(Database issue):D222–D229.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



