Skip Navigation


Bioinformatics Advance Access originally published online on February 5, 2008
Bioinformatics 2008 24(6):815-825; doi:10.1093/bioinformatics/btn044
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/6/815    most recent
btn044v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Southey, B. R.
Right arrow Articles by Rodriguez-Zas, S. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Southey, B. R.
Right arrow Articles by Rodriguez-Zas, S. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Prediction of neuropeptide cleavage sites in insects

Bruce R. Southey 1,2, Jonathan V. Sweedler 1 and Sandra L. Rodriguez-Zas 2,*

1Department of Chemistry and 2Department of Animal Sciences, University of Illinois, Urbana, IL, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: The production of neuropeptides from their precursor proteins is the result of a complex series of enzymatic processing steps. Often, the annotation of new neuropeptide genes from sequence information outstrips biochemical assays and so bioinformatics tools can provide rapid information on the most likely peptides produced by a gene. Predicting the final bioactive neuropeptides from precursor proteins requires accurate algorithms to determine which locations in the protein are cleaved.

Results: Predictive models were trained on Apis mellifera and Drosophila melanogaster precursors using binary logistic regression, multi-layer perceptron and k-nearest neighbor models. The final predictive models included specific amino acids at locations relative to the cleavage sites. Correct classification rates ranged from 78 to 100% indicating that the models adequately predicted cleaved and non-cleaved positions across a wide range of neuropeptide families and insect species. The model trained on D.melanogaster data had better generalization properties than the model trained on A. mellifera for the data sets considered. The reliable and consistent performance of the models in the test data sets suggests that the bioinformatics strategies proposed here can accurately predict neuropeptides in insects with sequence information based on neuropeptides with biochemical and sequence information in well-studied species.

Contact: rodrgzzs{at}uiuc.edu

Supplementary information: Sequences and cleavage information are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Neuropeptides have diverse biological functions that affect almost every brain system and neuronal network, and influence development and behavior. Released from a neuron upon electrical or chemical stimulation, they modulate the activity of the presynaptic or post-synaptic cell (Kandel et al., 2000). Neuropeptides are difficult to predict from precursor sequence information alone because they are generated from long protein precursors through a complex enzymatic processing system involving cleavage of the precursor and other post-translational modifications. Some neuropeptides are synthesized by relatively few neurons, or at a particular time point in development, or they tend to be present at low-physiological concentrations. Experimental confirmation of neuropeptides is often not available, which also contributes to the challenge of determining final bioactive peptides.

The sequencing of new genomes has provided an unprecedented opportunity to overcome the challenges involved in identifying precursors and detecting neuropeptides. Mass spectrometry has been used successfully in combination with genome sequences to characterize the insect peptidome for neuropeptides in Drosophila melanogaster (Baggerman et al., 2002, 2005; Predel et al., 2004b) and Apis mellifera (Hummon et al., 2006a). For example, using the A.mellifera genomic information, Hummon et al. (2006a) combined three different approaches (mass spectrometric analyses, homology and codon-scanning searches) to identify 36 precursors and subsequently predict more than 200 putative neuropeptide products, with the structure of 100 of these peptides being biochemically confirmed.

Several approaches have been reported to predict neuropeptides cleaved from a precursor protein sequence (Amare et al., 2006; Duckert et al., 2004; Hummon et al., 2003; Southey et al., 2006b). Hummon et al. (2003) developed two models using cleavage data from molluscan (Aplysia californica) neuropeptide precursors that had been obtained using mass spectrometry. These models correctly predicted ~96% of the precursor cleavages in the Aplysia data set. Using an alternative approach, Duckert et al. (2004) developed an artificial neural network trained on 227 viral and eukaryotic proteins containing 235 cleavage sites to predict neuropeptides.

Southey et al. (2006b) proposed a Known Motif model comprised of several prevalent motifs associated with neuropeptide precursor cleavage, Xxx-Xxx-Lys-Lys{downarrow}, Xxx-Xxx-Lys-Arg{downarrow}, Xxx-Xxx-Arg-Arg{downarrow}, Arg-Xxx-Xxx-Lys{downarrow} and Arg-Xxx-Xxx-Arg{downarrow}, where {downarrow} denotes cleavage and Xxx denotes any amino acid. They compared the predictive performance of the Known Motif model with approaches of Hummon et al. (2003) and Duckert et al. (2004) in the RFamide family of neuropeptides including invertebrate FMRFamide and the vertebrate NPFFamide, RFRPamide and PrRPamide precursors from a range of species. The Known Motif approach had a higher rate of correct classification of cleavages, high specificity and high negative predictive power than the other approaches. The performance of the Known Motif approach was similar to, or better than, the approach of Hummon et al. (2003), and both of these approaches performed better than the approach of Duckert et al. (2004) for the RFamide precursors studied.

Amare et al. (2006) trained a binary logistic regression model on mammalian neuropeptide precursors. Significant differences were found in the processing among vertebrate and molluscan precursors, specifically, in the processing of dibasic sites. Importantly, phyla-specific predictors were reported to provide the most accurate predictions of neuropeptide cleavage.

Currently, however, there are no phyla-specific models to predict cleavage of neuropeptide precursors in insects. Furthermore, no comparisons of different methodologies to predict neuropeptide cleavage such as logistic regression and artificial neural networks trained on the same data set have been reported. Different methodologies can have different properties that may be more suitable for different objectives or data sets. The objectives of this study were first, to develop insect-specific logistic regression and artificial neural network models to predict cleavage of neuropeptide precursors using extensive cleavage data sets in this important and well-studied taxa; second, to assess the aptitude of a classification approach yet to be used to predict neuropeptide precursor cleavage, k-nearest neighbor and; third, to compare the performance of these alternative approaches with an approach based on motifs known to be associated with cleavage to predict cleavage on neuropeptide precursors not used to train these approaches. The overarching aim is to provide tools to accurately identify neuropeptides from sequence information, expediting the experimental elucidation of neuropeptides and annotation of neuropeptide genes on insects with sequenced genomes. Using D.melanogaster and A.mellifera genome sequences and published neuropeptide information, the performance of alternative models was evaluated in multiple data sets using complementary indicators of model adequacy.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Data
Experimentally confirmed neuropeptide precursor cleavage data on A.mellifera (Hummon et al., 2006b) and D.melanogaster precursors (Baggerman et al., 2005; Nassel, 2002) were compiled for species-specific or genome training data sets. Only neuropeptide precursors with virtually complete and empirically validated cleavages at basic amino acids were considered in the model training stage to minimize the impact of low-quality data and maximize the signal-to-noise ratio. Precursors with incomplete or no experimental prohormone cleavage data were excluded from the training data sets. Each training data set (Apis and Drosophila) included only one sequence per neuropeptide precursor.

Two additional test (non-training) data sets that excluded precursors from the Apis or Drosophila species were created from the UniProt Knowledgebase database (Bairoch et al., 2005). The first of these data sets, Various, consisted of complete insect precursors from multiple insect species. The second data set, Insulin-like, was generated from the Insulin-like peptide precursors, including bombyxin, due to the high homology of these sequences within and between species. The Insulin-like data set did not contain homologous precursors to the precursor sequences present in the Apis, Drosophila and Various data sets. The Apis, Drosophila, Various and Insulin-like data sets do not include any neuropeptide precursor sequences from the same species.

Apis and Drosophila data sets were used to train models because these data sets included extensive and representative proteome-wise information suitable for general model building. These data sets were created to avoid introducing precursor bias into the training process as each precursor was represented at most once within species. An additional benefit of this data set specification was the ability to identify potential species-specific cleavage patterns. The Insulin-like and Various data sets contained many similar sequences from either a single species or the same precursor in multiple species making these data sets more suitable for model testing or evaluation.

For each data set, the following procedure was used in the complete precursor sequence:

  1. The signal peptide was predicted using SignalP V3 (Bendtsen et al., 2004) and was removed from the sequence.
  2. The remaining sequence was split into overlapping windows of 18 amino acids, with locations or sites denoted P9–P1 and P1'–P9 '. Every amino acid was present in the P1 location in the resulting windows. This nomenclature for locations follows Schechter and Berger (1967) where amino acids from the scissile bond of any potential cleavage site to the N-terminus are denoted as P, and amino acids towards the C-terminus (opposite side) are denoted as P '. Each amino acid is given a number denoting the location of the amino acid relative to the cleavage site, so that the cleavage occurs between the P1 and P1' amino acids.
  3. Windows started at the fourth amino acid of the sequence and ended at the fourth to last amino acid. This was based on the minimum number of amino acids before and after the observed cleavage sites in the Apis and Drosophila data sets.
  4. Windows which exceeded the C-terminus or N-terminus of the precursor were completed with Xxx (unspecified amino acid code) to ensure that the window would be included in the analysis, but this code was not used in the training process.
  5. Windows without an Arg or Lys in the P1 location were removed.
  6. When cleavage occurred with multiple sequential basic amino acids, the cleaved site was associated with the most C-terminal basic amino acid.
  7. Removal of windows that could not be cleaved due to cleavage at a nearby location, specifically where a basic amino acid was present in the P1 location, a basic amino acid was present in either the P1' or P4' locations, and there was no basic amino acid located in either the P2 or P4 locations.

2.2 Predictive methodologies
Binary logistic regression, artificial neural networks and k-nearest neighbor models were trained on the Apis and Drosophila data sets. Each data set was analyzed separately to obtain an Apis model trained only on the Apis data set, and a Drosophila model trained only on the Drosophila data set. Explanatory or input variables were the indicators of presence or absence of every possible combination of amino acid (20 possibilities), and location (18 possibilities) of the amino acid relative to the cleavage sites (i.e. presence or absence of Arg at P2, presence or absence of Lys at P4).

A detailed description of logistic regression theory can be found in Agresti (1996). In this study, the logarithmic transformation of the odds of cleavage was described with a linear model including all explanatory variables as main effects. Parsimonious yet accurate logistic regression models were obtained using a combination of backward, forward and stepwise model-selection methods with a 0.1 P-value threshold for inclusion of significant model terms. Subsequently, the model with fewest incorrect predictions in the respective training data set was selected. Artificial neural networks were implemented as multi-layer perceptrons (Hastie et al., 2001) with a single hidden layer. The number of hidden nodes evaluated ranged from 1 to 500 and all input nodes were connected to all hidden nodes. The range of hidden nodes considered, permitted the evaluation of different degrees of non-linear association between the input variables and the probability of cleavage in the training data sets. The k-nearest neighbor model or memory-based reasoning model (Hastie et al., 2001) was implemented using a k-nearest neighbor algorithm with the Euclidian metric. Different values of k between 1 and 100 (in intervals of 5 units until 25 and 25 units thereafter) were evaluated to identify the optimal number of neighbors that optimized the classification of windows between cleaved and non-cleaved status in the training data set. A representation of the binary logistic and multi-layer perceptron models used in this study is provided in Supplementary Material Figure 1.

The three approaches to predict neuropeptide precursor cleavage considered in this study were selected because they are commonly used pattern recognition techniques that have complementary strengths, yet have not been used to predict cleavages on the same insect neuropeptide data sets. The logistic regression model can be implemented as a simple multi-layer perceptron with no hidden layer and a logistic output activation function (Ohlsson, 2004). The implementation of the logistic model used in this study only includes explanatory variables in a linear fashion, meanwhile the hidden layer of the multi-layer perceptron accommodates potential non-linear effects of the amino acid locations on the probability of cleavage. The k-nearest neighbor method is a well-established non-parametric approach to classify samples that does not require parameter estimation, however, tends to have high-computational requirements for large training data sets (Chen et al., 1997). Logistic regression models were implemented in SAS (http://www.sas.com) and multi-layer perceptrons, and k-nearest neighbor (or memory-based reasoning) models were implemented in SAS Enterprise Miner (http://www.sas.com).

2.3 Model evaluation
Models trained on the Apis and Drosophila data sets were applied to the Apis and Drosophila genome-based training data sets and the Insulin-like and Various data sets. Results were categorized into false and true results, based on the assumption that the experimental data offers a correct depiction of precursor processing. Predicted probabilities of cleavage lower than 0.5 were considered non-cleaved (negative) and probabilities equal to or higher than 0.5 were considered cleaved (positive). For each data set, the number of predictions of correct cleavage (true-positive result), incorrect cleavage (false-positive result), correct non-cleavage (true-negative result) and incorrect non-cleavage (false-negative result) were calculated as the number of times the predicted probability of cleavage for each window exceeded the threshold probability value of 0.5. Using this information, the following established measurements of prediction accuracy (Baldi et al., 2000; Southey et al., 2006b) were calculated:

  1. Correct classification rate: number of correctly predicted sites divided by the total number of sites.
  2. Sensitivity (one minus false-positive rate): number of true positives divided by the total number of sites cleaved.
  3. Specificity (one minus false-negative rate): number of true negatives divided by the total number of sites not cleaved.
  4. Mathews (1975) correlation coefficient: a correlation coefficient between observed and predicted cleavage.
  5. Positive Precision: number of true-positive results divided by the total number of positives.
  6. Negative Precision: number of true-negative results divided by the total number of negatives.
  7. The area under the receiver-operator characteristic (ROC) curve is a measure of accuracy of the classification of cleaved and non-cleaved sites. A value of one indicates perfect accuracy, values greater than 0.8 are considered excellent approach performance, and values under 0.7 are considered poor approach performance.

A single representative model from each training data and predictive methodology was selected using the best predictive performance on the non-training data sets and, hence, best generalization ability. The Apis and Drosophila logistic regression models selected had the fewest incorrect predictions across all data sets. The final multi-layer perceptron specification for each data set had the lowest number of hidden nodes that provided the highest accuracy of cleavage prediction in the test data sets. The final k-nearest neighbor specification had the lowest number of neighbors that optimized the classification of windows between cleaved and non-cleaved status in the training and non-training data sets.

The performance of the representative logistic regression, multi-layer perceptron and k-nearest neighbor approaches for each training data set was also compared to the performance of the Known Motif model of Southey et al. (2006a, b). The Known Motif model assigns high probability of cleavage (>0.5) to precursor sequence sites with at least one amino acid motif reported to be associated with cleavage based on a literature review of experimental studies. These motifs include selected combinations or interactions of basic amino acids at locations P1 and P2 or P4. The known motifs were also specified in the full logistic regression models prior undergoing variable selection. However, the final Drosophila and Apis logistic regression models did not include any motif.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Neuropeptide precursor amino acid composition
A total of 16 Apis and 21 Drosophila precursor sequences were considered to surpass the minimum information quality and quantity criteria to be used for training. The Insulin-like data set consisted of 41 precursors from five species including 25 Bombyx mori bombyxin precursors. The Various insect data set consisted of 40 precursors, obtained from 19 neuropeptide families in 20 species. Only nine neuropeptide families occurred in both the Apis and Drosophila data sets. A total of 13 and eight precursor sequences present in the Drosophila and Apis data sets, respectively, had at least one homologous precursor present in the Various data set. The lengths of the precursor precursors and neuropeptides in the Apis and Drosophila data sets are equal and slightly larger respectively, than the corresponding sequences in the Various data set (Table 1).


View this table:
[in this window]
[in a new window]

 
Table 1. Descriptive statistics (number, mean, standard deviation, median) of the number of amino acids in the precursor sequences and components (signal peptide, neuropeptide) and of the number of neuropeptides per prohormone for the two training data sets (Apis and Drosophila) and two test data sets (Insulin-like and Various)

 
3.2 Cleavage by amino acid and location
After applying the filtering rules, cleaved windows occurred in 25 and 26% of the Apis and Drosophila training data sets, respectively. At least 19 amino acids were present in all sequence locations in the training data sets, except for the P1 location, and no amino acid was missing in both training data sets for the same location. The P2 location showed dramatic differences between cleaved and non-cleaved windows. All amino acids were present in the P2 location in non-cleaved windows; however, in the cleaved windows, only three amino acids (Gly, Lys and Arg) occurred often (in 151 of the 159 cleaved windows across both data sets). At P2, 64% of the cleavages occurred with Lys (75 and 82% of cleavages in Apis and Drosophila data sets, respectively), 16% occurred with Arg (36 and 35% of cleavages in Apis and Drosophila data sets, respectively) and 15% occurred with Gly. There was differential cleavage between both species in the presence of Gly, as only 22% of the P2 Gly windows were cleaved in the Apis data set, but 57% were cleaved in the Drosophila data set. However at the P3 location, Gly occurred in 12% of windows and 70% of these windows were cleaved in both species. The presence of Gly in the P2 and P3 locations is not surprising because many neuropeptides are amidated at a C-terminal Gly after cleavage at the P1 location and removal of any C-terminal basic amino acids. Other amino acids were present in other locations with frequencies higher than 10% of windows. In particular, Arg in the P5, P9 and P8' locations was present in over 10% of windows that was indicative of overlapping windows.

The proportion of cleaved sites in the Various and Insulin-like data sets was 29 and 28%, respectively, after applying the filtering rules. Only the Various data set showed similar structure to the training data sets except that there was a relatively high abundance of cleaved windows. The prevalence of Gly and Arg at the P2' location in the Various data set was mainly due to the Allatostatin neuropeptide family. The Insulin-like data set did not exhibit the same variation of amino acid composition at different locations than the other data sets, demonstrating the selective composition of these two data sets.

3.3 Cleavage motifs and the known motif model
The frequency of cleavage associated with different basic amino acid motifs is given in Table 2. Only 11 out of 468 cleavages were observed across all data sets where no basic amino acid occurred in the P2 or P4 location and when Arg was present in the P1 location. The remaining cleavages were observed with a basic amino acid in the P1 location and at least one basic amino acid in the P2 or the P4 location. Consequently, applying this observation to the Apis, Drosophila and Various data sets, resulted in 34–40% of non-cleaved windows being falsely predicted as cleaved, and 1–3% of cleaved windows being falsely predicted as non-cleaved.


View this table:
[in this window]
[in a new window]

 
Table 2. Motif occurrence and cleavage frequency within motif across the data sets

 
The Xxx-Xxx-Lys-Arg motif was the most frequently observed, occurring in 18, 16, 22 and 23% of the windows in the Apis, Drosophila, Various and Insulin-like data sets, respectively, and cleaved in 78, 86, 83 and 88% of these windows in the Apis, Drosophila, Various and Insulin-like data sets, respectively. The next most frequent motif was Arg-Xxx-Xxx-Arg, occurring in 9, 7, 4 and 7% of the windows in the Apis, Drosophila, Various and Insulin-like data sets, respectively, and cleaved in 42, 39, 52, 40 and 35% of these windows in the Apis, Drosophila, Various and Insulin-like data sets, respectively. Although the Xxx-Xxx-Lys-Lys only occurred in 4, 3, 2, 2 and 5% of the windows in the Apis, Drosophila, Various and Insulin-like data sets, respectively, this motif was the second most frequently cleaved motif, where 64, 64, 40 and 43% of these windows were cleaved in the Apis, Drosophila, Various and Insulin-like data sets, respectively.

3.4 Predictive performance
The different criteria considered provided comprehensive insights into the strengths of the approaches compared. The different approaches provided a series of predictive models with varying predictive performance. In most cases, there was little difference in prediction accuracy between the top models within each approach. For example, the differences between top models for correct classification rate were generally less than 5%. The single model representative of each training data set and approach combination was selected based on the performance of the model in the training and test data sets. Comparable performance of the predictive models from each approach suggests that alternative model specifications may offer slight improvements in the accuracy of the models to predict cleaved and non-cleaved sites for different data sets.

The performance of the multi-layer perceptrons with different numbers of hidden nodes is provided in the Supplemental Material Table 1. Multi-layer perceptrons with 25 and 50 hidden nodes in the Apis and Drosophila data sets, respectively, had consistently superior predictive performance across the test data sets. However, the apparent optimal number of hidden nodes in the multi-layer perceptron trained in the Apis data set was highly dependent on the test data set evaluated. The selected k-nearest neighbor had 5 and 10 neighbors in the Drosophila and Apis training data sets, respectively, and provided the most accurate predictions among all the k-nearest neighbor specifications evaluated. The model selection strategy resulted in logistic regression models with 20 and nine amino acid-location terms in the Drosophila and Apis data sets, respectively. The nature of the logistic approach allowed a straightforward comparison of model terms. The Drosophila and Apis logistic models only have two terms in common, Lys and Arg at P2.

The multi-layer perceptrons had perfect performance in the training data sets (Table 3). The performance of the logistic regression and k-nearest neighbor models were similar in both training data sets and were comparable to the multi-layer perceptron in the Drosophila training data set. However, the logistic regression and k-nearest neighbor models had lower accuracy than the multi-layer perceptrons in the Apis training data set.


View this table:
[in this window]
[in a new window]

 
Table 3. Model performance statistics of multi-layer perceptrons, logistic models, k-nearest neighbor and Known Motif model trained in either the Apis or Drosophila training data and tested in other data

 
In general, the Drosophila multi-layer perceptron, logistic regression and k-nearest neighbor models had similar performance statistics in all the test data sets (Table 3). Likewise, the Apis logistic regression and k-nearest neighbor models had similar performance statistics in all the test data sets. The Apis multi-layer perceptron had higher performance than the other two approaches for some criteria (sensitivity and negative precision) and lower performance than the other two approaches for the rest of the criteria in the Drosophila and Various test data sets. The Apis multi-layer perceptron had lower accuracy than the other two approaches in the Insulin-like test data set.

Overall, the high correct classification rate and area under each ROC curve across different data sets indicated that different models had good generalization properties. The high specificity was expected because 80% of the windows were non-cleaved. This was also reflected in the high negative precision with, on average, 94% of non-cleaved sites were correctly predicted as non-cleaved. The average specificity indicated that 83% of cleaved sites were correctly predicted and the positive precision indicated that, on average, 74% of sites were correctly predicted as cleaved.

The Known Motif model had, in general, the highest number of true-positive predictions across all data sets and consequently the highest sensitivity of all approaches across all test data sets. However, the Known Motif model also produced the highest number of false positives that translated into low-positive precision, as 30–50% of the predicted cleavage sites are expected to be false positives. Consequently, the Known Motif model had the weakest performance of all approaches across all data sets for the majority of the criteria, with the exception of sensitivity.

3.5 Prediction patterns
Correct classification rates ranged from 78 to 100%, indicating that the models under consideration adequately predicted cleaved and non-cleaved positions in the genome, Various and Insulin-like data sets. The Drosophila data set appeared to have a more similar structure to the Various data set than did the Apis data set; hence, the better performance of the Drosophila model compared to the Apis model for these predictions.

The three Apis-trained models correctly predicted 90% of the non-cleaved sites and 75% of the cleaved sites on the Apis data and, on average, 79% of the non-cleaved sites and an average of 57% of the cleaved sites on the other data sets. The three Drosophila-trained models correctly predicted 93% of the non-cleaved sites and 89% of the cleaved sites and, on average, 83% of the non-cleaved sites and an average of 65% of the cleaved sites on the other data sets.

Across all data sets, 65% of non-cleaved sites and 52% of cleaved sites were correctly predicted by all models resulting in 39% of sites being misclassified. The majority of the incorrectly predicted cleaved sites (false positives) were associated with the motifs Arg-Xxx-Xxx-Arg (43%) or Xxx-Xxx-Lys-Arg (25%). The majority of incorrectly predicted non-cleaved sites (false positives) also occurred with Arg-Xxx-Xxx-Arg (23%) and 16% occurred with Arg in P1 but no basic amino acids in either P4 or P2 locations. Most of the incorrect predictions (40%) occurred in the Various data set. Incorrect predictions in the Various and Insulin-like datatsets occurred either multiple times within a single precursor (such as the FMRFamide precursor) or a similar incorrect prediction was repeated multiple times due to the same neuropeptide family (such as the PBAN precursor).


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
4.1 Model terms
Four cleavage prediction models were applied to two types of data sets, one based on specific genome-enabled sets of neuropeptides (Drosophila and Apis test data sets), and another based on literature-based sets of neuropeptides (Insulin-like and Various). The genome-specific data sets were unique in that they attempted to cover known neuropeptides in a genome. The scope of these data sets may limit the performance of the genome-specific models when applied to other species, either because species-specific terms may be included in the model, or because such terms may have more impact on the prediction than optimized terms that may be less represented in the genome-specific data sets.

The final logistic regression models included the region between locations P6 and P4'. This model composition is consistent with the binding region observed in the crystal structures of two prohormone convertases, Kex2 (Holyoak et al., 2003) and furin (Henrich et al., 2003). The Drosophila logistic regression model also included the P6 location, which has been identified as a site influencing furin cleavage (Rockwell et al., 2002). This biological support to the model terms suggests that the locations of amino acids included are applicable to other data sets, although these terms may not necessarily be optimal for these data sets. The multi-layer perceptron and k-nearest neighbor predictive functions were not interpreted due to the opaque nature of these techniques (Hastie et al., 2001).

The limited representation of some cleavages in the genome-specific data sets results in the cleavage sites being partially or completely confounded with the training data set sequence information. The Drosophila and Apis training data sets consisted of a total of 621 windows and yet there are 342 possible combinations of amino acid-location that could have been fitted. The relationships between amino acid-locations required 212 and 251 eigenvalues to explain 99% of the variation in Apis and Drosophila data sets, respectively. The training data resulted in multiple models for each species that had similar predictive ability. For example, the difference in the correct classification rate between the top three logistic models in the Apis and Drosophila models was 2 and 5%, respectively.

Various authors (Cameron et al., 2001; Devi, 1991; Rholam et al., 1995; Veenstra, 2000) have proposed empirical rules for predicting cleavage, based on the observed structure or amino acid composition derived from known cleavage sites with a limited number of precursors. One rule that is favorably supported by the genome data sets is that cleavage only occurs in the presence of a basic amino acid in the P2, P4, P6 or P8 locations, in addition to the P1 location. There were 11 windows cleaved only with Arg in the P1 location (no other basic amino acids in nearby locations) across all data sets.

The Known Motif approach predicts cleavage in the presence of the motifs Lys at P2 and Lys at P1, or Lys at P2 and Arg at P1, or Arg at P2 and Arg at P1, or Arg at P4 and Lys at P1, or Arg at P4 and Arg at P1 (Southey et al., 2006a, b). The high sensitivity in the Known Motif model was accompanied with a large number of false-positive sites and low-specificity values. Consistent with the Known Motif model and data structure, the Apis logistic regression model included Lys at P1, Lys at P2, Arg at P2 and Arg at P4. Likewise, the Drosophila logistic regression model included Lys at P2, Arg at P1 and Arg at P2. The estimates of the parameters associated with these model terms were positive, indicating that the probability of cleavage increased with the presence of these amino acid location combinations.

Other empirical rules previously suggested are not supported by the data used here or were reported with known exceptions. For example, the frequency of cleaved windows of an aliphatic amino acid in the P1' locations was 21% across all data sets, although no cleavages were expected to occur (Devi, 1991; Cameron et al., 2001). Devi (1991) and Cameron et al. (2001) also proposed that cleavage would not occur with Cys in the P6–P2' locations. It has been reported that such cleavages occurred with Cys in P1' in Periplaneta americana FMRFamide (Predel et al., 2004a) and Cys was found in all data sets in the P5 location and also in the P4 location of the Apis data set. However, this rule has limited value because Cys was a rare amino acid in both the Apis and Drosophila data sets; only six windows had at least one Cys in the region between P2 and P4. Furthermore, many of the sites correctly predicted as cleaved by the empirical rules were also correctly predicted by the models generated here.

Presently, the most reliable information available to predict neuropeptide precursor cleavage is the amino acid location. The use of amino acid locations as model terms has two advantages. First, these inputs model the association between the biochemical and biophysical properties of the amino acids and the probability of cleavage. Second, the use of individual amino acid predictor variables allows modeling associations with cleavage that deviate from those explained by the amino acid properties. For example, Arg and Lys are both basic amino acids. However, the probability of cleavage differs whether one, or the other is present at location P1 or P2 which is consistent with the observation by Henrich et al. (2003) that Arg in the P1 location is required for cleavage by furin. Pilot studies with amino acid property location input variables (instead of the specific amino acids) resulted in lower performance than amino acid location models. These results suggested that individual characteristics of the amino acids play a critical role in neuropeptide precursor cleavage in addition to general physiochemical properties. Likewise, use of secondary or higher order structure information could improve the prediction of neuropeptide cleavage. Using known and predicted secondary structure as input variables together with amino acid information did not improve the accuracy of cleavage predictions. Structural information on neuropeptides is currently too limited to enable accurate modeling of neuropeptide precursor cleavage.

4.2 Comparison of different methodologies
The overall performance of all the models on the independent test data sets was similar with median correct classification rate equal to 86% (ranging from 80 to 91%) and median area under the curve equal to 88% (ranging from 78 to 94%). The high performance of the neuropeptide precursor cleavage models across test data sets is particularly remarkable considering the evolutionary distance between the two organisms used to train the models. It has been estimated that the Genus Apis (Family Apidae, Order Hymenoptera) diverged from the Orders Diptera (i.e. Genus Drosophila) and Lepidoptera (i.e. Genus Bombyx) about 300 million years ago (Honeybee Genome Sequencing Consortium, 2006). Furthermore, the performance level of the models is especially notable in view of the wide range function of the neuropeptides resulting from the predicted precursor cleavages.

Indicators of overall model performance are a valuable first pointer for model comparison. However, single model performance values encompass multiple criteria that need to be considered separately when evaluating models and approaches. In particular, single overall performance values do not allow evaluation of the trade-off between sensitivity and specificity that was observed in several cases. A particular example of the complementary properties of the approaches considered is the performance of the Drosophila models on the Various test data set. The Drosophila multi-layer perceptron and logistic regression models had similar correct classification rates in the Various data sets. However, the multi-layer perceptron had 5% higher sensitivity and 1% lower specificity than the logistic regression model. The complementary advantages of both approaches cancelled each other out because neuropeptide precursor sequences have more non-cleaved than cleaved sites and thus, the overall correct classification rate was similar. For the Various test data set, the k-nearest neighbor had the same sensitivity, multi-layer perceptron and higher specificity than the logistic regression model, and consequently had the highest correct classification rate of all three approaches.

Multi-layer perceptron and logistic regression models are highly related methodologies such that a simple multi-layer perceptron can be implemented that is equivalent to a logistic regression model (Schumacher et al., 1996; Vach et al., 1996). Multi-layer perceptrons are expected be superior to linear logistic regression models when the actual relationship cannot be approximated by a linear logistic regression model with polynomial and multiplicative interaction terms (Vach et al., 1996). Differences in the predictions between the multi-layer perceptron and logistic regression models implemented in this study indicate potential complex non-linear relationships or interactions between different amino acids at different locations and the probability of neuropeptide precursor cleavage in these two data sets. Multi-layer perceptron models were superior to logistic regression models in sensitivity in most data sets. This suggests that complex non-linear relationships and/or interactions between different amino acids at different locations are required to obtain high prediction of actual cleaved sites.

The wide range of neuropeptide precursors and divergence between insect species with neuropeptide precursor cleavage information hindered the ability to find a single best model across all data sets and performance criteria. For example, for the vast majority of the independent test data sets, the multi-layer perceptron models had higher sensitivity and lower specificity than the logistic regression models. However, the Apis logistic regression model was clearly superior to the multi-layer perceptron model in both sensitivity and specificity (and consequently correct classification rate) in the Insulin-like test data set. The overall performance of the Apis k-nearest neighbor approach was slightly lower than the logistic regression model due to similar specificity values and lower sensitivity of the k-nearest neighbor. The Apis multi-layer perceptron model had the lowest specificity and sensitivity of all three approaches in the Insulin-like dataset. The overall weak performance of the multi-layer perceptron model was not observed in the Drosophila counterpart or in other data sets. These comparisons suggest that the signals captured by the linear, non-linear, and interaction components of the multi-layer perceptron model in the Apis training data set are not well-represented in the Insulin-like data set.

Reported comparisons of logistic models and artificial neural networks trained by Duckert et al. (2004) for predicting cleavage sites have favored logistic models over artificial neural networks (Amare et al., 2006; Southey et al., 2006b). The difference was attributed to the training data since the training set used by Duckert et al. (2004) spans all Eukaryotes, whereas specific species and classes were studied in the logistic regression approaches (Amare et al., 2006; Hummon et al., 2003; Southey et al., 2006b). The present study supports the hypotheses formulated in these previous studies because the overall performance of the Drosophila multi-layer perceptron model was comparable to the Drosophila logistic regression model in all test data sets. Likewise, the Apis multi-layer perceptron had a similar correct classification rate and area under the curve compared to the logistic regression model in all test data sets with the exception of the Insulin-like data set.

The similar overall performance of all the approaches across test data sets indicates that the three approaches evaluated encompassed a good representation of classification techniques suitable for prediction of cleavage in neuropeptide precursors. In addition, the wider range of performance for specific criteria (e.g. sensitivity, specificity) compared to overall criteria suggests that there is no one single best approach across all the insects and precursors considered. The overall performance of the approaches studied is consistent with the performance of linear, non-linear, sub-space and ensemble classifiers applied to HIV precursor data presented by Nanni and Lumini (2007). These authors compiled a wide range of classifiers that can be applied to classify different biological problems, including precursor cleavage. The meta-predictors proposed by Nanni and Lumini (2007) are unfeasible in this study due to the perfect prediction of the multi-layer perceptron in the training data sets. The consistent overall performance of the three methods evaluated in this study suggests that results from other classification methods will not depart substantially from the results presented in this study. More substantial changes in model training are expected from the collection of more neuropeptide cleavage data that will be facilitated by the models presented here.

Complementary insights into neuropeptide precursor cleavage and non-cleavage patterns can be gained from the consideration of different approaches. For example, researchers that prioritize detection of cleavage (even at the cost of some false positives) may favor multi-layer perceptron or Known Motif approaches that typically have high sensitivity. On the other hand, researchers that prioritize the minimization of false-positive predictions typically associated with higher true-negative rates may favor logistic regression and k-nearest neighbor models. The complementary nature of the approaches considered is particularly relevant to the goal of this study, that is to support effective experimental detection of neuropeptides across insect species and precursor families based on sequence information.

Interactions between model terms can be expected because the Known Motif model involves interactions between the basic amino acids at the P4, P2 and P1 locations. The motifs used in the Known Motif model were included as interaction terms in the full logistic regression model and were subsequently eliminated during the training process. This infers that the inclusion of these motifs was insufficient to improve both sensitivity and specificity given the other model terms. The final logistic regression models only included additive combinations of the amino acid locations involved in the Known Motif model. The high sensitivity of the Known Motif model compared to logistic regression models across test data sets indicates that interaction of model terms represented by the motifs in the Known Motif model are necessary to accurately predict cleavage at some sites. The low specificity of the Known Motif model also indicates that the sole presence of one of the empirically-based motifs does not guarantee cleavage and that other amino acid location combinations have a substantial impact on the probability of cleavage either acting in linear or non-linear fashion.

4.3 Impact of experimental data on the predictive models
False-positive and false-negative neuropeptide precursor cleavage predictions can be attributed to poor model generalization and inadequate training of the data sets. Few studies have reported experimentally verified cleavage sites and associated neuropeptides, thus increasing the probability that some cleavage sites have not been discovered. Obviously, the correct identification of the precursor sequence and associated cleaved and non-cleaved sites will be critical for accurate predictive functions. Many well-conserved and well-known neuropeptide families have been fully studied across multiple species; this information biases the estimates or distorts the model fit when used to verify models trained on other data sets.

The precursors used in this study were selected using published cleavage information such that most or all cleavage sites within the precursor have been identified. While the full precursor sequence may be known, the cleavage sites may be completely or partially unknown, or have been inferred from alignment to better-studied sequences from other species. For example, Hummon et al. (2006b) identified 36 Apis precursors, however only 16 of these precursors were sufficiently characterized to be included in this study. Consequently, the cleavages may be incorrectly annotated. Also, cleaved sites may not always be observed, e.g. Garczynski et al. (2006) reported cleavages in the Drosophila short Neuropeptide F that were unreported by Wegener et al. (2006). Some false-positive results found with some of our models may be true-positive results; for example, two false positives in the Apis logistic model corresponded to a mass-matched peptide in the Apis neuropeptide-1 like precursor that has not been experimentally confirmed (Hummon et al., 2006b). The multi-layer perceptron and k-nearest neighbor methodologies are directly influenced by this incomplete data since these methodologies assume that there are no or very few errors in the data. This assumption results in the perfect prediction of the training data by the multi-layer perceptron models and thus these unreported sites would not be identified by this procedure. In k-nearest neighbor models, the probability of cleavage is determined by the proportion of k-nearest neighbors cleaved such that incorrect prediction will occur if a sufficient number of these neighbors are incorrectly assigned to a cleavage category. The simplicity of the linear logistic regression models considered in this study indirectly reduces the weight of the inaccurate estimates that model terms with errors have in the prediction. The influence of unreported cleavage sites has been minimized because of the precursor selection process used in creating the training data sets. The advantage of our approach is that these data sets can be further studied, using our predictions as guides, to discover currently unreported cleavage sites.

Incorrect predictions in the Various and Insulin-like data sets occurred either multiple times within a single precursor (such as the FMRFamide precursor) or a similar incorrect prediction was repeated multiple times due to the presence of multiple representatives of same neuropeptide family (such as the PBAN precursor). These incorrect predictions were present in all data sets and were due to incorrect predictions of similar windows rather than a specific influence of precursor or species. The consistent misclassification occurs because of either repetitive sequences within the precursor (such as FMRF-amide), or a high degree of conservation of sequence between precursors of the same family (such as Insulin-like peptides). This situation has the benefit of enhancing model performance when similar windows are correctly predicted, and worsen model performance when similar windows are incorrectly predicted. With precursors that have a high degree of homology and consistent cleavage reports, the model performance is higher than with precursors that do not share homology. For example, the bombyxins within the Insulin-like data set had a few incorrect predictions primarily due to the high homology between bombxyin precursors, even though there were no insulin-related peptides in the Apis and Drosophila training data sets.

Another source of ambiguity in precursor processing prediction is the precise location of one or multiple proximal cleavage sites, especially in the presence of tri- and tetra-basic sites. The data sets in this study have assumed that the cleavage occurred at the C-terminal end of these multiple basic sites, unless prior evidence of a peptide with an N-terminal basic amino acid existed. This assumption adversely influences the training process because, regardless of the actual cleavage site, C-terminal and N-terminal basic amino acids can be removed by carboxypeptidases and Arg/Lys aminopeptidases, respectively (Hook, 2006). Thus, after the action of these peptidases, the final peptides would be correctly predicted even though the actual cleavage site was incorrectly predicted.

The evaluation of model performance in this study also assumes that all enzymes are present to cleave the precursor and all possible neuropeptide cleavage sites will be cleaved. This assumption is necessary because there is limited experimental evidence, especially in insects, to suggest otherwise. An explanation for some false-positive results is that enzymes may be differentially expressed across tissues and conditions and the absence of some experimentally verified neuropeptides is related to the particular sample studied. Cameron et al. (2001) listed 63 cleavage sites from 23 protein precursors that were known to be cleaved by the prohormone convertases PC1/3 and/or PC2. Many sites were either cleaved by both enzymes, or favored by one of the prohormone convertases, illustrating the overlapping function of the prohormone convertases. Cameron et al. (2001) also noted that 12 sites were cleaved only by PC2, and only two by PC1/3. Thus, in mammalian tissues where PC2 is not expressed, the 12 PC2-dependent sites would not be cleaved, and certain peptides, including glucagons, would not be produced. Likewise, Veenstra (2000) indicated that the precursor PBAN may be differentially expressed in Lepidotera. Wegener et al. (2006) reported differential processing of the D.melanogaster CAPA-precursor where all three CAPA-peptides were detected in the transverse nerves, but only one truncated form (CAPA-3) was detected in the ring gland. As information on the specificity of the individual processing enzymes and tissue location becomes available, these models can be refined to include this information, perhaps even allowing the biochemically-confirmed peptides to be entered so that the models can estimate what enzymes were present during the processing of the precursors to form the observed peptides.

Additional ambiguity in precursor processing occurs when two motifs that would result in different peptides are present near each other. In this situation, cleavage may occur at one motif such that the second cleavage site no longer exists, thereby generating a false-positive result. For example, the A.mellifera allatostatin precursor had a tribasic motif, Lys-Arg-Arg, where cleavage was found as both Lys-Arg{downarrow}-Arg and Lys-Arg-Arg{downarrow}. Other instances can be found in the P.americana FMRFamide, B.mori PBAN and Aedes aegypti allatotropin precursors. Veenstra (2000) showed that one region of the A.aegypti allatotropin precursor could be cleaved in the presence of any one of three motifs; however, the predominate motif appeared to be Arg-Xxx-Xxx-Arg. Sato et al. (1993) reported that the B.mori PBAN precursor can be cleaved between, and after, the two adjacent Arg sites of the sequence Pro-Arg-Leu-Gly-Arg{downarrow}-Arg–Leu-Ser-Glu-Asp-Met to form two PBAN peptides. However, only the cleavage after the two Arg sites has been reported in other insect species (Choi et al., 1998; Duportets et al., 1998; Jacquin-Joly et al., 1998; Ma et al., 1994), even though the Pro-Arg-Leu-Gly-Arg-Arg-Leu sequence is identical in all species. In the P.americana FMRFamide precursor (Predel et al., 2004a), similar cleavages also occurred in the sequence Phe-Ile-Arg-Leu-Gly-Arg{downarrow}Arg-Asp-Glu-Glu-Val, and in the sequence Phe-Ile-Arg-Leu-Gly-Lys{downarrow}Arg-Ala-Leu-Asp-Gln. In these three cases, the Drosophila model correctly predicted these cleavage sites due to the presence of Gly in the P2 location. The Arg-Xxx-Xxx-Arg motif is a common feature of these examples and is known as the minimal furin motif (Rockwell et al., 2002). Furin is one of the first prohormone convertases active in the prohoromone processing pathway (Seidah and Prat, 2002), indicating that this motif would most likely be cleaved by furin. Therefore, any biological ambiguity would most likely be removed before the precursor was exposed to the other proteases present in the processing pathway.

4.4 Genome-enabled prediction
A large number of insect genomes are either undergoing, or are being selected to undergo sequencing. Most of these species currently lack extensive biochemical information, so the proposed processing models will expedite the identification of the most likely neuropeptides present in a range of insect species. This is important as biochemical confirmation of actual peptides present in an organism can occur many years after the availability of the genomic information; the processing models allow directed biological experiments to use a more likely putative peptide complement.

In the present study, the availability of the Apis and Drosophila genomes permitted a genomic division rather than the typical random division into training and testing subsets. It is expected that the genomic division accurately reflects the actual biological situation where all precursors are exposed to same set of prohormone processing pathways in the same organism. The negative aspect of genome-specific models is that species-specific noise is also incorporated in the models which can be addressed by comparing different models trained on data from the same or similar species. Training on a random subset of all the data may result in bias towards commonly occurring neuropeptide precursor cleavage sites resulting in poor model generalization. In addition, rare or species-specific cleavage sites may not be uncovered without further experimental data.

The overall performance of the Drosophila and Apis models was very similar. The median (and range of) correct classification rate in the Drosophila and Apis models was 87% (86–90%) and 86% (80–91%), respectively. The median (and range of) area under the curve in the Drosophila and Apis models was 88% (86–93%) and 84% (78–94%), respectively. The lower boundary of performance of the Apis models can be isolated to the poor performance of the Apis multi-perceptron model on the Insulin-like data set. Similarly to the comparison of models, there was no single organism that provided the most accurate predictions of neuropeptide precursor cleavage. The Drosophila models had consistent performance across species-specific (Apis), precursor-specific (Insulin-like) and general (Various) test data sets. The Apis models performed similarly to the Drosophila models in species-specific and general test data sets. The lower accuracy of Apis multi-layer network models was compensated for by the high performance of the Apis logistic regression model in the precursor-specific data set. Thus, with the exception of prediction of cleavage in Insulin-like precursors, all three types of Apis and Drosophila models evaluated had very good overall performance. Apis logistic regression models, Drosophila models and Apis k-nearest neighbor models were preferred, in that order, to predict cleavage in Insulin-like precursors.

The Various dataset encompasses a wide range of neuropeptide precursors across species but excluded all Drosophila and Apis precursors. This dataset offers a better representation of precursor sequences that researchers are more likely to encounter and need to predict cleavage than the precursor-specific Insulin-like data set. In this context, both Drosophila and Apis models are expected to exhibit similar prediction performance as new neuropeptide precursor sequences become available. The current status of neuropeptide precursor research impedes the training of reliable precursor-specific models due to the limited number of precursors with empirically confirmed cleavage information within each precursor family. The Drosophila and Apis models presented in this study will facilitate empirical confirmation of neuropeptide precursors and this data in turn will be used to fine-tune the current models or develop precursor-specific models.

The genomic data is also essential in identifying a more complete set of cleavage products of individual neuropeptide precursors. Without this genomic information, relatively rare cleavage sites would likely remain unpredicted and novel neuropeptides would remain unidentified. The application of the two genome-enabled models helped to discern whether the models were correctly predicting true species-specific cleavage patterns, or detecting data-specific information. These models can easily be incorporated into neuropeptidomic methods (Amare and Sweedler, 2007; Hummon et al., 2006b; Liu et al., 2006; Mirabeau et al., 2007) to accurately predict the most likely set of neuropeptides in the genome. This integrated approach permits identification and characterization of previously unknown peptides that, using existing techniques, may either require exhaustive study and/or remain unidentified.

The multiple cleavage prediction approaches compared in this study had similar performance and no approach clearly out-performed the others across all data sets and evaluation criteria. Prediction of neuropeptide cleavage using multiple approaches is recommended because different experimental objectives may be better supported by different predictive performance criteria. For example, in situations such as high throughput mass spectrometry where false-negative results (failure to predict cleavage) are less desirable than false-positive results, predictions from approaches that minimize false negatives are favored. Conversely, in situations such as targeted experimental verification of neuropeptides, approaches that minimize false-positive predictions and thus maximize the usage of the resources required are favored. A unique advantage of logistic regression models over other approaches is the straightforward interpretation of the predictive equations. The statistical significance, sign and magnitude of the cleavage predictors effects provide insights into the amino acids and locations that facilitate or hinder cleavage and on the protein–protein interactions involved in precursor cleavage. The NeuroPred website (Southey et al., 2006a) implements the Apis and Drosophila logistic models presented here, as well as our previous non-insect models (Amare et al., 2006; Hummon et al., 2003; Southey et al., 2006b), to predict cleavage sites, calculate model accuracy statistics and peptide mass, including post-translational modifications, resulting from cleavage at the predicted sites.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The National Institutes of Health; specifically, the National Institute of Drug Abuse support through P30 DA018310 [GenBank] to the UIUC Neuroproteomics Center for Cell to Cell Signaling (http://neuroproteomic.scs.uiuc.edu) and the National Institute of General Medical Science through GM068946 to S.R.Z.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Limsoon Wong

Received on August 1, 2007; revised on January 24, 2008; accepted on January 24, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Agresti A. An Introduction to Categorical Data Analysis. (1996) New York: John Wiley and Sons, Inc.

    Amare A, Sweedler JV. Neuropeptide precursors in Tribolium castaneum. Peptides (2007) 28:1282–1291.[CrossRef][Medline]

    Amare A, et al. Bridging neuropeptidomics and genomics with bioinformatics: prediction of mammalian neuropeptide prohormone processing. J. Proteome Res (2006) 5:1162–1167.[CrossRef][Web of Science][Medline]

    Baggerman G, et al. Peptidomic analysis of the larval Drosophila melanogaster central nervous system by two-dimensional capillary liquid chromatography quadrupole time-of-flight mass spectrometry. J. Mass Spectrom (2005) 40:250–260.[CrossRef][Web of Science][Medline]

    Baggerman G, et al. Peptidomics of the larval Drosophila melanogaster central nervous system. J. Biol. Chem (2002) 277:40368–40374.[Abstract/Free Full Text]

    Bairoch A, et al. The universal protein resource (UniProt). Nucl. Acids Res (2005) 33:D154–D159.[Abstract/Free Full Text]

    Baldi P, et al. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics (2000) 16:412–424.[Abstract/Free Full Text]

    Bendtsen JD, et al. Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol (2004) 340:783–795.[CrossRef][Web of Science][Medline]

    Cameron A, et al. The enzymology of PC1 and PC2. In: The Enzymes Volume XXII.—Dalby RE, Sigman DS, eds. (2001) San Diego: Academic Press. 291–332.

    Chen YQ, et al. On neural-network implementations of k-nearest neighbor pattern classifiers. IEEE Trans. circuits and Syst.—I: Fundam. Theory Appl (1997) 44:622–629.[CrossRef]

    Choi MY, et al. Isolation and identification of the cDNA encoding the pheromone biosynthesis activating neuropeptide and additional neuropeptides in the oriental tobacco budworm, Helicoverpa assulta (Lepidoptera: Noctuidae). Insect Biochem. Mol. Biol (1998) 28:759–766.[CrossRef][Web of Science][Medline]

    Devi L. Consensus sequence for processing of peptide precursors at monobasic sites. FEBS Lett (1991) 280:189–194.[CrossRef][Web of Science][Medline]

    Duckert P, et al. Prediction of proprotein convertase cleavage sites. Protein Eng. Des. Sel (2004) 17:107–112.[Abstract/Free Full Text]

    Duportets L, et al. The pheromone biosynthesis activating neuropeptide (PBAN) of the black cutworm moth, Agrotis ipsilon: immunohistochemistry, molecular characterization and bioassay of its peptide sequence. Insect Biochem. Mol. Biol (1998) 28:591–599.[CrossRef][Web of Science][Medline]

    Garczynski SF, et al. Structural studies of Drosophila short neuropeptide F: occurrence and receptor binding activity. Peptides (2006) 27:575–582.[CrossRef][Web of Science][Medline]

    Hastie T, et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction with 200 Full-color Illustrations. (2001) New York: Springer.

    Henrich S, et al. The crystal structure of the proprotein processing proteinase furin explains its stringent specificity. Nat. Struct. Biol (2003) 10:520–526.[CrossRef][Web of Science][Medline]

    Holyoak T, et al. 2.4 A resolution crystal structure of the prototypical hormone-processing protease Kex2 in complex with an Ala-Lys-Arg boronic acid inhibitor. Biochemistry (2003) 42:6709–6718.[CrossRef][Web of Science][Medline]

    Honeybee Genome Sequencing Consortium. Insights into social insects from the genome of the honeybee Apis mellifera. Nature (2006) 443:931–949.[CrossRef][Medline]

    Hook VY. Unique neuronal functions of cathepsin L and cathepsin B in secretory vesicles: biosynthesis of peptides in neurotransmission and neurodegenerative disease. Biol.Chem (2006) 387:1429–1439.[CrossRef][Web of Science][Medline]

    Hummon AB, et al. Discovering new invertebrate neuropeptides using mass spectrometry. Mass Spectrom. Rev (2006a) 25:77–98.[CrossRef][Web of Science][Medline]

    Hummon AB, et al. From the genome to the proteome: uncovering peptides in the Apis brain. Science (2006b) 314:647–649.[Abstract/Free Full Text]

    Hummon AB, et al. From precursor to final peptides: a statistical sequence-based approach to predicting prohormone processing. J. Proteome Res (2003) 2:650–656.[CrossRef][Web of Science][Medline]

    Jacquin-Joly E, et al. cDNA cloning and sequence determination of the pheromone biosynthesis activating neuropeptide of Mamestra brassicae: a new member of the PBAN family. Insect Biochem. Mol. Biol (1998) 28:251–258.[CrossRef][Web of Science][Medline]

    Kandel ER, et al. Principles of neural science. (2000) New York: McGraw-Hill, Health Professions Division.

    Liu F, et al. In silico identification of new secretory peptide genes in Drosophila melanogaster. Mol. Cell. Proteomics (2006) 5:510–522.[Abstract/Free Full Text]

    Ma PW, et al. Structural organization of the Helicoverpa zea gene encoding the precursor protein for pheromone biosynthesis-activating neuropeptide and other neuropeptides. Proc. Natl Acad. Sci. USA (1994) 91:6506–6510.[Abstract/Free Full Text]

    Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta (1975) 405:442–451.[Medline]

    Mirabeau O, et al. Identification of novel peptide hormones in the human proteome by hidden Markov model screening. Genome Res (2007) 17:320–327.[Abstract/Free Full Text]

    Nanni L, Lumini A. Ensemblator: An ensemble of classifiers for reliable classification of biological data. Pattern Recognit. Lett (2007) 28:622–630.[CrossRef]

    Nassel DR. Neuropeptides in the nervous system of Drosophila and other insects: multiple roles as neuromodulators and neurohormones. Prog. Neurobiol (2002) 68:1–84.[CrossRef][Web of Science][Medline]

    Ohlsson M. WeAidU—a decision support system for myocardial perfusion images using multi-layer perceptron neural networks. Artif. Intell. Med (2004) 30:49–60.[CrossRef][Web of Science][Medline]

    Predel R, et al. Unique accumulation of neuropeptides in an insect: FMRFamide-related peptides in the cockroach, Periplaneta americana. Eur. J. Neurosci (2004a) 20:1499–1513.[CrossRef][Web of Science][Medline]

    Predel R, et al. Peptidomics of CNS-associated neurohemal systems of adult Drosophila melanogaster: a mass spectrometric survey of peptides from individual flies. J. Comp. Neurol (2004b) 474:379–392.[CrossRef][Web of Science][Medline]

    Rholam M, et al. Role of amino acid sequences flanking dibasic cleavage sites in precursor proteolytic processing. The importance of the first residue C-terminal of the cleavage site. Eur. J. Biochem (1995) 227:707–714.[Web of Science][Medline]

    Rockwell NC, et al. Precursor processing by kex2/furin proteases. Chem. Rev (2002) 102:4525–4548.[CrossRef][Web of Science][Medline]

    Sato Y, et al. Precursor polyprotein for multiple neuropeptides secreted from the suboesophageal ganglion of the silkworm Bombyx mori: characterization of the cDNA encoding the diapause hormone precursor and identification of additional peptides. Proc. Natl Acad. Sci. USA (1993) 90:3251–3255.[Abstract/Free Full Text]

    Schechter I, Berger A. On the size of the active site in proteases. I. Papain. Biochem. Biophys. Res. Commun (1967) 27:157–162.[CrossRef][Web of Science][Medline]

    Schumacher M, et al. Neural networks and logistic regression: Part I. Comp. Stat. Data Anal (1996) 21:661–682.[CrossRef]

    Seidah NG, Prat A. Precursor convertases in the secretory pathway, cytosol and extracellular milieu. Essays Biochem (2002) 38:79–94.[Web of Science][Medline]

    Southey BR, et al. NeuroPred: a tool to predict cleavage sites in neuropeptide precursors and provide the masses of the resulting peptides. Nucl. Acids Res (2006a) 34:W267–W272.[Abstract/Free Full Text]

    Southey BR, et al. Prediction of neuropeptide prohormone cleavages with application to RFamides. Peptides (2006b) 27:1087–1098.[CrossRef][Web of Science][Medline]

    Vach W, et al. Neural networks and logistic regression: Part II. Comp. Stat. Data Anal (1996) 21:683–701.[CrossRef]

    Veenstra JA. Mono- and dibasic proteolytic cleavage sites in insect neuroendocrine peptide precursors. Arch. Insect Biochem. Physiol (2000) 43:49–63.[CrossRef][Web of Science][Medline]

    Wegener C, et al. Direct mass spectrometric peptide profiling and fragmentation of larval peptide hormone release sites in Drosophila melanogaster reveals tagma-specific peptide expression and differential processing. J. Neurochem (2006) 96:1362–1374.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Proc. Natl. Acad. Sci. USAHome page
A. Brockmann, S. P. Annangudi, T. A. Richmond, S. A. Ament, F. Xie, B. R. Southey, S. R. Rodriguez-Zas, G. E. Robinson, and J. V. Sweedler
Quantitative peptidomics reveal brain peptide signatures of behavior
PNAS, February 17, 2009; 106(7): 2383 - 2388.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/6/815    most recent
btn044v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Southey, B. R.
Right arrow Articles by Rodriguez-Zas, S. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Southey, B. R.
Right arrow Articles by Rodriguez-Zas, S. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?