Bioinformatics Advance Access originally published online on July 15, 2008
Bioinformatics 2008 24(17):1858-1864; doi:10.1093/bioinformatics/btn339
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Artificial neural network for prediction of antigenic activity for a major conformational epitope in the hepatitis C virus NS3 protein
1Division of Viral Hepatitis and 2Biotechnology Core Facility, Division of Scientific Resources, Centers for Disease Control and Prevention, 1600 Clifton Road MS A33, Atlanta, GA, 30333, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Insufficient knowledge of general principles for accurate quantitative inference of biological properties from sequences is a major obstacle in the rationale design of proteins with predetermined activities. Due to this deficiency, protein engineering frequently relies on the use of computational approaches focused on the identification of quantitative structure–activity relationship (SAR) for each specific task. In the current article, a computational model was developed to define SAR for a major conformational antigenic epitope of the hepatitis C virus (HCV) non-structural protein 3 (NS3) in order to facilitate a rationale design of HCV antigens with improved diagnostically relevant properties.
Results: We present an artificial neural network (ANN) model that connects changes in the antigenic properties and structure of HCV NS3 recombinant proteins representing all 6 HCV genotypes. The ANN performed quantitative predictions of the enzyme immunoassay (EIA) Signal/Cutoff (S/Co) profiles from sequence information alone with 89.8% accuracy. Amino acid positions and physicochemical factors strongly associated with the HCV NS3 antigenic properties were identified. The positions most significantly contributing to the model were mapped on the NS3 3D structure. The location of these positions validates the major associations found by the ANN model between antigenicity and structure of the HCV NS3 proteins.
Availability: Matlab code is available at the following URL address: http://bio-ai.myeweb.net/box_widget.html
Contact: jlara{at}cdc.gov; yek0{at}cdc.gov
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
A rational design of proteins with predetermined properties requires an accurate knowledge of quantitative structure–activity relationship (SAR). Unfortunately, our SAR knowledge is very limited and does not allow for reliable engineering proteins with desirable activities. Hence, many protein engineering experiments rely on the use of computational strategies focused on one specific task at hand (Schneider and Wrede, 1998; Zhang et al., 2005). Machine learning algorithms are ideally suitable for guiding design of proteins with novel desirable properties. Different machine learning techniques like support vector machines (Cui et al., 2006), hidden Markov models (Mamitsuka, 1998), evolutionary algorithms (Hohm et al., 2006) and artificial neural networks (ANN) (Zhang et al., 2005) have also been successfully applied to explore SARs in proteins and design protein structures with predetermined properties. In general, machine learning approaches have protein property prediction accuracy that ranges between 70% and 90%. More notably, machine learning approaches allow for focusing and expanding the search for polypeptide candidates to a large number of sequences in a cost effective manner and in less time than could be possible with conventional molecular techniques alone such as DNA shuffling (Fox, 2005).
Hepatitis C virus (HCV) is the major etiologic agent for blood borne non-A, non-B hepatitis (Choo et al., 1990). The development of affordable HCV diagnostic assays with improved specificity and sensitivity continues to be a major public health challenge. Due to the characteristic geographical distribution of all HCV genotypes worldwide, current diagnostic kits do not always perform equally well in all parts of the world (Kanistanon et al., 2002).
One of the strategies adopted for generating immunoreactive forms of HCV antigens for their use for diagnostic assays involves characterization of antigenic determinants derived from different HCV strains (Lin et al., 2005). However, probing of the entire HCV sequence space for the existence of molecules with desired properties is too onerous to be practicable. An alternative approach is to define SAR between protein structure and antigenicity for facilitating the development of antigens with improved diagnostically relevant properties (Cui et al., 2007).
The HCV NS3 protein contains diagnostically relevant conformation dependent immunodominant B cell epitopes (Khudyakov et al., 1995; Lin et al., 2005; Ou-Yang et al., 1999). One of the HCV NS3 conformational antigenic regions could be efficiently modeled with recombinant proteins of 103 amino acid long (Khudyakov et al., 2002). The effect of sequence variation on antigenic property of this region was studied using a set of 12 recombinant proteins derived from six known HCV genotypes. The study showed that some changes in primary structure can result in a significant variation of antigenic properties.
The current article describes investigation of structural parameters that quantitatively define immunoreactivity of this HCV NS3 conformational antigenic region. To our knowledge this is the first study on the quantitative SAR analysis of HCV antigenic epitopes.
| 2 METHODS |
|---|
|
|
|---|
2.1 Dataset
Twelve HCV NS3 protein variants comprising the amino acid positions 331–433 of the HCV NS3 helicase domain or positions 1357–1459 of the HCV polyprotein have been expressed using synthetic genes, and tested by enzyme immunoassay (EIA) against a panel of anti-HCV positive sera of patients from diverse geographical settings and infected with different HCV genotypes (Khudyakov et al., 2002). Variants were tested against 115 anti-HCV positive serum specimens. Of these, 107 serum samples were included in the training dataset, as eight samples did not bind to any of the synthetic NS3 proteins. The strength of serum reaction to the NS3 variants was measured as EIA Signal/Cutoff (S/Co) values. The S/Co values for positive reactions ranged in between 1.00 and 16.95 (see Supplementary Fig. 1). For ANN training, S/Co values were normalized to range from 0 to 1 using the following equation: Normalized value=(Actual value – Minimum value) / (Maximum value – Minimum value). Also, irrelevant proteins (poly-glycine and poly-alanine), and randomized sequences with equal amino acid composition to NS3 variants were included in the training set to provide the ANN with negative examples. The steps involved in the training scheme are as follows: (1) leave one sample out for testing and take remaining protein variants for training; (2) train the ANN; (3) test the ANN performance with the left out sample; (4) repeat Steps 1–3 for all NS3 recombinant variants; and (5) evaluate the overall performance of the ANN. The optimal ANN from this model building scenarios was selected for relative weight analysis.
|
2.2 Sequence encoding
Encoding schemes for protein sequence representation can be very intricate and can greatly impact the performance of the ANN. A common practice is to encode each one of the 20 letters corresponding to the 20 amino acids of a protein into a numerical scheme. For instance, each letter can be represented by a 20-dimensional binary vector, i.e. 20 bin representation, or, by a lower dimensional vector based on the known physicochemical properties of each amino acid. The latter scheme was used for this study. A total of six physicochemical properties, and several scales representing them, were examined. Herein, we present results from the following encoding schemes: hydrophobicity (Engelman et al., 1986), volume (Schneider and Wrede, 1998), polarity (Schneider and Wrede, 1998), secondary structure propensities (Creighton, 1993), predicted secondary structure (Stultz et al., 1993;, 1997; White et al., 1994) and eigenvalues derived by principal component analysis (PCA) (Schneider and Wrede, 1998). The HCV NS3 sequences were aligned and transformed into an n-dimensional input vector for ANN processing: each amino acid in the sequence represented a numerical vector using their respective physicochemical properties. Therefore, sequences were represented by up to 618 input features in the ANN during training. The EIA S/Co values were used as a measure of antigenicity of the HCV NS3 variants to represent the strength of antigen–antibody binding reactions.
2.3 ANN model
The ANN architecture used in this work is the fully connected feed-forward network consisting of three layers of neurons. This architecture, where every neuron (of the hidden and output layer) is connected to every neuron of the previous layer by weighted links and is activated by the sigmoid transfer function, f(x)=1/(1+e—x), is considered suitable for prediction problems (Schneider and Wrede, 1998; Su et al., 2005). The number of input units in the input layer was set according to the input vector dimensions, and the number of output units in the output layer was set to 107 (the number of serum samples in the training set). The ANN was trained using the back propagation with momentum learning algorithm (Rumelhart et al., 1986a). The generalized delta rule (Rumelhart et al., 1986b) was used as the cost function for updating the weights for error minimization. The optimal size of the ANN architecture was determined by testing several ANN topologies for prediction accuracy, starting with 0 hidden neurons up to 200 hidden neurons. Following such stepwise optimization approach, the final number of hidden units in the hidden layer was set to 159 units based on accuracy of simulations (data not shown). The training cycles was set at 1500 epochs. The learning rate (
) was set to 0.1 and the momentum (
) to 0.3. Updating of the network during training epochs was performed in a topological order. The weights of the connections were initialized randomly between -0.5 and 0.5. Also, a term was added to the computation to bump or jog the weights past any possible local minima and thus to find the global minima in the data. In addition, several combinations and scales of physicochemical properties were tested, with each amino acid type represented by three and up to a maximum of six features. Once the architecture was established, the ANN was trained to map a string of real numbers representing amino acid physicochemical properties onto 107 real-valued output neurons corresponding to the EIA S/Co values. ANN simulations were performed with the Stuttgart Neural Network Simulator (SNNS), version 4.2 (http://www-ra.informatik.uni-tuebingen.de/SNNS/).
2.4 Model evaluation
After each training cycle, the predicted output values for a given sequence were evaluated. If predicted, output values correlated to observed antigenic activity and fell within a certain deviation limit (cutoff=±5% in anti-HCV negative samples or a maximum of ±25% in anti-HCV positive samples) then the ANN output was considered to be a correct prediction. Four measures were used to evaluate the ANN model prediction performance: specificity (SP), sensitivity (SN), accuracy (AC) and correlation coefficient (cc):
|
| (1) |
|
| (2) |
|
| (3) |
|
| (4) |
2.5 Other methods and software programs
Randomized sequences used in the training set was generated by first using a statistical analysis of protein sequences (SAPS) program (http://www.ebi.ac.uk/saps/) (Brendel et al., 1992) to obtain the percentage of amino acid composition for each NS3 variant. Random sequences were then generated with the RandSeq program (http://us.expasy.org) using the user-specified composition in percent parameter.
For molecular modeling, the PDB: a chain of 1cu1 was used as a template onto which the exposed relevant positions, as determined by weight analysis, were mapped. NS3 recombinant variant sequences corresponded to positions 331–433 in the SCOP domain 3, of the A chain of 1cu1. Positions that accrued a relative weight cutoff of
5.0 for hydrophobicity and/or values of
3.0 for volume and polarity parameters were considered as relevant positions. Accessible surface area (ASA) of residues in 1cu1 corresponding to these positions was determined by using a fast heuristic algorithm as implemented in the Deepview/Swiss PdbViewer program version 3.7 (Guex and Peitsch, 1997). Residues with
25% surface accessibility were considered as accessible and mapped to the surface. All molecular model figures were generated using the Deepview/Swiss PdbViewer program.
Prediction of conformational epitopes (CEs) was done by submitting the crystallographic structure of the A chain of 1cu1 to the conformational epitope prediction (CEP) server (http://202.41.70.74:8080/cgi-bin/cep.pl), which is based on the method of Kolaskar and Kulkarni-Kale (Kolaskar and Kulkarni-Kale, 1999; Kulkarni-Kale et al., 2005). Prediction of sequential antigenic determinants was done by submitting the sequence of 1cu1 to the Predicted Antigenic Peptides server (http://bio.dfci.harvard.edu/), which based on the method of Kolaskar and Tongaonkar, (1990), and CEP server.
For analysis of antigenic properties of all inclusive genotype HCV NS3 proteins, the ANN model was transformed into a C code function using the snns2c program in SNNS and implemented in MATLAB.A total of 138 HCV NS3 protein sequences of different genotypes were randomly collected from GenBank. Sequences corresponding to the 103 amino acid region of the NS3 variants were tested with the trained ANN model to predict their breadth and strength of immunoreactivity.
| 3 RESULTS |
|---|
|
|
|---|
HCV NS3 sequences (see Supplementary Fig. 2) were initially transformed into 309-dimensional input vectors using three physicochemical property scales: hydrophobicitya (Engelman et al., 1986), volume (Schneider and Wrede, 1998) and polarity (Schneider and Wrede, 1998). Using these features, a neural network with 159 hidden neurons was able to reproduce the training S/Co values with a relative deviation of 14% and a cc=0.8807, while the prediction on the left out sample resulted in a maximum test data deviation of 28%, and a cc=0.8624 (data not shown). Further, optimization of the ANN internal parameters produced S/Co value reproductions with a relative deviation from training data of 8%, cc=0.8796, and a maximum test data deviation of 21%, cc=0.8615 (data not shown).
|
3.1 Combinations of feature representations
The overall averaged performance of the ANN models using the LOOCV test are shown in the Supplementary Table 1. ANN performance was measured based on the accuracy for correctly identifying anti-HCV sera positive and negative reactions. ANN simulations were conducted using different encoding schemes of amino acid representations: (1) combining hydrophobicity, volume and polarity physicochemical scales (3-propertiesA and 3-propertiesB); (2) using secondary structure information (2D propensities and 2D predicted); (3) using the first three or five eigenvector components derived by PCA from a collection of 143 amino acid properties (Schneider and Wrede, 1998) (3-PCA and 5-PCA, respectively); and (4) by combining physicochemical scales with secondary structure information (3-propertiesB and 2D propensities, and 3-propertiesB and 2D predicted).
|
3.1.1 Physicochemical properties
The best overall performance from ANN simulations was observed using the physicochemical scales of normalized hydrophobicityaa (Engelman et al., 1986), volume and polarity (3-propertiesB). Replacing the non-normalized hydrophobicitya scale used in the original model (3-propertiesA) with a normalized hydrophobicityaa scale increased the overall accuracy performance from 75.5%, cc=0.7729, to 89.8%, cc=0.8114 (compare Rows 1 and 2, Supplementary Table 1). Adding more features to this scheme (Rows 7 and 8, Supplementary Table 2) resulted in a significant decrease in overall performance.
|
3.1.2 Secondary structure properties
Secondary structure propensities and predicted secondary structure preferences were used for feature representation. Secondary structure propensities were adopted from Creighton (1993). Predictions of secondary structure preferences were made using the Type 1 model analysis using the protein sequence analysis (PSA) server (http://bmerc-www.bu.edu/) (Stultz et al., 1993; White et al., 1994). The ANN failed to find significant correlation between these secondary structure patterns and the EIA S/Co values to establish SAR (Rows 3 and 4, Supplementary Table 1).
3.1.3 Physicochemical-derived eigenvalues
The encoding scheme using eigenvalues for each amino acid type derived by PCA from 143 physicochemical scales was adopted from (Schneider and Wrede, 1998). Two schemes were implemented. First, we used feature input vectors derived from the first three eigenvector components (the 3-PCA scheme; Row 5, Supplementary Table 1), since they should account for 84.0% of the variance in the 143-property scale data. The overall performance of the ANN with this encoding scheme attained
75% accuracy, cc=0.7623, which is comparable to the performance using the 3-propertiesA scheme. Second, feature input vectors derived using all the five components (5-PCA; Row 6, Supplementary Table 1), which caused the overall performance of the ANN to deteriorate.
3.1.4 Evaluation of SAR learning
To determine whether the ANN algorithm was learning the underlying SAR regulating the antigenic properties of protein variants, simulations were also carried out by randomly assigning antigenic profiles to protein variants (3-properties}B(r); Supplementary Table 1), as opposed to having the ANN learn the SAR between the variants primary structure and their respective antigenic profiles. In contrast to the 3-propertiesB scheme (Row 2, Supplementary Table 1), such manipulation caused the overall accuracy of the performance to drop from 89.8% to 46.5%.
3.2 Combinations of feature representations
Input features from the best representation scheme (3-propertiesB; Supplementary Table 1) were removed one at a time to examine their relative significance, i.e. removal of hydrophobicity (2-propertiesB–h), polarity (2-propertiesB–p) or volume (2-properties}B–v). Removing hydrophobicity (2-propertiesB–h; Row 10, Supplementary Table 1) resulted in an overall decrease of
17% in accuracy of prediction (cc=0.6645); while removing either polarity (2-propertiesB–p; Row 11, Supplementary Table 1) or volume (2-propertiesB–v; Row 12, Supplementary Table 1), decreased accuracy of prediction by 12% (cc=0.7697), and 8% (cc=0.7489), respectively. In a three-layer fully connected ANN, and given the size of our system, the weighted relationships are too complex and demanding for direct examination. Accordingly, we quantified the relative weights between the input and hidden layer to determine the effect of input features to the ANN. Based on the best ANN model, hydrophobicity was the most weighted input feature (relative weight=46.3), while volume and polarity had similar weighting (relative weights of 28.0 and 25.6, respectively). These results point toward to hydrophobicity as the most relevant ANN input feature contributing to the prediction of antigenicity.
3.3 Contribution of individual protein positions to antigenic properties
Figure 1 shows the relative weights of the connections between the first and hidden layer of the neural network from each physicochemical attribute along the sequence strand. As expected, hydrophobicity accrued the largest relative weight values (Fig. 1A). The most weighted positions were 343, 377, 381, 383, 384 and 394. The heaviest mapping involved the middle positions of the sequence (positions 375–394). Meanwhile, weights for volume (Fig. 1B) tended to cluster between positions 337–343, 359–365, 379–387, 396–405 and 424–426, of which positions 339, 359, 382 and 396 had the largest weight for this parameter. Relative weights for polarity (Fig. 1C) distributed mainly between positions 333–344, 365–401 and 418–426 of which positions 333 and 375 had greater weights. In some positions, more than one feature was significantly weighted: hydrophobicity + polarity in positions 336, 375, 383 and 390; hydrophobicity + volume in positions 343 and 359; polarity + volume in positions 333, 338, 365, 380, 425 and 426. For all other positions, only one feature was significantly weighted.
Analysis of the variability along sequence positions revealed that the ANN did not give significant weights for any feature at positions 358, 376, 399, 403, 410 and 431, which had >25% variability between variants. This suggests that not all variable amino acid positions contribute or influence immunoreactivity properties of these 12 variants. Meanwhile, approximately half of the weighted positions in the best ANN model involved conserved positions.
3.4 Location of antigenically relevant positions in 3D structure
Positions associated with large relative weights in the ANN were mapped on the HCV NS3 protein structure, PDB: 1cu1 (Yao et al., 1999), to examine the stereochemical relation between these residues and potentially gain insight on their function. Sequence identity between the HCV NS3 helicase 1cu1 sequence and 12 recombinant NS3 variants studied in this article ranged between 91% and 78%. Since the template sequence has
78.0% identity with the sequences of the NS3 variants, good structural agreement between template and actual molecular models of variants can be expected (Rhodes, 2006).
Figure 2 A shows the pattern distribution of weighted residues mapped on the protein's surface. A total of 25 positions were mapped to the surface, corresponding to residues with a percent accessible surface area (%ASA) of
25% in 1cu1. The other ANN antigenically relevant positions were mapped as being buried. The surface exposed positions, which are discontinuous in linear sequence (Fig. 1), tend to cluster close together in the 3D space. Mapped residues grouped into three major clusters, which are denoted as C1, C2 and C3 on the proteins surface. About 70% of residues in these clusters involved variable positions (Fig. 2B). A fourth small cluster, involving positions 393, 394 and 396, was located on the interface between the NS3 region analyzed and the rest of the 1cu1 molecule (not shown).
To investigate the biological significance or role of the weighted positions corresponding to these clusters of exposed residues, two computer-based algorithms were used: one to predict continuous (linear) antigenic determinants and another to predict discontinuous, i.e. CEs.
Tables 1 and 2 show the predicted continuous and CEs, respectively. Table 1 lists the predicted antigenic determinants identified using two different methods for prediction of continuous epitopes, namely, the CEP server (Kolaskar and Kulkarni-Kale, 1999; Kulkarni-Kale et al., 2005) and Antigenic Prediction server (Kolaskar and Tongaonkar, 1990). While, Table 2 lists the predicted antigenic determinants forming part of CE using the CEP Server (Kolaskar and Kulkarni-Kale, 1999; Kulkarni-Kale et al., 2005). Comparisons were made between weighted positions in the ANN for the NS3 variants and the antigenic determinant predictions on the sequence of 1cu1 corresponding to the same NS3 region; see Supplementary Tables 2 and 3 for full description. As observed in Figure 2A and C, there is a good agreement between the weighted positions in the ANN corresponding to exposed residues and predicted antigenic determinants by both methods.
3.5 Antigenic properties of HCV NS3 sequences
The ANN model developed in this study, allowed for a rapid in silico testing of a large number of HCV NS3 protein sequences of different genotypes collected from GenBank. As shown in Figure 3, HCV proteins of different genotypes demonstrated a broad range in predicted breadth of immunoreactivity. HCV NS3 protein variants from genotype 1a/1b were all immunoreactive with <65% of serum specimens. The variants from genotype 2 were all predicted to be broadly immunoreactive (>70% of serum specimens). However, genotype 6 sequences were predicted having a wide range of breadth of immunoreactivity. Analysis of these 138 proteins failed, however, to identify genotype specificity in immunoreactivity of these proteins. For example, genotype 1 HCV proteins were predicted to be immunoreactive with similar percent of serum specimens of genotype 1, 4 and 6. Genotype 2 proteins were always predicted among the most broadly immunoreactive ones with serum specimens of all HCV genotypes. In concert with this observation, S/Co predictions showed that genotype 2 proteins consistently had the highest S/Co values with all serum specimens (data not shown).
|
| 4 DISCUSSION |
|---|
|
|
|---|
In the current study we have developed an ANN model that is capable of predicting the antigenic properties of HCV NS3 proteins from sequence information alone. ANN-based systems are flexible and adaptive for modeling arbitrary non-linear relationships. For some applications, the ANN can lead to more accurate SAR models than other machine learning techniques (Sutherland et al., 2004) and it has been successfully applied to relate strength of immunoreactivity with peptide structure (Zhang et al., 2005).
The accuracy of an ANN SAR model strongly depends on the number and types of features used for sequence encoding and representation. The poor performance of the ANN, when using the predicted secondary structure (Row 4, Supplementary Table 1), was caused most probably by low accuracy of secondary structure predictions. In fact, preliminary homology modeling of 3D molecular structures of the NS3 variants showed that the secondary structures derived from such models significantly deviate from those predicted by the PSA server (data not shown). In addition, a large number of input units can result in poor generalizations by the ANN, especially in cases where data is sparse. For example, increasing the encoding representation from 3 to 6 features, i.e. from 309 inputs to 618 inputs, dropped the accuracy of predictions to <59% (Rows 7 and 8, Supplementary Table 1).
The high accuracy of the ANN model is likely not due to random statistical correlations, but rather due to specific information content in the sequence profiles. The simulations where EIA profiles were randomly assigned to proteins variants resulted in a dramatic decrease to 46.5% in the accuracy of the predictions (Row 9, Supplementary Table 1). The purpose of such randomization was to test if the ANN is learning the structure in the patterns of the data, as opposed to learning the structure of random patterns in the data. High prediction accuracy on randomized data would have indicated that the ANN was learning to explain noise rather than specific patterns.
During training, the ANN was taught to map a set of input patterns to a set of corresponding output patterns. In a feed-forward ANN with a hidden layer, such mapping occurs in the hidden layer and is reflected in the weights of the network connectivity. In theory, the greater the weight value in a connection, the greater the importance of the parameter(s) linked to or associated with that connection (Arbib, 2003). Weight analysis revealed some important positions and features contributing to antigenic activity (Fig. 1A, B and C). It was previously observed that substitution Y418F correlated with a noticeable decrease in cross-immunoreactivity of the HCV NS3 proteins (Khudyakov et al., 2002). In concordance with this early observation, weight analysis plots showed that there was indeed a strong association between position 418 and antigenic properties of the HCV NS3 proteins (Fig. 1C), and further indicated that polarity at this position is a factor contributing to the differences in antigenicity between the HCV NS3 variants. Additionally, hydrophobicity at position 343 (Fig. 1A) correlated with a decrease in antigenicity. For instance, the substitutions T343N, T343Q or T343S decreased breadth of immunoreactivity, whereas the substitution T343L, favored antigenic properties.
Comparison of the relative weight plots with sequence heterogeneity shows that many variable positions did not gain relative weights above cutoffs and, thus, appear not to be relevant for modeling the antigenic properties of the NS3 CE variants. This suggests that, although many sites undergo amino acid substitutions, these changes may result in essentially unchanged conformation and do not affect antibody binding sites. This observation is consistent with previous reports for other proteins that underwent site specific or random site mutagenesis (Reddy et al., 1998). At the same time, a significant number of invariant positions were assigned high weights (Fig. 1). Two possible explanations for this observation are: (1) the ANN was trained using irrelevant and randomized proteins as non-immunoreactive antigens, in addition to these 12 HCV NS3 proteins, which may lead to recognition of significance of positions that are invariant within these HCV NS3 proteins; and (2) amino acid sequences were encoded as a set of profiles of physicochemical properties for amino acids, so that the invariant positions may contribute to these profiles as the variable positions.
Nonetheless, interpretation of the weights of the connections between neurons in a three-layer ANN should be done with caution (Swingler, 1996). However, the strong correlation in localization found between positions exposed on the protein's surface, associated to high relative weights in the ANN model, and predicted antigenic determinants (Table 1 and Fig. 2C), confers validity to the relevance of these positions identified with the ANN model. It is reasonable to suggest that the clusters shown in Figure 2 most probably represent positions forming part of antibody binding sites, whereas positions with lower probability of exposure on the surface of a protein globule (Fig. 1) are most probably responsible for the proper positioning of residues of the antibody binding sites.
There are two important observations that can be made from the results of testing the 138 HCV NS3 proteins (Fig. 3). First, NS3 proteins from different HCV genotypes and subgenotypes do not demonstrate similar patterns of immunoreactivity. HCV subgenotypes 1a and 1b were predicted to be rather moderately immunoreactive, whereas proteins of subgenotypes 1c, 2a and 2b to be most broadly immunoreactive. Such disparity in distribution of antigenic properties among different HCV subgenotypes suggests significant functional differences between subgenotypes and should be taken into consideration during diagnostic, molecular virological and molecular epidemiological research. Second, none of the HCV NS3 proteins demonstrated genotype- or subtype-specific immunoreactivity with serum specimens. All these observations suggest that not all NS3 proteins derived from any HCV genotype or subtype may be useful for the detection of a broad range of antibodies from serum specimens obtained from patients infected with different genotypes. The HCV NS3 proteins derived from subgenotypes 1c, 2a and 2b are among the most suitable targets for assay development.
| 5 CONCLUSION |
|---|
|
|
|---|
Due to the nature of protein engineering experiments relying on the use of synthetic genes and a labor-intensive process of experimental quantitative evaluation of biological properties for engineered proteins, the datasets generated by these experiments are frequently limited in size. This scarcity of data poses a serious challenge to a reliable mathematical modeling of quantitative SAR. In the present study, the ANN model built using antigenic profiles for only 12 HCV NS3 sequence variants is most probably overfitted. However, the mapping experiment of the weighted positions into the HCV NS3 3D-structure validated the associations found by the ANN model between amino acid physicochemical properties and antigenic activity of the NS3 proteins. Hence, the ANN model described in this article is completely suitable for guiding a focused rational design of antigenic targets with improved diagnostically relevant properties through a cyclic process of experimental evaluation of predicted antigens, and retraining the model for a more accurate representation of quantitative SAR in this specific case of the HCV NS3 conformational antigenic epitope.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The authors thank Mike Purdy for his Perl script contributions for data processing and sequence transformations. J.L. was supported by the American Society for Microbiology and National Centers for Infectious Diseases (ASM/NCID) Postdoctoral Research Associates Program Fellowship (http://www.asm.org).
Funding: This work was supported by CDC intramural funding.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
Received on January 30, 2008; revised on July 1, 2008; accepted on July 2, 2008
| REFERENCES |
|---|
|
|
|---|
Alter M.J. Epidemiology of hepatitis C virus infection. World J. Gastroenterol. (2007) 13:2436–2441.[Web of Science][Medline]
Arbib M.A. The elements of brain theory and neural networks. The Handbook of Brain Theory and Neural Networks.—Arbib M.A., ed. (2003) 2nd ed. Cambridge, Massachusetts: The MIT Press. 3–23.
Brendel V., et al. Methods and algorithms for statistical analysis of protein sequences. Proc. Natl Acad. Sci. USA (1992) 89:2002–2006.
Chen Y., et al. Immunoreactivity of HCV/HBV epitopes displayed in an epitope-presenting system. Mol. Immunol. (2006) 43:436–442.[CrossRef][Web of Science][Medline]
Choo Q.L., et al. Hepatitis C virus: the major causative agent of viral non-A, non-B hepatitis. Br. Med. Bull. (1990) 46:423–441.
Creighton T.E. Proteins: Structures and Molecular Properties. (1993) 2nd edn. New York: W.H. Freeman and Company.
Cui J., et al. MHC-BPS: MHC-binder prediction server for identifying peptides of flexible lengths from sequence-derived physicochemical properties. Immunogenetics (2006) 58:607–613.[CrossRef][Web of Science][Medline]
Cui J., et al. Prediction of MHC-binding peptides of flexible lengths from sequence-derived structural and physicochemical properties. Mol. Immunol. (2007) 44:866–877.[CrossRef][Web of Science][Medline]
Engelman D.M., et al. Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu. Rev. Biophys. Biophys. Chem. (1986) 15:321–353.[CrossRef][Web of Science][Medline]
Fox R. Directed molecular evolution by machine learning and the influence of nonlinear interactions. J. Theor. Biol. (2005) 234:187–199.[CrossRef][Web of Science][Medline]
Guex N., Peitsch M.C. SWISS-MODEL and Swiss-PdbViewer: an environment for comparative protein modeling. Electrophoresis (1997) 18:2714–2723.[CrossRef][Web of Science][Medline]
Hohm T., et al. A multiobjective evolutionary method for the design of peptidic mimotopes. J. Comput. Biol. (2006) 13:113–125.[CrossRef][Web of Science][Medline]
Kanistanon D., et al. Hepatitis C virus nonstructural 3 protein: recombinant NS3 protein of the Thai isolates as an antigen in a diagnostic assay. Asian Pac.J. Allergy Immunol. (2002) 20:161–166.
Khudyakov Y.E., et al. Linear B-cell epitopes of the NS3-NS4-NS5 proteins of the hepatitis C virus as modeled with synthetic peptides. Virology (1995) 206:666–672.[CrossRef][Medline]
Khudyakov Y.E., et al. Impact of sequence heterogeneity on antigenic properties of the Hepatitis C virus (HCV) proteins. In: Proceedings of the 10th International Symposium on Viral Hepatitis and Liver Disease.—Margolis H.S., et al, eds. (2002) London, UK: International Medical Press. 381–385.
Kolaskar A.S., Kulkarni-Kale U. Prediction of three-dimensional structure and mapping of conformational epitopes of envelope glycoprotein of Japanese encephalitis virus. Virology (1999) 261:31–42.[CrossRef][Medline]
Kolaskar A.S., Tongaonkar P.C. A semi-empirical method for prediction of antigenic determinants on protein antigens. Febs Lett. (1990) 276:172–174.[CrossRef][Web of Science][Medline]
Kulkarni-Kale U., et al. CEP: a conformational epitope prediction server. Nucleic Acids Res. (2005) 33:W168–W171.
Lin S., et al. Design of novel conformational and genotype-specific antigens for improving sensitivity of immunoassays for hepatitis C virus-specific antibodies. J. Clin. Microbiol. (2005) 43:3917–3924.
Macedo de Olivera A., et al. Sensitivity of second-generation enzyme immunoassay for detection of hepatitis C virus infection among oncology patients. J. Clin. Virol. (2006) 35:21–25.[CrossRef][Web of Science][Medline]
Mamitsuka H. Predicting peptides that bind to MHC molecules using supervised learning of hidden Markov models. Proteins (1998) 33:460–474.[CrossRef][Web of Science][Medline]
Ou-Yang P., et al. Characterization of monoclonal antibodies against hepatitis C virus nonstructural protein 3: different antigenic determinants from human B cells. J. Med. Virol. (1999) 57:345–350.[CrossRef][Web of Science][Medline]
Reddy B.V., et al. Use of propensities of amino acids to the local structural environments to understand effect of substitution mutations on protein stability. Protein Eng. (1998) 11:1137–1145.
Rhodes G. Crystallography Made Crystal Clear: A Guide for Users of Macromolecular Models. (2006) 3rd edn. Burlington, MA: Academic Press.
Rumelhart D.E., et al. Learning representations of back-propagation erros. Nature (1986a) 323:533–536.[CrossRef][Web of Science]
Rumelhart D.E., et al. Learning internal representations by error propagation. In: Parallel Distributed Processing.—Rumelhart D.E., McClelland J.L., eds. (1986b) 1. Cambridge: MIT.
Schneider G., Wrede P. Artificial neural networks for computer-based molecular design. Prog. Biophys. Mol. Biol. (1998) 70:175–222.[CrossRef][Web of Science][Medline]
Stultz C.M., et al. Structural analysis based on state-space modeling. Protein Sci. (1993) 2:305–314.[Web of Science][Medline]
Stultz C.M., et al. Protein structural biology in biomedical research. In: Predicting Protein Structure with Probabilistic Models.—Allewell N., Woodward C., eds. (1997) Greenwich: JAI Press. 447–506.
Su M., et al. An artificial neural network for predicting the incidence of radiation pneumonitis. Med. Phys. (2005) 32:318–325.[CrossRef][Web of Science][Medline]
Sutherland J.J., et al. A comparison of methods for modeling quantitative structure-activity relationships. J. Med. Chem. (2004) 47:5541–5554.[CrossRef][Web of Science][Medline]
Swingler K. Introduction. In: Applying Neural Networks: A practical Guide. (1996) San Francisco: Mourgan Kaufman Publishers Inc. 3–20.
White J.V., et al. Protein classification by stochastic modeling and optimal filtering of amino-acid sequences. Math. Biosci. (1994) 119:35–75.[CrossRef][Web of Science][Medline]
Xiong X.Y., et al. Expression and immunoreactivity of HCV/HBV epitopes. World J. Gastroenterol. (2005) 11:6440–6444.[Medline]
Yao N., et al. Molecular views of viral polyprotein processing revealed by the crystal structure of the hepatitis C virus bifunctional protease-helicase. Structure (1999) 7:1353–1363.[Medline]
Zhang G.L., et al. MULTIPRED: a computational system for prediction of promiscuous HLA binding peptides. Nucleic Acids Res. (2005) 33:W172–W179.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


