Skip Navigation


Bioinformatics Advance Access originally published online on January 18, 2007
Bioinformatics 2007 23(6):664-672; doi:10.1093/bioinformatics/btl649
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/6/664    most recent
btl649v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (9)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Care, M. A.
Right arrow Articles by Westhead, D. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Care, M. A.
Right arrow Articles by Westhead, D. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Deleterious SNP prediction: be mindful of your training data!

Matthew A. Care 1, Chris J. Needham 2, Andrew J. Bulpitt 2 and David R. Westhead 1,*

1Institute of Molecular and Cellular Biology, University of Leeds, Leeds, LS2 9JT, UK and 2School of Computing, University of Leeds, Leeds, LS2 9JT, UK

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: To predict which of the vast number of human single nucleotide polymorphisms (SNPs) are deleterious to gene function or likely to be disease associated is an important problem, and many methods have been reported in the literature. All methods require data sets of mutations classified as ‘deleterious’ or ‘neutral’ for training and/or validation. While different workers have used different data sets there has been no study of which is best. Here, the three most commonly used data sets are analysed. We examine their contents and relate this to classifiers, with the aims of revealing the strengths and pitfalls of each data set, and recommending a best approach for future studies.

Results: The data sets examined are shown to be substantially different in content, particularly with regard to amino acid substitutions, reflecting the different ways in which they are derived. This leads to differences in classifiers and reveals some serious pitfalls of some data sets, making them less than ideal for non-synonymous SNP prediction.

Availability: Software is available on request from the authors.

Contact: d.r.westhead{at}leeds.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Single nucleotide polymorphisms (SNPs) are the most abundant form of genetic variation, accounting for approximately 90% of the DNA polymorphism in humans (Collins et al., 1998). It is estimated that there is a SNP of >1% frequency for every 290 base-pairs (Kruglyak and Nickerson, 2001). Within coding regions there are on average four SNPs per gene with a frequency above 1%. About half of these cause amino acid substitutions: termed non-synonymous SNPs (nsSNPs) (Cargill et al., 1999).

Deleterious SNP prediction tries to ascertain if an nsSNP will affect a protein's function and possibly contribute to genetic disease. Methods in the existing literature have used a large range of structure- and sequence-based attributes to separate deleterious from neutral SNPs (see supplementary Tables 1 and 2 for information). Structural attributes provide more understanding of effect mechanisms, but are not available for all SNPs. Sequence attributes usually identify important residues using information from homologous proteins. With enough homologues (~10), sequence attributes can often compete effectively with structural approaches (Bao and Cui, 2005; Saunders and Baker, 2002; Yue and Moult, 2006).

Efforts to classify SNPs have used these attributes in a variety of prediction methods from sets of empirical rules (Herrgard et al., 2003; Ng and Henikoff, 2001; Ramensky et al., 2002; Sunyaev et al., 2001; Wang and Moult, 2001), probabilistic prediction (Chasman and Adams, 2001) to a variety of machine-learning techniques including decision trees (DT) (Dobson et al., 2006; Krishnan and Westhead, 2003), support vector machines (Bao and Cui, 2005; Krishnan and Westhead, 2003; Yue et al., 2005; Yue and Moult, 2006), neural networks (Ferrer-Costa et al., 2004, 2005), Bayesian networks (Cai et al., 2004; Needham et al., 2006), random forests (Bao and Cui, 2005) and Bayesian multivariate adaptive regression splines (Verzilli et al., 2005). Although these different approaches derive prediction rules in a variety of ways they almost all require a data set of classified mutations for both model building (training) and error rate estimation (validation). For machine-learning methods to generalize well to target data, it is imperative that the right training data is chosen; the training and validation data should be drawn from the same (usually unknown) distribution as the target data.

However, this is not easy to arrange for the problem concerned, and a number of very different data sets have been employed. Some workers have used deleterious and neutral nsSNPs data based on systematic mutation studies on particular proteins (Cai et al., 2004; Chasman and Adams, 2001; Krishnan and Westhead, 2003; Ng and Henikoff, 2001; Verzilli et al., 2005; Wang and Moult, 2001). Others have used annotated disease variants from protein sequence databases as deleterious data, and have generated neutral data sets either from annotated sequence variants not known to be associated with disease (Bao and Cui, 2005), or by using pseudo mutations between orthologous proteins in closely related species (Ferrer-Costa et al., 2002; Ferrer-Costa et al., 2004, 2005; Ramensky et al., 2002; Sunyaev et al., 2001; Yue et al., 2005; Yue and Moult, 2006). These approaches yield data sets that are different in content and character, with different properties when used to train machine-learning methods, and give rise to classifiers with varying error rates.

This article is the first attempt to quantify the aforementioned effects. We begin by comparing the contents of the data sets to what might be expected in the target prediction data (real human SNPs). Then, using a simple decision tree method to produce easily interpretable classifiers, we study the relationships between data set, classifiers and estimated accuracy. Finally, we quantify the transferability of classifiers between data sets, thus quantifying the effect of, for instance, training a method on systematic mutagenesis data and applying it to human SNPs. This leads to detailed understanding of the advantages and potential pitfalls of each data set in training and validating nsSNP prediction methods.


    2 SYSTEMS AND METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Decision trees
Decision trees (DT) are predictive models, displayed as a top down tree structure. Every node in the tree represents a decision point, where a test is carried out upon an attribute. For every possible outcome of the test there will be a child node, until the final decision node is reached, which branches to a set of leaf nodes giving the final classification. Here, we used the ‘yet another decision tree’ (YaDT) algorithm (Ruggieri, 2004) for constructing trees, using default parameters (confidence cut-off of 0.5; accepting all predictions) and with no optimization.

2.2 Evaluation of accuracy
All experiments were carried out multiple times (see cross-validation section) with balanced data sets using evaluation measures including the overall error (OE) [(FP + FN)/(TP + FP + TN + FN)], (where TP = true positive, TN = true negative, FP = false positive and FN = false negative), the false positive rate [FPR = FP/(TN + FP)] and the false negative rate [FNR = FN/(TP + FN)].

2.3 Data sets
Three different types of data sets were used for deleterious SNP prediction, shown in Table 1 (see supplementary Table 3 for information on data set usage in other studies, and Table 4 for detailed information on data sets used here):


View this table:
[in this window]
[in a new window]

 
Table 1. Data sets. Showing the origin of the data sets along with the names assigned to each

 
2.3.1 Mutagenesis data sets
The mutagenesis data sets consist of systematic unbiased mutations of the T4 lysozyme (Alber et al., 1987; Rennell et al., 1991) and lac repressor (Markiewicz et al., 1994; Suckow et al., 1996) proteins. The subset of mutations used here (from Krishnan and Westhead, 2003) has 1990 mutations for the T4 lysozyme and 3303 for the lac repressor protein. The original mutagenesis experiments classified each mutation into four effect categories which were reduced to a binary classification by Chasman and Adams (2001), yielding data sets with 40 and 38% of deleterious mutations for the lac and lysozyme, respectively.

2.3.2 Swiss-Prot data set
Another type of data set used for deleterious SNP prediction is derived from the Swiss-Prot variant webpage (Yip et al., 2004). Approximately 20% of the human proteins contained in the Swiss-Prot knowledgebase have one or more single amino acid polymorphism (SAP) (Boeckmann et al., 2003). Each SAP is manually annotated in the feature table of the Swiss-Prot variant database with the label ‘disease’ (SAP with disease association), ‘polymorphism’ (SAP with no known disease association) or ‘unclassified’ (SAP which has too little information to classify). Parsing this data gave a total of 12911 disease SAPs on 1055 proteins and 8302 polymorphism SAPs (deemed neutral) on 3388 proteins.

2.3.3 Divergent data set
An alternative source of neutral SAPs is the divergence data set, created by noting the changes between human proteins and their related mammalian orthologs. It is assumed that almost all of the variation fixed between closely related species is non-deleterious. There is a variation in the exact method used to create a divergent data set. Some research groups accept proteins with >90% sequence identity (SI) and >80% coverage allowing all matches per species (Yue and Moult, 2006), whilst others only accept >95% SI over 100% coverage and to avoid paralogs only use the best match per species (Sunyaev et al., 2001).

As is the normal practice, the proteins containing disease SAPs (Swiss-Prot ‘disease’) were used to generate a divergence data set. Each protein was searched against the NCBI non-redundant (NR) database using BLASTP (Altschul et al., 1997). All non-mammalian matches were discarded and the remaining matches processed using two different methods. For both methods each match was aligned with its corresponding disease protein and all amino acid differences were noted along with the SI of the alignment. This resulted in a set of pseudo mutations separated into SI categories from ≥30% to ≥95% SI. Furthermore, one of the methods used all of the mammalian matches (neutralAH) generated by BLASTP whilst the other only used the best match per mammalian species (neutralBH, as Sunyaev et al., 2001) to avoid possible paralogs.

2.4 Attributes
To allow for predictions to be made on all available SNPs a set of attributes was selected that could be generated without any requirement for structural information:

  1. Original and mutated amino acid residue identity
  2. Original and mutated amino acid physicochemical class (Hydrophobic, Polar, Charged, Glycine)
  3. Hydrophobicity difference between original and mutated residues
  4. Mass shift upon mutation
  5. Predicted secondary structure at mutation site:(Loop, Helix, Strand)
  6. Predicted solvent accessibility at mutation site: (0 -> 9; buried->exposed)
  7. Scorecons value: sequence conservation score at mutation site: (0->1; not->fully conserved)
  8. Buried charge at mutation site: (Residue is one of K, R, D, E, H and has an accessibility of 0 or 1)
  9. Position specific scoring matrix (PSSM) value for amino acid substitution
  10. Log-odds score of amino acid substitution

Attributes 1–8 are the same as those used by Krishnan and Westhead (2003), with the exception that only predicted secondary-structure and solvent accessibility were used and these were generated using the Sable program (Adamczak et al., 2004) rather than PHD (Rost and Sander, 1993).

The attributes were generated as follows: Each protein sequence was submitted to Sable for secondary-structure and solvent accessibility prediction. Sable carries out a PSIBLAST (Altschul et al., 1997) search against the NCBI NR database with 3 iterations. The resultant alignment profile and PSSM were retained for later use. The proteins in the PSIBLAST alignment profile with E-score values <10–3 were pulled out of the NR database using fastacmd and then aligned with the human query protein using Muscle (Edgar, 2004). The produced multiple alignment was submitted to Scorecons (Valdar, 2002) to calculate the sequence conservation. The log-odds score was calculated as the log ratio of amino acid substitution probabilities in the neutral and deleterious data sets, respectively.

2.5 Cross-validation and data set randomization
All data sets were sampled to give an equal number of positive and negative examples, as it has been shown that balanced data sets give the best accuracy with decision trees (Dobson et al., 2006).

For the homogeneous cross-validation experiments (training and validation data drawn from the same data set) 4000 SAPs were randomly sampled from each data set 10 times (e.g. 4000 deleterious and 4000 neutral) and used to carry out 10-fold cross-validation. To remove any possible training bias multiple SAPs on a given protein were not split between training and testing sections. In addition, the level of homologous proteins within the training data is too low to cause bias (data not shown).

Heterogeneous cross-validation involves using one data set for training and another for testing. For some of the experiments part of the training set was from the same data set type as the test set (e.g. train on disease/polymorphism, test on disease/divergent) and, therefore, the data sets had to be split into training and test groups. Thus, 4000 SAPs were randomly sampled 10 times from each data set and then split into training and test parts, as mentioned earlier in the article. This gave training and test data sets of 4000 mutations (2000 deleterious and 2000 neutral).

The exception to the above regards the experiments using the mutagenesis data sets which owing to their limited size had to be sampled differently. These data sets were each initially split into the two classes of mutation (lac: 1325 ‘deleterious’, 1978 ‘neutral’; lysozyme: 762 ‘deleterious’, 1228 ‘neutral’). From these 762 mutations were randomly sampled 10 times from each part (maximum size limited by lysozyme ‘deleterious’). These samples were then used to carry out 10-fold homogeneous cross-validation, with training and testing sizes of 1372 and 152 mutations, respectively. In addition, the lac and lysozyme data sets were merged to make a combined mutagenesis data set containing 3048 mutations per sample.

2.6 Construction of HEAT matrix
A matrix of human expected amino acid transitions (HEAT) was constructed, consisting of the expected rates of amino acid substitutions in human protein coding genes, in the absence of selection. It was constructed in a similar fashion to Vitkup et al. (2003) using a matrix of neighbour-dependent substitution rates (Hess et al., 1994). These rates were generated by aligning ~10 Mb of human gene-pseudogene pairs, resulting in 20 200 pseudo mutations. From this the relative substitution rates (X->Y) were calculated for the four nucleotide bases (X,Y) starting in all possible 3 nucleotide neighbourhoods (*X*), giving a matrix of 96 neighbourhood dependent substitution rates ([12 x 16]/2; 12 possible substitutions in 16 possible 3 base contexts with data aggregated for complementary substitutions) with a 65-fold variation of relative rates.

Here, we used this matrix of relative rates along with calculated average human-codon-usage to calculate the expected rates of all the amino acid substitutions resulting from single nucleotide mutations (SNM). The resulting HEAT matrix, shown in Figure 1a, is based on the average codon usage across all known coding sequences in the human genome and, thus, is more general than those produced by Vitkup (2003), which were created from a smaller sample of genes.


Figure 1
View larger version (75K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Deviation of data sets from expected mutations. (a) human expected amino acid transitions (HEAT); displaying the expected relative rates of amino acid substitutions under no selection pressure (Intensity of greyscale depicting expected rate of amino acid substitution. ‘x’ = substitutions requiring multiple nucleotide mutations; not present in HEAT). (b, c, d) deviation of data sets from expected (HEAT), blue = under-represented, red = over-represented ‘x’ = not present in HEAT. (b) Swiss-Prot annotated ‘disease’. (c) Swiss-Prot annotated ‘polymorphism’. (d) Divergent neutralBH SI90.

 

    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Data set comparisons
Single nucleotide mutations (SNM) within codons can give rise to 150 possible amino acid substitutions (see Fig. 1a for relative rates in humans). The remaining 230 amino acid substitutions require multiple nucleotide mutations (MNM) to occur within a codon. Figure 2 shows the percentage of amino acid substitutions in each data set that result from MNM. The two mutagenesis data sets have a very high percentage of MNM (Lac = 57%, Lyso = 59%). The Swiss-Prot data sets, in contrast, have almost no MNM with the disease and polymorphism having only 0.2 and 0.1%, respectively. The divergent ‘neutral’ data sets have 5–40% MNM, depending on the SI threshold, a lower level than the mutagenesis data sets but still far greater than observed in the Swiss-Prot data sets.


Figure 2
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. The percentage of multiple nucleotide mutations present in each data set. Mutagenesis (Lac/Lyso), Swiss-Prot (disease/polymorphism) and divergent (neutralBH/neutralAH; sequence identity cut-off 30%–95%).

 
Even with this rudimentary data set analysis, it becomes apparent that a large percentage of the mutations in the mutagenesis data sets (Lac/Lyso), and a significant proportion in the divergent data sets (neutralAH/BH), are very unlikely to be observed in the short evolutionary distance associated with real human mutations. One possible result of this is that irrelevant rules will be generated by learning methods, with significant effects on prediction accuracy (see later sections).

A more sophisticated method for comparing data sets is to observe their relative content of amino acid substitutions. Here we compare the amino acid substitution rates in each data set using HEAT as a reference distribution. Thus, the Swiss-Prot data set, consisting of human SAPs, would be expected to be similar to this distribution, with any significant differences attributable to natural selection within the human population. The divergent data sets, consisting of pseudo-mutations between man and related mammals, are also likely to be similar, but with the deviation from HEAT attributable to longer evolutionary distances.

HEAT is shown in Figure 1a, displaying the expected relative rates of amino acid substitutions. The amino acids are arranged by similarity, so that substitutions lying close to the leading (top-left-to-bottom-right) diagonal are between chemically similar amino acids. The non-uniform nature of the matrix is due to a variety of factors; the differing number of codons for each amino acid, the codon usage pattern in the human genome and the high rate of mutations caused by CpG deamination in certain codons, notably Arg in which four of the six codons contain CpG sites.

The HEAT matrix only contains amino acid substitutions resulting from SNM and has no values for the other less likely amino acid substitutions (from MNM; represented by ‘x’). Owing to the number of MNM present in the mutagenesis data sets, they were set aside for this analysis. For the other data sets count matrices were created with the counts of all amino acid substitutions resulting from SNM. Then, to see which substitutions were over/under represented in each data set, the log-odds score was calculated for each amino acid substitution [log (P(datasetSubstitution)/P(HEAT Substitution))]. In addition the Pearson's correlation between each data set and the HEAT matrix was calculated and is given along with an estimate of significance.

The results for three of the data sets are shown in Figure 1. For the Swiss-Prot disease data set (Fig. 1b; R = 0.81, P < 0.0001) the squares lying close to the leading diagonal, displaying substitutions between chemically similar amino acids, are under-represented whilst those not lying close to this diagonal, substitutions between chemically dissimilar amino acids, are over-represented. For the Swiss-Prot polymorphism data set (Fig. 1c; R = 0.91, P < 0.0001) there is the opposite trend, with the squares close to the leading diagonal over-represented, or near to expected, and the squares not lying close to the leading diagonal under-represented. As expected this data set is more strongly correlated with HEAT than the disease mutations.

In the Swiss-Prot disease data set (Fig. 1b) the most over-represented substitutions are from the amino acids Cys, Gly, Trp, Arg and Tyr; this agrees with the findings of Vitkup et al. (2003). The differences seen in the data sets are mainly governed by the types of substitutions an amino acid can undergo by SNM. In some cases, such as Cys and Trp, the substitutions resulting from SNM are all disfavoured, while for others, such as Gly, the substitutions resulting from SNM are a mixture of favored, neutral, and disfavoured. Thus, substitutions from Cys and Trp are very likely to be deleterious, not only because these amino acids play important structural roles but also because their likely substitutions are all disfavoured.

Most of the substitutions that are over-represented in the disease data set are under-represented in the polymorphism data set (Fig. 1c) with Cys and Tyr having the strongest divergence from HEAT, suggesting that even the relatively simple attribute of amino acid substitution would separate these data sets to some extent.

The divergent (neutralBH90) data set (Fig. 1d; R = 0.74, P < 0.0001) is similar to the polymorphism, except that the divergence from HEAT is generally greater, owing to longer evolutionary distances, particularly for example with substitutions from Cys and Trp. A notable exception is Arg, which is slightly over-represented in the polymorphism data set and strongly under-represented in the divergent data set. In this case, substitutions over short evolutionary distances are strongly influenced by the coding sequence. As the evolutionary distance increases, selection begins to reflect the constraints imposed by protein structural stability and function, favouring substitutions between amino acids with similar chemical properties (the only over-represented substitution in this latter case is Arg to the related basic amino acid Lys) (Benner et al., 1994).

Overall, the comparison with the HEAT matrix has highlighted differences between the data sets, showing the potential to discriminate deleterious from neutral using only the parameter of amino acid substitution. It has also emphasized significant differences in the distribution of amino acid substitutions present in the Swiss-Prot polymorphism and divergent data sets. The polymorphism data set has a high level of correlation with the HEAT matrix (R = 0.91, P < 0.0001), while the divergent data set's correlation (R = 0.74, P < 0.0001) is actually less than that of Swiss-Prot disease (R = 0.81, P < 0.0001).

This has important consequences for machine-learning methods: rules learned using the divergent data sets for neutral data are likely to give accurate rules to separate deleterious from neutral. However, there is a danger that the basis of these rules would be simply the differing evolutionary distances for the mutations in the Swiss-Prot disease set compared with the divergent data sets. Such rules may be of little use in distinguishing human disease mutations from neutral mutations occurring on the same evolutionary time scale.

3.2 Decision tree homogeneous cross-validation
The results from homogeneous 10-fold cross-validation are shown in Table 2 (see Systems and Methods for information on accuracy measures). The results are split according to the attributes used for prediction, giving a comparison of accuracy using ‘All’ of the attributes with that from some important individual attributes used independently (‘PSSM only’, position specific scoring matrix value; ‘amino acid substitution only’, amino acid substitution; ‘Scorecons only’, conservation value).


View this table:
[in this window]
[in a new window]

 
Table 2. Homogenous cross-validation. Showing the average false positive rate (FPR), false negative rate (FNR) and overall-error (OE) for 10-fold cross-validation trained on ‘All’ attributes, ‘PSSM only’, ‘amino acid substitution only’ and ‘Scorecons only’

 
When ‘All’ attributes are used for prediction the overall-error (OE) ranges from 19.88 to 30.05 across the different data sets, showing that even under homogeneous cross-validation some data sets are far easier to classify than others. This range of accuracy across data sets is greatly influenced by the level of distinction between the ‘deleterious’/‘neutral’parts of each data set. The comparison made with the HEAT matrix showed that the substitutions in the divergent data sets (neutralAH/BH) deviate further from HEAT than those in the Swiss-Prot polymorphism data set, explaining the greater prediction accuracy when using the former (~20% OE) compared with the latter (28.42% OE) as neutral data. The divergent data sets are also easier to separate from disease due to their MNM (Fig. 2), which are almost completely absent in the Swiss-Prot (polymorphism/disease) data sets. These are effectively ‘easy’ predictions as the DT can correctly classify ~10% of the divergent (SI90) data set on these alone.

‘PSSM only’ encodes position specific evolutionary information and leads to OE ranging from 26.17%–35.93%, with the T4 lysozyme proving the hardest to classify and the diseaseDivergent the easiest. The OE for the ‘amino acid substitution only’ attribute ranges from 26.06%–41.46%, a larger range than for PSSM, yet for the diseaseDivergent data sets the ‘amino acid substitution only’ is the most accurate single attribute. ‘Scorecons only’ is an alternative measure of position specific evolutionary information and gives rise to overall errors in the range 29.27%–36.51%. By contrast with the PSSM, scorecons encodes only conservation while the PSSM contains information on the likelihood of specific amino acid substitutions, yet this only makes a significant difference to the OE in the cases of the diseaseDivergent data sets.

The clear interpretation emerging from these observations, and the previous data set analysis, is that the simplest attribute, amino acid substitution, contains useful information to separate all data sets, and is highly predictive when the divergent data set is used for negative data [diseaseN(A/B)H in Table 2]. Nevertheless this should be viewed with substantial caution, since, as previously noted, the effect may not be due to distinguishing deleterious from neutral mutations as distinguishing data sets differing in content of amino acid substitutions, owing to variations in evolutionary distance and systematic mutation.

Figure 3 shows the effect upon overall accuracy of homogeneous cross-validation (disease/divergent) when changing the minimum SI level for accepting homologs in the divergent data sets (neutralAH/BH). Using ‘All’ attributes produces the highest accuracy, followed by ‘amino acid substitution only’ and then ‘PSSM only’. Again these results are strongly influenced by data set content. As the SI level increases the level of MNM (Fig. 2) decreases, and the divergent data sets share more similarity with the Swiss-Prot disease data set, thus increasing the observed error rate for the ‘amino acid substitution only’ attribute. In addition the ‘amino acid substitution only’ attribute is little affected by the method used to create the divergent data set. In contrast, the ‘PSSM only’ attribute is strongly affected, with a larger variation in the OE for the neutralAH (4.06%) compared with the neutralBH (1.45%).


Figure 3
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Effect of divergent data set's minimum sequence identity on overall error. Displaying homogeneous cross-validation overall error percent for ‘All’ attributes, ‘amino acid substitution only’ and ‘PSSM only’ with increasing sequence identity cut-off; data set disease/divergent.

 
To optimize the generation of divergent data sets, we note that with increasing SI levels the discrepancy between the errors for the two divergent data sets diminishes, but the neutralBH gives consistently lower error rates. With ‘All’ attributes, lower apparent error rates result from data sets at lower SI thresholds. However, this is again misleading. It is unlikely that these data sets give better SNP classification methods, rather, the lower error rates are artefacts caused by different data set contents in terms of amino acid substitutions. The effect is clear with ‘amino acid substitution only’, but when ‘PSSM only’ is used the trend of error rate increasing with SI disappears. Both ‘amino acid substitution only’ and ‘PSSM only’ show a small increase in error between SI values of 90 and 95%, suggesting that if these data sets were to be used to train methods, a 90% cut-off would be preferred. This may be caused by limited data available at very high sequence identity.

3.3 Heterogeneous cross-validation
Heterogeneous cross-validation measures the ability of a classifier trained on one data set to predict on another. Table 3 shows the results for heterogeneous and homogeneous (on the diagonal) cross-validation for a selection of attributes, along with their corresponding average error rates per attribute type. In addition, for each attribute the average deviation from the homogeneous overall-error is shown, indicating how well the homogeneous cross-validation gauges predictive ability on other data sets.


View this table:
[in this window]
[in a new window]

 
Table 3. Heterogeneous cross-validation. Showing the average homogeneous (on diagonal) and heterogeneous overall-error (OE) for data sets trained and tested on ‘All’ attributes, ‘PSSM only’, ‘amino acid substitution only’ and ‘Scorecons only’

 
First, considering ‘All’ attributes it is clear that error rates in heterogeneous cross validation are generally significantly higher than the corresponding values for the homogeneous case. This would be expected, but is an important effect in a field where training data can be substantially different to the final target data for prediction. An exception is that training on diseasePoly data has a homogeneous OE of 28% but predicts on diseaseNBH90 with OE of 24%. The explanation here is that this latter data set is easier to separate, for reasons previously discussed. Otherwise, transfer of rules derived from one data set results in significantly larger error rates, and perhaps most notably rules learned from the LacLyso data tend to transfer poorly to the other data sets, and vice versa. Rules based on this systematic mutagenesis data may be a poor choice for SNP prediction, and the most likely cause of this is that the amino acid substitution content of the data set, particularly the large level of MNMs, leads to rules of little relevance for human SNPs.

In contrast, rules derived from the other two data sets are more interchangeable, as might be expected since they share the same deleterious (disease) data. As before differences between these data sets stem from the relationship of the attributes used, to basic differences in amino acid substitution content in the neutral data. Notably, using ‘amino acid substitution only’, a homogeneous OE of 26% is obtained for diseaseNBH90, while the same DT rules have an OE almost 12% higher on the diseasePoly data.

It is interesting, yet intuitive, that when prediction methods are based purely on the evolutionary attributes they tend to transfer better between the different data sets. Compared with the 12% difference noted above for ‘amino acid substitution only’, with ‘PSSM only’ the error rate rises by only 7% between homogeneous cross-validation (diseaseNBH90) and heterogeneous cross-validation (tested on diseasePoly). Similarly, with ‘Scorecons only’ the OE rises by only 3%. These attribute-dependent effects are also clear in the figures for average deviation from homogeneous OE, showing smaller deviations for the evolutionary attributes. It is also apparent that ‘PSSM only’ is a better predictor in general than ‘Scorecons only’; the latter only encodes conservation while the former contains information about possible neutral residue replacements.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
It is an appealing idea to use data from gene mutations, known to cause disease or affect protein function, to train machine-learning methods for predictions on observed human nsSNPs. It is nevertheless vitally important to consider the selection of training data very carefully. In this article we have shown that the choice of training data has significant effects on classifiers and estimated error rates.

Our results suggest that the use of mutagenesis data, with a significantly higher content of MNMs than would be expected for nsSNPs, may lead to largely irrelevant rules for SNP predictions. They remain, however, good unbiased data sets for the prediction of the effects of general protein mutations. Equally, the generation of neutral data from pseudo-mutations between orthologous proteins (divergent data set), produces data sets that can be distinguished from known disease mutations at reasonable error rates, solely on the basis of the amino acid substitutions. But such classifiers are unlikely to perform with the same low error rates in distinguishing human deleterious and neutral SNPs. The rules may have some predictive power for SNPs, but a significant contribution to their apparent homogeneous cross-validation accuracy results from separation of the training data on the basis of content of amino acid substitutions, caused by different evolutionary distances in the deleterious and neutral parts of the training data. One potential way of improving the divergent data sets is to limit the aligned orthologous proteins to primates, thus, reducing the evolutionary distance. This results in a neutral data set that has a higher correlation with HEAT (R = 0.81, P < 0.0001) than the mammalian derived data set (NBH90; R = 0.74, P < 0.0001) yet still much lower than the Swiss-Prot polymorphism (R = 0.91, P < 0.0001). When combined with Swiss-Prot disease this new neutral data set produces a homogeneous OE of 23.23%, placing it between diseaseNBH90 (OE 19.88%) and diseasePoly (OE 28.42%). Thus, while this data set is closer to HEAT than NBH90 it is still clearly over a longer evolutionary distance than the Swiss-Prot polymorphism data.

Therefore, we suggest that the best training data for human nsSNP predictions is the Swiss-Prot annotated ‘disease’ and ‘polymorphism’ variants of known human proteins. This is not without problems: variants annotated as neutral polymorphisms may have an unknown association with disease. Nevertheless the differences in Figures 1b and c, and the fact that learning methods can successfully separate disease and polymorphism classes, suggests that this is unlikely to be the case for the majority of the data.

Equally it might be suggested that other data sets could be used if appropriate attributes were chosen. Rules based on evolutionary attributes are more transferable between data sets than amino acid substitutions. However, any good learning method will separate the data sets using the most informative attributes, and it can be difficult to completely remove effects such as those reported here. For instance, the apparently purely physicochemical attributes hydrophobicity and molecular-mass-difference contain information sufficient to identify the amino acid substitution involved. Such effects are even harder to tease out with methods less interpretable then decision trees (e.g. support vector machines or neural networks). The training data is fundamental, it affects all methods and it is important to get it right first.


    5 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have raised some important issues regarding training data for nsSNP prediction methods and recommended a best data set (Swiss-Prot disease/polymorphism). We believe that effects described here have affected several studies, including our own, and whatever view is taken on the best data set it is important that workers in the field be aware of them.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We would also like to acknowledge the comments of Fyodor Kondrashov and one anonymous reviewer who helped us improve this manuscript. Work carried out by M. Care with technical support from C. Needham. Supervision and support provided by D. Westhead and A. Bulpitt. All authors approved the final manuscript. BBSRC for funding – studentship (BBS/S/A/2004/10974) and grant number (BBS/B/16585).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Dmitrij Frishman

Received on September 29, 2006; revised on November 22, 2006; accepted on December 18, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Adamczak R, et al. Accurate prediction of solvent accessibility using neural networks-based regression. Proteins (2004) 56:753–767.[CrossRef][Web of Science][Medline]

    Alber T, et al. Temperature-sensitive mutations of bacteriophage T4 lysozyme occur at sites with low mobility and low solvent accessibility in the folded protein. Biochemistry (1987) 26:3754–3758.[CrossRef][Medline]

    Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. (1997) 25:3389–3402.[Abstract/Free Full Text]

    Bao L, Cui Y. Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information. Bioinformatics (2005) 21:2185–2190.[Abstract/Free Full Text]

    Benner SA, et al. Amino acid substitution during functionally constrained divergent evolution of protein sequences. Protein Eng. (1994) 7:1323–1332.[Abstract/Free Full Text]

    Boeckmann B, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. (2003) 31:365–370.[Abstract/Free Full Text]

    Cai Z, et al. Bayesian approach to discovering pathogenic SNPs in conserved protein domains. Hum. Mutat. (2004) 24:178–184.[CrossRef][Web of Science][Medline]

    Cargill M, et al. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. (1999) 22:231–238.[CrossRef][Web of Science][Medline]

    Chasman D, Adams RM. Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. J. Mol. Biol. (2001) 307:683–706.[CrossRef][Web of Science][Medline]

    Collins FS, et al. A DNA polymorphism discovery resource for research on human genetic variation. Genome Res. (1998) 8:1229–1231.[Free Full Text]

    Dobson R, et al. Predicting deleterious nsSNPs: an analysis of sequence and structural attributes. BMC Bioinformatics (2006) 7:217.[CrossRef][Medline]

    Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics (2004) 5:113.[CrossRef][Medline]

    Ferrer-Costa C, et al. Use of bioinformatics tools for the annotation of disease-associated mutations in animal models. Proteins (2005) 61:878–887.[CrossRef][Web of Science][Medline]

    Ferrer-Costa C, et al. Sequence-based prediction of pathological mutations. Proteins (2004) 57:811–819.[CrossRef][Web of Science][Medline]

    Ferrer-Costa C, et al. Characterization of disease-associated single amino acid polymorphisms in terms of sequence and structure properties. J. Mol. Biol. (2002) 315:771–786.[CrossRef][Web of Science][Medline]

    Herrgard S, et al. Prediction of deleterious functional effects of amino acid mutations using a library of structure-based function descriptors. Proteins (2003) 53:806–816.[CrossRef][Web of Science][Medline]

    Hess ST, et al. Wide variations in neighbor-dependent substitution rates. J. Mol. Biol. (1994) 236:1022–1033.[CrossRef][Web of Science][Medline]

    Krishnan VG, Westhead DR. A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function. Bioinformatics (2003) 19:2199–2209.[Abstract/Free Full Text]

    Kruglyak L, Nickerson DA. Variation is the spice of life. Nat. Genet. (2001) 27:234–236.[CrossRef][Web of Science][Medline]

    Markiewicz P, et al. Genetic studies of the lac repressor. XIV. Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as "spacers" which do not require a specific sequence. J. Mol. Biol. (1994) 240:421–433.[CrossRef][Web of Science][Medline]

    Needham CJ, et al. Predicting the effect of missense mutations on protein function: analysis with Bayesian networks. BMC Bioinformatics (2006) 7:405.[CrossRef][Medline]

    Ng PC, Henikoff S. Predicting deleterious amino acid substitutions. Genome Res. (2001) 11:863–874.[Abstract/Free Full Text]

    Ramensky V, et al. Human non-synonymous SNPs: server and survey. Nucleic Acids Res. (2002) 30:3894–3900.[Abstract/Free Full Text]

    Rennell D, et al. Systematic mutation of bacteriophage T4 lysozyme. J. Mol. Biol. (1991) 222:67–88.[CrossRef][Web of Science][Medline]

    Rost B, Sander C. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. (1993) 232:584–599.[CrossRef][Web of Science][Medline]

    Ruggieri S. YaDT: Yet another Decision Tree builder. Proceedings of the 16th International Conference on Tools with Artificial Intelligence. IEEE Press (2004) 0:260–265.

    Saunders CT, Baker D. Evaluation of structural and evolutionary contributions to deleterious mutation prediction. J. Mol. Biol. (2002) 322:891–901.[CrossRef][Web of Science][Medline]

    Suckow J, et al. Genetic studies of the Lac repressor. XV: 4000 single amino acid substitutions and analysis of the resulting phenotypes on the basis of the protein structure. J. Mol. Biol. (1996) 261:509–523.[CrossRef][Web of Science][Medline]

    Sunyaev S, et al. Prediction of deleterious human alleles. Hum. Mol. Genet. (2001) 10:591–597.[Abstract/Free Full Text]

    Valdar WS. Scoring residue conservation. Proteins (2002) 48:227–241.[CrossRef][Web of Science][Medline]

    Verzilli CJ, et al. A hierarchical Bayesian model for predicting the functional consequences of amino-acid polymorphisms. J. R. Stat. Soc. Ser. C-Appl. Stat. (2005) 54:191–206.[CrossRef]

    Vitkup D, et al. The amino-acid mutational spectrum of human genetic disease. Genome Biol. (2003) 4:R72.[CrossRef][Medline]

    Wang Z, Moult J. SNPs, protein structure, and disease. Hum. Mutat. (2001) 17:263–270.[CrossRef][Web of Science][Medline]

    Yip YL, et al. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum. Mutat. (2004) 23:464–470.[CrossRef][Web of Science][Medline]

    Yue P, et al. Loss of protein structure stability as a major causative factor in monogenic disease. J. Mol. Biol. (2005) 353:459–473.[CrossRef][Web of Science][Medline]

    Yue P, Moult J. Identification and Analysis of Deleterious Human SNPs. J. Mol. Biol. (2006) 356:1263–1274.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Genome ResHome page
S. Chun and J. C. Fay
Identification of deleterious mutations within three human genomes
Genome Res., September 1, 2009; 19(9): 1553 - 1561.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
P. Radivojac, P. H. Baenziger, M. G. Kann, M. E. Mort, M. W. Hahn, and S. D. Mooney
Gain and loss of phosphorylation sites in human cancer
Bioinformatics, August 15, 2008; 24(16): i241 - i247.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Torkamani and N. J. Schork
Accurate prediction of deleterious protein kinase polymorphisms
Bioinformatics, November 1, 2007; 23(21): 2918 - 2925.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/6/664    most recent
btl649v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (9)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Care, M. A.
Right arrow Articles by Westhead, D. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Care, M. A.
Right arrow Articles by Westhead, D. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?