Bioinformatics Advance Access originally published online on March 3, 2005
Bioinformatics 2005 21(10):2185-2190; doi:10.1093/bioinformatics/bti365
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information
Department of Molecular Sciences, Center of Genomics and Bioinformatics, University of Tennessee Health Science Center 858 Madison Avenue, Memphis, TN 38163, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: There has been great expectation that the knowledge of an individual's genotype will provide a basis for assessing susceptibility to diseases and designing individualized therapy. Non-synonymous single nucleotide polymorphisms (nsSNPs) that lead to an amino acid change in the protein product are of particular interest because they account for nearly half of the known genetic variations related to human inherited diseases. To facilitate the identification of disease-associated nsSNPs from a large number of neutral nsSNPs, it is important to develop computational tools to predict the phenotypic effects of nsSNPs.
Results: We prepared a training set based on the variant phenotypic annotation of the Swiss-Prot database and focused our analysis on nsSNPs having homologous 3D structures. Structural environment parameters derived from the 3D homologous structure as well as evolutionary information derived from the multiple sequence alignment were used as predictors. Two machine learning methods, support vector machine and random forest, were trained and evaluated. We compared the performance of our method with that of the SIFT algorithm, which is one of the best predictive methods to date. An unbiased evaluation study shows that for nsSNPs with sufficient evolutionary information (with not <10 homologous sequences), the performance of our method is comparable with the SIFT algorithm, while for nsSNPs with insufficient evolutionary information (<10 homologous sequences), our method outperforms the SIFT algorithm significantly. These findings indicate that incorporating structural information is critical to achieving good prediction accuracy when sufficient evolutionary information is not available.
Availability: The codes and curated dataset are available at http://compbio.utmem.edu/snp/dataset/
Contact: ycui2{at}utmem.edu
Supplementary information: The curated dataset is available at http://compbio.utmem.edu/snp/dataset/
| INTRODUCTION |
|---|
|
|
|---|
In humans,
90% of sequence variants are differences in single bases of DNA, called single nucleotide polymorphisms (SNPs) (Collins et al., 1998). Among them, non-synonymous SNPs (nsSNPs) that lead to an amino acid change in the protein product are most relevant to human inherited diseases (Stenson et al., 2003). Whereas a large number of nsSNPs may be functionally neutral, others may cause deleterious effects on protein functions and are hence disease associated. Given the vast number of nsSNPs discovered (Irizarry et al., 2000; Fredman et al., 2002), a major challenge is to predict which of them are potentially disease associated. Recent studies have discovered a variety of potential predictors discriminating disease-associated nsSNPs from neutral nsSNPs. Empirical rule-based and machine learning approaches were used to classify these two types of nsSNPs. Empirical rules discriminating disease-associated and neutral nsSNPs were derived based on structural information (Wang and Moult, 2001), evolutionary information (Ng and Henikoff, 2001) or both (Sunyaev et al., 2001). Other recent studies (Chasman and Adams, 2001; Saunders and Baker, 2002; Krishnan and Westhead, 2003) developed classification models automatically learned from the training data. Except for the work of Wang and Moult (2001) all the mentioned studies used some form of position-specific evolutionary information contained in the multiple sequence alignments. The prediction accuracy depends heavily on the existence of a sufficient number of homologous sequences. Saunders and Baker (2002) showed that the prediction accuracy decreased significantly when fewer than 510 homologous sequences are available. Incorporating structural information is crucial in such cases (Saunders and Baker, 2002). Here we developed classifiers combining structural and evolutionary information to discriminate disease-associated nsSNPs from neutral nsSNPs. We prepared a curated training dataset from the UniProt knowledgebase (Apweiler et al., 2004). This dataset consists of natural nsSNPs, in contrast to in vitro mutational data used in previous studies (Chasman and Adams, 2001; Krishnan and Westhead, 2003). The structural environments (Bowie et al., 1991) and substitution properties of nsSNPs were used as predictors. We applied two machine learning methods, support vector machine (SVM) (Vapnik, 1998) and random forest (RF) (Breiman, 2001). We showed that for nsSNPs with insufficient homologous sequences, our method outperformed the SIFT algorithm (Ng and Henikoff, 2003) on account of the incorporated structural information. In the cases where sufficient homologous sequences were available, the performance of our method was comparable with the SIFT algorithm. | SYSTEMS AND METHODS |
|---|
|
|
|---|
Dataset
Human nsSNPs were extracted via analysis of the VARIANT field in the corresponding Swiss-Prot entries (Apweiler et al., 2004). nsSNPs annotated as Disease are disease associated, and those annotated as Polymorphism are neutral nsSNPs. Major histocompatibility complex proteins and membrane proteins were excluded. We focused our analysis on nsSNPs with experimentally determined structure or structural homologs. Each nsSNP variant was searched against the ASTRAL database (Chandonia et al., 2004) using the BLASTP program (Altschul et al., 1990) to find representative homologous 3D structures. Hits were retained if they met the following criteria:
- sequence identity to the query sequence was not <30%, for the conservation of basic structural characteristics,
- the number of identical amino acids was not <20,
- gap content was <15% and
- the hit sequence had the same amino acid as the query sequence at the substitution site.
In case of multiple representative PDB entries, the one with highest sequence identity was chosen. These filters resulted in 532 neutral nsSNPs within 305 genes and 3686 disease-associated nsSNPs within 323 genes. To evaluate the discriminative power of our method on nsSNPs with insufficient evolutionary information, we split all the 4218 nsSNPs into two sets according to the number of homologous sequences. 4013 nsSNPs with not <10 homologous sequences were used as training samples (502 neutral and 3511 disease-associated nsSNPs), while the remaining 205 nsSNPs were used as independent test samples (30 neutral and 175 disease-associated nsSNPs). The datasets are available at http://compbio.utmem.edu/snp/dataset/.
SIFT score
The SIFT program (Ng and Henikoff, 2003) was used to calculate the SIFT score, a score measuring the tolerance of a substitution based on the mutability of the substitution position. SIFT used PSI-BLAST (Altschul et al., 1997) to search against the EMBL non-redundant protein database (Apweiler et al., 2004) for homologous sequences and construct a multiple sequence alignment. The multiple sequence alignment was converted into a position-specific scoring matrix. Each matrix entry Pij is the probability of amino acid j occurring at position i. The Pij was estimated as a weighted average of the observed frequencies at the position and the Dirichlet pseudocounts (Henikoff and Henikoff, 1996). To reduce multiple contributions from closely related members of a sequence family, the sequences were weighted (Henikoff and Henikoff, 1996). SIFT uses an empirical threshold: substitutions with normalized probabilities <0.05 are predicted as deleterious while others are predicted as tolerated.
Predictors
The predictors we used are listed in Table 1. The first three predictors in Table 1 represent the structural environment of a substitution site. The structural environment of each nsSNP was annotated by the ENVIRONMENT program developed by Bowie et al. (1991). The program combined three structural parameters (area buried, fraction polar and secondary structure) to define the structural environment of a site. Briefly, the buried area of a residue was determined by placing imaginary solvent spheres around each atom and calculating the difference between the side-chain area covered by solvent-accessible sample points in a protein site and in a Gly-X-Gly tripeptide. The fraction polar of a residue was calculated as the fraction of the number of sample points covered by polar atoms (or exposed to solvent) to the number of total sample points. By setting empirical cutoffs for these two structural parameters, Bowie et al. (1991) defined six environment classes: B1, B2, B3, P1, P2 and E (Figure 4 of Bowie et al., 1991). Combining the six environment classes with three-state (helix, sheet and coil) secondary structures gave a total of 18 environment classes. The STRIDE program (Frishman and Argos, 1995) was used to assign the secondary structures. In essence, each position in a 3D structure could be assigned to 1 of the 18 environment classes. It is of importance that different environment classes had different amino acid preferences, as was measured by 3D1D compatibility scores (Figure 5 of Bowie et al., 1991). To assess the differences between the wild-type (original) and mutated amino acids, we derived a structural environment-specific grouping of the 20 amino acids (Table 2). The grouping of the 20 amino acids was based both on their physicochemical properties and compatibility with the structural environment (Table S1). If the wild-type and mutated amino acids fell into the same group, the indicator of change of amino acid group got a value of 0; otherwise, the indicator got a value of 1. The SIFT score was calculated by the SIFT program as described above.
|
|
Evaluation of classification accuracy
Classification accuracy was evaluated using a 10-fold cross-validation. The data were randomly split into 10 equal parts. One was used for testing and the others for training. The procedure was repeated 10 times so that each sample was used exactly once for testing. The results of five independent 10-fold cross-validation experiments were averaged to get a fair evaluation. Since the dataset contains many more disease-associated nsSNPs (positives) than neutral nsSNPs (negatives), we used Matthew's correlation coefficient (MCC) (Matthews, 1985) to evaluate the performance,
![]() |
![]() |
Support vector machine
Support vector machine (SVM) (Vapnik, 1998) is a classifier seeking an optimal hyperplane to separate two classes of samples. SVM uses kernel functions to map original data to a feature space of higher dimensions and locate an optimal separating hyperplane there. We used SVM-light, an implementation of the SVM algorithm by Joachims (1999). The performance of SVM is mainly controlled by the kernel function and the regularization parameter C. The kernel function determines the sample distribution in the feature space. Regularization parameter C is used to trade between training errors and larger hyperplane margins. A larger C value assigns a higher penalty to the training errors. Polynomial kernels functions with powers of 1, 2 or 3 and radial basis kernels (g=0.01, 0.1, 1.0, 5.0 and 10.0) were tested in combination with different C values (0.1, 1.0, 5.0, 10.0 and 50.0) to tune for good performance.
Random forest
Random forest (RF) is a classifier consisting of an ensemble of tree-structured classifiers (Breiman, 2001). RF takes advantage of two powerful machine learning techniques: bagging (Breiman, 1996) and random feature selection. In bagging, each tree is trained on a bootstrap sample of the training data, and predictions are made by majority vote of the trees. When using bootstrap samples of the training data, about one-third of the cases are left out, which is called out-of-bag (OOB) data. OOB data can be used to get an unbiased estimate of the classification error during the training process. The details in growing (training) of an individual tree can be found in Breiman et al. (1984). RF is a further development of bagging. Instead of using all features, RF randomly selects a subset of features to split at each node when growing a tree. Breiman (2001) deduced an upper bound on the generalization error and concluded that RF does not suffer from the overfitting problem. Several recent studies demonstrated the better performance of RF over other machine learning approaches (Wu et al., 2003; Gunther et al., 2003; Svetnik et al., 2003). We used the R language implementation of RF (Svetnik et al., 2003). The number of trees to grow was set to 1000. RF uses a parameter mtry to specify the number of random features to be searched at each tree node. We used cross-validation to determine the best mtry value.
| RESULTS |
|---|
|
|
|---|
Selected predictors and biological implications
Structural and functional constraints are believed to be the underlying mechanisms that determine the phenotypic effect of an nsSNP. We derived several predictors from the literature and our own studies; the predictors used in this work are listed in Table 1. Such constraints are related to the properties of the substitution site, the identity of the wild-type amino acid and the differences between the wild-type and the mutated amino acid. First, the structural environment class definition, originally introduced by Bowie et al. (1991) in 1D representation of protein structure in fold-recognition studies, is a good proxy for structural constraints on the substitution site. Here, we extended their application to the problem of predicting phenotypic effect of nsSNPs. Bowie et al. (1991) used combinations of three structural parameters (buried area, fraction polar and secondary structure) to define 18 structural environments. Buried area reflects the solvent accessibility constraint and it is known that disease-associated nsSNPs tend to occur at buried sites (Sunyaev et al., 2000). Fraction polar is an indicator of environmental polarity and reflects the hydrogen bond constraint (Bowie et al., 1991). Disease-associated and neutral nsSNPs also have a slightly different secondary structure propensity, with the former tend to occur at ß-sheet sites (Sunyaev et al., 2000). Second, the identity of wild-type amino acid was used as a predictor. Third, two parameters were used to describe the substitution changes. The SIFT score (Ng and Henikoff, 2001) measures the tolerance for a substitution in a multiple sequence alignment and hence incorporates evolutionary information. The indicator of change in the amino acid group has been first proposed by us. We took both physicochemical properties and compatibility with the structural environment into consideration. Different structural environments have different groupings of amino acids (Tables 2 and S1). A substitution leading to a great change in amino acid physicochemical property and/or compatibility with the structural environment tends to be disease associated, while a substitution leading to a minor change tends to be neutral.
Performance of SVM and RF
For the 4013 training samples with sufficient evolutionary information (each had no <10 homologous sequences), we used cross-validation experiments to evaluate the performance of our method and to compare the results with that of the SIFT algorithm. For the 205 independent test samples with insufficient evolutionary information (each had <10 homologous sequences), the classifiers trained by the training samples made a prediction on each test sample. It was straightforward to compare the prediction accuracy between different methods. Various parameters were tested for SVM and RF classifiers. RF has a built-in measurement of the performance: the OOB prediction error (Breiman, 2001). Hence, cross-validation was not necessary. But for the purpose of comparison with SVM, we still performed cross-validation to determine the best RF parameter mtry. In fact, the OOB error was very similar to the classification error of cross-validation. Best performance was found using radial basis kernel with g = 0.1 and C = 10 among the tested SVM classifiers. For RF classifiers, best performance was found by setting mtry = 2. The cross-validation results of the selected SVM and RF are listed in Table 3 along with the prediction accuracy of the SIFT algorithm. Figure 1 plots the corresponding ROC curves. The result shows that RF outperforms SVM. A possible reason is that the last two predictors in Table 1 are partially correlated, and SVM has difficulty in dealing with correlated predictors. In contrast, correlated predictors are tractable to RF, because RF uses random feature selection technique. Table 3 and Figure 1A also show that for nsSNPs with sufficient evolutionary information (not <10 homologous sequences), our method is comparable with the SIFT algorithm. The BER and the MCC of our method are slightly better than the SIFT algorithm. These findings indicate that, for nsSNPs with sufficient evolutionary information, adding structural information only improves the prediction accuracy slightly. However, for the 205 independent test samples with insufficient evolutionary information, Table 3 and Figure 1B show that the improvement is significant. Therefore, for nsSNPs with insufficient evolutionary information, making use of structural information is critical for predicting the phenotypic effects of the nsSNPs.
|
|
Predictive power of the individual predictors
RF has a built-in measurement for the importance of individual predictor called mean decrease accuracy. It is calculated by randomly permuting the values of an individual predictor (predictor j) in the OOB cases. For each tree, the number of votes for the correct class in the predictor-j-permuted OOB data was subtracted from the number of votes for the correct class in the untouched OOB data, and the remainders were averaged over all trees in the forest. The resulting mean decrease accuracy is a measure of predictor importance with respect to its contribution to the prediction accuracy. Table 4 shows the importance of individual predictors. SIFT score was the best among all the predictors. This is expected when sufficient evolutionary information exists, because the SIFT algorithm uses the information that the tolerance of a substitution has been naturally sampled during the evolution. The discriminating power of buried area and ß-sheet was consistent with previous observations (Sunyaev et al., 2000). Interestingly, the discriminating power of the wild-type amino acid was obvious for some amino acids like glycine, cysteine and the charged amino acids. This indicated that wild-type amino acid was differently distributed over the 18 structural environments between disease-associated and neutral nsSNPs. For example, in the EC structural environment, we found that the wild-type amino acid of disease-associated nsSNPs was much more likely to be glycine than that of neutral nsSNPs (64% versus 27%). Hydrophobic wild-type amino acids, in contrast, had the least discriminating power.
|
| DISCUSSION |
|---|
|
|
|---|
Discovering relationships between genotypes and phenotypes is the central task of genetic studies. The links between genotype and phenotype of nsSNPs have received plenty of research attention because of their prevalence in genomes and close associations to inherited diseases. With more and more genotype and phenotype data available and with increasing knowledge of the properties of nsSNPs, it is now practical to predict the phenotype of an nsSNP (i.e. whether an nsSNP is disease-associated or neutral) from the genotype in silico. The SIFT server (Ng and Henikoff, 2003) and the PolyPhen server (Ramensky et al., 2002) are the two representatives for this purpose. Instead of learning from data, they determine parameters manually based on the knowledge of a human expert. Several other studies have exploited machine learning approaches to classify disease-associated and neutral nsSNPs (Chasman and Adams, 2001; Saunders and Baker, 2002; Krishnan and Westhead, 2003). Our study is different from the others in that we used natural nsSNPs rather than in vitro mutational data as the training set. Saunders and Baker (2002) also tested their method in natural nsSNPs, but their set contained a rather small number of samples. In vitro mutational data includes only two proteins and might introduce some bias. Previous studies showed that cross-validation accuracy of natural nsSNP data is generally lower than that of in vitro mutational data (Saunders and Baker, 2002), demonstrating that a fair evaluation of performance should use a natural nsSNP dataset.
Good prediction accuracy usually depends on two factors: informative predictors and superior machine learning approach. We introduced several novel informative predictors in combination with some predictors from the literature to achieve better discriminating power. We found that the structural parameters representing environments of nsSNPs as well as the environment-specific grouping of wild-type and mutated amino acids have considerable discriminating powers. Furthermore, two state-of-the-art machine learning methodsRF and SVM, were used to combine the discriminating powers of individual predictors in approximately optimal ways. RF was found to outperform SVM. A possible reason is that RF is superior to SVM in dealing with correlated predictors. The comparison of our method with the frequently used SIFT algorithm revealed that, for nsSNPs with insufficient evolutionary information, incorporating structural information remarkably increased the prediction accuracy.
Our method required 3D structures (or homologous structures) of the nsSNP variants, which limits its application when only sequence information is available. However, it is expected that the structural genomics project (Berman and Westbrook, 2004) will rapidly increase the number of experimentally derived protein structures. Furthermore, genome-wide protein 3D modeling projects (Schwede et al., 2003) and the progress in protein structure prediction (Hardin et al., 2002) will also increase the applicability of our method.
| Acknowledgments |
|---|
We thank Drs James Bowie, Roland Luethy and David Eisenberg for providing the computer program for calculating the structural environments. We thank Drs Pauline Ng and Steven Henikoff for providing access to the SIFT program. We thank Drs Leo Breiman, Andy Liaw and Matthew Wiener for providing access to the Random Forest package and helpful discussions. We thank Dr Thorsten Joachims for providing access to the SVM-light software. We also thank the two anonymous reviewers for their very helpful comments. This work was partly supported by a PhRMA Foundation grant to YC.
Received on October 20, 2005; revised on February 17, 2005; accepted on February 28, 2005
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F., et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403410[CrossRef][Web of Science][Medline].
Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 33893402
Apweiler, R., et al. (2004) UniProt: the universal protein knowledgebase. Nucleic Acids Res., 32, D115D119
Bhasin, M. and Raghava, G.P. (2004) ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Res., 32, W414W419
Berman, H.M. and Westbrook, J.D. (2004) The impact of structural genomics on the protein data bank. Am. J. Pharmacogenomics, 4, 247252[CrossRef][Medline].
Bowie, J.U., et al. (1991) A method to identify protein sequences that fold into a known three-dimensional structure. Science, 253, 164170
Breiman, L. (1996) Bagging predictors. Mach. Learning, 24, 123140.
Breiman, L. (2001) Random forest. Technical Report, Stat. Dept. UCB.
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C. Classification and Regression Trees, (1984) , NY Chapman and Hall.
Chandonia, J.M., et al. (2004) The ASTRAL compendium in 2004. Nucleic Acids Res., 32, D189D192
Chasman, D. and Adams, R.M. (2001) Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. J. Mol. Biol., 307, 683706[CrossRef][Web of Science][Medline].
Chen, Y.C., et al. (2004) Prediction of the bonding states of cysteines using the support vector machines based on multiple feature vectors and cysteine state sequences. Proteins, 55, 10361042[CrossRef][Web of Science][Medline].
Collins, F.S., et al. (1998) A DNA polymorphism discovery resource for research on human genetic variation. Genome Res., 8, 12291231
Fredman, D., et al. (2002) HGVbase: a human sequence variation database emphasizing data quality and a broad spectrum of data sources. Nucleic Acids Res., 30, 387391
Frishman, D. and Argos, P. (1995) Knowledge-based protein secondary structure assignment. Proteins, 23, 566579[CrossRef][Web of Science][Medline].
Gunther, E.C., et al. (2003) Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro. Proc. Natl Acad. Sci. USA, 100, 96089613
Hardin, C., et al. (2002) Ab initio protein structure prediction. Curr. Opin. Struct. Biol., 12, 176181[CrossRef][Web of Science][Medline].
Henikoff, J.G. and Henikoff, S. (1996) Using substitution probabilities to improve position-specific scoring matrices. Comput. Appl. Biosci., 12, 135143
Irizarry, K., et al. (2000) Comprehensive EST analysis of single nucleotide polymorphism across coding regions of the human genome. Nat. Genet, 26, 233236[CrossRef][Web of Science][Medline].
Joachims, T. (1999) Making large-scale SVM learning practical. In Schölkopf, B., Burges, C., Smola, A. (Eds.). Advances in Kernel MethodsSupport Vector Learning, , Cambridge, MA MIT Press.
Krishnan, V.G. and Westhead, D.R. (2003) A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function. Bioinformatics, 19, 21992209
Matthews, BW. (1985) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta, 405, 442451.
Ng, P.C. and Henikoff, S. (2001) Predicting deleterious amino acid substitutions. Genome Res., 11, 863874
Ng, P.C. and Henikoff, S. (2003) SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res., 31, 38123814
Ramensky, V., et al. (2002) Human non-synonymous SNPs: server and survey. Nucleic Acids Res., 30, 38943900
Saunders, C.T. and Baker, D. (2002) Evaluation of structural and evolutionary contributions to deleterious mutation prediction. J. Mol. Biol., 322, 891901[CrossRef][Web of Science][Medline].
Schwede, T., et al. (2003) SWISS-MODEL: an automated protein homology-modeling server. Nucleic Acids Res., 31, 33813385
Stenson, P.D., et al. (2003) Human gene mutation database (HGMD): 2003 update. Hum. Mutat., 21, 577581[CrossRef][Web of Science][Medline].
Sunyaev, S., et al. (2000) Towards a structural basis of human non-synonymous single nucleotide polymorphisms. Trends Genet., 16, 198200[CrossRef][Web of Science][Medline].
Sunyaev, S., et al. (2001) Prediction of deleterious human alleles. Hum. Mol. Genet., 10, 591597
Svetnik, V., et al. (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci., 43, 19471958[CrossRef][Web of Science][Medline].
Vapnik, V. Statistical Learning Theory, (1998) , NY Wiley.
Wang, Z. and Moult, J. (2001) SNPs, protein structure, and disease. Hum. Mutat., 17, 263270[CrossRef][Web of Science][Medline].
Wu, B., et al. (2003) Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics, 19, 16361643
Zhou, X.H., Obuchowski, N., Obuchowski, D. Statistical Methods in Diagnostic Medicine, (2002) , NY Wiley and Sons.
This article has been cited by other articles:
![]() |
B. Li, V. G. Krishnan, M. E. Mort, F. Xin, K. K. Kamati, D. N. Cooper, S. D. Mooney, and P. Radivojac Automated inference of molecular mechanisms of disease from amino acid substitutions Bioinformatics, November 1, 2009; 25(21): 2744 - 2750. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. J. Jorgensen, I. Ruczinski, B. Kessing, M. W. Smith, Y. Y. Shugart, and A. J. Alberg Hypothesis-Driven Candidate Gene Association Studies: Practical Design and Analytical Considerations Am. J. Epidemiol., October 15, 2009; 170(8): 986 - 993. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Kaminker, Y. Zhang, C. Watanabe, and Z. Zhang CanPredict: a computational tool for predicting cancer-associated missense mutations Nucleic Acids Res., July 13, 2007; 35(suppl_2): W595 - W598. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Bromberg and B. Rost SNAP: predict effect of non-synonymous polymorphisms on function Nucleic Acids Res., June 28, 2007; 35(11): 3823 - 3835. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z.-Q. Ye, S.-Q. Zhao, G. Gao, X.-Q. Liu, R. E. Langlois, H. Lu, and L. Wei Finding new structural and sequence attributes to predict possible disease association of single amino acid polymorphism (SAP) Bioinformatics, June 15, 2007; 23(12): 1444 - 1450. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. A. Care, C. J. Needham, A. J. Bulpitt, and D. R. Westhead Deleterious SNP prediction: be mindful of your training data! Bioinformatics, March 15, 2007; 23(6): 664 - 672. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Kaminker, Y. Zhang, A. Waugh, P. M. Haverty, B. Peters, D. Sebisanovic, J. Stinson, W. F. Forrest, J. F. Bazan, S. Seshagiri, et al. Distinguishing Cancer-Associated Missense Mutations from Common Polymorphisms Cancer Res., January 15, 2007; 67(2): 465 - 473. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. S. Choi, E. J. Vallender, and B. T. Lahn Systematically Assessing the Influence of 3-Dimensional Structural Context on the Molecular Evolution of Mammalian Proteomes Mol. Biol. Evol., November 1, 2006; 23(11): 2131 - 2133. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Larranaga, B. Calvo, R. Santana, C. Bielza, J. Galdiano, I. Inza, J. A. Lozano, R. Armananzas, G. Santafe, A. Perez, et al. Machine learning in bioinformatics Brief Bioinform, March 1, 2006; 7(1): 86 - 112. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








