Bioinformatics Advance Access originally published online on September 12, 2007
Bioinformatics 2007 23(21):2918-2925; doi:10.1093/bioinformatics/btm437
Accurate prediction of deleterious protein kinase polymorphisms
1Department of Medicine and Center for Human Genetics and Genomics and 2Scripps Genomic Medicine and Department of Molecular and Experimental Medicine, The Scripps Research Institute, University of California, San Diego, La Jolla, CA 92093, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Contemporary, high-throughput sequencing efforts have identified a rich source of naturally occurring single nucleotide polymorphisms (SNPs), a subset of which occur in the coding region of genes and result in a change in the encoded amino acid sequence (non-synonymous coding SNPs or nsSNPs). It is hypothesized that a subset of these nsSNPs may underlie common human disease. Testing all these polymorphisms for disease association would be time consuming and expensive. Thus, computational methods have been developed to both prioritize candidate nsSNPs and make sense of their likely molecular physiologic impact.
Results: We have developed a method to prioritize nsSNPs and have applied it to the human protein kinase gene family. The results of our analyses provide high quality predictions and outperform available whole genome prediction methods (74% versus 83% prediction accuracy). Our analyses and methods consider both DNA sequence conservation, which most traditional methods are based on, as well unique structural and functional features of kinases. We provide a ranked list of common kinase nsSNPs that have a higher probability of impacting human disease based on our analyses.
Contact: nschork{at}scripps.edu
Supplementary information: Supplementary data are available on Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Many rare single nucleotide polymorphisms (SNPs) have been identified as contributing to susceptibility to human diseases (Cargill et al., 1999). However, these highly penetrant variations account for only a small proportion of all human diseases. With this in mind, the common disease, common variant hypothesis has been put forward which postulates that common low-penetrance variations, rather than multiple rare high-penetrance variations, are likely to be the main contributors to disease susceptibility (Becker, 2004; Pritchard, 2001; Reich and Lander, 2001). Alternatively, the majority of disease may be caused by a large number of extremely rare mutations. In fact, the allelic heterogeneity of many overtly monogenic, Mendelian disorders suggests that this may indeed be a possibility (Pritchard and Cox, 2002). Obviously, characterizing the genetic basis of disease is important for understanding disease pathogenesis and might be especially important in identifying pharmaceutical targets for relevant treatments as well as providing possible diagnostic and prognostic markers for assessing an individual's susceptibility to disease (Collins and Guttmacher, 2001).
It is estimated that 10 million common SNPs populate the human genome that have an appreciable frequency (i.e. >1%) in the population at large (The International HapMap Consortium, 2003), of which 67 000–200 000 are non-synonymous coding SNPs (nsSNPs) (Cargill et al., 1999; Halushka et al., 1999; Livingston et al., 2004). Irrespective of whether or not a subset of common nsSNPs or a large number of rare nsSNPs are responsible for causing human disease, testing possible associations between these nsSNPs and disease and/or experimentally characterizing the functional effects of these nsSNPs, would be extremely expensive, time consuming and likely suffer from low statistical power to differentiate disease-causing from non-disease causing nsSNPs (Ohashi and Tokunaga, 2002).
One approach to overcome this problem is to computationally prioritize all candidate nsSNPs for their likely impact on disease susceptibility and then test the most probable disease-causing SNPs for association with diseases. In addition, nsSNPs identified as associated with a disease from whole genome association(WGA) studies may benefit from insight into their putative functional significance (Couzin, 2007). A number of methods have been designed for this purpose (Ng and Henikoff, 2006). Many of these prediction schemes exploit only a few characteristics of the SNPs such as their levels of DNA or amino acid conservation. Others exploit a wider range of characteristics but are limited to characteristics, which can be easily generalized to the entire range of proteins found in the human genome, or are restricted in coverage to structurally characterized proteins (Jian et al., 2007). As a result, these methods typically either provide a wide coverage (>50%) but high false positive and false negative rates (>20%), or lower false positive and false negative rates, but with extremely restricted coverage that requires complete structural characterization of relevant proteins.
In this article, we describe a sequence-based method that exploits information and nsSNP characteristics previously used by other prediction schemes (i.e. conservation, secondary structure, solvent accessibility, etc.), as well as information not used in previous prediction schemes (group membership, domain residence, protein flexibility and five different amino acid metrics). These additional structural features can be readily extracted and applied to any particular protein family. Essentially, we sought to predict disease-causing nsSNPs using either subsets of these characteristics or all of them together with different statistical prediction and analysis tools. To showcase the proposed methodology, we have designed and applied analysis methods in order to predict nsSNPs that cause disease falling within the human protein kinase gene family, a family comprising 22% of the druggable genome (Hopkins and Groom, 2002), and implicated in a wide variety of biological processes and human diseases (Hunter, 1998). The best prediction model we developed outperforms previously described prediction schemes (83% correctly predicted by our method versus <74% correctly predicted by previous methods; significance of the difference, p < 0.0001) and provides high quality predictions for probable disease-associated common nsSNPs in the human protein kinase family. We comment on our analyses and models and consider some important caveats associated with our results in the Discussion section.
| 2 METHODS |
|---|
|
|
|---|
We compiled an extensive record of nsSNPs in kinases using public domain resources (Cargill et al., 1999; Lander et al., 2001; Sachidanandam et al., 2001; Venter et al., 2001). We developed a number of SNP databases including a natural set of SNPs that included nsSNPs known to cause disease from genetic studies and an experimental set of SNPs that included SNPs found to be deleterious from specific experimental manipulations. The details of the construction of these datasets can be found in Torkamani and Schork (2007). For the creation of the natural set, all disease-causing (DCs) SNPs were taken from published literature compiled in OMIM, KinMutBase and the Human Gene Mutation Database (HMGD). SNPs are not known to cause disease (uDCs; i.e. nsSNPs unknown to cause disease) were obtained from dbSNP125 and PupaSNP. The majority of these nsSNPs are common and probably neutral variations within the human genome and are not associated with any overt clinical phenotype. We want to emphasize, however, that the functional effects of many of these SNPs have not been explored in full. For the creation of the experimental set, all DCs were from experimentally generated and functionally characterized mutations found in the SwissProt feature table (nsSNPs affecting protein function are characterized as disease causing) and all uDCs were obtained from dbSNP126. An additional dataset, Swiss-Prot disease/polymorphism, was compiled by collecting polymorphisms found in the SwissProt feature table labeled as polymorphism and disease.
The SNP characteristics used to predict disease-causing status were: (1) kinase group; (2) wild-type amino acid; (3) SNP amino acid; (4) domain; (5) subPSEC score (Thomas and Kejariwal, 2004; Thomas et al., 2003); (6) the change in hydrophobicity, polarity and charge coded as 1, 0 or –1 where 1 is a gain in the respective factor, 0 is no change, and –1 is a loss in the respective factor; (7) the secondary structure coded as coil, helix, or sheet as predicted by the Proteus server (http://129.128.185.184/proteus/index.jsp) (Montgomerie et al., 2006); (8) the solvent accessibility coded as accessible, inaccessible or intermediate, as determined by the Predict Protein server (http://www.predictprotein.org) (Rost et al., 2003); (9) the flexibility WMSA and Union scores as determined by Wiggle (Gu et al., 2006) and (10) the differences in the following characteristics: the five amino acid metrics from (Atchley et al., 2005), Kyte–Doolittle Hydropathy (Kyte and Doolittle, 1982), water/octanol partition energy (White and Wimley, 1999) and volume (Harpaz et al., 1994). For mutations falling in the kinase catalytic domain, an additional 11th predictor, whether the mutations falls within the N-terminal or the C-terminal lobe, was used. Additional characteristics that were used as predictors, just not used in our model but rather used to compare the performance of our model to others were the SIFT score (Ng and Henikoff, 2002), PMut score (Ferrer-Costa et al., 2005) and SNPs3D (Yue et al., 2006).
A Support Vector Machine (SVM) used for predictions was implemented in the Sequential Minimal Optimization (SMO) package of the WEKA (Witten and Frank, 2005) data-mining software package. Other classifiers we explored, but ultimately discarded in favor of a SVM, were a neural network (Multilayer Perceptron), and the Decision Table, also from the WEKA software package.
In creating the final prediction model, training of the SVM was performed on the full natural set as well as a subset of the natural set containing only mutations occurring within the kinase catalytic domain. An additional characteristic, the subdomain of the kinase catalytic domain, was considered in the second SVM. These separate SVMs were then applied to the test set and predictions were combined to form the final set of predictons. The threshold probability to declare a mutation as disease causing was determined as the threshold resulting in the highest average F-measure score when both training and testing was carried out upon the natural set, this threshold was maintained for application to all test sets. Areas under the curve and comparison of different ROC curves were determined empirically as described in Lasko et al. (2005).
| 3 RESULTS |
|---|
|
|
|---|
3.1 Prediction method
The SVM-based statistical classifier used to generate our prediction scheme and model was chosen heuristically by comparison of its performance to other prediction schemes in differentiating disease from non-disease causing variations using two test datasets: (1) a natural set, consisting of naturally occurring kinase polymorphisms; and (2) an experimental set, consisting of induced mutations. Among other statistical classifiers, we compared a SVM, a Neural Network model and a Decision Table (Table 1). Since experimental mutations are selected by experimentalists and do not occur naturally in particular kinase groups, the kinase group characteristic was omitted for experimental mutation predictions. Comparison of the different methods involved consideration of average F-measures, percent correctly predicted, Matthew's correlation coefficient (Petrova and Wu, 2006), and the balanced error rate. Our comparisons suggested that, considering both the experimental and natural datasets, the SVM performed best on average, and, as such, was chosen to generate our final prediction scheme and model.
|
3.2 Performance and validation of the prediction model
First, the method was applied to the natural set on which it was trained. Figure 1 presents ROC curves derived from analyses of the natural set as the test set. The model performs with a high degree of accuracy (AUC = 0.8925 ± 0.0056, 83% correctly predicted) and performs similarly to predictions made by training on the full natural set alone (the P-value for a test of equality of the two models was 0.56). This comparison did not take into account the different thresholds used for determining disease-causing status, where the percent correctly predicted on the full dataset alone is 81% versus 83%.
|
To demonstrate that our results using the natural set, as the test set did not result from overtraining, we performed 10-fold cross-validation (Table 2). As in the case where the full natural set was used for training and testing, the model performs with a high degree of accuracy (81% correctly predicted; AUC = 0.8709 ± 0.0067).
|
To confirm the method is learning to differentiate between disease-causing and non-diseasecausing nsSNPs, we tested the natural set trained method on the Swiss-Prot dataset (Tables 2 and 3), held as the best dataset for deleterious SNP prediction benchmarking (Care et al., 2007).The results confirm the model differentiates between disease and non-disease causing nsSNPs (77% correctly predicted; AUC = 0.8714 ± 0.0108).
|
To demonstrate the general applicability of the model, we also applied to the experimental set, which contains no nsSNPs found within the natural set. Figure 1 also depicts ROC curves derived from analyses involving the experimental set as the test set. In this case, our method (77% correctly predicted) clearly outperforms a SVM in which predictions for the kinase catalytic domain are not made separately (73% correctly predicted; P-value <0.0001).
To visually present the separation of disease from non-disease causing nsSNPs, we generated a tree diagram based upon the distances of the SNP characteristics used to discriminate disease from non-disease associated nsSNPs (Fig. 2). Distances were calculated as follows: for categorical characteristics, a distance of 0 was assigned for a match or 1 for a mismatch, whereas for continuous variables the distance was taken as the absolute values of the difference between two characteristics divided by the range of the values these characteristics can take on, thus leading a measure that varies between 0 and 1. These distances were then either unweighted or weighted by the SVM coefficients to generate two different trees. Graphical tree representations were generated by the Unweighted Pair Group with Arithmetic Mean method implemented in MEGA 3.1 (Kumar et al., 2004). While both methods show separation of disease from non-disease causing SNPs, weighting by SVM coefficients results in closer clustering of the characteristics of the disease and non-disease causing SNPs with each other.
|
3.3 Comparison to previous methods
The accuracy of our SVM-based prediction scheme and model on the natural, experimental and Swiss-Prot sets was compared to three previous prediction schemes, the SubPSEC method (used in our model), the SIFT method–which is regarded as one of the best methods for functional mutation prediction–and the PMut method, which cites a level of accuracy similar to ours based on a completely different test set. Figure 3, as well as Tables 2 and 3, demonstrate that our SVM-based model and prediction scheme outperforms the SubPSEC, SIFT and the PMut methods, on all datasets (p < 0.0001 for all comparisons).
|
Additionally, comparison was made to SNPs3D, a classifier capable of performing predictions based upon solved crystal structures. When comparing the performance of our model versus SNPs3D on a subset of nsSNPs where structural information is available, our model (76% correctly predicted) outperforms SNPs3D (60% correctly predicted) (Table 2). Importantly, 32% of DCs incorrectly classified by SNPs3D as neutral variants were correctly classified by our method.
3.4 Contribution of the characteristics
The different SNP characteristics used as predictors of disease versus non-disease-associated SNPs were evaluated for their individual contributions to the predictions by either removing one set of characteristics from a larger total set of characteristics for making predictions (Table 4; upper diagonal), or performing predictions with only one set of characteristics (Fig. 4, Table 4; lower diagonal). The characteristics were divided into categories, which included conservation that is comprised of the SubPSEC score; amino acid information that is comprised of the wild-type and SNP amino acid identity; changes in the five amino acid metrics, and changes in hydropathy, water/octanol partition energy, hydrophobicity, polarity, charge and volume; overall structural similarity that is the group association and general structural information that is comprised of secondary structure, solvent accessibility, domain residence and flexibility predictions.
|
|
Using any single characteristic is significantly less accurate than combining all the different characteristics (Fig. 4, Table 4 (p < 0.0001 for all comparisons)) and removal of any single characteristics also causes a significant decrease in model accuracy (Table 4). This demonstrates that each characteristic makes a significant positive contribution to the overall performance of the model, though predictability is still obtained with a subset of the parameters. Thus, any predictor of disease that relies upon a single characteristic will fall short of the accuracy obtainable by a combination of characteristics.
3.5 Implementation
In contrast to most methods, which predict
25–30% of human nsSNPs to detrimentally affect protein function, we find that 12% of kinase nsSNPs are predicted to detrimentally affect kinase protein function. Of the top three ranked dbSNP SNPs predicted to cause disease, LRRK2(G2026S) lies in the DFG motif (DYG for LRRK2) and is associated with Parkinson's disease, EGFR(G719) lies in the G-X-G-XX-G motif and has been identified as a mutation in non-small cell lung cancer responsive to gefitinib (Lynch et al., 2004), and PKCh(D487Y) also lies in the DFG motif. Another SNP, ATM(F2827C), which was mistakenly labeled as a non-disease-associated SNP in our dataset, was also detected with a probability of causing disease of 83%. A number of SNPs not conclusively implicated in disease, but for which weak disease associations have been observed, such as rs2234909 in FGFR3 and rs4647902 in FGFR1–both of which have been associated with craniosynostosis–are also predicted to be disease causing. The results of our analysis as to which nsSNPs, currently not known to contribute to a specific disease within the human protein kinase gene family, but that are likely to contribute to human disease are presented in rank order in Supplementary Table 1.
| 4 DISCUSSION |
|---|
|
|
|---|
The improved performance of our prediction scheme over other methods presented herein likely reflects biases in the distribution of disease-causing mutations within the protein kinase gene family. These biases, at the level of group, domain and amino acid have been detailed previously by Torkamani and Schork (2007). It is quite likely that the weight of characteristics used in determining the functional status of a mutation differs from gene family to gene family. A simple example is mutations in DNA-binding proteins, where mutations of positively charged residues are likely to disrupt binding to negatively charged DNA, and thus be more likely to cause disease than mutations of positively charged residues in other gene families (La et al., 2004). Additionally, when a prediction method is trained and applied to a particular gene family, additional characteristics, such as large-scale structural similarities determined by group or domain membership, can be exploited to improve accuracy. These statistical signals would more than likely be dampened to the level of random noise when the prediction method is trained and applied to the whole genome. This loss of information is especially significant considering that group, as a predictor of large-scale structural similarity, is among the most informative characteristics for functional classification (Table 4). The lack of correlation between experimentally induced mutations within kinase groups and their occurrence in disease, as detailed in Torkamani and Schork (2007), demonstrates that this observation is not an artifact of the training data but reflects a real increased propensity for disease-causing mutations in specific kinase groups. Additionally, the close phylogenetic relationship between RGC, TK and TKL kinases, kinase groups strongly associated with disease, further suggests a relationship between their overall structural or evolutionary similarities and an increased propensity to cause disease. Though different protein families may require a different set of informative attributes to perform predictions, our results indicate that expert knowledge can be leveraged to greatly improve prediction accuracy of deleterious protein polymorphisms. The specific predictors used herein may not apply directly to other protein families, and intensive analysis of the unique determinants of disease in each individual protein family will be required to generate enhanced prediction accuracy. Our results suggesting that conservation information alone is not sufficient to differentiate nsSNPs likely to cause disease from those that are not likely to cause disease is consistent with the results of the recent survey of functional genomic elements in the genome by The ENCODE Project Consortium (The ENCODE Project Consortium, 2007). The ENCODE researchers identified a number of regions of the genome that exhibited clear biological activities but were not conserved across species, suggesting a role for lineage-specific variations in mediating particular biological functions. On the other hand our results suggest that phylogeny, domain or other attributes relevant to overall structural features are powerful predictors for disease-causing status.
In the particular case of human protein kinases, disease-causing mutations tend to be clustered within the highly conserved catalytic core (Hanks and Hunter, 1995). Within this catalytic core the probability of a disease-causing mutation occurring at a specific amino acid is different than the probability observed on a whole genome scale (Torkamani and Schork, 2007). Thus, in addition to training the method on kinase proteins in general, our method performs separate predictions for mutations occurring both outside of, and within, the conserved catalytic core, further exploiting biases in the distribution of predictive characteristics at the domain level. When predictions are performed using mutations occurring within the kinase catalytic core, an additional structural characteristic, the subdomain of the kinase catalytic core, is also included. Ultimately, we have found that disease-causing mutations tend to cluster within the C-terminal lobe rather than the N-terminal lobe (manuscript in preparation). Similar biases have been observed within structural features of other gene families as well (Lee et al., 2003).
An additional SNP structural characteristic not used previously in other prediction methods, but ranking as one of the more powerful predictors in our model, is the protein flexibility measure, Wiggle (Gu et al., 2006). The importance of this predictor is described in respect to its prediction performance within the kinase catalytic core. The Wiggle measure tends to give large negative scores (inflexible) to residues towards the center of helices. The centers of these helices tend to be enriched with disease-causing mutations, while the edges of the helices tend to be enriched with neutral mutations (data not shown). Additionally, conserved residues and motifs tend to occupy central positions within these helices adding extra emphasis upon these residues as highly conserved and structurally inflexible. The score performs well on mutations occurring outside of the catalytic core as well, suggesting that disease-causing mutations tend to occur at structurally inflexible locations in general, and may be particularly enriched within the centers of secondary structures.
The combined contributions of all the characteristics taken as predictors described above lead to a prediction accuracy that significantly exceeds those of the SubPSEC, SIFT, PMut or SNPs3D methods on both the natural and experimental datasets (Fig. 3, Tables 2 and 3). While methods based on conservation, like SubPSEC and SIFT, are excellent for whole genome predictions, experimentalists interested in a large number of nsSNPs in a particular gene family, e.g. nsSNPs in kinases implicated in cancer samples, can benefit from improved accuracy by including additional predictors designed to target unique determinants of disease-causing status within the gene family of interest. Some of these predictors such as group membership, derive from real biological tendencies towards disease-causing status, thus while our method outperforms other methods on the experimental dataset, it performs less well on the experimental dataset as compared to the natural dataset. Our method also compares favorably to the PMut method, which uses a combination of conservation and structural attributes and SNPs3D, which is able to perform predictions based upon solved crystal structures. It is likely that the datasets that PMut and SNPs3D were trained on contained disease associated and neutral mutations whose characteristics vary wildly from those in our kinase mutations dataset. This further demonstrates that non-conservation predictors of disease association vary significantly from protein family to protein family and suggests that caution should be used in applying these methods as general predictors (Care et al., 2007; The ENCODE Project Consortium, 2007). Therefore, while PMut and SNPs3D exhibit excellent performance on the datasets they were trained with and should perform well on protein families represented in their training sets, they do not appear to be well suited for predictions within the protein kinase gene family.
To our knowledge, all available methods for disease SNP prediction, except for PMut, demonstrate <75% correct predictions and estimate that 25–30% of mutations found in dbSNP are deleterious. Our studies indicate, at least for the kinase gene family that this figure is closer to 10% and likely even lower since many of the SNPs presented in Supplementary Table 1 are rare, have not been validated or not strongly predicted to be disease causing. It has been estimated that a limited number of disease susceptibility genes with common variants can explain a major proportion of common diseases in the population (Yang et al., 2005), thus, a much lower proportion of deleterious common SNPs than currently estimated is in agreement with this estimate.
We believe that the predictions presented herein represent a highly accurate analysis of nsSNPs within the human kinase gene family, and present an excellent starting point for the elucidation of common SNPs within this family that may contribute to common diseases. The importance of human protein kinases to nearly every biological process suggests that this gene family is likely to contribute significantly to common disease. It is also likely that our methods will be applicable to characterization of the properties of precancerous somatic mutations, since both inherited disease susceptibility and the DNA changes in somatic cells associated with cancers can result from altered protein function.
An important caveat in not only our analyses but all analyses seeking to differentiate disease causing from non-disease causing polymorphisms is the delineation of the control variations that do not cause disease. It is very likely that our chosen control variations include amongst them variations that do, in fact, contribute to disease, although the role of these variations in mediating disease susceptibility has not been worked out. Although likely true, this fact does not invalidate our analyses for at least two reasons: first, the inclusion of actual disease-causing variations in our control set should, if anything, bias our results towards the null hypothesis of no differences between our defined disease and non-disease causing variations on the basis of conservation and structural characteristics of those variations. Thus, the fact that we could distinguish disease from non-disease causing variations corroborates our use of the variations we chose as controls. Second, if disease-causing variations do exist amongst our control variations, then their influence on disease must be subtle if it has not been revealed yet. As such, our analyses may be best considered as providing results more relevant to the prediction of overt, Mendelian, largely monogenic diseases influenced by highly penetrant variations than to polygenic, multifactorial diseases. As the genetic bases of polygenic, multifactorial diseases are characterized, a reapplication of our ideas and methods would be in order.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The authors would like to thank Jenny Gu for invaluable discussions regarding protein flexibility and Dr Susan Taylor for advice and encouragement. N.J.S. and his laboratory are supported in part by the following research grants: The National Heart Lung and Blood Institute Family Blood Pressure Program (FBPP; U01 HL064777-06); The National Institute on Aging Longevity Consortium (U19 AG023122-01); The National Institute of Mental Health Consortium on the Genetics of Schizophrenia (COGS; 5 R01 HLMH065571-02); National Institute of Health (R01s: HL074730-02 and HL070137-01) and Scripps Genomic Medicine. A.T. was supported in part by the UCSD Genetics Training Grant for the Biomedical Sciences.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on June 20, 2007; revised on August 2, 2007; accepted on August 19, 2007
| REFERENCES |
|---|
|
|
|---|
Atchley WR, et al. Solving the protein sequence metric problem. Proc. Natl Acad. Sci. USA (2005) 102:6395–6400.
Becker KG. The common variants/multiple disease hypothesis of common complex genetic disorders. Med. Hypotheses (2004) 62:309–317.[CrossRef][Web of Science][Medline]
Care MA, et al. Deleterious SNP prediction: be mindful of your training data! Bioinformatics (2007) 23:664–672.
Cargill M, et al. Characterization of single-nucleotide polymorphisms in coding regions of the human genes. Nat. Genet. (1999) 22:231–238.[CrossRef][Web of Science][Medline]
Collins FS, Guttmacher AE. Genetics moves into the medical mainstream. JAMA (2001) 294:1399–1402.
Couzin J, Kaiser J. Genome-wide association. Closing the net on common disease genes. Science (2007) 316:820–822.
Ferrer-Costa C, et al. PMUT: a web-based tool for the annotation of pathological mutations on proteins. Bioinformatics (2005) 21:3176–3178.
Gu J, et al. Wiggle – predicting functionally flexible regions from primary sequence. PLoS Comput. Biol. (2006) 2:e90.[CrossRef][Medline]
Halushka MK, et al. Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat. Genet. (1999) 22:239–247.[CrossRef][Web of Science][Medline]
Hanks SK, Hunter T. Protein kinases 6. The eukaryotic protein kinase superfamily: kinase (catalytic domain structure and classification). FASEB J. (1995) 9:576–596.[Abstract]
Harpaz Y, et al. Volume changes on protein folding. Structure (1994) 2:641–649.[Medline]
Hopkins AL, Groom CR. The druggable genome. Nat. Rev. Drug Discov. (2002) 1:727–730.[CrossRef][Web of Science][Medline]
Hunter T. Croonian lecture: the phosphorylation of proteins on tyrosine – its role in cell growth and disease. Philos. Trans. R. Soc. Lond. B Biol. Sci. (1998) 353:583–605.
Jian R, et al. Sequence-based prioritization of nonsynonymous single-nucleotide polymorphisms for the study of disease mutations. Am. J. Hum. Genet. (2007) 81:346–360.[CrossRef][Web of Science][Medline]
Kumar S, et al. MEGA3: integrated software for molecular evolutionary genetics analysis and sequence alignment. Brief. Bioinformatics (2004) 5:150–163.
Kyte J, Doolittle R. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. (1982) 157:105–132.[CrossRef][Web of Science][Medline]
La P, et al. Direct binding of DNA by tumor suppressor menin. J. Biol. Chem. (2004) 279:49045–49054.
Lander ES, et al. Initial sequencing and analysis of the human genome. Nature (2001) 209:860–921.
Lasko TA, et al. The use of receiver operating characteristic curves in biomedical informatics. J. Biomed. Inform. (2005) 38:404–415.[CrossRef][Web of Science][Medline]
Lee A, et al. Distribution analysis of nonsynonymous polymorphisms within the G-protein-coupled receptor gene family. Genomics (2003) 81:245–248.[CrossRef][Web of Science][Medline]
Livingston RJ, et al. Pattern of sequence variation across 213 environmental response genes. Genome Res. (2004) 14:1821–1831.
Lynch TJ, et al. Activating mutations in the epidermal growth factor receptor underlying responsiveness of non-small-cell lung cancer to gefitinib. N. Engl. J. Med. (2004) 21:2129–2139.
Montgomerie S, et al. Improving the accuracy of protein secondary structure prediction using structural alignment. BMC Bioinformatics (2006) 14:301.
Ng PC, Henikoff S. Accounting for human polymorphisms predicted to affect protein function. Genome Res. (2002) 12:436–446.
Ng PC, Henikoff S. Predicting the effects of amino acid substitutions on protein function. Annu. Rev. Genomics Hum. Genet. (2006) 7:61–80.[CrossRef][Web of Science][Medline]
Ohashi J, Tokunaga K. The expected power of genome-wide linkage disequilibrium testing using single nucleotide polymorphism markers for detecting a low-frequency disease variant. Ann. Hum. Genet. (2002) 66:297–306.[CrossRef][Web of Science][Medline]
Petrova NV, Wu CH. Prediction of catalytic residues using support vector machine with selected protein sequence and structural properties. BMC Bioinformatics (2006) 21:312.
Pritchard JK. Are rare variants responsible for susceptibility to common diseases? Am. J. Hum. Genet. (2001) 69:124–137.[CrossRef][Web of Science][Medline]
Pritchard JK, Cox NJ. The allelic architecture of human disease genes: common disease-common variant ... or not? Hum. Mol. Genet. (2002) 20:2417–2423.
Reich DE, Lander ES. On the allelic spectrum of human disease. Trends Genet. (2001) 17:502–510.[CrossRef][Web of Science][Medline]
Rost B, et al. The PredictProtein server. Nucleic Acids Res. (2003) 32:W321–W326.[CrossRef][Web of Science]
Sachidanandam R, et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature (2001) 409:928–933.[CrossRef][Medline]
The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature (2007) 447:799–816.[CrossRef][Web of Science][Medline]
The International HapMap Consortium. The international HapMap project. Nature (2003) 426:789–796.[CrossRef][Medline]
Thomas PD, Kejariwal A. Coding single-nucleotide polymorphisms associated with complex vs. Mendelian disease: evolutionary evidence for differences in molecular effects. Proc. Natl Acad. Sci. USA (2004) 101:15398–15403.
Thomas PD, et al. PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. (2003) 13:2129–2141.
Torkamani A, Schork NJ. Distribution analysis of nonsynonymous polymorphisms within the human kinase gene family. Genomics (2007) 90:49–58.[CrossRef][Web of Science][Medline]
Venter JC, et al. The sequence of the human genome. Science (2001) 291:1304–1351.
White SH, Wimley WC. Membrane protein folding and stability: physical principles. Ann. Rev. Biophys. Biomol. Struct. (1999) 28:319–365.[CrossRef][Web of Science][Medline]
Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques (2005) 2nd. San Francisco: Morgan Kaufmann.
Yang Q, et al. How many genes underlie the occurrence of common complex diseases in the population? Int. J. Epidemiol. (2005) 34:1129–1137.
Yue P, et al. SNPs3D: candidate gene and SNP selection for association studies. BMC Bioinformatics (2006) 7:166.[CrossRef][Medline]
This article has been cited by other articles:
![]() |
B. Li, V. G. Krishnan, M. E. Mort, F. Xin, K. K. Kamati, D. N. Cooper, S. D. Mooney, and P. Radivojac Automated inference of molecular mechanisms of disease from amino acid substitutions Bioinformatics, November 1, 2009; 25(21): 2744 - 2750. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Torkamani and N. J. Schork Predicting functional regulatory polymorphisms Bioinformatics, August 15, 2008; 24(16): 1787 - 1792. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Torkamani, N. Kannan, S. S. Taylor, and N. J. Schork Congenital disease SNPs target lineage specific structural elements in protein kinases PNAS, July 1, 2008; 105(26): 9011 - 9016. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Torkamani and N. J. Schork Prediction of Cancer Driver Mutations in Protein Kinases Cancer Res., March 15, 2008; 68(6): 1675 - 1682. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






