Bioinformatics Advance Access originally published online on February 19, 2008
Bioinformatics 2008 24(7):901-907; doi:10.1093/bioinformatics/btn055
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ParCrys: a Parzen window density estimation approach to protein crystallization propensity prediction
1School of Life Sciences Research, University of Dundee, Dow Street, Dundee, DD1 5EH and 2Department of Computing Science, University of Glasgow, Glasgow, GL12 8QQ, UK
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
The ability to rank proteins by their likely success in crystallization is useful in current Structural Biology efforts and in particular in high-throughput Structural Genomics initiatives. We present ParCrys, a Parzen Window approach to estimate a protein's propensity to produce diffraction-quality crystals. The Protein Data Bank (PDB) provided training data whilst the databases TargetDB and PepcDB were used to define feature selection data as well as test data independent of feature selection and training. ParCrys outperforms the OB-Score, SECRET and CRYSTALP on the data examined, with accuracy and Matthews correlation coefficient values of 79.1% and 0.582, respectively (74.0% and 0.227, respectively, on data with a real-world ratio of positive:negative examples). ParCrys predictions and associated data are available from www.compbio.dundee.ac.uk/parcrys.
Contact: geoff{at}compbio.dundee.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Knowledge of protein three-dimensional structure is useful for addressing and formulating key scientific questions, as well as stimulating technological innovation (Hol, 2000; Todd et al., 2005). For example, structure-based drug design has informed the process of developing treatments for human diseases such as HIV (Poppe et al., 1997), influenza (von Itzstein et al., 1993), asthma (Schuttelkopf et al., 2006) and cancer (Davies et al., 2002). Structural insights have revealed molecular mechanisms of cellular function (Shapiro and Harris, 2000; Zarembinski et al., 1998), pathogenesis (Diprose et al., 2001; Singh et al., 2006) and genetic disease (Yard et al., 2007). In addition, the structures of proteins from thermophilic organisms suggest a promising basis for the development of industrial biocatalysts (Hol, 2000).
Structural genomics is a global enterprise that aspires to produce a large scale mapping of protein structure space (Burley et al., 1999; Chandonia and Brenner, 2006; Hol, 2000, http://www.isgo.org/home/index.php; Service, 2002; Stevens et al., 2001). However, it is common for only 5% of selected protein targets to result in a high-resolution protein model (http://www.ebi.ac.uk/msd-srv/msdtarget; Service, 2002, 2005; Terwillinger, 2000). Various strategies have been proposed to increase the return from structural genomics efforts, including obtaining one representative structure per protein family and working with multiple orthologues (Brenner, 2000; Chandonia and Brenner, 2005; Hiu and Edwards, 2003; Liu et al., 2004; Savchenko et al., 2003). These approaches imply methods to rank proteins within functionally defined groups according to their likely success in the structural genomics pipeline. As crystallization is a significant bottleneck in structural biology, the relative ease of obtaining diffraction-quality crystals is a key consideration (Biertumpfel et al., 2005; Chayen, 2004; Puesy et al., 2005). For example, to date, the Structural Proteomics in Europe (SPINE) consortium reports 58% success for the progression from Expressed to Soluble and 53% success for the progression from Purified to Crystallized (http://www.ebi.ac.uk/msd-srv/msdtarget). Investigations into the relationship between protein sequence properties and progression through the structural genomics pipeline have suggested features relevant to crystallization propensity prediction (Canaves et al., 2004; Goh et al., 2004). Indeed, there are already methods that aim to predict crystallization propensity. These include SECRET (Smialowski et al., 2006), the OB-Score (Overton and Barton, 2006) and more recently, CRYSTALP (Chen et al., 2007). A limitation of both SECRET and CRYSTALP is that they only accept as input sequences of length between 46 and 200 amino acids. While the OB-score has no such limitation, it only exploits two predictive features to estimate crystallization propensity: isoelectric point and hydrophobicity. In this article we describe a Parzen Window density estimator that incorporates additional features not considered by the OB-Score, whilst permitting prediction over the full range of sequence length. The Parzen Window technique affords a non-parametric statistical approach to estimate a density function from a given sample dataset (Duda and Hart, 1973; Parzen, 1962). We took sequences from the Protein Data Bank (PDB) (Berman et al., 2007) culled at 25% identity, with R factor of
0.3 and resolution
3.0 Å downloaded from PISCES (Wang and Dunbrack, 2003) as the model for the Parzen Window density estimate. The propensity of a test protein to progress to the stage of diffraction-quality crystals was estimated according to the density function derived from the PDB sequences; no negative examples are required for this. In independent tests, we found that this new method (ParCrys) outperforms the OB-Score, SECRET and CRYSTALP.
| 2 METHODS |
|---|
|
|
|---|
2.1 Datasets
Selection of appropriate training and test datasets is critical in the development of a predictor. In this study we developed 13 primary datasets, which were filtered by stringent sequence similarity thresholds in order to minimize bias. The Parzen Window density estimate was always based on the PDB data. Feature selection was conducted using TargetDB datasets. Additional datasets for method evaluation were created to be independent of the PDB and feature selection data. The datasets are summarized in Table 1. Details of the dataset creation is are as follows.
|
2.1.1 Training and feature selection datasets
The Parzen Window density estimate was derived from a dataset of 3958 PDB (Berman et al., 2007) structures with a resolution of at least 3.0 Å, maximum R-factor of 0.3 and filtered at 25% sequence identity (PDB3958), obtained from the PISCES server (Wang and Dunbrack, 2003) on August 4, 2006.
For feature selection, positive and negative datasets DIF728 and WS6025, respectively, were taken from TargetDB as described previously (Overton and Barton, 2006). Briefly, DIF728 comprises 728 sequences that gave diffraction-quality crystals, and WS6025 comprises 6025 sequences where work had been stopped before crystals were obtained. The negative feature selection dataset WS728 of the same size as DIF728 was created by randomly selecting 728 sequences from WS6025. The WS728 and DIF728 pair is referred to as FEAT. The WS6025 and DIF728 pair is referred to as FEAT-W, which was used only to define a ParCrys threshold for predicting over real-world distributions.
2.1.2 Independent test datasets
Datasets were constructed to provide a test independent of the training and feature selection data. The test datasets were made to be both non-redundant and not to show similarity to each other or to the FEAT and PDB3958 sets, according to stringent thresholds. This was achieved by filtering with PSIBLAST (Altschul et al., 1997), using HMMER to cluster against Pfam (Eddy, 1998; Finn et al., 2006) and finally Z-score clustering with AMPS (Barton and Sternberg, 1987) as explained below.
In order to provide a search database for PSIBLAST, the training and feature selection datasets (PDB3958, DIF728 and WS6025) were combined with UniRef50 (Apweiler et al., 2004), low-complexity filtered using SEG (Wan and Wootton, 2000) and helixfilt (D. Jones personal communication) to produce the database DEVEL_U50. The positive test dataset, representing proteins that readily crystallize, was based on TargetDB (Chen et al., 2004) sequences downloaded on April 17, 2007 and with a date stamp later than April 2006; the date filtering aims to avoid sequences from the DIF728 and WS6025 datasets (both downloaded on February 2, 2006). Sequences annotated with Diffraction-quality Crystals and not annotated with In PDB were taken and searched against DEVEL_U50 using PSIBLAST (five iterations, default settings). Sequences were eliminated if they matched a sequence from the PDB3958, DIF728 or WS6025 datasets in any PSIBLAST iteration according to 'similar structure' thresholds (Rost, 1999), the 128 sequences that did not match were termed TDB_FILT. As a redundancy filtering step, TDB_FILT was searched against Pfam version 21.0 using HMMER (default settings). The top-scoring TDB_FILT sequence for each Pfam family matched was taken, as well as the 16 sequences without a Pfam match, leaving a total of 78 sequences (TDB_FAM). For further redundancy filtering, TDB_FAM was clustered by AMPS (Barton and Sternberg, 1987) with a Z-score threshold of 5. This left 72 sequences to form the positive test dataset (T_POS72).
The negative test dataset was based on PepcDB trial sequences, parsed from the XML (http://pepcdb.pdb.org). These sequences ostensibly represent the actual construct sequence used. Sequences were taken if annotated as Status work stopped and with Status History including Cloned but not including an indicator of crystallization (e.g. Crystals). Sequences were excluded if annotated as test target, if the stopDetails included duplicate target found, and DNA sequences were filtered out leaving 12 070 sequences. These were processed by a protocol similar to that applied in generating T_POS72, resulting in 614 sequences (T_NEG614). In order to safeguard against overlap between the positive and negative test datasets, it was necessary to filter T_NEG614 against T_POS72. This step aims to eliminate sequences where work was stopped due to target deselection (Chandonia et al., 2006), although it may also concomitantly remove examples of work stopped data that are relatively similar to the diffraction-quality crystals data. However, only 4 work stopped sequences were eliminated by this step (see below). To provide a database for this purpose, T_POS72 and UniRef50 were combined, and processed with SEG and helixfilt as for DEVEL_U50 to generate TPOS_U50. T_NEG614 was searched against TPOS_U50 using PSIBLAST (five iterations, default settings) and matches determined according to similar structure thresholds (Rost, 1999). The 4 sequences from T_NEG614 that matched a T_POS72 sequence were eliminated to produce T_NEG610. In order to derive a negative test set of the same size as the positive test set, 72 sequences were randomly selected from T_NEG610 (T_NEG72). The T_NEG72 and T_POS72 pair is termed TEST. For testing real-world distributions, the T_POS72 and T_NEG610 pair is termed TEST-W.
Additional datasets were required to allow comparison with the SECRET and CRYSTALP methods, because these methods only give predictions for sequences with length between 46 and 200 (Chen et al., 2007; Smialowski et al., 2006). T_POS72 and T_NEG610, respectively, have 43 (T_POS43) and 197 (T_NEG197) sequences with length between 46 and 200. In order to derive a negative test set of the same size as T_POS43, 43 sequences were randomly selected from T_NEG197 (T_NEG43). The T_POS43 and T_NEG43 pair is termed TEST-RL.
Also, WS6025 and DIF728 were used as a basis to provide length-restricted datasets independent of the TEST-RL dataset. DIF728 and WS6025, respectively, have 246 (DIF246) and 3103 (WS3103) sequences with length between 46 and 200. In order to derive a negative test set of the same size as DIF246, 246 sequences were randomly selected from WS3103 (WS246). The DIF246 and WS246 pair is termed FEAT-RL.
Additional test datasets were developed from the PepcDB trial sequences. These datasets allow testing on actual construct sequences for both the positive and negative test data, as well as allowing examination of predictive power for the crystallization of soluble proteins. For the positive set, sequences were taken if Status History included Diffraction-quality crystals but not including In PDB and with length
30. These 569 sequences were processed by a protocol similar to that applied in generating T_POS72, resulting in 30 sequences, of which 16 had length between 46 and 200 (T_POS16). The negative test set was based on T_NEG614 identifiers, with trialSequences taken if work had been stopped after status soluble, but before status crystallized; thus 28 sequences were identified, of which 12 had length between 46 and 200. To derive a positive test set of the same size as the negative test set 12 sequences were randomly selected from T_POS16 (T_POS12). The T_POS12 and T_NEG12 pair is termed TEST-SOL. For TEST-SOL no filter was applied to remove potential overlap between the positive and negative data.
2.2 Feature calculation
For each protein sequence, the number of low complexity regions identified by SEG (Wan and Wootton, 2000), hydrophobicity, isoelectric point (pI) and the standard amino acid frequencies were calculated. Hydrophobicity was calculated as the sum of Goldmann-Engleman-Steiz (GES) hydrophobicity values (Engelman et al., 1986) for all residues, divided by the sequence length. Isoelectric point was calculated using the Bioperl (Stajich et al., 2004) pI calculator module with EMBOSS-defined pKa values (Rice et al., 2000).
2.3 Parzen window probability density function estimate
The Parzen Window density estimator technique (Duda and Hart, 1973; Parzen, 1962) aims to define an unknown probability density p(x) from a set of observations, in this case the observations are provided by the PDB3958 dataset. The Parzen Window density estimator was implemented in C++ using the Gnu Scientific Library (Galassi et al., 2006). Let us suppose that the values of the considered observations are defined from p(x) in a D-dimensional space. The probability P that a vector x belongs to a specific region R can be expressed as:
|
| (1) |
|
| (2) |
|
| (3) |
|
| (4) |
Supposing the region R to be a D-dimensional hypercube centred on the point x, with length of the hypercube sides given by h, V is defined:
|
| (5) |
|
| (6) |
|
| (7) |
|
| (8) |
|
| (9) |
The optimal value of h was first estimated using PDB3958, with 2000 search increments between values hmax and hmin of 0.8 and 0.0001, respectively. The Area under the Receiver-Operator Characteristic (AROC) curve was determined by 10-fold crossvalidation, identifying a value of h equal to 0.0157 with the best PDB3958 AROC. Five additional values of h around this initial h-value were examined to maximize the AROC for the Parzen Window density estimate derived from PDB3958 and predicting over all of the FEAT dataset, finding a final value of h equal to 0.040. To facilitate separation of test data into the classes amenable to crystallization and recalcitrant to crystallization a probability density cutoff value was determined by optimizing accuracy [Equation (10)] over the FEAT predictions. This cutoff was applied to give non-optimized accuracy and Matthews correlation coefficient [Equation (11)] for predictions over TEST, TEST-RL and TEST-SOL.
|
| (10) |
|
| (11) |
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
3.1 Selection of features
In order to rank the amino acids by their effectiveness in prediction, the FEAT dataset was analysed using three features; isoelectric point (pI), hydrophobicity and each of the 20 amino-acid frequencies in turn (Table S1, supplementary data). The AROC was calculated for each feature combination. Random predictions will give AROC values of 0.5 whilst a perfect predictor would give an AROC value of 1. From the ranking given in Table S1, amino acid frequencies were successively included as features (Table 2). ParCrys outperforms the OB-Score when using only pI and hydrophobicity as features. This gain is likely due to ParCrys providing a smoother and non-parametric representation of the data, when compared with the OB-Score. The improvement upon addition of S and C frequencies provides a similar performance gain, with additional features providing small improvements up to the best performing combination of features pI, hydrophobicity, S, C, G, F, Y, M. A similar process was repeated with FEAT but for pI, hydrophobicity and the number of low-complexity regions predicted by SEG (Wan and Wootton, 2000). However, the AROC values obtained were worse than those obtained without including the SEG information (see Supplementary Material). SEG has previously been used in structural genomics target selection to eliminate targets that are ... likely to be intractable for HT study (Chandonia et al., 2006), so this pre-filtering of the structural genomics targets may be the underlying reason that the SEG predictions are not found to be informative with the FEAT dataset.
|
There is substantial overlap between the features we selected based on FEAT and features found to be informative by Goh et al., 2004 as well as for the SECRET method (Smialowski et al., 2006). Serine frequency and hydrophobicity were found to be important by all three studies. In addition to amino-acid frequencies, the SECRET method involved reduced alphabet representations. The SECRET selected features (Smialowski et al., 2006) include contributions from all the amino acid features used in ParCrys. The CRYSTALP features (Chen et al., 2007) were based on co-location of amino acid pairs. Features selected for CRYSTALP also include contributions from all the amino acids taken as ParCrys features. Goh et al. identified highest ranked features for various stages of the structural genomics pipeline; all the ParCrys selected features contribute to this list except for the amino acid frequencies for F and Y. Unlike the Goh et al., CRYSTALP and SECRET approaches, we do not find any charged amino acid frequencies to be informative. This is probably because our amino acid features were examined in combination with pI and hydrophobicity; pI probably serves to represent much of the information present in the individual frequencies of charged amino acids. The sparseness of data increases with the number of selected features eventually causing prediction performance to degrade (Guyon and Elisseef, 2003), therefore limiting the number of features that can usefully be included.
ParCrys outperformed the OB-Score over both the FEAT and FEAT-RL datasets, with significantly higher AROC values (two-tailed P-value < 0.01).
3.2 Evaluation and comparison with other methods
In order to classify protein sequences as amenable to crystallization, or recalcitrant to crystallization, it is necessary to determine a threshold for the ParCrys probability density estimate. A threshold value of 3 564 600 was obtained by optimizing accuracy [Equation (10)] over the FEAT dataset Receiver-Operator Characteristic (ROC) space. The ParCrys accuracy estimate was then obtained by applying the threshold derived over FEAT to the independent test sets. A threshold value of 0.809 for the OB-Score method was similarly defined. FEAT is estimated to be structurally independent of the test datasets (methods), and so thresholds calculated from FEAT can be applied over the test datasets with little risk of bias. TEST-RL, the restricted length subset of the TEST dataset was developed for comparison of ParCrys, OB-Score, SECRET and CRYSTALP; this was necessary because only sequences with length between 46 and 200 are able to be analysed by SECRET and CRYSTALP (Chen et al., 2007; Smialowski et al., 2006). The SECRET predictions for TEST-RL and TEST-SOL were obtained from the web server (http://webclu.bio.wzw.tum.de:8080/secret/). The CRYSTALP predictions for TEST-RL were kindly provided by the CRYSTALP authors. Table 3 summarizes the accuracy and Matthews correlation coefficient data for TEST-RL, TEST-SOL, TEST and TEST-W. ParCrys outperforms the OB-Score, SECRET and CRYSTALP on the data considered, with respective TEST-RL accuracy values of 79.1, 69.8, 58.1 and 46.5%. Fisher's exact test shows ParCrys and SECRET predictions for accuracy values calculated over TEST-RL to be significantly different (two-tailed P-value < 0.005). Figure 1 illustrates ROC data for ParCrys, the OB-Score, SECRET and CRYSTALP. The ParCrys and OB-score TEST-RL AROC values were significantly different (two-tailed P-value < 0.056). The TEST-RL AROC values for ParCrys and SECRET were significantly different (two-tailed P-value < 10–11). A full ROC curve was not plotted for CRYSTALP, as this method returned a binary classification of sequences without raw scores. Interestingly, SECRET outperformed CRYSTALP, despite a reported 77% accuracy for CRYSTALP in 10-fold crossvalidation over the training dataset (Chen et al., 2007). This highlights the inherent problems in evaluating methods without the use of independent test datasets. Additional testing was conducted with TEST-SOL to investigate ParCrys and SECRET predictive power for the crystallization of soluble proteins. ParCrys and SECRET accuracies over TEST-SOL were 75.0 and 20.8%, respectively. Whilst TEST-SOL is a relatively small dataset, Fisher's exact test finds ParCrys and SECRET predictions for accuracy values calculated over TEST-SOL are highly significantly different (two-tailed P-value < 0.0004). We note that the ParCrys accuracy over TEST-SOL was relatively close to that found over TEST-RL, whilst the TEST-SOL accuracy value for SECRET was much lower than that over TEST-RL.
|
|
Interestingly, ParCrys had a higher AROC value over the restricted length dataset TEST-RL compared with the non-restricted dataset TEST. Indeed, higher accuracy and Matthews correlation coefficient values were also found for TEST-RL compared with TEST. These observations concur with results presented for the SECRET method (Smialowski et al., 2006). For the data unrestricted by length, the datasets representing sequences where work was stopped before crystals were obtained had smaller maximal length values than the datasets representing sequences with diffraction quality crystals (Table 1). Also, the work stopped data (WS728) had smaller median and mean values of length, compared with the diffraction-quality crystals data (DIF728). Coupled with the observed improvement in ParCrys performance for the restricted length datasets, this suggests that the restriction by length may remove sequences that were successfully crystallized but incorrectly classified by the prediction methods. The diffraction-quality crystals sequences with length >600 (DIF-LONG) had median hydrophobicity of –0.83, similar to the WS728 value (–0.78). However, the DIF728 median hydrophobicity was –0.67, similar to the PDB3958 median hydrophobicity (–0.70). Therefore, the DIF-LONG dataset is relatively hydrophobic with a median hydrophobicity even lower than that of the work stopped before crystals dataset. Application of the Wilcoxon rank sum test shows the PDB3958 hydrophobicity to be significantly different to that of WS728 and DIF-LONG (respective one-tailed P-values 1.43e–10 and 0.016), whereas DIF728 is not significantly different to PDB3958 (one-tailed P-value 0.989). Therefore, the longest sequences with diffraction-quality crystals have a hydrophobicity distribution that is shifted away from that of PDB3958, compared to the hydrophobicity distribution of the overall diffraction-quality crystals dataset (DIF728). Moreover, this shift is in the same direction as that observed for the work stopped before crystals dataset (WS728). These observations may be rationalized by the tendency for larger proteins to have a greater proportion of residues in the hydrophobic core. Accordingly, the improved performance on shorter sequences appears to be partly due to incorrect classification of the longer diffraction-quality crystals sequences into the relatively more hydrophobic recalcitrant set.
We find unequal numbers of sequences in our non-redundant datasets diffraction-quality crystals and work stopped before crystals. Therefore, to evaluate ParCrys in the context of the observed distribution of real-world data, we took the full DIF728 and WS6025 sets (FEAT-W) to determine a ParCrys threshold by optimizing Matthews correlation coefficient over the ROC space. Simply optimizing for accuracy is flawed for these data because the negative class represents 89.2% of the data, and so the highest possible threshold value would be selected to always predict into the recalcitrant class. Independent test data were provided by the TEST-W set (see methods and Table 1) and gave a ParCrys AROC of 0.738. The accuracy and Matthews correlation coefficient values were 74.0% and 0.227, respectively (Table 3), using the ParCrys threshold determined over FEAT-W. This threshold, derived from real-world distributions, defines the class high-scoring crystallization propensity prediction. Therefore, ParCrys provides a three state classification: recalcitrant, amenable and high-scoring. We recommend the high-scoring threshold is applied when the number of recalcitrant sequences is expected to outweigh the number of amenable sequences, for example when selecting amenable sequence(s) from a structurally uncharacterized group of orthologues. The ratio of positive:negative examples was approximately 1:8 in both the FEAT-W and TEST-W datasets. In general, ParCrys predictions are intended to provide guidance in selecting targets and should be interpreted in conjunction with additional knowledge about the sequence(s) of interest. An interesting further study would be to train a predictor to distinguish proteins that differ in only a few amino acids, yet have quite different crystallization characteristics. Unfortunately, data with which to train such a predictor are currently too sparse, but targeting a predictor at borderline examples of this type is likely to become possible in the future.
| 4 CONCLUSIONS |
|---|
|
|
|---|
We present a novel algorithm, ParCrys, for estimating a protein's propensity to progress to the stage of diffraction-quality crystals via current structural biology techniques. We find the feature combination Hydrophobicity, isoelectric point, S, G, C, F, Y, M to be optimal. Independent test data finds ParCrys to outperform the other publicly available methods, SECRET, CRYSTALP and the OB-Score. Also, in agreement with the authors of SECRET, we find that predictions are more accurate for sequences of length between 46 and 200 amino acids. ParCrys predictions and datasets are available from www.compbio.dundee.ac.uk/parcrys.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Dr L. Kurgan for providing CRYSTALP predictions and Dr T. Walsh and Dr J. Monk for computational advice. UK Biotechnology and Biological Sciences Research Council (BBS/B/14434 to G.J.B.); Scottish Bioinformatics Research Network (HR03021 to G.J.B.); UK Engineering and Physical Sciences Research Council (EP/E052029/1 to M.A.G.).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
Received on June 1, 2007; revised on January 21, 2008; accepted on February 6, 2008
| REFERENCES |
|---|
|
|
|---|
Altschul S, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:3389–3402.
Apweiler R, et al. UniProt: the universal protein knowledgebase. Nucleic Acids Res (2004) 32:D115–D119.
Barton GJ, Sternberg MJE. A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J. Mol. Biol (1987) 198:327.[CrossRef][Web of Science][Medline]
Berman H, et al. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res (2007) 35:D301–D303.
Biertumpfel C, et al. Practical implementations for improving the throughput in a manual crystallization setup. J. Appl. Cryst (2005) 38:568–570.[CrossRef][Web of Science]
Brenner SE. Target selection for structural genomics. Nat. Struct. Biol (2000) 7:967–969.[CrossRef][Medline]
Burley S, et al. Structural genomics: beyond the human genome project. Nat. Genet (1999) 23:151–157.[CrossRef][Web of Science][Medline]
Canaves JM, et al. Protein biophysical properties that correlate with crystallisation success in Thermotoga maritima: maximum clustering strategy for structural genomics. J. Mol. Biol (2004) 344:977–991.[CrossRef][Web of Science][Medline]
Chandonia JM, Brenner SE. Implications of structural genomics target selection strategies: Pfam5000, whole genome, and random approaches. Proteins (2005) 58:166–179.[CrossRef][Web of Science][Medline]
Chandonia J-M, Brenner SE. The impact of structural genomics: expectations and outcomes. Science (2006) 311:347–351.
Chandonia JM, et al. Target selection and deselection at the berkeley structural genomics centre. Proteins (2006) 62:356–370.[CrossRef][Web of Science][Medline]
Chayen NE. Turning protein crystallisation from an art into a science. Curr. Opin. Struct. Biol (2004) 14:577–583.[CrossRef][Web of Science][Medline]
Chen K, et al. Prediction of protein crystallization using collocation of amino acid pairs. Biochem. Biophys. Res. Commun (2007) 355:764.[CrossRef][Web of Science][Medline]
Chen L, et al. TargetDB: a target registration database for structural genomics projects. Bioinformatics (2004) 20:2860–2862.
Davies TG, et al. Structure-based design of a potent purine-based cyclin-dependent kinase inhibitor. Nat. Struct. Mol. Biol (2002) 9:745.[CrossRef]
Diprose JM, et al. Translocation portals for the substrates and products of a viral transcription complex: the bluetongue virus core. EMBO J (2001) 20:7229–7239.[CrossRef][Web of Science][Medline]
Duda R, Hart P. Pattern Classification and Scene Analysis (1973) London: Wiley.
Eddy SR. Profile hidden Markov models. Bioinformatics (1998) 14:755–763.
Engelman DM, et al. Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Ann. Rev. Biophys. Biophys. Chem (1986) 15:321–353.[CrossRef][Web of Science][Medline]
Finn RD, et al. Pfam: clans, web tools and services. Nucleic Acids Res (2006) 34:D247–D251.
Galassi M, et al. GNU Scientific Library Reference Manual – Revised. (2006) 2nd edn. Network Theory Ltd.
Goh C, et al. Mining the Structural genomics pipeline: identification of protein properties that affect high-throughput experimental analyses. J. Mol. Biol (2004) 336:115–130.[CrossRef][Web of Science][Medline]
Guyon I, Elisseef A. An introduction to variable and feature selection. J. Mach. Learn. Res (2003) 3:1157–1182.[CrossRef]
Hiu R, Edwards E. High-throughput protein crystallisation. J. Struct. Biol (2003) 142:154–161.[CrossRef][Web of Science][Medline]
Hol W. Structural genomics for science and society. Nat. Struct. Biol (2000) 7:964–966.[CrossRef][Medline]
Liu J, et al. Automatic target selection for structural genomics on eukaryotes. Proteins (2004) 56:188–200.[CrossRef][Web of Science][Medline]
Overton IM, Barton GJ. A normalised scale for structural genomics target ranking: the OB-Score. FEBS Lett (2006) 580:4005.[CrossRef][Web of Science][Medline]
Parzen E. On estimation of a probability density function and mode. Ann. Math. Stat (1962) 33:1065–1076.[CrossRef]
Poppe SM, et al. Antiviral activity of the Dihydropyrone PNU-140690, a new nonpeptide himan immunodeficiency virus protease inhibitor. Antimicrob. Agents Chemother (1997) 41:1058–1063.[Abstract]
Puesy M, et al. Life in the fast lane for protein crystallization and X-ray crystallography. Prog. Biophys. Mol. Biol (2005) 88:359–386.[CrossRef][Web of Science][Medline]
R Development Core Team. R: A language and environment for statistical computing. (2004) R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.
Rice P, et al. EMBOSS: the european molecular biology open software suite. Trends Genet (2000) 16:276–277.[CrossRef][Web of Science][Medline]
Rost B. Twilight zone of protein sequence alignments. Protein Eng (1999) 12:85–94.
Savchenko A, et al. Strategies for structural proteomics of prokaryotes: quantifying the advantages of studying orthologous proteins and of using both NMR and x-ray crystallography approaches. Proteins (2003) 50:392–399.[CrossRef][Web of Science][Medline]
Schuttelkopf AW, et al. Screening-based discovery and structural dissection of a novel family 18 chitinase Inhibitor. J. Biol. Chem (2006) 281:27278–27285.
Service R. Tapping DNA for structures produces a trickle. Science (2002) 298:948–950.
Service R. Structural genomics, round 2. Science (2005) 307:1554–1558.
Shapiro L, Harris T. Finding function through structural genomics. Curr. Opin. Biotechnol (2000) 11:31.[CrossRef][Web of Science][Medline]
Singh SK, et al. Structural basis for duffy recognition by the malaria parasite duffy-binding-like domain. Nature (2006) 439:741.[CrossRef][Medline]
Smialowski P, et al. Will my protein crystallize? A sequence-based predictor. Proteins: Struct. Funct. Bioinformatics (2006) 62:343–355.[CrossRef]
Stajich JE, et al. The bioperl toolkit: perl modules for the life sciences. Genome Res (2004) 12:1611–1618.[CrossRef]
Stevens RC, et al. Global efforts in structural genomics. Science (2001) 294:89–92.
Terwillinger TC. Structural genomics in North America. Nat. Struct. Biol (2000) 7:935–939.[CrossRef][Medline]
Todd AE, et al. Progress of structural genomics initiatives: an analysis of solved target structures. J. Mol. Biol (2005) 348:1235.[CrossRef][Web of Science][Medline]
von Itzstein M, et al. Rational design of potent sialidase-based inhibitors of influenza virus replication. Nature (1993) 363:418.[CrossRef][Medline]
Wan H, Wootton JC. A global compositional complexity measure for biological sequences: AT-rich and GC-rich genomes encode less complex proteins. Comput. Chem (2000) 24:71.[CrossRef][Web of Science][Medline]
Wang G, Dunbrack R. PISCES: a protein sequence culling server. Bioinformatics (2003) 19:1589–1591.
Yard BA, et al. The structure of serine palmitoyltransferase; gateway to sphingolipid biosynthesis. J. Mol. Biol (2007) 370:870–886.[CrossRef][Web of Science][Medline]
Zarembinski TI, et al. Structure-based assignment of the biochemical function of a hypothetical protein: a test case of structural genomics. PNAS (1998) 95:15189–15193.
This article has been cited by other articles:
![]() |
I. M. Overton, C. A. J. van Niekerk, L. G. Carter, A. Dawson, D. M. A. Martin, S. Cameron, S. A. McMahon, M. F. White, W. N. Hunter, J. H. Naismith, et al. TarO: a target optimisation system for structural biology Nucleic Acids Res., July 1, 2008; 36(suppl_2): W190 - W196. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

