Skip Navigation


Bioinformatics Advance Access originally published online on February 19, 2008
Bioinformatics 2008 24(7):901-907; doi:10.1093/bioinformatics/btn055
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/7/901    most recent
btn055v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Overton, I. M.
Right arrow Articles by Barton, G. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Overton, I. M.
Right arrow Articles by Barton, G. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

ParCrys: a Parzen window density estimation approach to protein crystallization propensity prediction

Ian M. Overton 1, Gianandrea Padovani 2, Mark A. Girolami 2 and Geoffrey J. Barton 1,*

1School of Life Sciences Research, University of Dundee, Dow Street, Dundee, DD1 5EH and 2Department of Computing Science, University of Glasgow, Glasgow, GL12 8QQ, UK

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

The ability to rank proteins by their likely success in crystallization is useful in current Structural Biology efforts and in particular in high-throughput Structural Genomics initiatives. We present ParCrys, a Parzen Window approach to estimate a protein's propensity to produce diffraction-quality crystals. The Protein Data Bank (PDB) provided training data whilst the databases TargetDB and PepcDB were used to define feature selection data as well as test data independent of feature selection and training. ParCrys outperforms the OB-Score, SECRET and CRYSTALP on the data examined, with accuracy and Matthews correlation coefficient values of 79.1% and 0.582, respectively (74.0% and 0.227, respectively, on data with a ‘real-world’ ratio of positive:negative examples). ParCrys predictions and associated data are available from www.compbio.dundee.ac.uk/parcrys.

Contact: geoff{at}compbio.dundee.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Knowledge of protein three-dimensional structure is useful for addressing and formulating key scientific questions, as well as stimulating technological innovation (Hol, 2000; Todd et al., 2005). For example, structure-based drug design has informed the process of developing treatments for human diseases such as HIV (Poppe et al., 1997), influenza (von Itzstein et al., 1993), asthma (Schuttelkopf et al., 2006) and cancer (Davies et al., 2002). Structural insights have revealed molecular mechanisms of cellular function (Shapiro and Harris, 2000; Zarembinski et al., 1998), pathogenesis (Diprose et al., 2001; Singh et al., 2006) and genetic disease (Yard et al., 2007). In addition, the structures of proteins from thermophilic organisms suggest a promising basis for the development of industrial biocatalysts (Hol, 2000).

Structural genomics is a global enterprise that aspires to produce a large scale mapping of protein structure space (Burley et al., 1999; Chandonia and Brenner, 2006; Hol, 2000, http://www.isgo.org/home/index.php; Service, 2002; Stevens et al., 2001). However, it is common for only 5% of selected protein targets to result in a high-resolution protein model (http://www.ebi.ac.uk/msd-srv/msdtarget; Service, 2002, 2005; Terwillinger, 2000). Various strategies have been proposed to increase the return from structural genomics efforts, including obtaining one representative structure per protein family and working with multiple orthologues (Brenner, 2000; Chandonia and Brenner, 2005; Hiu and Edwards, 2003; Liu et al., 2004; Savchenko et al., 2003). These approaches imply methods to rank proteins within functionally defined groups according to their likely success in the structural genomics pipeline. As crystallization is a significant bottleneck in structural biology, the relative ease of obtaining diffraction-quality crystals is a key consideration (Biertumpfel et al., 2005; Chayen, 2004; Puesy et al., 2005). For example, to date, the Structural Proteomics in Europe (SPINE) consortium reports 58% success for the progression from ‘Expressed’ to ‘Soluble’ and 53% success for the progression from ‘Purified’ to ‘Crystallized’ (http://www.ebi.ac.uk/msd-srv/msdtarget). Investigations into the relationship between protein sequence properties and progression through the structural genomics pipeline have suggested features relevant to crystallization propensity prediction (Canaves et al., 2004; Goh et al., 2004). Indeed, there are already methods that aim to predict crystallization propensity. These include SECRET (Smialowski et al., 2006), the OB-Score (Overton and Barton, 2006) and more recently, CRYSTALP (Chen et al., 2007). A limitation of both SECRET and CRYSTALP is that they only accept as input sequences of length between 46 and 200 amino acids. While the OB-score has no such limitation, it only exploits two predictive features to estimate crystallization propensity: isoelectric point and hydrophobicity. In this article we describe a Parzen Window density estimator that incorporates additional features not considered by the OB-Score, whilst permitting prediction over the full range of sequence length. The Parzen Window technique affords a non-parametric statistical approach to estimate a density function from a given sample dataset (Duda and Hart, 1973; Parzen, 1962). We took sequences from the Protein Data Bank (PDB) (Berman et al., 2007) culled at 25% identity, with R factor of ≤0.3 and resolution ≤3.0 Å downloaded from PISCES (Wang and Dunbrack, 2003) as the model for the Parzen Window density estimate. The propensity of a test protein to progress to the stage of diffraction-quality crystals was estimated according to the density function derived from the PDB sequences; no negative examples are required for this. In independent tests, we found that this new method (ParCrys) outperforms the OB-Score, SECRET and CRYSTALP.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Datasets
Selection of appropriate training and test datasets is critical in the development of a predictor. In this study we developed 13 primary datasets, which were filtered by stringent sequence similarity thresholds in order to minimize bias. The Parzen Window density estimate was always based on the PDB data. Feature selection was conducted using TargetDB datasets. Additional datasets for method evaluation were created to be independent of the PDB and feature selection data. The datasets are summarized in Table 1. Details of the dataset creation is are as follows.


View this table:
[in this window]
[in a new window]

 
Table 1. Primary datasets

 
2.1.1 Training and feature selection datasets
The Parzen Window density estimate was derived from a dataset of 3958 PDB (Berman et al., 2007) structures with a resolution of at least 3.0 Å, maximum R-factor of 0.3 and filtered at 25% sequence identity (PDB3958), obtained from the PISCES server (Wang and Dunbrack, 2003) on August 4, 2006.

For feature selection, positive and negative datasets DIF728 and WS6025, respectively, were taken from TargetDB as described previously (Overton and Barton, 2006). Briefly, DIF728 comprises 728 sequences that gave diffraction-quality crystals, and WS6025 comprises 6025 sequences where work had been stopped before crystals were obtained. The negative feature selection dataset WS728 of the same size as DIF728 was created by randomly selecting 728 sequences from WS6025. The WS728 and DIF728 pair is referred to as FEAT. The WS6025 and DIF728 pair is referred to as FEAT-W, which was used only to define a ParCrys threshold for predicting over ‘real-world’ distributions.

2.1.2 Independent test datasets
Datasets were constructed to provide a test independent of the training and feature selection data. The test datasets were made to be both non-redundant and not to show similarity to each other or to the FEAT and PDB3958 sets, according to stringent thresholds. This was achieved by filtering with PSIBLAST (Altschul et al., 1997), using HMMER to cluster against Pfam (Eddy, 1998; Finn et al., 2006) and finally Z-score clustering with AMPS (Barton and Sternberg, 1987) as explained below.

In order to provide a search database for PSIBLAST, the training and feature selection datasets (PDB3958, DIF728 and WS6025) were combined with UniRef50 (Apweiler et al., 2004), low-complexity filtered using SEG (Wan and Wootton, 2000) and helixfilt (D. Jones personal communication) to produce the database DEVEL_U50. The positive test dataset, representing proteins that readily crystallize, was based on TargetDB (Chen et al., 2004) sequences downloaded on April 17, 2007 and with a date stamp later than April 2006; the date filtering aims to avoid sequences from the DIF728 and WS6025 datasets (both downloaded on February 2, 2006). Sequences annotated with ‘Diffraction-quality Crystals’ and not annotated with ‘In PDB’ were taken and searched against DEVEL_U50 using PSIBLAST (five iterations, default settings). Sequences were eliminated if they matched a sequence from the PDB3958, DIF728 or WS6025 datasets in any PSIBLAST iteration according to 'similar structure' thresholds (Rost, 1999), the 128 sequences that did not match were termed TDB_FILT. As a redundancy filtering step, TDB_FILT was searched against Pfam version 21.0 using HMMER (default settings). The top-scoring TDB_FILT sequence for each Pfam family matched was taken, as well as the 16 sequences without a Pfam match, leaving a total of 78 sequences (TDB_FAM). For further redundancy filtering, TDB_FAM was clustered by AMPS (Barton and Sternberg, 1987) with a Z-score threshold of 5. This left 72 sequences to form the positive test dataset (T_POS72).

The negative test dataset was based on PepcDB trial sequences, parsed from the XML (http://pepcdb.pdb.org). These sequences ostensibly represent the actual construct sequence used. Sequences were taken if annotated as Status ‘work stopped’ and with Status History including ‘Cloned’ but not including an indicator of crystallization (e.g. ‘Crystals’). Sequences were excluded if annotated as ‘test target’, if the ‘stopDetails’ included ‘duplicate target found’, and DNA sequences were filtered out leaving 12 070 sequences. These were processed by a protocol similar to that applied in generating T_POS72, resulting in 614 sequences (T_NEG614). In order to safeguard against overlap between the positive and negative test datasets, it was necessary to filter T_NEG614 against T_POS72. This step aims to eliminate sequences where work was stopped due to target deselection (Chandonia et al., 2006), although it may also concomitantly remove examples of ‘work stopped’ data that are relatively similar to the ‘diffraction-quality crystals’ data. However, only 4 ‘work stopped’ sequences were eliminated by this step (see below). To provide a database for this purpose, T_POS72 and UniRef50 were combined, and processed with SEG and helixfilt as for DEVEL_U50 to generate TPOS_U50. T_NEG614 was searched against TPOS_U50 using PSIBLAST (five iterations, default settings) and matches determined according to ‘similar structure’ thresholds (Rost, 1999). The 4 sequences from T_NEG614 that matched a T_POS72 sequence were eliminated to produce T_NEG610. In order to derive a negative test set of the same size as the positive test set, 72 sequences were randomly selected from T_NEG610 (T_NEG72). The T_NEG72 and T_POS72 pair is termed TEST. For testing ‘real-world’ distributions, the T_POS72 and T_NEG610 pair is termed TEST-W.

Additional datasets were required to allow comparison with the SECRET and CRYSTALP methods, because these methods only give predictions for sequences with length between 46 and 200 (Chen et al., 2007; Smialowski et al., 2006). T_POS72 and T_NEG610, respectively, have 43 (T_POS43) and 197 (T_NEG197) sequences with length between 46 and 200. In order to derive a negative test set of the same size as T_POS43, 43 sequences were randomly selected from T_NEG197 (T_NEG43). The T_POS43 and T_NEG43 pair is termed TEST-RL.

Also, WS6025 and DIF728 were used as a basis to provide length-restricted datasets independent of the TEST-RL dataset. DIF728 and WS6025, respectively, have 246 (DIF246) and 3103 (WS3103) sequences with length between 46 and 200. In order to derive a negative test set of the same size as DIF246, 246 sequences were randomly selected from WS3103 (WS246). The DIF246 and WS246 pair is termed FEAT-RL.

Additional test datasets were developed from the PepcDB trial sequences. These datasets allow testing on actual construct sequences for both the positive and negative test data, as well as allowing examination of predictive power for the crystallization of soluble proteins. For the positive set, sequences were taken if Status History included ‘Diffraction-quality crystals’ but not including ‘In PDB’ and with length ≥30. These 569 sequences were processed by a protocol similar to that applied in generating T_POS72, resulting in 30 sequences, of which 16 had length between 46 and 200 (T_POS16). The negative test set was based on T_NEG614 identifiers, with ‘trialSequences’ taken if work had been stopped after ‘status’ ‘soluble’, but before ‘status’ ‘crystallized’; thus 28 sequences were identified, of which 12 had length between 46 and 200. To derive a positive test set of the same size as the negative test set 12 sequences were randomly selected from T_POS16 (T_POS12). The T_POS12 and T_NEG12 pair is termed TEST-SOL. For TEST-SOL no filter was applied to remove potential overlap between the positive and negative data.

2.2 Feature calculation
For each protein sequence, the number of low complexity regions identified by SEG (Wan and Wootton, 2000), hydrophobicity, isoelectric point (pI) and the standard amino acid frequencies were calculated. Hydrophobicity was calculated as the sum of Goldmann-Engleman-Steiz (GES) hydrophobicity values (Engelman et al., 1986) for all residues, divided by the sequence length. Isoelectric point was calculated using the Bioperl (Stajich et al., 2004) pI calculator module with EMBOSS-defined pKa values (Rice et al., 2000).

2.3 Parzen window probability density function estimate
The Parzen Window density estimator technique (Duda and Hart, 1973; Parzen, 1962) aims to define an unknown probability density p(x) from a set of observations, in this case the observations are provided by the PDB3958 dataset. The Parzen Window density estimator was implemented in C++ using the Gnu Scientific Library (Galassi et al., 2006). Let us suppose that the values of the considered observations are defined from p(x) in a D-dimensional space. The probability P that a vector x belongs to a specific region R can be expressed as:


Formula 1

(1)
Supposing N observations are drawn from p(x), the points K falling inside the region R can be estimated by:


Formula 2

(2)
Assuming the region R is sufficiently small so that the probability density p(x) can be considered constant within R, then P can be approximated to:


Formula 3

(3)
where V is the volume of the region R. Combining Equations 2 and 3 we can write:


Formula 4

(4)
Equation (4) depends on two contradictory assumptions. The region R is assumed to be small enough in order to consider the probability density p(x) constant, yet Equation (2) assumes R is large enough to contain a sample of K of points falling inside it. Nevertheless, the Parzen Window density probability estimator exploits this result to model p(x). Thus, K can be defined given a fixed value of V. As N tends to infinity, the estimate of p(x) tends towards the true value (Duda and Hart, 1973).

Supposing the region R to be a D-dimensional hypercube centred on the point x, with length of the hypercube sides given by h, V is defined:


Formula 5

(5)
We can count K, the number of points falling inside R according to the ‘Window Function’, k(u):


Formula 6

(6)
The term |ui| refers to the Euclidian distance in dimension i for the considered data points. Given a data point xn the quantity k[(xxn)/h] will be one if xn falls within the hypercube centred on x with side h, and zero otherwise. The value of K can therefore be defined by:


Formula 7

(7)
Substituting Equation (5) and Equation (7) into Equation (4) defines a new expression for the estimated probability density at x:


Formula 8

(8)
From Equation (6), k(u) is symmetrical, therefore Equation (8) can be reinterpreted as the sum over N cubes centred on the N data points xn. The use of a step function creates the presence of artificial discontinuities in the probability density function estimate, so to obtain a smoother density model we chose a Gaussian kernel, which gives rise to the following density model:


Formula 9

(9)
where h now represents the SD of the Gaussian components. Thus, the final density function estimate is obtained by adding up the contributions of several Gaussian models placed over the whole data set and then normalizing. Therefore, the value of h is a crucial to correctly describing the PDB3958 data. The model becomes too sensitive to noise when h is too small, whilst information loss occurs when h is too large.

The optimal value of h was first estimated using PDB3958, with 2000 search increments between values hmax and hmin of 0.8 and 0.0001, respectively. The Area under the Receiver-Operator Characteristic (AROC) curve was determined by 10-fold crossvalidation, identifying a value of h equal to 0.0157 with the best PDB3958 AROC. Five additional values of h around this initial h-value were examined to maximize the AROC for the Parzen Window density estimate derived from PDB3958 and predicting over all of the FEAT dataset, finding a final value of h equal to 0.040. To facilitate separation of test data into the classes ‘amenable to crystallization’ and ‘recalcitrant to crystallization’ a probability density cutoff value was determined by optimizing accuracy [Equation (10)] over the FEAT predictions. This cutoff was applied to give ‘non-optimized’ accuracy and Matthews correlation coefficient [Equation (11)] for predictions over TEST, TEST-RL and TEST-SOL.


Formula 10

(10)


Formula 11

(11)
TP = true positives, TN = true negatives, FP = false positives, FN = false negatives.


    3 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Selection of features
In order to rank the amino acids by their effectiveness in prediction, the FEAT dataset was analysed using three features; isoelectric point (pI), hydrophobicity and each of the 20 amino-acid frequencies in turn (Table S1, supplementary data). The AROC was calculated for each feature combination. Random predictions will give AROC values of 0.5 whilst a perfect predictor would give an AROC value of 1. From the ranking given in Table S1, amino acid frequencies were successively included as features (Table 2). ParCrys outperforms the OB-Score when using only pI and hydrophobicity as features. This gain is likely due to ParCrys providing a smoother and non-parametric representation of the data, when compared with the OB-Score. The improvement upon addition of ‘S’ and ‘C’ frequencies provides a similar performance gain, with additional features providing small improvements up to the best performing combination of features ‘pI, hydrophobicity, S, C, G, F, Y, M’. A similar process was repeated with FEAT but for pI, hydrophobicity and the number of low-complexity regions predicted by SEG (Wan and Wootton, 2000). However, the AROC values obtained were worse than those obtained without including the SEG information (see Supplementary Material). SEG has previously been used in structural genomics target selection to eliminate targets that are ‘... likely to be intractable for HT study’ (Chandonia et al., 2006), so this pre-filtering of the structural genomics targets may be the underlying reason that the SEG predictions are not found to be informative with the FEAT dataset.


View this table:
[in this window]
[in a new window]

 
Table 2. Successive addition of amino acids in combination with isoelectric point (pI) and hydrophobicity

 
There is substantial overlap between the features we selected based on FEAT and features found to be informative by Goh et al., 2004 as well as for the SECRET method (Smialowski et al., 2006). Serine frequency and hydrophobicity were found to be important by all three studies. In addition to amino-acid frequencies, the SECRET method involved reduced alphabet representations. The SECRET selected features (Smialowski et al., 2006) include contributions from all the amino acid features used in ParCrys. The CRYSTALP features (Chen et al., 2007) were based on co-location of amino acid pairs. Features selected for CRYSTALP also include contributions from all the amino acids taken as ParCrys features. Goh et al. identified highest ranked features for various stages of the structural genomics pipeline; all the ParCrys selected features contribute to this list except for the amino acid frequencies for F and Y. Unlike the Goh et al., CRYSTALP and SECRET approaches, we do not find any charged amino acid frequencies to be informative. This is probably because our amino acid features were examined in combination with pI and hydrophobicity; pI probably serves to represent much of the information present in the individual frequencies of charged amino acids. The sparseness of data increases with the number of selected features eventually causing prediction performance to degrade (Guyon and Elisseef, 2003), therefore limiting the number of features that can usefully be included.

ParCrys outperformed the OB-Score over both the FEAT and FEAT-RL datasets, with significantly higher AROC values (two-tailed P-value < 0.01).

3.2 Evaluation and comparison with other methods
In order to classify protein sequences as ‘amenable to crystallization’, or ‘recalcitrant to crystallization’, it is necessary to determine a threshold for the ParCrys probability density estimate. A threshold value of 3 564 600 was obtained by optimizing accuracy [Equation (10)] over the FEAT dataset Receiver-Operator Characteristic (ROC) space. The ParCrys accuracy estimate was then obtained by applying the threshold derived over FEAT to the independent test sets. A threshold value of 0.809 for the OB-Score method was similarly defined. FEAT is estimated to be structurally independent of the test datasets (methods), and so thresholds calculated from FEAT can be applied over the test datasets with little risk of bias. TEST-RL, the restricted length subset of the TEST dataset was developed for comparison of ParCrys, OB-Score, SECRET and CRYSTALP; this was necessary because only sequences with length between 46 and 200 are able to be analysed by SECRET and CRYSTALP (Chen et al., 2007; Smialowski et al., 2006). The SECRET predictions for TEST-RL and TEST-SOL were obtained from the web server (http://webclu.bio.wzw.tum.de:8080/secret/). The CRYSTALP predictions for TEST-RL were kindly provided by the CRYSTALP authors. Table 3 summarizes the accuracy and Matthews correlation coefficient data for TEST-RL, TEST-SOL, TEST and TEST-W. ParCrys outperforms the OB-Score, SECRET and CRYSTALP on the data considered, with respective TEST-RL accuracy values of 79.1, 69.8, 58.1 and 46.5%. Fisher's exact test shows ParCrys and SECRET predictions for accuracy values calculated over TEST-RL to be significantly different (two-tailed P-value < 0.005). Figure 1 illustrates ROC data for ParCrys, the OB-Score, SECRET and CRYSTALP. The ParCrys and OB-score TEST-RL AROC values were significantly different (two-tailed P-value < 0.056). The TEST-RL AROC values for ParCrys and SECRET were significantly different (two-tailed P-value < 10–11). A full ROC curve was not plotted for CRYSTALP, as this method returned a binary classification of sequences without raw scores. Interestingly, SECRET outperformed CRYSTALP, despite a reported 77% accuracy for CRYSTALP in 10-fold crossvalidation over the training dataset (Chen et al., 2007). This highlights the inherent problems in evaluating methods without the use of independent test datasets. Additional testing was conducted with TEST-SOL to investigate ParCrys and SECRET predictive power for the crystallization of soluble proteins. ParCrys and SECRET accuracies over TEST-SOL were 75.0 and 20.8%, respectively. Whilst TEST-SOL is a relatively small dataset, Fisher's exact test finds ParCrys and SECRET predictions for accuracy values calculated over TEST-SOL are highly significantly different (two-tailed P-value < 0.0004). We note that the ParCrys accuracy over TEST-SOL was relatively close to that found over TEST-RL, whilst the TEST-SOL accuracy value for SECRET was much lower than that over TEST-RL.


Figure 1
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. ROC data for ParCrys, the OB-Score, SECRET and CRYSTALP are shown for the independent test dataset TEST-RL. Also ROC curves for ParCrys and the OB-Score are shown for independent test dataset TEST. ParCrys outperforms the other methods over both TEST-RL and TEST, with AROC values of 0.844 and 0.752, respectively. A full ROC curve was not plotted for CRYSTALP, as this method returned a binary classification of sequences without raw scores. This figure was generated in R (R Development Core Team, 2004).

 

View this table:
[in this window]
[in a new window]

 
Table 3. Accuracy and Matthews correlation coefficient for test datasets

 
Interestingly, ParCrys had a higher AROC value over the restricted length dataset TEST-RL compared with the non-restricted dataset TEST. Indeed, higher accuracy and Matthews correlation coefficient values were also found for TEST-RL compared with TEST. These observations concur with results presented for the SECRET method (Smialowski et al., 2006). For the data unrestricted by length, the datasets representing sequences where ‘work was stopped before crystals were obtained’ had smaller maximal length values than the datasets representing sequences with diffraction quality crystals (Table 1). Also, the ‘work stopped’ data (WS728) had smaller median and mean values of length, compared with the ‘diffraction-quality crystals’ data (DIF728). Coupled with the observed improvement in ParCrys performance for the restricted length datasets, this suggests that the restriction by length may remove sequences that were successfully crystallized but incorrectly classified by the prediction methods. The ‘diffraction-quality crystals’ sequences with length >600 (DIF-LONG) had median hydrophobicity of –0.83, similar to the WS728 value (–0.78). However, the DIF728 median hydrophobicity was –0.67, similar to the PDB3958 median hydrophobicity (–0.70). Therefore, the DIF-LONG dataset is relatively hydrophobic with a median hydrophobicity even lower than that of the ‘work stopped before crystals’ dataset. Application of the Wilcoxon rank sum test shows the PDB3958 hydrophobicity to be significantly different to that of WS728 and DIF-LONG (respective one-tailed P-values 1.43e–10 and 0.016), whereas DIF728 is not significantly different to PDB3958 (one-tailed P-value 0.989). Therefore, the longest sequences with diffraction-quality crystals have a hydrophobicity distribution that is shifted away from that of PDB3958, compared to the hydrophobicity distribution of the overall ‘diffraction-quality crystals’ dataset (DIF728). Moreover, this shift is in the same direction as that observed for the ‘work stopped before crystals’ dataset (WS728). These observations may be rationalized by the tendency for larger proteins to have a greater proportion of residues in the hydrophobic core. Accordingly, the improved performance on shorter sequences appears to be partly due to incorrect classification of the longer ‘diffraction-quality crystals’ sequences into the relatively more hydrophobic ‘recalcitrant’ set.

We find unequal numbers of sequences in our non-redundant datasets ‘diffraction-quality crystals’ and ‘work stopped before crystals’. Therefore, to evaluate ParCrys in the context of the observed distribution of ‘real-world’ data, we took the full DIF728 and WS6025 sets (FEAT-W) to determine a ParCrys threshold by optimizing Matthews correlation coefficient over the ROC space. Simply optimizing for accuracy is flawed for these data because the negative class represents 89.2% of the data, and so the highest possible threshold value would be selected to always predict into the ‘recalcitrant’ class. Independent test data were provided by the TEST-W set (see methods and Table 1) and gave a ParCrys AROC of 0.738. The accuracy and Matthews correlation coefficient values were 74.0% and 0.227, respectively (Table 3), using the ParCrys threshold determined over FEAT-W. This threshold, derived from ‘real-world’ distributions, defines the class ‘high-scoring crystallization propensity prediction’. Therefore, ParCrys provides a three state classification: ‘recalcitrant’, ‘amenable’ and ‘high-scoring’. We recommend the ‘high-scoring’ threshold is applied when the number of ‘recalcitrant’ sequences is expected to outweigh the number of ‘amenable’ sequences, for example when selecting ‘amenable’ sequence(s) from a structurally uncharacterized group of orthologues. The ratio of positive:negative examples was approximately 1:8 in both the FEAT-W and TEST-W datasets. In general, ParCrys predictions are intended to provide guidance in selecting targets and should be interpreted in conjunction with additional knowledge about the sequence(s) of interest. An interesting further study would be to train a predictor to distinguish proteins that differ in only a few amino acids, yet have quite different crystallization characteristics. Unfortunately, data with which to train such a predictor are currently too sparse, but targeting a predictor at ‘borderline’ examples of this type is likely to become possible in the future.


    4 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We present a novel algorithm, ParCrys, for estimating a protein's propensity to progress to the stage of diffraction-quality crystals via current structural biology techniques. We find the feature combination ‘Hydrophobicity, isoelectric point, S, G, C, F, Y, M’ to be optimal. Independent test data finds ParCrys to outperform the other publicly available methods, SECRET, CRYSTALP and the OB-Score. Also, in agreement with the authors of SECRET, we find that predictions are more accurate for sequences of length between 46 and 200 amino acids. ParCrys predictions and datasets are available from www.compbio.dundee.ac.uk/parcrys.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank Dr L. Kurgan for providing CRYSTALP predictions and Dr T. Walsh and Dr J. Monk for computational advice. UK Biotechnology and Biological Sciences Research Council (BBS/B/14434 to G.J.B.); Scottish Bioinformatics Research Network (HR03021 to G.J.B.); UK Engineering and Physical Sciences Research Council (EP/E052029/1 to M.A.G.).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

Received on June 1, 2007; revised on January 21, 2008; accepted on February 6, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Altschul S, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:3389–3402.[Abstract/Free Full Text]

    Apweiler R, et al. UniProt: the universal protein knowledgebase. Nucleic Acids Res (2004) 32:D115–D119.[Abstract/Free Full Text]

    Barton GJ, Sternberg MJE. A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J. Mol. Biol (1987) 198:327.[CrossRef][Web of Science][Medline]

    Berman H, et al. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res (2007) 35:D301–D303.[Abstract/Free Full Text]

    Biertumpfel C, et al. Practical implementations for improving the throughput in a manual crystallization setup. J. Appl. Cryst (2005) 38:568–570.[CrossRef][Web of Science]

    Brenner SE. Target selection for structural genomics. Nat. Struct. Biol (2000) 7:967–969.[CrossRef][Medline]

    Burley S, et al. Structural genomics: beyond the human genome project. Nat. Genet (1999) 23:151–157.[CrossRef][Web of Science][Medline]

    Canaves JM, et al. Protein biophysical properties that correlate with crystallisation success in Thermotoga maritima: maximum clustering strategy for structural genomics. J. Mol. Biol (2004) 344:977–991.[CrossRef][Web of Science][Medline]

    Chandonia JM, Brenner SE. Implications of structural genomics target selection strategies: Pfam5000, whole genome, and random approaches. Proteins (2005) 58:166–179.[CrossRef][Web of Science][Medline]

    Chandonia J-M, Brenner SE. The impact of structural genomics: expectations and outcomes. Science (2006) 311:347–351.[Abstract/Free Full Text]

    Chandonia JM, et al. Target selection and deselection at the berkeley structural genomics centre. Proteins (2006) 62:356–370.[CrossRef][Web of Science][Medline]

    Chayen NE. Turning protein crystallisation from an art into a science. Curr. Opin. Struct. Biol (2004) 14:577–583.[CrossRef][Web of Science][Medline]

    Chen K, et al. Prediction of protein crystallization using collocation of amino acid pairs. Biochem. Biophys. Res. Commun (2007) 355:764.[CrossRef][Web of Science][Medline]

    Chen L, et al. TargetDB: a target registration database for structural genomics projects. Bioinformatics (2004) 20:2860–2862.[Abstract/Free Full Text]

    Davies TG, et al. Structure-based design of a potent purine-based cyclin-dependent kinase inhibitor. Nat. Struct. Mol. Biol (2002) 9:745.[CrossRef]

    Diprose JM, et al. Translocation portals for the substrates and products of a viral transcription complex: the bluetongue virus core. EMBO J (2001) 20:7229–7239.[CrossRef][Web of Science][Medline]

    Duda R, Hart P. Pattern Classification and Scene Analysis (1973) London: Wiley.

    Eddy SR. Profile hidden Markov models. Bioinformatics (1998) 14:755–763.[Abstract/Free Full Text]

    Engelman DM, et al. Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Ann. Rev. Biophys. Biophys. Chem (1986) 15:321–353.[CrossRef][Web of Science][Medline]

    Finn RD, et al. Pfam: clans, web tools and services. Nucleic Acids Res (2006) 34:D247–D251.[Abstract/Free Full Text]

    Galassi M, et al. GNU Scientific Library Reference Manual – Revised. (2006) 2nd edn. Network Theory Ltd.

    Goh C, et al. Mining the Structural genomics pipeline: identification of protein properties that affect high-throughput experimental analyses. J. Mol. Biol (2004) 336:115–130.[CrossRef][Web of Science][Medline]

    Guyon I, Elisseef A. An introduction to variable and feature selection. J. Mach. Learn. Res (2003) 3:1157–1182.[CrossRef]

    Hiu R, Edwards E. High-throughput protein crystallisation. J. Struct. Biol (2003) 142:154–161.[CrossRef][Web of Science][Medline]

    Hol W. Structural genomics for science and society. Nat. Struct. Biol (2000) 7:964–966.[CrossRef][Medline]

    Liu J, et al. Automatic target selection for structural genomics on eukaryotes. Proteins (2004) 56:188–200.[CrossRef][Web of Science][Medline]

    Overton IM, Barton GJ. A normalised scale for structural genomics target ranking: the OB-Score. FEBS Lett (2006) 580:4005.[CrossRef][Web of Science][Medline]

    Parzen E. On estimation of a probability density function and mode. Ann. Math. Stat (1962) 33:1065–1076.[CrossRef]

    Poppe SM, et al. Antiviral activity of the Dihydropyrone PNU-140690, a new nonpeptide himan immunodeficiency virus protease inhibitor. Antimicrob. Agents Chemother (1997) 41:1058–1063.[Abstract/Free Full Text]

    Puesy M, et al. Life in the fast lane for protein crystallization and X-ray crystallography. Prog. Biophys. Mol. Biol (2005) 88:359–386.[CrossRef][Web of Science][Medline]

    R Development Core Team. R: A language and environment for statistical computing. (2004) R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.

    Rice P, et al. EMBOSS: the european molecular biology open software suite. Trends Genet (2000) 16:276–277.[CrossRef][Web of Science][Medline]

    Rost B. Twilight zone of protein sequence alignments. Protein Eng (1999) 12:85–94.[Abstract/Free Full Text]

    Savchenko A, et al. Strategies for structural proteomics of prokaryotes: quantifying the advantages of studying orthologous proteins and of using both NMR and x-ray crystallography approaches. Proteins (2003) 50:392–399.[CrossRef][Web of Science][Medline]

    Schuttelkopf AW, et al. Screening-based discovery and structural dissection of a novel family 18 chitinase Inhibitor. J. Biol. Chem (2006) 281:27278–27285.[Abstract/Free Full Text]

    Service R. Tapping DNA for structures produces a trickle. Science (2002) 298:948–950.[Abstract/Free Full Text]

    Service R. Structural genomics, round 2. Science (2005) 307:1554–1558.[Abstract/Free Full Text]

    Shapiro L, Harris T. Finding function through structural genomics. Curr. Opin. Biotechnol (2000) 11:31.[CrossRef][Web of Science][Medline]

    Singh SK, et al. Structural basis for duffy recognition by the malaria parasite duffy-binding-like domain. Nature (2006) 439:741.[CrossRef][Medline]

    Smialowski P, et al. Will my protein crystallize? A sequence-based predictor. Proteins: Struct. Funct. Bioinformatics (2006) 62:343–355.[CrossRef]

    Stajich JE, et al. The bioperl toolkit: perl modules for the life sciences. Genome Res (2004) 12:1611–1618.[CrossRef]

    Stevens RC, et al. Global efforts in structural genomics. Science (2001) 294:89–92.[Abstract/Free Full Text]

    Terwillinger TC. Structural genomics in North America. Nat. Struct. Biol (2000) 7:935–939.[CrossRef][Medline]

    Todd AE, et al. Progress of structural genomics initiatives: an analysis of solved target structures. J. Mol. Biol (2005) 348:1235.[CrossRef][Web of Science][Medline]

    von Itzstein M, et al. Rational design of potent sialidase-based inhibitors of influenza virus replication. Nature (1993) 363:418.[CrossRef][Medline]

    Wan H, Wootton JC. A global compositional complexity measure for biological sequences: AT-rich and GC-rich genomes encode less complex proteins. Comput. Chem (2000) 24:71.[CrossRef][Web of Science][Medline]

    Wang G, Dunbrack R. PISCES: a protein sequence culling server. Bioinformatics (2003) 19:1589–1591.[Abstract/Free Full Text]

    Yard BA, et al. The structure of serine palmitoyltransferase; gateway to sphingolipid biosynthesis. J. Mol. Biol (2007) 370:870–886.[CrossRef][Web of Science][Medline]

    Zarembinski TI, et al. Structure-based assignment of the biochemical function of a hypothetical protein: a test case of structural genomics. PNAS (1998) 95:15189–15193.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
I. M. Overton, C. A. J. van Niekerk, L. G. Carter, A. Dawson, D. M. A. Martin, S. Cameron, S. A. McMahon, M. F. White, W. N. Hunter, J. H. Naismith, et al.
TarO: a target optimisation system for structural biology
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W190 - W196.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/7/901    most recent
btn055v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Overton, I. M.
Right arrow Articles by Barton, G. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Overton, I. M.
Right arrow Articles by Barton, G. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?