Skip Navigation


Bioinformatics Advance Access originally published online on September 27, 2005
Bioinformatics 2005 21(23):4230-4238; doi:10.1093/bioinformatics/bti690
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
21/23/4230    most recent
bti690v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Google Scholar
Right arrow Articles by Janes, R. W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Janes, R. W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oxfordjournals.org
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions{at}oxfordjournals.org

Bioinformatics analyses of circular dichroism protein reference databases

Robert W. Janes

School of Biological and Chemical Sciences, Queen Mary, University of London Mile End Road, E1 4NS, UK


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 

Motivation: Circular dichroism (CD) spectroscopy has become established as a key method for determining the secondary structure contents of proteins which has had a significant impact on molecular biology. Many excellent mathematical protocols have been developed for this purpose and their quality is above question. However, reference database sets of proteins, with CD spectra matched to secondary structure components derived from X-ray structures, provide the key resource for this task. These databases were created many years ago, before most CD spectrophotometers became standardized and before it was commonplace to validate X-ray structures prior to publication. The analyses presented here were undertaken to investigate the overall quality of these reference databases in light of their extensive usage in determining protein secondary structure content from CD spectra.

Results: The analyses show that there are a number of significant problems associated with the CD reference database sets in current use. There are disparities between CD spectra for the same protein collected by different groups. These include differences in magnitudes, peak positions or both. However, many current reference sets are now amalgamations of spectra from these groups, introducing inconsistencies that can lead to inaccuracies in the determination of secondary structure components from the CD spectra. A number of the X-ray structures used fall short on the validation criteria now employed as standard for structure determination. Many have substantial percentages of residues in the disallowed regions of the Ramachandran plot. Hence their calculated secondary structure components, used as a foundation for the reference databases, are likely to be in error. Additionally, the coverage of secondary structure space in the reference datasets is poorly correlated to the secondary structure components found in the Protein Data Bank. A conclusion is that a new reference CD database with cross-correlated, machine-independent CD spectra and validated X-ray structures that cover more secondary structure components, including diverse protein folds, is now needed. However, that reasonably accurate values for the secondary structure content of proteins can be determined from spectra is a testament to CD spectroscopy being a very powerful technique.

Contact: r.w.janes{at}qmul.ac.uk


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
Circular dichroism (CD) spectroscopy has become an invaluable research technique, used by many labs worldwide, for gaining information about protein structure, dynamics and interactions both with other proteins and with ligands. This is possible because different types of secondary structures give rise to characteristic CD spectra, which differ in their peak positions and intensities, and to a first approximation a spectrum can be considered to arise from the weighted sum of these components. The information content available from CD is wavelength-range dependent and analyses of spectral data can determine the number of independent eigenvectors needed to reconstruct the original spectrum. For data down to wavelengths of ~190 nm this number is between three and four (Hennessey and Johnson, 1981). However, because secondary structure components are not independent of each other (Pancoska et al., 1992), solutions for a greater number of components than there are independent eigenvectors can be found (Hennessey and Johnson, 1981; Wallace and Janes, 2001).

There are a number of methods that have been developed for deconvoluting CD spectra into the calculated secondary structure components present in the protein. These include as examples, linear least-squares (Chen and Yang, 1971; Brahms and Brahms, 1980), parameterized fit (Provencher and Glöckner, 1981), singular-value decomposition (Hennessey and Johnson, 1981), non-linear least-squares (Wallace and Teeters, 1987) and self-consistent variable selection methods (Sreerama and Woody, 1993; Johnson, 1999; Sreerama and Woody, 2000). These methods are based on very sound mathematical approaches. They have been enhanced and refined over the years, and because of this they can yield reasonable results for the calculation of secondary structure content. Many of these methods are to be found in DICHROWEB (Lobley and Wallace, 2001; Lobley et al., 2002; Whitmore and Wallace, 2004), a package designed to aid in the determination of secondary structure content and used world wide. Such is the wide-scale use of CD spectroscopy in research that the creation of the Protein Circular Dichroism Data Bank (PCDDB) has been proposed, which will act as a repository and resource for CD spectra and associated data (Wallace et al., 2005).

Synchrotron radiation circular dichroism (SRCD), first developed in 1980 (Sutherland et al., 1980), has recently become a potentially valuable tool for substantially extending the wavelength range of available data due to the increased photon flux of the source over conventional CD (cCD) machines at the lower wavelength limits. For data collected over the full SRCD range, down to ~160 nm, the information content rises to at least seven or eight eigenvectors. These data may be deconvoluted into as many as 12 different secondary, and perhaps supersecondary, structure types thereby enabling a much more detailed resolution of structural features than has been possible from a cCD source (Wallace and Janes, 2001).

Empirical determination of the secondary structure components from CD spectral data employs reference databases. These are either a combination of CD spectra from a set of proteins with known secondary structure content, obtained from their X-ray crystallography structures, or principal component spectra derived from a set of individual spectra. Examples of these databases are Chang et al. (1978), Bolotina et al. (1980a,b), Brahms and Brahms (1980), Provencher and Glöckner (1981), Compton and Johnson (1986), Pancoska and Keiderling (1991) and Sreerama et al. (2000). These reference datasets were created early in the development of CD spectroscopy as a technique for proteins, and significantly more recent databases are for the most part combinations of older ones, and include no or limited new protein secondary structure types and few, if any, new protein constituents. For the major reference databases available, the lowest wavelength data included are to 178 nm. Of note, for SRCD measurements, while a higher number of resolvable secondary structure types should potentially be determinable and with a greater degree of accuracy than is possible from cCD sources, currently no databases are capable of covering down to the full wavelength range available to this technique.

The work presented here analyses the current CD reference databases for the quality of the CD data and X-ray structures used, for their breadth of secondary structure types covered and their effectiveness at covering fold space.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
2.1 CD spectra
CD spectral data were obtained from reference databases at the CDPRO (Sreerama et al., 2000) program website (http://lamar.colostate.edu/~sreeram/CDPro/). Additional spectra were obtained from the Brahms and Brahms (1980) reference dataset (provided by Prof. Jon B. Applequist, personal communication) and the Supplementary data (Pancoska and Keiderling, 1995). An SRCD spectrum for {gamma}-crystallin came from Paul Evans and Dr Christine Slingsby. As some original spectra were collected at non-integral wavelengths, an in-house program was used to interpolate these to integral wavelengths for comparison purposes. This did not alter the spectral characteristics in any way however (data not shown). Brahms and Brahms spectra were reported in mean residue ellipticity ({theta}) values and were therefore scaled to match the delta epsilon ({Delta}{varepsilon}) units used in CDPRO by dividing by 3298.

2.2 X-ray structure data
The X-ray structure data in Tables 1 and 2 are derived from Pancoska and Keiderling (1995), (their original Table 1 set of structures) and Exp32 reference set from Sreerama et al. (2000). Original Protein Data Bank (PDB) (Bernstein et al., 1977; Berman et al., 2000) files were used for subsequent analyses, even when these had been superseded, as the data within the reference databases are still derived from this original material. Atomic co-ordinates were from PDB files (http://www.pdb.org/) or from the archive site for obsolete structures (http://pdbobs.sdsc.edu/index.cgi).


View this table:
[in this window]
[in a new window]
 
Table 1 Set of reference proteins used by Pankoska et al., 1995

 

View this table:
[in this window]
[in a new window]
 
Table 2 Exp32—a set of reference proteins used in SELCON3 and many other amalgamation setsa

 
2.3 X-ray structure analyses
The resolution was obtained from the PDB files. Structural fold information was from the CATH (class, architecture, topology and homologous superfamily) protein topology website (Orengo et al., 1997; Pearl et al., 2000, 2005). The DSSP (definition of secondary structure of proteins) program (Kabsch and Sander, 1983) was used (http://bioweb.pasteur.fr/seqanal/interfaces/dssp-simple.html) to assign secondary structure content to these proteins. Percentages of secondary structure, derived from the DSSP output, were determined using an in-house program as were the percentages and numbers of ‘missing’ (undetermined) residues in these structures. These missing residues were checked against the sequence data of the native protein in each case (http://us.expasy.org/srs5/). PROCHECK (Laskowski et al., 1993) was used to derive the percentages of residues in fully, additionally and generously allowed and disallowed regions of the Ramachandran plot.

2.4 Evaluating the correlation coefficient of alpha helix and beta sheet content of all PDB proteins against Exp32
The values for percentage alpha helical against beta sheet content of all proteins in the PDB were binned into ‘ten percentile tranches’ (0–9.99, 10–19.99, etc.), not including any nucleic acid material but leaving in all homologue proteins. The reasoning was that any of these proteins could have their CD spectra recorded and they were therefore eligible for inclusion. The Exp32 set for proteins were binned in a similar way to enable a direct comparison. To quantify the coverage of ‘secondary structure space’ of the Exp32 set compared with the whole PDB, the standard Pearson r2 correlation coefficient was calculated. To create an idealized set of data maximizing the coverage of these secondary structure components for a reference set containing only 32 proteins, each bin of PDB data was reduced by an overall scaling term such that the total of proteins then approximated to 32. Rounding these new values to the nearest integer, and rounding one value of >0.49 manually to one, created the desired idealized 32 protein set.

2.5 Evaluating the correlation coefficient of fold space
In a manner similar to that for the secondary structure components, the fold space of single-domain proteins was obtained from the CATH database for all proteins in the PDB. These data were correlated with those from the Exp32 set of proteins and a hypothetical set of proteins was also generated to characterize the quality of fold space coverage (in fact comprising 31, as there was one protein unclassified in the Exp32 set).


    3 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
3.1 Quality of CD spectra in reference databases
The CD spectra used in many of the current reference databases are derived from amalgamations of previously created datasets from different groups, with the aim of broadening the secondary structure types represented by the proteins in these sets. As examples, Exp32 used as a stand-alone set in SELCON3 (Sreerema and Woody, 1993; Sreerama et al., 1999) and as part of many sets in CDPRO (Sreerama et al., 2000, http://lamar.colostate.edu/~sreeram/CDPro/), and analysed here, contains 29 CD spectra from Johnson (a gift as stated in Sreerama et al., 1999) and 3 from Sreerama et al. (1999). The set CDDATA56, currently the largest used in CDPRO containing 56 CD spectra, is a combination from Johnson (Sreerama et al., 1999) (29), Sreerama et al. (1999) (3), Yang et al. (1978) (a gift as stated in Provencher and Glöckner, 1981) (6), Pancoska and Keiderling (1991) (5) and membrane proteins from Park et al. (1992) (13). An inherent problem is that the spectra used were often obtained from individual non-commercial CD machines, or machines modified to collect CD data, with, at the time, limited cross-reference calibration. CD spectra for the same protein from the different sets used in these amalgamated databases have different spectral features in a number of cases, as illustrated in Figures 1 and 2. Figure 1 shows the CD spectra of superoxide dismutase from three original databases (Johnson, 1999; Pancoska and Keiderling, 1995 and Brahms and Brahms, 1980). The spectra are considerably different from each other, and yet are from the same source material. In addition, each of these spectra is equated to secondary structure components from the same PDB file (2sod). Only one of these spectra is now used in the current datasets, but it is unclear why one was chosen over the rest and which one is the ‘actual’ spectrum of the protein? Figure 2a–c show the spectra from Brahms and Brahms (1980) compared with that from Johnson (1999). Here the differences are not so pronounced as in Figure 1, being either wavelength shifts in the spectra, magnitude shifts in the peaks, ratio differences between peaks or a combination of these, but they are nevertheless serious when being used as the basis for empirical calculations of secondary structure content for novel proteins. Figure 2d compares a spectrum of {gamma}-crystallin from a database set with an SRCD spectrum of the same protein (Evans et al., 2004). Although the spectra have comparable characteristics there is a 10 nm shift between them, the SRCD spectrum being down-wavelength from that in the database. Whilst CD spectra from single-source reference databases were possibly ‘internally consistent’ when used as an isolated set, when they became components within an amalgamated set, their differences, as exemplified here, created problems with consistency within these combined sets. To ensure consistency, cross-calibration and cross-checking on a diverse range of machines are of vital importance to remove possible machine bias within the data (Miles et al., 2003, 2005).



View larger version (13K):
[in this window]
[in a new window]
 
Fig. 1 Superoxide dismutase spectra of the same protein from three CD reference databases Johnson (blue), Pancoska and Keiderling (red) and Brahms and Brahms (green). These spectra are equated to the same secondary structure data derived from PDB file 2sod.

 


View larger version (15K):
[in this window]
[in a new window]
 
Fig. 2 (a)–(c) CD spectra of three proteins from current reference database sets in CDPRO (blue) in comparison with the spectra of the same protein from a set now not used (Brahms and Brahms, 1980) (red). The proteins are (a) prealbumin, (b) lactate dehydrogenase and (c) lysozyme. The spectra in (d) are of {gamma}-crystallin from the reference database sets used in CDPRO (blue) and SRCD (red) spectral data recorded by Evans et al. (2004).

 
3.2 Wavelength range of CD reference spectra
CD spectra from different reference databases were collected over different wavelength ranges, as illustrated in Figures 1 and 2.The information content is directly proportional to the wavelength range: the shorter the range the less the information available. The data from Johnson (1999) for example have a wavelength range 178–240 nm, while those of the CDDATA56 set, which incorporates the Johnson data, are only over 190–240 nm as other contributing groups collected over a shorter range. This reduction in range represents a significant difference in the available information content, decreasing rather than increasing the number of secondary structure components that can be derived from the data (Wallace and Janes, 2001). In addition, none of the current reference sets covers the range obtainable by SRCD, down to ~160 nm, and any new reference database would need to address this current shortcoming.

3.3 Quality of X-ray structures in the reference databases
The majority of the X-ray structures used in current reference databases were taken from the limited number available in the PDB (Bernstein et al., 1977; Berman et al., 2000) in the early 1980s, and some are from earlier. This has an inherent and serious weakness associated with it. Many of these structures were determined long before any systematic refinement protocols, checking and validation programs like PROCHECK (Laskowski et al., 1993) were available. With limited validation, to a lesser or greater extent flaws do exist in a number of these structures which went undetected at that time. However, they are still used either within a stand-alone set or as part of an amalgamation set for determining secondary structure content from a CD spectrum. Data on some of these reference set proteins are presented in Tables 1 and 2 as illustrative examples. The Tables give the PDB code for the given protein at the time of their database inception, CATH database code, resolution of the structure, percentage secondary structure content, percentage residues missing (undetermined) in the structure and percentage residues in the fully, additionally and generously allowed and disallowed regions of the Ramachandran plot, as defined by PROCHECK. Another issue is that it is assumed that protein crystal and solution structures are the same despite the environments being markedly different. Any structural differences that might result from different conditions, e.g. concentrations, salts, pH, etc., could also compound the database inaccuracies in secondary structure determination from CD data.

3.4 Ramachandran plot quality of structures
For the Ramachandran plot, PROCHECK defines a threshold for well-resolved, accurate structures as >90% of their residues being located in the most favoured region, and this is flagged should the value fall below this level. If the value <80% then a double flag is issued to draw attention to potentially more serious problems within the structure. Table 1 is a reference dataset from Pancoska et al. (1995), and some members of this set are now used in amalgamation sets in CDPRO. All 23 (100%) proteins of this set are under the 90% threshold for the most favoured region. Additionally, 13 of 23 structures (57%) of these proteins are under the 80% threshold. Of these 5 proteins are now used in amalgamated datasets (2cga [PDB] , 4adh, 1ca2, 2grs and 1rhd) of which 3 (60%) are below the 80% threshold. In Table 2, the Exp32 set, 22 of 32 structures (69%) have less than the 90% threshold for residues in the most favoured region. Of these, 8 (25%) are doubly flagged, as being <80%, substantially less than optimal. Failing to reach these thresholds indicates there may be some degree of error in their determined conformations, which means that secondary structure contents derived from them must also be in error. Many of these structures would not be publishable today given the validation procedures now employed. Figures 3 and 4 illustrate some of the problems associated with the structures comprising the reference databases. Here, only 14.8 and 59.9% of residues are located in the most favoured region of the Ramachandran plot. Indeed, in the first case, 24.6% of the residues are in the disallowed region of the plot, indicating a larger number of residues wholly incorrect than correct.



View larger version (70K):
[in this window]
[in a new window]
 
Fig. 3 A Ramachandran plot output, modified from PROCHECK (Laskowski et al., 1993) of ‘Prot1’, a protein used in current reference databases. The areas marked are fully (red), additionally (yellow), generously (fawn) allowed and disallowed (white) regions for amino acid {varphi}/{phi} angles. Shown in red lettering are those residues in the generously allowed and disallowed regions.

 


View larger version (80K):
[in this window]
[in a new window]
 
Fig. 4 A Ramachandran plot output (modified from PROCHECK) of ‘Prot2’, another protein in the current reference databases. Refer Figure 3 for a description of what is depicted.

 
X-ray structures solved at low resolution can potentially have regions where errors arise from an inability to follow accurately the electron density. Some structures with less than optimal resolution are in the reference sets. In Table 2 the five structures with the lowest percentage amino acids in the fully allowed region are all solved at a resolution of 2.5 Å or worse. Clearly, the more incorrect the conformation the more incorrect the percentages of secondary structure types derived, and this questions their reliability for use in the reference datasets. Only a few of the structures have serious mistakes, nevertheless, inclusion of small errors will introduce a degree of inaccuracy, which in turn will lead to erroneous calculation of secondary structure components for empirical determination from a CD spectrum.

3.5 Coverage of secondary structure space
The numbers of X-ray structures used within the reference databases are limited, especially so when compared with those in the PDB. For their optimal utilization it would be important for the sets accurately to reflect the secondary structures present in the PDB. Figure 5 shows as an example, a plot of the coverage of alpha helical against beta sheet content for (a) all protein structures in the PDB, (b) the Exp32 set and (c) a theoretical idealized set also containing 32 proteins. The Exp32 reference set does not cover the same secondary structure space as found in the PDB. Correlating the two sets of data, as described in the Methods, gives a value for r2 of 0.55. This is significantly lower than it could be and is again reflective of the limits imposed in having minimal numbers of protein structures available at the inception of these databases. By comparison, the r2 term is 0.95 for the idealized set, indicating the coverage of secondary structure space is more extensive, even for such a small dataset.



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 5 Plot of alpha helix against beta sheet content for (a) all proteins in the PDB compared with (b) those proteins found in the CD reference set Exp32 used in SELCON3, and as part of most other reference databases, and (c) for an idealized reference dataset also with 32 proteins, with high correlation to the PDB, as described in the text. Beyond the diagonal base line represents possible regions for alpha/beta content.

 
Increasing the number to that in CDDATA56 (not incorporating the membrane proteins, so this becomes a 43 protein set) gains little in the secondary structure coverage, the r2 value now becoming 0.62. However, increasing the number of available proteins in the reference set even allows improvement for an idealized set, the r2 becoming 0.97. Optimizing this correlation with the PDB is one way of ensuring that for a given number of proteins within a reference dataset, they will be as representative as is possible of the whole PDB.

3.6 Coverage of fold space
The maximum number of structures in the current reference databases used is 56, and this includes 13 membrane protein structures together with their related spectra. Whether this is an advisable move is a matter of debate (Wallace et al., 2003; Sreerama and Woody, 2004). With SRCD sources the increased information content may enable analysing for the fold of a protein, therefore is it interesting to consider how broad based the coverage of fold space in the current reference databases is? This was not an issue at their inception, where interest lay only in the secondary structure content of the proteins, but it would be an important factor to consider for any new reference databases. Table 2 gives the CATH entry code for the structures in Exp32. Figure 6 shows these data in relation to those for all single-domain proteins in the PDB and to a hypothetical set of proteins with the same number as that in Exp32 (31 here as 1 was unclassified in this set). While some of the more populated CATH classes are well represented in the Exp32 set, others that should be present to ensure a good coverage of fold classes are clearly lacking. The r2 correlation coefficient is 0.81 for Exp32 in relation to the entire PDB. In contrast, the hypothetical set of proteins with the same number of components as in Exp32 has an r2 of 0.97 demonstrating that a better coverage of fold space is possible.



View larger version (19K):
[in this window]
[in a new window]
 
Fig. 6 Population of CATH fold space for single-domain proteins in the PDB (light blue) in comparison with data for the Exp32 set (blue) and with a hypothetical set of proteins, labelled Hyp32 (red) with the same number of proteins as that of the Exp32 set.

 
In Table 1, there are four structures present that are multi-domain proteins, containing two or more recognized CATH topologies in their structures, and two of these are used in amalgamated sets in CDPRO. Again, this was not an issue at the time of database inception. However, inclusion of multi-domain proteins into future reference database sets, which might be aimed at analysing for fold recognition (Wallace and Janes, 2001), would lead to difficulties in interpretation of such fold classes and so single-domain proteins would be the optimum to be used in such sets.

3.7 Completeness of structures
The Exp32 set of structures in Table 2 has 11 (34%) of them that are incomplete, having undetermined regions, maybe from inherent structural flexibility. These missing residues are predominantly lost from the N- and C-termini, and two have >5% of missing structure (5.43% for 1eri and 10.24% for 2pab) representing 15 and 18 residues, respectively. How missing structural content is accounted for in each of the databases is not always clear, especially because definitions pertinent to secondary structure features are also in their infancy and so totals counted for such features are sometimes not the same as those from current calculations. The main method assumes missing residues can be considered as ‘other’ (previously referred to as ‘random coil’) and thus adds them to that component's total. This is feasible only if the number of missing residues is insufficient to form any type of secondary structure component. Highly flexible regions with proportionately larger numbers of missing residues might prevent determination of other more structured areas containing secondary structure components that would be missed as a result. Another method might be to ignore the missing content and to take the known portion of determined structure as being the total content. Whichever of these two methods were to be used, neither is valid as it is making unsupported assumptions in one case and introducing direct errors in the other, and hence both are unsatisfactory. Indeed, there is no satisfactory answer for dealing with missing residues in protein structures being used for a CD reference dataset other than to use only ‘complete’ proteins, structures whose entire length of chain is resolvable. It should be put into perspective, however, that at the time of inception of the databases many limitations hampered the selection of structures.


    4 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
Despite the internal errors associated with the CD spectra and X-ray structures found in many reference databases, reasonably accurate values for secondary structure content of proteins can be determined from CD data. This is particularly true for mainly alpha helical proteins, likely due to the lack of variance in the geometry of this secondary structure component in different proteins. Accurate secondary structure determinations for many unknown proteins are also possible because the reference databases contain several of the most popular protein folds. Determination of beta sheet-containing proteins is usually less accurate, due to both the lesser intensity of CD signal from this component relative to that from alpha helices and the greater diversity of topologies of such a component found within proteins. Other less common secondary structural components, such as 310 and PPII helices, also tend to be less accurately determined because of their limited representation within the reference databases.

CD spectroscopy is used to determine the secondary structure content of proteins and many excellent mathematical approaches have been developed for this procedure. All rely on reference databases to obtain accurate values for calculating this content, but the databases themselves must have minimal errors otherwise this accuracy could be compromised. From the analyses presented there are some problems associated with the current CD reference databases, both with the CD spectra and with the X-ray structures used. Some of the CD spectra are potentially erroneous representations of the referenced protein, with restrictions in their wavelength range covered, and many of the X-ray structures are not of the highest quality because of limitations in structure validation procedures. Also the structures are limited in their range of secondary structure types represented and coverage of secondary structure space. These analyses suggest that there is an urgent need to create a new, more comprehensive CD reference database containing cross-validated CD spectra collected and cross-checked on a number of different cCD spectrophotometers and SRCD beamlines to ensure machine independence. These should be for proteins with X-ray structures that have a broad coverage both of different secondary structure types and of secondary structure space, whose quality has been assured by the many available validation programs. Additionally, with the number of SRCD facilities increasing worldwide and improvements in cCD machine optics, any new reference database should extend to lower wavelength limits to enable analyses of these extra data. In summary, the recommendations for the content of a future reference database would be as follows: to contain ~80 proteins; with complete X-ray structures (i.e. those with no missing residues) whose quality has been confirmed by programs such as PROCHECK; a broad range of secondary structure and fold types represented; created with SRCD spectra to achieve low wavelength data (at least 170 nm), matching the CD data to the protein organism/sequence of the X-ray structure; and with full calibration/validation of these CD spectra. Such a database would enhance the quality and accuracy of secondary structure component determination, ensuring that CD spectroscopy remains a very powerful technique.


    Acknowledgments
 
I thank Prof. B. A. Wallace for many useful discussions. I thank Prof Jon B. Applequist for the CD data from some of the database sets, and Paul Evans and Dr Christine Slingsby for the SRCD spectral data for {gamma}-crystallin. I also thank Dr Alison Cuff for provision of the single-domain proteins CATH data. This work was supported by a BBSRC grant (B19312 [GenBank] ).

Conflict of Interest: none declared.

Received on August 2, 2005; revised on September 21, 2005; accepted on September 23, 2005

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 

    Berman, H.M., et al. (2000) The protein data bank. Nucleic Acids Res, . 28, 235–242[Abstract/Free Full Text].

    Bernstein, F.C., et al. (1977) The protein data bank: a computer-based archival file for macromolecular structures. J. Mol. Biol, . 112, 535–542[Web of Science][Medline].

    Bolotina, I.A., et al. (1980a) Determination of the secondary structure of proteins from the circular-dichroism spectra.1. Protein reference spectra for alpha structure, beta structure and irregular structure. Mol. Biol, . 14, 701–709.

    Bolotina, I.A., et al. (1980b) Determination of the secondary structure of proteins from the circular-dichroism spectra. 2. Consideration of the contribution of beta-bends. Mol. Biol, . 14, 709–715.

    Brahms, S. and Brahms, J. (1980) Determination of protein secondary structure in solution by vacuum ultraviolet circular dichroism. J. Mol. Biol, . 138, 149–178[CrossRef][Web of Science][Medline].

    Chang, T.C., et al. (1978) Circular dichroic analysis of protein conformation: inclusion of beta-turns. Anal. Biochem, . 91, 13–31[CrossRef][Web of Science][Medline].

    Chen, Y.H. and Yang, J.T. (1971) A new approach to the calculation of secondary structures of globular proteins by optical rotatory dispersion and circular dichroism. Biochem. Biophys. Res. Commun, . 44, 1285–1291[CrossRef][Web of Science][Medline].

    Compton, L.A. and Johnson, W.C., Jr. (1986) Analysis of protein circular dichroism spectra for secondary structure using a simple matrix multiplication. Anal. Biochem, . 155, 155–167[CrossRef][Web of Science][Medline].

    Evans, P., et al. (2004) The P23T cataract mutation causes loss of solubility of folded gammaD-crystallin. J. Mol. Biol, . 343, 435–444[CrossRef][Web of Science][Medline].

    Hennessey, J.P., Jr and Johnson, W.C., Jr. (1981) Information content in the circular dichroism of proteins. Biochemistry, 20, 1085–1094[CrossRef][Medline].

    Johnson, W.C. (1999) Analyzing protein circular dichroism spectra for accurate secondary structures. Proteins, 35, 307–312[CrossRef][Web of Science][Medline].

    Kabsch, W. and Sander, C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577–2637[CrossRef][Web of Science][Medline].

    Laskowski, R.A., et al. (1993) PROCHECK—a program to check the stereochemical quality of protein structures. J. Appl. Cryst, . 26, 283–291[CrossRef].

    Lobley, A. and Wallace, B.A. (2001) DICHROWEB: a website for the analysis of protein secondary structure from circular dichroism spectra. Biophysical J, . 80, 373a.

    Lobley, A., et al. (2002) DICHROWEB: an interactive website for the analysis of protein secondary structure from circular dichroism spectra. Bioinformatics, 18, 211–212[Abstract/Free Full Text].

    Miles, A.J., et al. (2003) Calibration and standardisation of synchrotron radiation circular dichroism and conventional circular dichroism spectrophotometers. Spectroscopy, 17, 653–661[Web of Science].

    Miles, A.J., et al. (2005) Calibration and standardisation of synchrotron radiation and conventional circular dichroism spectrometers. Part 2: Factors affecting magnitude and wavelength. Spectroscopy, 19, 43–51.

    Orengo, C.A., et al. (1997) CATH—a hierarchic classification of protein domain structures. Structure, 5, 1093–1108[Medline].

    Pancoska, P and Keiderling, T.A. (1991) Systematic comparison of statistical-analyses of electronic and vibrational circular dichroism for secondary structure prediction of selected proteins. Biochemistry, 30, 6885–6895[CrossRef][Medline].

    Pancoska, P., et al. (1992) Relationships between secondary structure fractions for globular proteins. Neural network analyses of crystallographic datasets. Biochemistry, 31, 10250–10257[CrossRef][Medline].

    Pancoska, P., et al. (1995) Comparison of and limits of accuracy for statistical analyses of vibrational and electronic circular dichroism spectra in terms of correlations to and predictions of protein secondary structure. Protein Sci, . 4, 1384–1401[Web of Science][Medline].

    Pearl, F.M., et al. (2000) Assigning genomic sequences to CATH. Nucleic Acids Res, . 28, 277–282[Abstract/Free Full Text].

    Pearl, F., et al. (2005) The CATH domain structure database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Res, . 33, D247–D251[Abstract/Free Full Text].

    Provencher, S.W. and Glöckner, J. (1981) Estimation of globular protein secondary structure from circular dichroism. Biochemistry, 20, 33–37[CrossRef][Medline].

    Sreerema, N. and Woody, R.W. (1993) A self-consistent method for the analysis of protein secondary structure from circular dichroism. Anal. Biochem, . 209, 32–44[CrossRef][Web of Science][Medline].

    Sreerama, N. and Woody, R.W. (2000) Estimation of protein secondary structure from circular dichroism spectra: comparison of CONTIN, SELCON and CDSSTR methods with an expanded reference set. Anal. Biochem, . 287, 252–260[CrossRef][Web of Science][Medline].

    Sreerema, N., et al. (1999) Estimation of the number of helical and strand segments in proteins using CD spectroscopy. Protein Sci, . 8, 370–380[Web of Science][Medline].

    Sreerama, N., et al. (2000) Estimation of protein secondary structure from circular dichroism spectra: inclusion of denatured proteins with native proteins in the analysis. Anal. Biochem, . 287, 243–251[CrossRef][Web of Science][Medline].

    Sutherland, J.C., et al. (1980) Versatile spectrometer for experiments using synchrotron radiation at wavelengths greater than 100 nm. Nucl. Instrum. Methods, 172, 195–199[CrossRef].

    Wallace, B.A. and Janes, R.W. (2001) Synchrotron radiation circular dichroism spectroscopy of proteins: secondary structure, fold recognition and structural genomics. Curr. Opin. Chem. Biol, . 5, 567–571[CrossRef][Web of Science][Medline].

    Wallace, B.A. and Teeters, C.L. (1987) Differential absorption flattening optical effects are significant in the circular-dichroism spectra of large membrane-fragments. Biochemistry, 26, 65–70[CrossRef][Medline].

    Wallace, B.A., et al. Proteins, (2005) in press.

    Whitmore, L. and Wallace, B.A. (2004) DICHROWEB: an online server for protein secondary structure analyses from circular dichroism spectroscopic data. Nucleic Acids Res, . 32, W668–W673[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
J. G. Lees, A. J. Miles, F. Wien, and B. A. Wallace
A reference database for circular dichroism spectroscopy covering fold and secondary structure space
Bioinformatics, August 15, 2006; 22(16): 1955 - 1962.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
21/23/4230    most recent
bti690v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Google Scholar
Right arrow Articles by Janes, R. W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Janes, R. W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?