Skip Navigation


Bioinformatics Advance Access originally published online on October 27, 2004
Bioinformatics 2005 21(7):988-992; doi:10.1093/bioinformatics/bti082
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/7/988    most recent
bti082v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (28)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Westbrook, J.
Right arrow Articles by Berman, H. M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Westbrook, J.
Right arrow Articles by Berman, H. M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

PDBML: the representation of archival macromolecular structure data in XML

John Westbrook 1,*, Nobutoshi Ito 2, Haruki Nakamura 3, Kim Henrick 4 and Helen M. Berman 1

1Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB), Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey 610 Taylor Road, Piscataway, NJ 08854, USA
2Protein Data Bank Japan (PDBj), School of Medical Science, Tokyo Medical and Dental University 1-5-45 Yushima, Bunkyo-ku, Tokyo 113-8510, Japan
3Protein Data Bank Japan (PDBj), Institute for Protein Research, Osaka University 3-2 Yamadaoka, Suita, Osaka 565-0871, Japan
4EMBL Outstation, The European Bioinformatics Institute, Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD, UK

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 XML SCHEMA FOR PDB...
 PDBML DATA FILES
 SUPPORTING SOFTWARE TOOLS
 REFERENCES
 

Summary: The Protein Data Bank (PDB) has recently released versions of the PDB Exchange dictionary and the PDB archival data files in XML format collectively named PDBML. The automated generation of these XML files is driven by the data dictionary infrastructure in use at the PDB. The correspondences between the PDB dictionary and the XML schema metadata are described as well as the XML representations of PDB dictionaries and data files.

Availability: The current software translated XML schema file is located at http://deposit.pdb.org/pdbML/pdbx-v1.000.xsd, and on the PDB mmCIF resource page at http://deposit.pdb.org/mmcif/. PDBML files are stored on the PDB beta ftp site at ftp://beta.rcsb.org/pub/pdb/uniformity/data/XML

Contact: jwest{at}rcsb.rutgers.edu

The Protein Data Bank (PDB) (Bernstein et al., 1977; Berman et al., 2000) is the single worldwide repository for macromolecular structure data. For more than 30 years (Bernstein et al., 1977), the PDB has used a column-oriented data format to store archival entries. This format resembles many other data formats constrained by the limitations of paper punch card technology. Examples of the data format are shown in Figure 1.



View larger version (64K):
[in this window]
[in a new window]
 
Fig. 1 Excerpts of records from a PDB data files. (a) Structured PDB records describing crystallographic cell constants (CRYST1), transformation matrices between orthogonal and fractional coordinates (ORIGX and SCALE) and the atomic coordinates (ATOM). (b) Unstructured PDB records describing the details of crystallographic refinement used in PDB data files before 1996. (c) Semi-structured PDB records describing crystallographic refinement used in PDB data files after 1996.

 
The representation of coordinate, sequence, secondary structure and citation data in the PDB has remained remarkably stable since the original format definition in 1972. The data records in the PDB format are prefixed with a record tag (e.g. CRYST1, ATOM) followed by individual items of data (Figure 1a). The specifications for the records in this data format are described informally in the PDB Content Guide: Atomic Coordinate Entry Format Description (Callaway et al., 1996). The description of the experimental details of structure determination has been encoded largely in the form of remark records. Although these records have some internal structure, the organization of these records has changed over time. For example, in Figure 1b the details of refinement are presented as unstructured text, and in Figure 1c these details are presented as semi-structured remark records.

The growing interest in database development and electronic publication in the late 1980s created a need for a more structured representation of PDB data. In 1990, the International Union of Crystallography (IUCr) commissioned a working group (see http://ndbserver.rutgers.edu/mmcif/background/index.html; Fitzgerald et al., 2004) to develop macromolecular extensions to the data representation used to describe small molecule structures and crystallographic structure determination, called the Crystallographic Information File (CIF) (Hall et al., 1991). The CIF representation had been designed and deployed by the IUCr to support electronic publication of small molecule crystal structures. The efforts of the working group and many community experts lead to the development of the macromolecular Crystallographic Information Framework (mmCIF) dictionary. The first version of this data dictionary released in 1996 contained 1700 data definitions (Bourne et al., 1997). The content of the mmCIF dictionary, a superset of PDB crystallographic content, included detailed definitions describing macromolecular structure and the current state of the macromolecular crystallographic experiment.

In 1998, the Research Collaboratory for Structure Bioinformatics (RCSB) assumed the management of the PDB. RCSB adopted the mmCIF data dictionary as the foundation of their data processing and data management infrastructure. The members of the worldwide PDB (wwPDB) (Berman et al., 2003), that includes the RCSB, the Macromolecular Structure Database (MSD) at the European Bioinformatics Institute (EBI) and the PDBj at Osaka University, have collaborated to extend the mmCIF dictionary to include all of the data managed and distributed by the PDB. These extensions include data definitions describing internal bookkeeping, non-crystallographic structure determination methods (e.g. NMR and cryo-electron microscopy), greater detail in experimental crystallography and the details of protein production. These extensions are collected into the PDB Exchange data dictionary (Westbrook et al., 2004a). This data dictionary provides the foundation for the generation of XML schema (World Wide Web Consortium, 2001a,b,c) and XML data files described in the remainder of this article.


    XML SCHEMA FOR PDB DATA, PDBML
 TOP
 Abstract
 XML SCHEMA FOR PDB...
 PDBML DATA FILES
 SUPPORTING SOFTWARE TOOLS
 REFERENCES
 
The representation of PDB data in XML builds from the content of the PDB Exchange dictionary, both for assignment of data item names and for defining data organization. Although presented in very different syntaxes, the PDB Exchange and XML representations use the same logical data organization. A side effect of maintaining a logical correspondence with the PDB Exchange representation is that the PDBML files lack the hierarchical structure characteristic of many XML data applications. However, preserving the logical data model of the PDB Exchange dictionary has three important advantages. First, the semantics of PDB data are completely preserved across the two formats. Second, the translation of the PDB Exchange dictionary and PDB Exchange data files to XML is greatly simplified. Third, the straightforward mapping of PDB data to relational database systems is retained.

The correspondences between the metadata attributes used in the PDB Exchange dictionary (Westbrook and Bourne, 2000; Westbrook et al., 2004b) and those of XML schema are summarized in Table 1. The top level of scope in the PDB Exchange dictionary or data file is the data block. The data block encloses complete data dictionaries or data entries. The dictionary data block is mapped to the standard top-level XML schema element, and the data file data block is mapped to a datablock element. The schema and datablock elements provide namespace definitions, linkages to the supporting XML schema definition documents and linkages to the location of the current supporting schema.


View this table:
[in this window]
[in a new window]
 
Table 1 Summary of the correspondences between PDB Exchange data dictionary and XML schema metadata

 
Category or table definitions in the Exchange dictionary are described as XML complexTypes. The category definition and examples are mapped to XML annotation and documentation elements. The data items within the category are defined as an unordered sequence of XML elements named according to the attribute portion of their dictionary equivalents. The special data items that form the primary key for the category are defined as XML attributes.

Individual data items have a definition and an optional set of examples. The item-level definition and examples are mapped to XML annotation and documentation elements. Parent–child relationships between data items in the Exchange dictionary are represented as XML key and keyref elements. All parent data items are identified as named XML keys, and their associated children are identified as named XML keyrefs. Primitive data types in the Exchange dictionary are described as XML simpleTypes. Allowed ranges are represented as restriction elements within simpleTypes. Complicated boundary conditions are represented as unions of simpleTypes containing restriction elements. Controlled vocabularies and allowed values are represented as simpleTypes with restrictions including enumeration elements. Where physical units of measurement are included in a definition in the absence of any other range restrictions, this information is mapped to a XML simpleContent element containing a fixed attribute element representing the measurement units. There are currently no mappings for the item-level dictionary attributes describing item-level interdependency, exclusivity or subcategory membership.

The correspondences between the PDB dictionary and XML schema metadata described in this section make the automatic translation of the PDB dictionary to XML schema possible. The current software translated XML schema file is located at http://deposit.pdb.org/pdbML/pdbx-v1.000.xsd, and on the PDB mmCIF resource page at http://deposit.pdb.org/mmcif/.


    PDBML DATA FILES
 TOP
 Abstract
 XML SCHEMA FOR PDB...
 PDBML DATA FILES
 SUPPORTING SOFTWARE TOOLS
 REFERENCES
 
The PDBML data files follow the same logical organization as their PDB Exchange data file counterparts. Figure 2 provides an abbreviated example comparing the presentation of a category describing polymer features in the two syntaxes. In Figure 2a, a single row of the entity_poly data category is illustrated within a data block named EXAMPLE. The corresponding XML representation of this information is shown in Figure 2b. Here the root-level enclosing XML datablock element identifies the namespace and the associated schema files. The entity_poly data category is enclosed by an XML entity_polyCategory element. Each row of the category is defined within an XML entity_poly element where the category key, entity_id, is included as an XML attribute. The remaining data items in the row are represented as XML elements.



View larger version (71K):
[in this window]
[in a new window]
 
Fig. 2 Examples of PDB Exchange data and PDBML data representations. (a) PDB Exchange data file example with a single category describing some of the features of polymer molecule, (b) the corresponding example of polymer description in a PDBML data file.

 
The XML organization illustrated in Figure 2b is repeated for each data category in the data file. Because of its size the atom_site category is also represented in an alternative form. Examples of the fully marked-up atom record and the simplified alternative are shown in Figure 3. The alternative representation of the atom_site category in Figure 3b simplifies the fully marked-up style in Figure 3a by presenting the data items within the atom_site category in a white-space delimited string. The current schema fragment describing the alternative atom_site representation is located at http://deposit.pdb.org/pdbML/pdbx-v1.000-alt.xsd.



View larger version (49K):
[in this window]
[in a new window]
 
Fig. 3 Examples of PDBML atom records. (a) Example of a fully marked-up PDBML atom record. The content of this record is equivalent to the content of the PDB Exchange data file. Empty data records are not translated to the XML data file. (b) Example of a simplified PDBML atom record. The information in this record is also the equivalent to the PDB Exchange data file; however, it is formatted as a white-space delimited string.

 
PDBML files are stored on the PDB beta ftp site at ftp://beta.rcsb.org/pub/pdb/uniformity/data/XML. The files are updated during each weekly PDB update. These files are currently under beta test. Comments and data issues related to these files may be reported at http://pdb-forum.rutgers.edu/. Three XML data files are produced from each PDB Exchange data file. One XML file contains the fully marked-up translation of the PDB Exchange data file. A second XML file contains the full PDB Exchange data file content omitting coordinate data. A third XML file contains only the simplified representation of the coordinate data in which each atom record is marked up as a single XML string.


    SUPPORTING SOFTWARE TOOLS
 TOP
 Abstract
 XML SCHEMA FOR PDB...
 PDBML DATA FILES
 SUPPORTING SOFTWARE TOOLS
 REFERENCES
 
The XML schema and data files described in this article are produced by software translation of the PDB Exchange dictionary and data files, respectively. The software tools that RCSB has developed to automate the production of XML schema and dictionaries can be downloaded from the website http://deposit.pdb.org/mmcif/MMCIF-XML-UTIL/. The molecular graphics viewer, PDBjViewer (Kinoshita and Nakamura, in press) that PDBj has developed can parse the current PDBML files to display macromolecular structures http://www.pdbj.org/PDBjViewer/. These tools are available in full source under an Open Source software license. The XML-based Protein Structure Search Service (xPSSS) is a browser with the XPath-SOAP service, based on the PDBML files using a native XML-DB at PDBj http://www.pdbj.org/xpsss/.


    Acknowledgments
 
The authors thank Ms Kaori Kobayashi, Mr Hisashi Sakamoto, Ms Reiko Yamashita, Dr Daron Standley and Dr Arno Paehler at PDBj for their help to develop the schema and validation system for the PDBML. The RCSB PDB is operated by Rutgers, The State University of New Jersey; The San Diego Supercomputer Center at the University of California, San Diego; and the Center for Advanced Research in Biotechnology of the National Institute of Standards and Technology—three members of the Research Collaboratory for Structural Bioinformatics (RCSB). The work reported in this paper has been supported by grants from the NSF, NIGMS, DOE, NLM, NCI, NCRR, NIBIB and NINDS. The MSD-EBI is supported by funds from the Wellcome Trust, the European Union (TEMBLOR, NMRQUAL, SPINE, AUTOSTRUCT, and IIMS awards), CCP4, the Biotechnology and Biological Sciences Research Council (UK), the Medical Research Council (UK), and the European Molecular Biology Laboratory. The PDBj is supported by grant-in-aid from the Institute for Bioinformatics Research and Development of the Japan Science and Technology Agency (BIRD-JST), and the Ministry of Education, Culture, Sports, Science and Technology (MEXT).

Received on April 14, 2004; revised on May 10, 2004; accepted on October 8, 2004

    REFERENCES
 TOP
 Abstract
 XML SCHEMA FOR PDB...
 PDBML DATA FILES
 SUPPORTING SOFTWARE TOOLS
 REFERENCES
 

    Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235–242[Abstract/Free Full Text].

    Berman, H.M., Henrick, K., Nakamura, H. (2003) Announcing the worldwide Protein Data Bank. Nat. Struct. Biol., 10, 980[CrossRef][Web of Science][Medline].

    Bernstein, F.C., Koetzle, T.F., Williams, G.J.B., Meyer, E.F., Jr., Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T., Tasumi, M. (1977) Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol., 112, 535–542[Web of Science][Medline].

    Bourne, P.E., Berman, H.M., Watenpaugh, K., Westbrook, J.D., Fitzgerald, P.M.D. (1997) The macromolecular Crystallographic Information File (mmCIF). Meth. Enzymol., 277, 571–590[Medline].

    Callaway, J., Cummings, M., Deroski, B., Esposito, P., Forman, A., Langdon, P., Libeson, M., McCarthy, J., Sikora, J., Xue, D., et al. Protein Data Bank Contents Guide: Atomic Coordinate Entry Format Description, (1996) Brookhaven National Laboratory.

    Fitzgerald, P.M.D., Westbrook, J.D., Bourne, P.E., McMahon, B., Watenpaugh, K.D., Berman, H.M. (2004) Classification and use of macromolecular data. In Hall, S.R. and McMahon, B. (Eds.). International Tables for Crystallography vol G, , Dordrecht (in press) Kluwer Academic Publishers.

    Hall, S.R., Allen, A.H., Brown, I.D. (1991) The Crystallographic Information File (CIF): a new standard archive file for crystallography. Acta Crystallogr A, 47, , pp. 655–685.

    Kinoshita, K. and Nakamura, H. (2004) eF-site and PDBjViewer: database and viewer for protein functional sites. Bioinformatics, 20, 1329–1330[Abstract/Free Full Text].

    Westbrook, J. and Bourne, P.E. (2000) STAR/mmCIF: an extensive ontology for macromolecular structure and beyond. Bioinformatics, 16, 159–168[Abstract/Free Full Text].

    Westbrook, J., Henrick, K., Ulrich, E.L., Berman, H.M. (2004a) The Protein Data Bank exchange dictionary. In Hall, S.R. and McMahon, B. (Eds.). International Tables for Crystallography, , Dordrecht (in press) Kluwer Academic Publishers.

    Westbrook, J.D., Berman, H.M., Hall, S.R. (2004b) Specification of a relational Dictionary Definition Language (DDL2). In Hall, S.R. and McMahon, B. (Eds.). International Tables for Crystallography, , Dordrecht (in press) Kluwer Academic Publishers.

    World Wide Web Consortium. (Ed.). XML Schema Part 0: Primer, W3C Recommendation, (2001a) http://www.w3.org/TR/2001/REC-xmlschema-2000-20010502/.

    World Wide Web Consortium. (Eds.). XML Schema Part 1: Structures W3C Recommendation, (2001b) W3C. http://www.w3.org/TR/2001/REC-xmlschema-2001-20010502/.

    World Wide Web Consortium. (Eds.). XML Schema Part 2: Datatypes W3C Recommendation, (2001c) W3C pp. http://www.w3.org/TR/xmlschema-2/.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
D. M. Standley, A. R. Kinjo, K. Kinoshita, and H. Nakamura
Protein structure databases with new web services for structural biology and biomedical research
Brief Bioinform, July 1, 2008; 9(4): 276 - 285.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
B. W. Brandt, J. Heringa, and J. A. M. Leunissen
SEQATOMS: a web tool for identifying missing regions in PDB in sequence context
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W255 - W259.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
K. Henrick, Z. Feng, W. F. Bluhm, D. Dimitropoulos, J. F. Doreleijers, S. Dutta, J. L. Flippen-Anderson, J. Ionides, C. Kamada, E. Krissinel, et al.
Remediation of the protein data bank archive
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D426 - D433.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
H. Berman, K. Henrick, H. Nakamura, and J. L. Markley
The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D301 - D303.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Q. Xu, A. Canutescu, Z. Obradovic, and R. L. Dunbrack Jr
ProtBuD: a database of biological unit structures of protein families and superfamilies
Bioinformatics, December 1, 2006; 22(23): 2876 - 2882.
[Abstract] [Full Text] [PDF]


Home page
RNAHome page
N. B. LEONTIS, R. B. ALTMAN, H. M. BERMAN, S. E. BRENNER, J. W. BROWN, D. R. ENGELKE, S. C. HARVEY, S. R. HOLBROOK, F. JOSSINET, S. E. LEWIS, et al.
The RNA Ontology Consortium: An open invitation to the RNA community
RNA, April 1, 2006; 12(4): 533 - 541.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/7/988    most recent
bti082v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (28)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Westbrook, J.
Right arrow Articles by Berman, H. M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Westbrook, J.
Right arrow Articles by Berman, H. M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?