Bioinformatics Advance Access originally published online on August 5, 2004
Bioinformatics 2005 21(1):116-121; doi:10.1093/bioinformatics/bth462
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 21 issue 1 © Oxford University Press 2005; all rights reserved.
PIML: the Pathogen Information Markup Language
Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University 1880 Pratt Drive, Blacksburg, VA 240610477, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: A vast amount of information about human, animal and plant pathogens has been acquired, stored and displayed in varied formats through different resources, both electronically and otherwise. However, there is no community standard format for organizing this information or agreement on machine-readable format(s) for data exchange, thereby hampering interoperation efforts across information systems harboring such infectious disease data.
Results: The Pathogen Information Markup Language (PIML) is a free, open, XML-based format for representing pathogen information. XSLT-based visual presentations of valid PIML documents were developed and can be accessed through the PathInfo website or as part of the interoperable web services federation known as ToolBus/PathPort. Currently, detailed PIML documents are available for 21 pathogens deemed of high priority with regard to public health and national biological defense. A dynamic query system allows simple queries as well as comparisons among these pathogens. Continuing efforts are being taken to include other groups' supporting PIML and to develop more PIML documents.
Availability: All the PIML-related information is accessible from http://www.vbi.vt.edu/pathport/pathinfo/
Contact: pathinfo{at}vbi.vt.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Diverse groups have developed and provided many public information resources related to infectious diseases and their pathogenic agents. Some examples include the Centers for Disease Control (CDC) (http://www.bt.cdc.gov/Agent/Agentlist.asp), the National Institutes of Health (NIH) (http://sis.nlm.nih.gov/Tox/biologicalwarfare.htm), Infectious Disease Society of American (IDSA) (http://www.idsociety.org/bt/toc.htm), Center for Civilian Biodefense Strategies (CCBS) (http://www.hopkins-biodefense.org) and the World Health Organization (WHO) (Communicable Disease Surveillance and Response, http://www.who.int/csr/disease/en). In addition, various journal or book publications are available electronically, such as Harrison's online (http://www.accessmedicine.com/amed/public/amed_news/news_article/281.html). Ferguson et al. (2003) provided an excellent review of the content and focus of over 40 of these sites by using a consistent framework to categorize the available information. The final sites chosen for review were shown to contain high-quality pathogen information on topics, such as general pathogen overview, laboratory work, infection control, epidemiology and event preparedness. These are excellent resources that could be leveraged further and accessed more broadly by more diverse communities. One way to expand the utility and interoperability of these individual resources would be to create a standard format for data exchange that is machine readable, thereby allowing interoperability of resources. Many of these electronic resources are currently available via HTML- or PDF-based formats, which are good standards for viewing and navigation but do not enable automated, machine-based data reformatting, transfer, query, standardization or integration.
The eXtensible Markup Language (XML) uses simple text-based markup to describe the structure and semantics of ordered data, thereby allowing for standardized data formatting and interchange at the procedural level (Bray et al., 1998, http://www.w3.org/TR/1998/REC-xml-19980210.html). A Document Type Definition (DTD) or XML Schema describes the structure of the markup elements of an XML document (Achard et al., 2001). The eXtensible Stylesheet Language for Transformations (XSLT) provides a flexible, powerful language for transforming between XML schemas and transforming XML documents into many other formats (e.g. HTML, PDF and MS Word) (Clark, 1999, http://www.w3.org/TR/xslt). XML offers an open framework for defining standard data structures and specifications. XML provides an industry-wide standard for data exchange, built on top of the Internet standard network transport protocols (TCP/IP, HTTP, SMTP, FTP, etc.) making it the meta-language standard of choice for transport mechanism (Achard et al., 2001). XML is also an important part of the web services enabling standards, serving as the meta-language for web services federations such as ToolBus/PathPort (TB/PP) (Eckart and Sobral, 2003). XML usage within bioinformatics is growing rapidly, and biological research groups and communities have created different XML-based biological markup languages. Some prominent examples include the Systems Biology Markup Language (SBML) (Hucka et al., 2003), the Bioinformatic Sequence Markup Language (BSML) (http://www.bsml.org/), the Protein Markup Language (Hanisch et al., 2002), the Biopolymer Markup Language (BioML) (Fenyo, 1999) and the Taxonomic Markup Language (Gilmour, 2000). These markup languages focus on different biological data types.
With the goal of developing a system that allows interoperation of distributed electronic resources providing pathogen information, we developed an XML-based markup language, Pathogen Information Markup Language (PIML). PIML was structured to handle information of interest to a wide variety of users and specify topics including pathogen taxonomy and life cycle, genome sequence information, epidemiology, host-pathogen interaction, disease prevention and treatment, laboratory isolation, and diagnostic methods (Fig. 1). PIML allows for a portable, system-independent, machine-parseable and human-readable representation of general information for any pathogen. To date, 21 pathogens have been described using extensible PIML documents and more are forthcoming. This body of pathogen information can be queried and displayed graphically from a web service (http://staff.vbi.vt.edu/pathport/services/wsdls/piml.wsdl) in TB/PP (Eckart and Sobral, 2003), which provides a custom graphical visualization module for PIML documents, and interoperability with other types of infectious disease data (e.g. genomic). A web-based query and display system was also developed to query the complete pathogen information or a specific topic across multiple pathogens (http://www.vbi.vt.edu/pathport/pathinfo/query.html) (Fig. 2).
|
|
| 2 PIML Design |
|---|
|
|
|---|
The PIML DTD and Schema support information on the following topics of one specific pathogen: organism, epidemiology, infected hosts, laboratory work, references and curation (Fig. 1). The PIML DTD and a detailed description of all elements and attributes can be accessed at http://www.vbi.vt.edu/pathport/xml/pathinfo/pathinfo.dtd.html. The corresponding PIML Schema can be downloaded from http://staff.vbi.vt.edu/pathport/pathinfo/piml-docs/piml.xsd. The major PIML elements are summarized below:
Organism. Specifies information related to the pathogen, including taxonomy, life cycle and summarized genome information (Fig. 1). The taxonomy element specifies the pathogen species that can also include associated variants. A National Center for Biotechnology Information (NCBI) taxonomy ID (Wheeler et al., 2003) is also assigned to species and variants via the optional genbank-taxon-id attribute. When available, this external database identifier can be used to access the detailed taxonomy hierarchy through the NCBI taxonomy database (Wheeler et al., 2003). Each species or variant is also required to have an ID unique within the document in order to be referenced elsewhere in the PIML document. The optional lifecycle element defines different stages of the life cycle and the progression between stages. The genome-summary element summarizes the genome information including the chromosomal or viral genome, plasmid sequences and mitochondrial DNA sequences. The GenBank accession number (Wheeler et al., 2003) of a specific genome or sequence is recorded if available using the optional genbank-access-number attribute.
Epidemiology. Specifies epidemic outbreak locations, pathogen transmission mechanisms, environmental reservoirs and information related to intentional release (Fig. 1). In case the pathogen is intentionally released during a bioterrorism or biowarfare event, the release element provides emergency contact information, possible delivery mechanisms, containment methods and a general description of event preparedness.
Host. Specifies infected host information, including host taxonomy, infection mechanisms, prevention measures, disease information and model systems to study the host (Fig. 1). The host element describes the information for a specific host(s) that carries the pathogen. The structure of host taxonomy is the same as the taxonomy under pathogen organism. The infection and prevention elements describe the mechanism(s) by which the pathogen infects the host and also includes methods to prevent infection. The disease element contains general information about a specific disease (Fig. 1). The pathogenesis element describes disease development and is separated from infection because a pathogen variant (e.g. an attenuated vaccine strain) may still infect the host but not cause the specified disease. The diagnosis-summary element provides an overall diagnosis description and is based on the analysis of disease symptoms, laboratory detection and additional features described in other PIML elements. The model-system element defines specific model systems used to study the host-pathogen interactions. A specific model host can be infected naturally or artificially.
Labwork. Specifies laboratory biosafety issues, culturing methods and laboratory diagnostic tests (Fig. 1). The biosafety element defines the biological safety issues including the biosafety level defined by the CDC, recommended precautions and proper disposal procedures for contaminated materials. The laboratory methods of culturing the pathogen are defined by the culturing element. The diagnostic-tests element describes the methods for pathogen diagnosis in a biological laboratory. To provide prompt diagnosis of the pathogen, we provide an enumerated time-to-perform attribute that specifies the time needed to perform the specified test.
References. Specifies the references for all curated data. The references can be from journal publications, books, websites, dissertations and theses. The PubMed ID of a specific journal article is recorded using the optional PMID attribute in order to access the PubMed database (Wheeler et al., 2003). The Uniform Resource Locator (URL) is required for a website reference in order to provide a direct link on the Internet.
Curation. Specifies curator name, date, version, note, revision information and contact information.
Each statement of fact in a PIML document is associated with its corresponding references for the end-user to review the supporting data and more completely assess a topic of specific interest. For example, the shape element in the PIML DTD is defined as:
<!ELEMENT shape (#PCDATA | ref-info)*>
The characteristic shape of Bacillus anthracis cells is described with detailed references in the PIML document:
<shape><ref-info refs="ref6">Vegetative cells are rod-shaped.</ref-info><ref-info refs="ref21">Endospores are oval-shaped.</ref-info></shape>
The curation information is also tied to the individual elements by the optional curators attribute, which enables the construction of a fine-grained audit trail. PIML supports externally controlled vocabularies and ontologies, such as UMLS (http://www.nlm.nih.gov/research/umls/), GO (http://www.geneontology.org/) and SNOMED (http://www.snomed.org/). Selected PIML elements include the attribute ontology with the format: ontology="UMLS:xxx, GO:xxx, SNOMED:xxx, otherStd:xxx". Anchoring the definition of terms to external standards improves consistency of semantics and allows cross-data-source integration.
Images are often useful for the description of biological phenomena and concepts. Images can be referenced from PIML documents via URLs contained in those documents. The image URLs can be from any reliable website around the world. Where applicable, photographic images and diagrams are incorporated in PathInfo (with the requisite permission from the publisher/owner) to more fully illustrate important points.
| 3 STORING, QUERYING AND DISPLAYING PATHOGEN INFORMATION |
|---|
|
|
|---|
Specific pathogen information can be easily queried from a database storing curated PIML documents. Curated XML-based PIML documents can also be transformed to other formats for user-friendly display and are currently available through our PathInfo website (http://www.vbi.vt.edu/pathport/pathinfo). Access to the information is also available via the PIML web service (http://staff.vbi.vt.edu/pathport/services/wsdls/piml.wsdl) for use by other software (e.g. TB/PP).
PathInfo is a part of the Pathogen Portal (PathPort) project. PathPort aims to combine information about pathogens (and their near relatives) as well as data analysis and visualization tools to expedite biological research on high-priority pathogens. The publicly available TB/PP software package is built around ToolBus, a platform- and domain-independent client-side interconnect that provides easy access to distributed web services (Eckart and Sobral, 2003). The PIML web service in TB/PP stores all the PIML documents in an Apache Xindice XML database (http://xml.apache.org/xindice), and displays information for a specific pathogen upon request by using an XSLT script for dynamic transformation of the XML document into HTML. The web-based PathInfo viewer is similar to the TB/PP PathInfo viewer and is available via a web browser at http://www.vbi.vt.edu/pathport/pathinfo. As shown in Figure 2B, the web-based PathInfo viewer contains a table of clickable contents within a separate left frame while the corresponding information is displayed in the upper right window frame. The bottom right frame displays the same detailed information, and particularly provides reference or other internally linked information once a reference or another internal link is clicked on in the upper right window frame. The actual reference website (e.g. a PubMed abstract website) is hot-linked and spawns a new web browser window when clicked. This allows the user to view the supporting reference information while maintaining their place within the PathInfo document (Fig. 2B). The contents in the three window frames are created separately using three different XSLT scripts that parse and transform a PIML document into HTML. A cron job (Petersen, 2000) runs daily in the background to automatically extract the PIML documents, from the same XML Xindice database used by TB/PP, via the TB/PP PIML web service (Fig. 2A). When updated versions of the PIML documents are extracted, the three XSLT scripts will update the HTML views. This mechanism ensures that the web-based viewer uses PIML documents no more than 1 day out of sync with TB/PP while not overburdening the website with unnecessary updates.
The web-based PathInfo query and display system can also be used to query information on a specific topic for a pathogen or compare a topic among multiple pathogens from our XML database (Fig. 2A and C). The query topic is specifically handled by a corresponding XSLT script that parses the requested pathogen PIML document(s) and transforms the query results into HTML (Fig. 2A). Scientific Latin terms inside the dynamically created HTML are italicized and the final result is available for the user to view from their web browser (Fig. 2C).
| 4 CURATION OF PIML DOCUMENTS AND FUTURE PLANS |
|---|
|
|
|---|
A team of infectious disease biologists reviewed and collected pathogen data in PIML format from multiple sources, but concentrated on published, refereed material whenever possible to assure maximum validity of the information. Additionally, because images are incorporated into PIML documents wherever useful, written permission is acquired from the authoritative source of an image before it can be downloaded and published in our system. Since most of these data sources are not in standardized interchangeable formats, it is not possible to generate PIML documents automatically from existing data sources. Thus, the current curation process is labor intensive. However, we are also migrating to a distributed curation model by adding other Subject Matter Experts to help curate the pathogens for which they have expertise.
To date, we have created PIML documents for 21 pathogens (Table 1), which are disease agents listed as Category A, B and C priority pathogens (e.g. Bacillus anthracis and Brucella melitensis) by the CDC (http://www.bt.cdc.gov/Agent/Agentlist.asp) and the National Institute of Allergy and Infectious Diseases (NIAID) (http://www.niaid.nih.gov/biodefense/bandc_priority.htm). These pathogens are of great interest to people involved in biological event/threat preparedness and response. Eight of these pathogens are also the research targets of the Middle Atlantic Regional Center of Excellence in Biodefense and Emerging Infectious Diseases project (MARCE, http://marce.vbi.vt.edu). In the ensuing year, an estimated 15 pathogens will be added; these new pathogens will focus more on animal and plant pathogens with the potential for agroterrorism.
|
The HTML or PDF formats that many websites (Ferguson et al., 2003) use to provide pathogen information do not support standardized machine parsing in automated information systems. PIML offers a data exchange format for pathogen information. If PIML is supported and maintained by the infectious disease community, existing electronic pathogen-related resources could be wrapped to allow updated information to be accessed, thereby leveraging community-wide data through interoperability, and thus reducing the amount of effort spent on manual curation and ad hoc data integration. To make the data curation process more efficient, natural language processing and statistical methods can be explored for acquiring and semi-automating literature acquisition and curation (Marcus, 1995).
The proposed PIML format we have developed offers the opportunity for further evolution and community input to the format. To provide a first approximation of a useable standard, we have worked with a number of infectious disease experts as acknowledged below during the process of PIML development. However, we recognize that a broader community must be engaged to ensure that it meets wide needs and is evolved accordingly. We have made the PIML XSLT scripts, DTD and XML Schema freely available in the PathInfo website. We plan to organize meetings and workshops specifically to develop the PIML into a standard, and to provide technical support to adopters. We are committed to providing an open development process, transforming the project from a primarily VBI-led effort toward to a community-led and maintained model. We invite all interested organizations, groups and experts to shape the evolution of this pathogen information exchange format.
| Acknowledgments |
|---|
We are grateful to Herman Formadi, Eric Nordberg and Ronald Kenyon for their feedback and support during the evolution of PIML. Thanks to Balaprasuna Chennupati and Dr Tian Xue for their technical help. We are grateful to our collaborators at US Army Research, Development & Engineering Command (RDECOM) at Edgewood, especially Dr Jay Valdes and Dr Jennifer Sekowski, for their suggestions and support. We thank Dr Karen Thorn at National Library of Medicine for providing helpful suggestions on incorporating UMLS ontology in PIML and Dr Jeff Wilcke in Virginia-Maryland Regional College of Veterinary Medicine for providing knowledge regarding SNOMED. We also thank Dr Brett Tyler and Lachelle Waller in the Virginia Bioinformatics Institute for their feedback in the pathogen curation process by adopting the PIML format. We are grateful to the researchers in the MARCE collaboration (e.g. University of Maryland at Baltimore, John Hopkins University, University of Virginia and Uniformed Services University of the Health Sciences) for their interest and feedback. We are also grateful to Dr Susan Baker at Loyola University Chicago Stritch School of Medicine for her suggestions. Contributions from other PathPort scientists, many students and reviewers are also gratefully acknowledged. Development of PIML is part of the PathPort Project supported by Department of Defense grant no. DAAD 13-02-C-0018 to B.W.S.S.
Received on January 29, 2004; revised on July 30, 2004; accepted on July 30, 2004
| REFERENCES |
|---|
|
|
|---|
Achard, F., Vaysseix, G., Barillot, E. (2001) XML, bioinformatics and data integration. Bioinformatics, 17, 115125
Bray, T., Paoli, J., Sperberg-McQueen, C.M. (1998) Extensible Markup Language (XML) 1.0.
Clark, J. (1999) XSL Transformations (XSLT) Version 1.0. W3C Proposed Recommendation 8 October 1999.
Eckart, J.D. and Sobral, B.W. (2003) A life scientist's gateway to distributed data management and computing: the PathPort/ToolBus framework. OMICS, 7, 7988[CrossRef][Medline].
Fenyo, D. (1999) The Biopolymer Markup Language. Bioinformatics, 15, 339340
Ferguson, N.E., Steele, L., Crawford, C.Y., Huebner, N.L., Fonseka, J.C., Bonander, J.C., Kuehnert, M.J. (2003) Bioterrorism web site resources for infectious disease clinicians and epidemiologists. Clin. Infect. Dis., 36, 14581473[CrossRef][Web of Science][Medline].
Gilmour, R. (2000) Taxonomic markup language: applying XML to systematic data. Bioinformatics, 16, 406407
Hanisch, D., Zimmer, R., Lengauer, T. (2002) ProMLthe protein markup language for specification of protein sequences, structures and families. In Silico Biol., 2, 313324[Medline].
Hucka, M., Finney, A., Sauro, H.M., Bolouri, H., Doyle, J.C., Kitano, H., Arkin, A.P., Bornstein, B.J., Bray, D., Cornish-Bowden, A., et al. (2003) The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 19, 524531
Marcus, M. (1995) New trends in natural language processing: statistical natural language processing. Proc. Natl. Acad. Sci., USA, 92, 1005210059
Petersen, R. Linux: The Complete Reference, (2000) 4th edn. , Emeryville, CA McGraw-Hill Osborne Media, pp. 718719.
Wheeler, D.L., Church, D.M., Federhen, S., Lash, A.E., Madden, T.L., Pontius, J.U., Schuler, G.D., Schriml, L.M., Sequeira, E., Tatusova, T.A., et al. (2003) Database resources of the National Center for Biotechnology. Nucleic Acids Res., 31, 2833
This article has been cited by other articles:
![]() |
W. Valdivia-Granda and F. Larson ORION-VIRCAT: a tool for mapping ICTV and NCBI taxonomies Database, October 12, 2009; 2009(0): bap014 - bap014. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Zhang, O. Crasta, S. Cammer, R. Will, R. Kenyon, D. Sullivan, Q. Yu, W. Sun, R. Jha, D. Liu, et al. An emerging cyberinfrastructure for biodefense pathogen and pathogen host data Nucleic Acids Res., January 11, 2008; 36(suppl_1): D884 - D891. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Xiang, T. Todd, K. P. Ku, B. L. Kovacic, C. B. Larson, F. Chen, A. P. Hodges, Y. Tian, E. A. Olenzek, B. Zhao, et al. VIOLIN: vaccine investigation and online information network Nucleic Acids Res., January 11, 2008; 36(suppl_1): D923 - D928. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. E. Snyder, N. Kampanya, J. Lu, E. K. Nordberg, H. R. Karur, M. Shukla, J. Soneja, Y. Tian, T. Xue, H. Yoo, et al. PATRIC: The VBI PathoSystems Resource Integration Center Nucleic Acids Res., January 12, 2007; 35(suppl_1): D401 - D406. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. K. McNeil, C. Reich, R. K. Aziz, D. Bartels, M. Cohoon, T. Disz, R. A. Edwards, S. Gerdes, K. Hwang, M. Kubal, et al. The National Microbial Pathogen Database Resource (NMPDR): a genomics platform based on subsystem annotation Nucleic Acids Res., January 12, 2007; 35(suppl_1): D347 - D353. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Fletcher, C. Bender, B. Budowle, W. T. Cobb, S. E. Gold, C. A. Ishimaru, D. Luster, U. Melcher, R. Murch, H. Scherm, et al. Plant Pathogen Forensics: Capabilities, Needs, and Recommendations Microbiol. Mol. Biol. Rev., June 1, 2006; 70(2): 450 - 471. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




