Skip Navigation


Bioinformatics Advance Access originally published online on June 9, 2006
Bioinformatics 2006 22(15):1910-1916; doi:10.1093/bioinformatics/btl272
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
22/15/1910    most recent
btl272v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (9)
Google Scholar
Right arrow Articles by Saltz, J.
Right arrow Articles by Covitz, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Saltz, J.
Right arrow Articles by Covitz, P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2006 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commerical use, distribution, and reproduction in any medium, provided the original work is properly cited.

caGrid: design and implementation of the core architecture of the cancer biomedical informatics grid

Joel Saltz 1,*, Scott Oster 1, Shannon Hastings 1, Stephen Langella 1, Tahsin Kurc 1, William Sanchez 2, Manav Kher 2, Arumani Manisundaram 3, Krishnakant Shanbhag 4 and Peter Covitz 4

1 Department of Biomedical Informatics, Ohio State University 3184 Graves Hall, 333 West 10th Avenue, Columbus, OH 43210, USA
2 Science Applications International Corporation Annapolis, MD, USA
3 Booz Allen Hamilton, Inc. Rockville, MD, USA
4 National Cancer Institute Center for Bioinformatics Rockville, MD, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 3.1 Managing data types,...
 3.2 Analytical services
 3.3 Data services
 3.4 Service advertisement and...
 3.5 Security
 4 DISCUSSION
 REFERENCES
 

Motivation: The complexity of cancer is prompting researchers to find new ways to synthesize information from diverse data sources and to carry out coordinated research efforts that span multiple institutions. There is a need for standard applications, common data models, and software infrastructure to enable more efficient access to and sharing of distributed computational resources in cancer research. To address this need the National Cancer Institute (NCI) has initiated a national-scale effort, called the cancer Biomedical Informatics Grid (caBIGTM), to develop a federation of interoperable research information systems.

Results: At the heart of the caBIG approach to federated interoperability effort is a Grid middleware infrastructure, called caGrid. In this paper we describe the caGrid framework and its current implementation, caGrid version 0.5. caGrid is a model-driven and service-oriented architecture that synthesizes and extends a number of technologies to provide a standardized framework for the advertising, discovery, and invocation of data and analytical resources. We expect caGrid to greatly facilitate the launch and ongoing management of coordinated cancer research studies involving multiple institutions, to provide the ability to manage and securely share information and analytic resources, and to spur a new generation of research applications that empower researchers to take a more integrative, trans-domain approach to data mining and analysis.

Availability: The caGrid version 0.5 release can be downloaded from https://cabig.nci.nih.gov/workspaces/Architecture/caGrid/. The operational test bed Grid can be accessed through the client included in the release, or through the caGrid-browser web application http://cagrid-browser.nci.nih.gov.

Contact: joel.saltz{at}osumc.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 3.1 Managing data types,...
 3.2 Analytical services
 3.3 Data services
 3.4 Service advertisement and...
 3.5 Security
 4 DISCUSSION
 REFERENCES
 
The need for efficient synthesis of information from multiple sources pervades all steps of biomedical research. Basic and clinical research has increasingly become dependent on advanced information technologies for management, exchange and analysis of diverse biomedical data. Although a wealth of information is collected by the cancer research community, any one given researcher is faced with challenges in discovering, extracting and analyzing the information relevant to his/her research. Datasets hosted at different sites may be represented in a variety of formats. Even datasets corresponding to the same biological entity may have heterogeneous representations with different database schemas, attribute names and data values. Similarly, analysis programs at different sites may have different input and output formats and different interfaces.

Recognizing the need for sharing of data and analysis tools and the lack of advanced software systems to address this need, the National Cancer Institute (NCI) launched national-scale cancer Biomedical Informatics Grid (caBIGTM) program in 2004 (http://cabig.nci.nih.gov). The overarching goal of caBIG is to create a network of cancer centers and research laboratories across the country in order to better leverage their combined strengths and expertise in cancer research. To accomplish this goal, the caBIG community is developing standards, policies, guidelines, common applications, and open-source tools and middleware infrastructure to enable more effective sharing of data and research tools among scientists and organizations in a multi-institutional environment. One of the first accomplishments of the caBIG program was to develop and publish the caBIG Compatibility Guidelines (https://cabig.nci.nih.gov/guidelines_documentation/caBIGCompatGuideRev2_final.pdf). This document lays out requirements for achieving varying degrees of syntactic and semantic interoperability, labeled Bronze, Silver and Gold. The Gold level of interoperability calls for a common framework across the caBIG federation for the representation, advertisement, discovery, and invocation of distributed data and analytic resources. The caGrid software implements the core technology and a suite of services and tools to support this framework.

Before the caGrid development effort was initiated, an exploratory work had been carried out by the NCI Center for Bioinformatics (NCICB) in order to evaluate the state of existing technology frameworks and the availability of tools and middleware systems in each framework. The findings from this work have been published as a white paper (Sanchez et al., 2004, https://cabig.nci.nih.gov/guidelines_documentation/caGRIDWhitepaper.pdf). Based on the results of the technology evaluation presented in that white paper, Grid Services technology was chosen as the underlying framework for caGrid. In this paper, we describe the design of the caGrid architecture, how it employs the Grid Services framework and the current implementation of caGrid, referred to as caGrid version 0.5 (https://cabig.nci.nih.gov/workspaces/Architecture/caGrid).


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 3.1 Managing data types,...
 3.2 Analytical services
 3.3 Data services
 3.4 Service advertisement and...
 3.5 Security
 4 DISCUSSION
 REFERENCES
 
2.1 Use cases
The design of caGrid has been driven by requirements collected from people collaborating in the various workspaces of the caBIG program. Several key use cases have emerged from these groups: discovery; coordinated research; and integrative data analysis. Effective protection of intellectual property and sensitive information is a critical component in multi-institutional settings. Thus, security is a common requirement in all of these use cases. The common issues include authentication of users from different institutions, common mechanisms for management of user accounts and privileges, and support for resource owners to implement and enforce access control policies.

The current state-of-the-art in Discovery typically involves searching for abstracts or research articles describing work in a given area. The descriptions are provided in unstructured text which in some cases is augmented with auxiliary datasets. The caGrid infrastructure aims to support much more precisely targeted searches that return precisely defined sets of attributes and values. In a scenario in which cancer researchers stored their data in grid accessible databases as described in this paper, a query would return sets of machine readable common data elements and values associated with each study. This information could then be systematically sorted, searched, examined and analyzed.

The Coordinated Research use case stemmed from the recognition that many kinds of cancer related research need to be conducted as coordinated efforts at multiple sites. Clinical trials generally accrue patients in multiple locations and carry out specialized tasks such as data analysis, pathology, proteomics, cytogenetics, radiology and various types of molecular studies at distributed sites. Ambitious basic science efforts are also frequently carried out in a coordinated manner by researchers at multiple sites. A key goal of caGrid infrastructure is to improve and standardize the ways in which these groups manage information and analyses, as well as to help these groups better manage their clinical trials and experiments. This has the potential to permit safer and more effective clinical trials by facilitating ongoing real-time analyses of toxicities and therapeutic endpoints, and through improving accruals via identification of eligible patients. In the basic science area, improved coordination tools will make it easier for Centers, cooperative groups and program projects to carry out coordinated studies that systematically explore experimental models and carry out coordinated analysis of different types of genetic and molecular image information.

The Integrative Data Analysis use case articulates the need to provide support for researchers to develop, share and analyze large databases containing high-throughput molecular, image data, and to correlate their findings with information with clinical and pathology annotations that may be stored in other systems. This use case informed the design of standardized representations for data services and analytic services, and led to the creation of tools and utilities that make it straightforward to implement such services.

2.2 Enabling technologies and standards
There are an increasing number of applications that employ Grid technologies (Foster and Kesselman, 1999; Atkinson et al., 2002, http://www.cs.man.ac.uk/grid-db/documents.html; Berman et al., 2003; Solomonides et al., 2003; Tweed and Miguet, 2003; Parashar et al., 2004; Grethe et al., 2005). As Grid computing has become more widely applied in application domains, a need has emerged for community accepted standards and a standards-based architecture for the Grid that would facilitate better interoperability among various Grid middleware systems and Grid-enabled applications. As an answer to this need, the Open Grid Services Architecture (OGSA) was proposed (Foster et al., 2002a, b, http://www.globus.org/alliance/publications/papers/ogsa.pdf). The OGSA extends web services to support additional features such as stateful services, dynamic service instantiation, service lifetime management and service notification. It has evolved recently into the Web Services Resource Framework (WSRF) (http://www.globus.org/wsrf). The WSRF not only encapsulates the core concepts underlying the OGSA, but also extends them to create a path for unification of web and Grid service frameworks. The caGrid framework builds on the concept of Grid Services, i.e. distributed resources in the environment are represented as Grid Services and interactions between services and clients are done using Grid Service protocols.

caGrid version 0.5 leverages three Grid middleware systems and their tools: the Globus Toolkit 3.2 (GT 3.2) (www.globus.org); OGSA Data Access and Integration (OGSA-DAI) toolkit (http://www.ogsadai.org.uk/); and Mobius (http://projectmobius.osu.edu/). GT 3.2 is the most widely used reference implementation of the core OGSA standards. It implements support for creation, deployment and invocation of Grid services. It is used in caGrid as the core Grid middleware system, upon which other caGrid components are built. The support for virtualization of data sources as Grid Data Services is implemented using the OGSA-DAI infrastructure. The OGSA-DAI is an implementation of the OGSA Data Access and Integration Standards developed by the Global Grid Forum (http://www.ggf.org). It provides tools and runtime support for development and deployment of data services in the Grid. The Mobius infrastructure provides support for distributed data and metadata management in the framework of a strongly-typed Data Grid supporting XML virtualization of data sources. It consists of a set of services for distributed and coordinated management of data type definitions and on-demand creation, management and federation of databases. In caGrid, Mobius is employed to support Grid-wide management of XML schemas representing the structure of common data types in the caBIG domain.

caGrid customizes and extends Grid technologies to better support the needs of the cancer research community. A primary distinction between basic Grid infrastructures and caGrid is the focus on modeling of data and metadata information associated with resources. caGrid adopts a model-driven, service-oriented architecture approach, with all data being transmitted by grid services as objects that are derived from models expressed in the Unified Modeling Language (UML). All data types are precisely defined and semantically harmonized, and thus common data elements, controlled vocabularies and object-based abstractions play a key role in the architecture. caGrid leverages services and components provided by the NCI common ontologic reference environment (caCORE) (Covitz et al., 2003). The Enterprise Vocabulary Service (EVS) within caCORE provides the description logic concepts and terms used to describe the classes, attributes and instance data in caGrid. caCORE also includes an ISO/IEC 11179 metadata registry, the Cancer Data Standards Repository (caDSR), that serves as the authoritative registry of caGrid data elements and their corresponding semantics. The caCORE Software Development Kit (SDK) provides a convenient mechanism for creating model-driven, semantically interoperable data services that can then be connected to caGrid (Phillips et al., 2006).


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 3.1 Managing data types,...
 3.2 Analytical services
 3.3 Data services
 3.4 Service advertisement and...
 3.5 Security
 4 DISCUSSION
 REFERENCES
 
In this section we describe the implementation of caGrid version 0.5, released in September 2005, with several updates since then. caGrid is designed as a service-oriented architecture (Fig. 1), in which a resource is exposed to the environment as a Grid service with well-defined interfaces. Services and clients interact with each other through Grid communication and service invocation protocols. In addition to data and analytical services, the caGrid infrastructure consists of coordination services that are required by clients and services for Grid-wide functions. Coordination services include services for metadata management; advertisement and discovery; query; and security. We note that coordination services can be replicated and distributed. For example, the distribution of the metadata management services can be done in a style similar to the operation of domain name servers. That is, a hierarchy of services can be established. The security services also are versatile in that an institution can set up the components locally to manage its users and their attributes, or alternatively, an authoritative institution (such as NCI or the lead institution in a cooperative study) can host the security services to collectively manage the users and attributes for a group of researchers.

Data sources are wrapped in caGrid Data Service interfaces; analysis programs are exposed through caGrid Analytical Service interfaces (Fig. 1). Following the metadata and model driven architecture approach of caGrid, each service is required to describe itself using caGrid standard service metadata. caGrid takes advantage of rich semantic information in order to support metadata based discovery of resources in the environment. In addition, caGrid services represent an object-oriented view of resources to the environment and are strongly typed services. That is, any given caGrid Data Service makes the backend datasets and databases, which may be relational tables, available to the environment as a database of objects. An analytical resource, implemented as a caGrid Analytical Service, exposes methods that take objects as input and return objects as output. Client and service APIs are object-oriented. They operate on well-defined and curated data types that are based on common data elements and controlled vocabularies. The definitions and structures of data types are registered and published in the environment via caCORE and Mobius services, thus creating strongly-typed services.

Below we detail the core components and functions of caGrid. Examples are drawn from two of the currently operational grid nodes; RProteomics (http://www.dbsr.duke.edu/research/softwaredev/libraries/r/rproteomics/default.aspx) and caBIO (cancer Bioinformatics Infrastructure Objects) (Covitz et al., 2003). RProteomics, developed at Duke University, is a prototypical caGrid Analytical Service that performs proteomics data analysis using statistical routines such as de-noising, peak calibration and characterization of peaks using the R statistical language. The caBIO system is a prototypical caGrid Data Service that provides a query API that returns biomedical domain data objects. The caGrid reference implementations of these two types of services are available as part of the caGrid version 0.5 release.


    3.1 Managing data types, ontologies and schemas
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 3.1 Managing data types,...
 3.2 Analytical services
 3.3 Data services
 3.4 Service advertisement and...
 3.5 Security
 4 DISCUSSION
 REFERENCES
 
The caBIG compatibility guidelines dictate that caGrid services expose ‘Gold Level’ analytical and data resources to the Grid environment. Gold systems are based upon information models, terminologies, ontologies and common data elements that are harmonized and accepted as standards by the caBIG community. They have strongly typed and object-oriented service interfaces; use service interfaces in the form of Grid services; and employ XML for data exchange. caGrid leverages NCI EVS, the caDSR and the Mobius Global Model Exchange (GME) service for ontology, metadata and schema management, respectively. The caDSR and EVS serve as the authoritative sources of common data elements, vocabularies and ontologies. The GME serves as the authoritative repository of the XML schemas that correspond to the object classes registered in caDSR. The coordinated use of caDSR, EVS and GME is shown in Figure 2.

Common data types used by applications are described in UML and annotated with semantic concept information from EVS. All caGrid information models are annotated with concepts from the same reference ontology in EVS, the NCI Thesaurus. These concepts then form the basis for automated cross-model comparison during registration in the caDSR. In this manner data elements that are equivalent in different models are identified on the basis of having identical ontology concept annotations. These equivalencies are captured during the registration of models in the caDSR, and can be used to join information across independently created caGrid services. The process of annotating and harmonizing domain models, and converting them into caDSR metadata components, is further described in an earlier publication (Phillips et al., 2006).

In the Grid, client applications and Grid services are expected to be able to communicate using registered data types in an implementation and platform agnostic manner. To address this requirement, caGrid employs XML to represent registered data types when instances of these data types are exchanged over the Grid. Objects are serialized/de-serialized to/from XML documents when they are transferred between two end points in the caBIG environment. Thus, the descriptions of the data types registered in caDSR are augmented with XML schemas that represent the syntactic structure of each object in XML. An XML schema (XSD) can be used both as a formal description of the structural constraints of an XML document type and to validate instance documents.

Support for publication and management of XML schemas is provided by the Mobius GME service (Hastings et al., 2004). GME is a DNS-like data definition registry and exchange service that enables services and clients to publish, retrieve, discover and version XML schemas under namespaces. Namespaces allow creation of authoritative hierarchies on schemas and makes it easier to manage them. Management of all schemas registered under one or more namespaces and sub-namespaces can be assigned to a single GME. This GME can further delegate its authority to additional GMEs in a hierarchy using sub-namespaces. As an example, an institution may have one GME as the root authority and each division in the institution can be assigned a sub-namespace. caGrid 0.5 release uses a single authoritative GME to manage all of the namespaces and schemas. As GME manages schemas by their respective namespaces, a consistent approach to selecting namespaces for data types is necessary. caGrid 0.5 release uses the following structure for assigning namespaces for caBIG objects in caGrid: gme://<Classification Scheme>.<Context>/<Scheme Version>/<Scheme Item>. Each of the components of the namespace structure is derived from corresponding metadata components in the caDSR. <Classification Scheme> defines the project (or application; e.g. RProteomics, caBIO) within the <Context> (e.g. caBIG). The version of the schema is encoded in <Scheme Version> section of the namespace. The name or id of the schema is stored in the <Scheme Item> section.

Using this namespace policy in conjunction with the GME, it is programmatically possible to discover the syntax used to transmit registered data types given a concept or data element in the caDSR. For instance, the following is used as the namespace for the object types in the caGrid RProteomics Analytical Service: gme://RProteomics.caBIG/1.0/edu.duke.cabig.rproteomics.model.scanFeatures. Here, the Classification Scheme is RProteomics; the Context is caBIG; the Scheme Version is 1.0; and the Classification Scheme Item is edu.duke.cabig.rproteomics.model.scanFeatures. The latter defines the name of the XML schema itself, which represents the object types. Figure 3 shows an example registration of the Scan Features data model for the RProteomics service. These data types, once registered, can then be used by caGrid application developers during service creation. A graphical representation of the RProteomics featureType element in the Scan Features schema is shown in Supplementary Figure S1.


    3.2 Analytical services
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 3.1 Managing data types,...
 3.2 Analytical services
 3.3 Data services
 3.4 Service advertisement and...
 3.5 Security
 4 DISCUSSION
 REFERENCES
 
There are several requirements that must be met by an analytical resource to be a caGrid Analytical Service. (1) The service should provide object-oriented client APIs for discovery and invocation of the service. (2) It must have strongly typed interfaces; i.e. the methods provided by the service and their signature should be well defined and registered in the environment. Moreover, the input parameters to each method and the output returned by the method must be objects, whose object model definitions are registered in the caDSR, and whose corresponding XML schemas are registered in the GME service. (3) The service should have descriptive metadata and this metadata should be stored in the caGrid advertisement and discovery infrastructure, the Index Service.

The GT is used as the underlying Grid middleware for creating, registering, discovering and invoking caGrid analytical services. However, the GT provides low-level constructs and APIs that are not trivial to use by service developers. In addition, caGrid services require that their interfaces be strongly typed. In order to address these issues, a caGrid analytical service toolkit, called Introduce, has been developed. The Introduce toolkit abstracts away the complexities of the GT for creating the description of a service and the associated metadata, and for registering the service. Using the Introduce Graphical User Interface (GUI), the developer can create and modify the service interface by adding and removing methods and can define method signatures (i.e. the types of objects the methods take as input and return as output). The toolkit generates the basic classes and skeleton for the service implementation. It also provides tools to look up and properly reference the common data elements specific to the caBIG domain. Introduce uses GME to extract the schemas for the types which are required in the analytical service interface, thus enabling the creation of strongly typed Grid service interfaces. The toolkit also automatically generates the APIs for client programs to use to interact with the service. It allows a developer to make modifications to an existing service, such as adding new methods, removing existing methods or modifying methods to the service interface. With Introduce, the service developer is freed up to concentrate on the details of implementing domain-specific code, with the toolkit performing most of the work to ensure the service is caGrid-compliant with respect to the way it is registered to the Grid.

Using Introduce, the RProteomics analytical service was developed as follows. The developer can describe the method interfaces of the analytical service using the GUI. The developer can look up the GME to search for the data types for the input and return parameters of each function. The toolkit retrieves the structure of the corresponding objects from the GME using the schema registered under the following namespace: gme://caBIG.RProteomics/1/edu.duke.cabig.rproteomics.model.scanFeatures.

The toolkit imports the schema into the Grid interface description of the service so that objects can be created for the types described in the schema. The service description, along with the newly imported data types, is processed by Introduce to create the service skeleton that the developer can then implement and deploy. The service skeleton contains auto-generated code, i.e. the service interface methods as shown in Supplementary Figure S2 (for the RProteomics service). The toolkit also creates client APIs so that client applications can be written to use this service.


    3.3 Data services
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 3.1 Managing data types,...
 3.2 Analytical services
 3.3 Data services
 3.4 Service advertisement and...
 3.5 Security
 4 DISCUSSION
 REFERENCES
 
As caGrid Analytical Services provide mechanisms for the caBIG community share data analysis mechanisms, caGrid Data Services provide the means to share data. The main requirement for a caGrid Data Service is that it should implement an object-oriented virtualization of the backend data source. The data service provider needs to map the elements of the backend datasets and databases into objects, whose models and semantic meanings are described in caDSR and EVS and whose corresponding XML schemas are registered in GME. In this way, data types served by a data source can be identified and interacted with in a standard way by clients. Each data service is also required to register additional metadata (such as the institute of the data provider, the list of objects served by the data service, etc.) in caGrid so that it can be discovered by other services and clients.

caGrid 0.5 uses the OGSA-DAI 5.0 platform (http://www.ogsadai.org.uk) for data service implementation and deployment. The OGSA-DAI has three kinds of methods (referred to as activities) that a data source has to implement: query, transformation and delivery. Each caGrid Data Service is expected to provide an implementation of these activities that will present an object-oriented view of backend databases. To create a data service, the data source developer creates the service interfaces using the registered object definitions and corresponding XML schemas. The developer should then implement mechanisms to carry out the mapping between the object structures and the backend datasets or databases. As a reference implementation, caGrid 0.5 provides an activity implementation for caCORE SDK generated data sources. If a data source is developed from scratch, the caCORE SDK can be used to implement the system. In this way, the integration of the data source as a caGrid Data Service does not require any new code. caBIO is an example of a data service that was developed by caCORE SDK and then registered as a caGrid data service.

Queries supported by the OGSA-DAI query activity in caGrid version 0.5 are based on the notion of objects and object traversals. Objects or the attributes of an object may have one or more associations with other objects. These associations describe semantic relationships among objects and can be viewed as paths that allow one to traverse nested object hierarchies based on domains, concepts and relationships. Objects can be searched for based on their object classes and concept definitions, criteria on their properties, and association paths. These queries are expressed in the caGrid Query Language (CQL), an object oriented query language that is expressed in XML. Every data service in caGrid provides a uniform query interface that provides support for CQL. CQL allows representing data service queries in object oriented terms, i.e. by using objects and attributes of objects. A CQL query reflects the structure of the underlying object model of a data service while abstracting the physical representation of the data. An example CQL query is shown in Supplementary Figure S3.


    3.4 Service advertisement and discovery
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 3.1 Managing data types,...
 3.2 Analytical services
 3.3 Data services
 3.4 Service advertisement and...
 3.5 Security
 4 DISCUSSION
 REFERENCES
 
When a caGrid service is connected to the caBIG environment, the service is required to advertise itself using Service Data Elements (SDEs) in a standardized way so that it can be discovered by other services and clients. The SDE, an OGSA concept, is used to express the state of the service and the metadata describing the service. caGrid defines a common specification for a set of required SDEs for caGrid services. These SDEs are divided into three main categories: Common SDEs describe the basic aspects of all caGrid services; Analytical Service SDEs describe features relevant to Analytical Services; and Data Service SDEs describe features relevant to Data Services. The Common SDEs enable discovery of services based on basic information such as research center name, center address, center phone, the name of the point of contact person and a free-text description of the resource. Data and Analytical Service SDEs store the descriptions of the object types and operations a service makes available to the caBIG environment and facilitate service discovery based on domain object types. The model structures of these three categories of SDEs as well as examples of each are shown in Supplementary Figures S4–S8.

Instances of SDEs for each service in caGrid are registered with the caGrid Index Service to support advertisement and discovery. caGrid leverages the Index Service provided in the Information Services architecture of the GT. The Index Service can be thought of as the ‘yellow pages’ and ‘white pages’ of caGrid. Each caGrid service instance advertises its availability and its service metadata with the Index Service. Clients can query the Index Service, as shown in Figure 4, to locate the services of interest by specifying search criteria over the published SDEs. For instance, caGrid data services register their domain object types with the Index service as part of their SDEs; a client can then execute a query to find all the data services, which serve a particular type of object (e.g. Proteomics objects, Gene objects). Additional common discovery scenarios supported in caGrid include full-text searches over all the service metadata.

caGrid 0.5 provides a series of high-level APIs and user applications to facilitate service advertisement and service discovery. It also provides a graphical deployment tool which can be used to deploy caGrid services. The graphical deployment tool prompts the service provider for the information necessary to populate the standard metadata formats. Once the necessary information has been gathered, it configures the service to register itself with the Index Service, and to support subscription capabilities such that the Index Service can be automatically notified of any changes to the values of SDEs of the service being deployed. A client side discovery API is provided in caGrid to enable the user to implement discovery as part of a caGrid application without having to know the low-level details of the discovery process. All discovery API methods return a list of grid service handles, and provide convenience methods for common discovery scenarios. Service metadata can also be inspected and returned in object model form.


    3.5 Security
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 3.1 Managing data types,...
 3.2 Analytical services
 3.3 Data services
 3.4 Service advertisement and...
 3.5 Security
 4 DISCUSSION
 REFERENCES
 
caGrid 0.5 implements a security architecture that supports five basic requirements that emerged from the use cases: (1) Authentication and Secure Communication; (2) Authorization and Access Control; (3) Single Sign On; (4) User Management; and (5) Delegation. In caGrid version 0.5, these functions are supported by four main components: the Grid Security Infrastructure (GSI), the Authorization Manager, the Grid User Management Service (GUMS) and the Common Attribute Management Service (CAMS). The GSI is a component provided by the GT. It implements the basic mechanisms for secure communication, single sign on, authentication and delegation. Its functionality is based on the use of public key cryptography. The GUMS and CAMS have been developed on top of the GT security components for Grid-wide management of users, user certificates and attributes. The caGrid authorization manager has been extended from the Globus authorization manager framework to support operation-level access control (i.e. a Grid service can make authorization decisions based on the credentials of the requestor and which methods of the service are being invoked) and ability to consult external authorization or policy frameworks. The overall security infrastructure is illustrated in Figure 5. A detailed description of the Security component architecture and the mechanisms by which they interact with caGrid services will be published elsewhere.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 3.1 Managing data types,...
 3.2 Analytical services
 3.3 Data services
 3.4 Service advertisement and...
 3.5 Security
 4 DISCUSSION
 REFERENCES
 
Driven by the requirements of the applications and use cases targeted in caBIG, caGrid represents a major step towards leveraging combined strength of individual research groups in the cancer research community and towards making much more effective utilization of diverse and distributed biomedical information in cancer research. A key characteristic of the framework is its focus on metadata and model driven service development and deployment. This salient aspect of caGrid is particularly important to support syntactic and semantic interoperability across heterogeneous collections of applications and enable unified and programmatic access to remote, autonomously controlled data and analytical resources.

caGrid 0.5 was instantiated as a test bed in the configuration depicted in Figure 6. Its initial implementation included services at NCI, Georgetown Lombardi Cancer Center, Duke University Cancer Center and University of Pittsburgh Medical Center Cancer Center as well as nodes at the Ohio State University and NCI for testing the caGrid core infrastructure components. The operational grid offers Data Services for microarray data; the protein information from the Protein Information Resource (PIR); pathology data extracted from pathology reports by the cancer Tissue Information Extraction System (caTIES); and bioinformatics data from caBIO (https://cabig.nci.nih.gov/workspaces/Architecture/caGrid/). This test bed will grow as new data and analytical services are developed. Access to caGrid resources is available through the client API provided with the caGrid 0.5 software package, or with either of two prototypical GUI applications for discovery and invocation of caGrid services. The first is a discovery client application provided with the caGrid software package, which allows user to choose the appropriate discovery operation they want to use to search for services and browse the service metadata. Users can access a screen which enables the execution of a query to any of the discovered data services, and when browsing the SDEs of an analytical service the user can view a screen which shows the analytical service's interface. The second is a web application called cagrid-browser, and allows clients with no caGrid installation to discover and interact with the caGrid services through a web browser (http://cagrid-browser.nci.nih.gov).

The design and implementation of caGrid is an on-going project. The current release has provided the core functionality required to build, deploy and discover, and securely access data and analytical resources in a Grid environment. The caGrid team is now working towards the caGrid version 1.0 release that will incorporate several extensions and improvements. First, support will be developed for composition, management and execution of Grid-wide data analysis workflows. The current release allows a user to access individual analytical resources and manually construct a workflow. Direct workflow support within the caGrid architecture will enable users to create complex data analysis applications by composing multiple analytical services into a network of data processing operations. Second, the analytical service infrastructure will be extended to support high-performance services, running on parallel and distributed machines. This support will enable faster execution of computing intensive data analysis operations using high-performance clusters at supercomputer centers and institutions. Third, the current implementation of the query support allows a user to submit queries to individual data sources and carry out data subsetting operations. This support will be extended for execution of federated queries across multiple data sources in the next release. Fourth, the security infrastructure will be extended to support a broader set of functions. These functions include methods for better interfacing with existing institutional security infrastructure, federated identity management, virtual organizations and management of trust relationships (Vimercati and Samarati, 1996; Lin and Daemer, 2005, https://cabig.nci.nih.gov/workspaces/Architecture/Documents/Arch_Workspace/caBIG_Technology_Evaluation_Security_White_Paper_version_0_2.pdf; Langella et al., 2006). An initial implementation of identity federation and authentication has been developed and is presented in Langella et al. (2006). The core infrastructure is also being ported to Globus Toolkit 4.0 (www.globus.org) to leverage the WSRF.


Figure 1
View larger version (54K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 The caGrid infrastructure with core services.

 


Figure 2
View larger version (38K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 The use of caDSR, EVS and GME for creation and management of common data types and exchange of objects conforming to these types.

 


Figure 3
View larger version (24K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Example registration of the RProteomics scan features data model.

 


Figure 4
View larger version (31K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 Index service usage.

 


Figure 5
View larger version (40K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5 caGrid security infrastructure.

 


Figure 6
View larger version (38K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6 caGrid version 0.5 testbed. The caBIG Grid cloud represents the wide-area networking of resources at NCI, Georgetown, Duke, Pittsburgh and OSU connected via Grid communication protocols.

 

    Acknowledgments
 
The authors wish to acknowledge the substantial technical contributions of Brian Gilman, Steve Lagou, Nick Encina, Ram Chilukuri, Jijin Yan, Ruowei Wu and Nicole Thompson. Denise Warzel, George Komatsoulis and Frank Hartel provided managerial coordination and support for the work to integrate caCORE and caCORE-SDK generated services with caGrid. Tara Akhavan and Michael Keller provided project and program management support, and Nafis Zebarjadi provided developer and user training support. Gavin Brennan, Wei Lu and Troy Smith provided system administration and hardware configuration support. We would like to thank Patrick McConnell and Salvatore Mungal for help with the RProteomics application example and the RProteomics Reference Implementation at Duke Cancer Center. Cathy Wu, Baris Suzek, Hongzhan Huang, Scott Chung, Hsing-Kuo Hua, Peter McGarvey, Colin Freas, Jess Cannata, Nick Marcou, Jack Yuelin Zhu and Arnie Miles contributed the PIR and caArray Reference Implementations at Georgetown/Lombardi Cancer Center. Rebecca Crowley, Aditya Namlekar, Kevin Mitchell, Girish Chavan and Linda Schmandt provided the caTIES Reference Implementation at University of Pittsburgh Cancer Center. This work was funded by the National Cancer Institute, National Institutes of Health, US Department of Health and Human Services. The development of the Mobius software was funded in part by the Ohio Board of Regents BRTTC #BRTT02-0003 and NIH NIBIB BISTI P20EB000591. Funding to pay the Open Access publication charges was provided by NCI Center for Bioinformatics and the Biomedical Informatics Department, Ohio State University College of Medicine.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alvis Brazma

Received on March 30, 2006; accepted on May 23, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 3.1 Managing data types,...
 3.2 Analytical services
 3.3 Data services
 3.4 Service advertisement and...
 3.5 Security
 4 DISCUSSION
 REFERENCES
 

    Atkinson, M.P., Dialani, V., Guy, L., Narang, I., Paton, N.W., Pearson, D., Storey, T., Watson, P. (2002) Grid Database Access and Integration: Requirements and Functionalities, Technical Document, Global Grid Forum.

    Berman, F., Hey, A.J., Fox, G. Grid Computing: Making the Global Infrastructure a Reality, (2003) John Wiley & Sons.

    Covitz, P.A., Hartel, F., Schaefer, C., Coronado, S., Fragoso, G., Sahni, H., Gustafson, S., Buetow, K.H. (2003) caCORE: a common infrastructure for cancer informatics. Bioinformatics, 19, 2404–2412[Abstract/Free Full Text].

    Foster, I. and Kesselman, C. The Grid: Blueprint for a New Computing Infrastructure, (1999) , San Francisco Morgan Kaufmann.

    Foster, I., Kesselman, C., Nick, J., Tuecke, S. (2002a) Grid services for distributed system integration. Computer, 35, 37–46.

    Foster, I., Kesselman, C., Nick, J., Tuecke, S. (2002b) The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration, Open Grid Service Infrastructure Working Group Technical Report, Global Grid Forum.

    Grethe, J.S., Baru, C., Gupta, A., James, M., Ludaescher, B., Martone, M.E., Papadopoulos, P.M., Peltier, S.T., Tajasekar, A., Santini, S., Zaslavsky, I.N. and Ellisman,M.H. (2005) Biomedical Informatics Research Network: Building a National Collaboratory to Hasten the Derivation of New Understanding and Treatment of Disease. Stud. Health Technol. Inform, . 112, 100–109[Medline].

    Hastings, S., Langella, S., Oster, S., Saltz, J. (2004) Distributed data management and integration: the Mobius Project. Proceedings of the Global Grid Forum 11 (GGF11) Semantic Grid Applications WorkshopHonolulu, Hawaii, USA , pp. 20–38.

    Langella, S., Oster, S., Hastings, S., Siebenlist, F., Kurc, T., Saltz, J. (2006) Dorian: grid service infrastructure for identity management and federation. Proceedings of the 19th IEEE Symposium on Computer-Based Medical Systems, Special Track: Grids for Biomedical InformaticsJune 22–23, 2006Salt Lake City, Utah.

    Lin, K. and Daemer, G. (2005) caBIG Security Technology Evaluation White Paper.

    Parashar, M., Klie, H., Catalyurek, U., Kurc, T., Matossian, V., Saltz, J., Wheeler, M. (2004) Application of grid-enabled technologies for solving optimization problems in data-driven reservoir studies. Proceedings of the International Conference on Computational Science—(ICCS 2004), Pt 3Springer-Verlag, Berlin Vol. 3038, , pp. 805–812.

    Phillips, J., Chilukuri, R., Fragoso, G., Warzel, D., Covitz, P.A. (2006) The caCORE Software Development Kit: Streamlining construction of interoperable biomedical information services. BMC Med. Inform. Decision Making, 6, .

    Sanchez, W., Gilman, B., Kher, M., Lagou, S., Covitz, P. (2004) caGRID White Paper.

    Solomonides, A., McClatchey, R., Odeh, M., Brady, M., Mulet-Parada, M., Schottlander, D., Amendolia, S.R. (2003) MammoGrid and eDiamond: grids applications in mammogram analysis. Proceedings of the IADIS International Conference: e-Society 2003Lisbon, Portugal , pp. 1032–1033.

    Tweed, T. and Miguet, S. (2003) Medical Image Database on the grid: strategies for data distribution. Proceedings of HealthGrid'03Lyon, France , pp. 152–162.

    Vimercati, S. and Samarati, P. (1996) An authorization model for federated systems. Proceedings of the 4th European Symposium on Research in Computer SecurityLNCS 1146, , pp. 99–117.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
International Journal of High Performance Computing ApplicationsHome page
T. Kurc, S. Hastings, V. Kumar, S. Langella, A. Sharma, T. Pan, S. Oster, D. Ervin, J. Permar, S. Narayanan, et al.
HPC and Grid Computing for Integrative Biomedical Research
International Journal of High Performance Computing Applications, August 1, 2009; 23(3): 252 - 264.
[Abstract] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
E. P. Shironoshita, Y. R. Jean-Mary, R. M. Bradley, and M. R. Kabuka
semCDI: A Query Formulation for Semantic Data Integration in caBIG
J. Am. Med. Inform. Assoc., July 1, 2008; 15(4): 559 - 568.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
S. Langella, S. Hastings, S. Oster, T. Pan, A. Sharma, J. Permar, D. Ervin, B. B. Cambazoglu, T. Kurc, and J. Saltz
Sharing Data and Analytical Resources Securely in a Biomedical Research Grid Environment
J. Am. Med. Inform. Assoc., May 1, 2008; 15(3): 363 - 373.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
S. Oster, S. Langella, S. Hastings, D. Ervin, R. Madduri, J. Phillips, T. Kurc, F. Siebenlist, P. Covitz, K. Shanbhag, et al.
caGrid 1.0: An Enterprise Grid Infrastructure for Biomedical Research
J. Am. Med. Inform. Assoc., March 1, 2008; 15(2): 138 - 149.
[Abstract] [Full Text] [PDF]


Home page
J Oncol PractHome page
D. Whippen, M. J. Deering, and E. P. Ambinder
Advancing High-Quality Cancer Care: Cancer Biomedical Informatics Grid Supports Personalized Medicine and the Electronic Health Record
J. Oncol. Pract, July 1, 2007; 3(4): 208 - 211.
[Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
22/15/1910    most recent
btl272v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (9)
Google Scholar
Right arrow Articles by Saltz, J.
Right arrow Articles by Covitz, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Saltz, J.
Right arrow Articles by Covitz, P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?