Bioinformatics Advance Access originally published online on September 5, 2006
Bioinformatics 2006 22(21):2706-2708; doi:10.1093/bioinformatics/btl444
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SEBINI: Software Environment for BIological Network Inference
1 Computational Biology and Bioinformatics Group, Pacific Northwest National Laboratory Richland, WA, USA
2 Oberlin College, Oberlin OH, USA
3 Case Western Reserve University, Cleveland OH, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: The Software Environment for BIological Network Inference (SEBINI) has been created to provide an interactive environment for the deployment and evaluation of algorithms used to reconstruct the structure of biological regulatory and interaction networks. SEBINI can be used to compare and train network inference methods on artificial networks and simulated gene expression perturbation data. It also allows the analysis within the same framework of experimental high-throughput expression data using the suite of (trained) inference methods; hence SEBINI should be useful to software developers wishing to evaluate, compare, refine or combine inference techniques, and to bioinformaticians analyzing experimental data. SEBINI provides a platform that aids in more accurate reconstruction of biological networks, with less effort, in less time.
Availability: A demonstration website is located at https://www.emsl.pnl.gov/NIT/NIT.html. The Java source code and PostgreSQL database schema are available freely for non-commercial use.
Contact: ronald.taylor{at}pnl.gov
| 1 INTRODUCTION |
|---|
|
|
|---|
Reconstruction of regulatory and signaling networks is a critical task in systems biology. High-throughput molecular biology experiments are now producing mRNA expression data in quantities large enough for researchers to attempt to reconstruct the structure of gene transcription networks based primarily on state correlation measurements. Protein expression and activation measurements are soon to follow, allowing similar work on proteinprotein interaction and signaling networks.
Researchers have previously created artificial networks and simulated expression data to test their specific network inference algorithm development (Ideker et al., 2000; Tamada et al., 2003; Zak et al., 2001). A collection of static datasets for four network topology types, with corresponding synthetic gene expression data, has been made available for download as flat files (Mendes et al., 2003). The ASIAN website is available where users can infer edges between gene clusters using one particular method, that of graphical Gaussian modeling (Aburatani et al., 2004). Also, a collection of Java-based Bayesian network structure learning algorithms has recently been made available (Hartemink et al., 2005, www.cs.duke.edu/~amink/software/banjo/). However, the community still lacks a general analysis environment. We are missing a software platform where artificial datasets of different types can be created dynamically and then used to test a growing collection of inference algorithms, and where true experimental data can then be entered for analysis through a standard API to each of the toolkit algorithms. That is, there is currently no interactive environment available to (1) evaluate different algorithms for inference of biological regulatory and signaling network structure using common datasets and (2) easily apply such state-of-the-art algorithms to experimentally generated high-throughput data (van Someren et al., 2002). The Software Environment for BIological Network Inference (SEBINI) project at the Pacific Northwest National Laboratory (PNNL) is designed to fill this gap, as an aid in the reconstruction of the structure; i.e. the wiring diagram of mRNA and protein networks. SEBINI provides a web-based environment that allows inference algorithms to be compared, trained, refined and then employed on experimental data.
We assume that users of SEBINI seek to directly infer genetic regulatory networks from high-throughput microarray mRNA expression data, and protein interaction and signaling networks from high-throughput quantitative protein data. Briefly, methods using high-throughput data rely on searching for patterns of partial correlation or conditional probabilities that indicate causal influence (Sprites et al., 2000). Such patterns of partial correlations found in the high-throughput data, possibly combined with other supplemental data on the genes or proteins in the proposed networks, or combined with other information on the organism, are the basis upon which the algorithms in SEBINI's toolkit infer networks. In other words: SEBINI may be useful in inferring the topology of any network where the change in state of one node can affect the state of other nodes.
| 2 SEBINI ARCHITECTURE |
|---|
|
|
|---|
SEBINI uses a standard three-tier architecture: (1) a web-based client user interface, (2) an application logic middle tier consisting of a suite of Java servlets and other Java programs (>100 Java classes) and (3) a relational database storing the data required by the middle tier. Inferred networks (as well as the raw data, discretized data and algorithm parameter selections used to generate the networks) are permanently stored in the database for visualization, topological and statistical analysis, and for later export in a human-readable or program-specific format. Inference and discretization (binning) algorithms can be any sort of executable program; a Java handler class is added for each new algorithm to handle communication between the invocation web page, the database and the algorithm. Security is implemented on a project basis, with one owner and possibly multiple users per project.
Major design issues included (1) the interface for user navigation among possibly huge datasets, allowing easy drill down from a network set to a specific network to a specific node or edge and (2) producing an efficient, understandable mapping from the inferred networks and inferred edges back to the corresponding original expression data. Note that we have one-to-many relationships from an expression dataset to a binned expression dataset, as well as a one-to-many relationship between a binned dataset and the inferred network and inferred edges created by the selected inference algorithm. Records for each of these data types are permanently stored and connected to the appropriate records of the other data types. Other design decisions: all inter-servlet communication is routed through a CentralControl servlet, for a clear (and reusable) flow of control. Each binning and inference algorithm is invoked in a separate Java thread that performs job posting to the database, thus allowing dynamic monitoring of job progress by the user. Jobs are timed to the millisecond, allowing comparison between algorithms of relative speed versus relative power.
SEBINI was initially implemented on a Dell desktop running Red Hat Linux, using Java ver. 1.4, PostgreSQL ver. 7.4 and Tomcat 4.1. SEBINI has also been installed on a Windows 2003 Web Server. Machine-specific parameters are stored in an easily changed properties text file. Mathworks' MATLAB is required for some of the inference algorithms.
| 3 SEBINI CAPABILITIES |
|---|
|
|
|---|
Capabilities of interest to all users include:
- Upload of several types of experimental data for input into selected binning and network inference algorithms.
- Several choices of inference methods from the growing toolkit, currently including algorithms from classical statistics (e.g. Pearson correlation, as a baseline), Bayesian networks (e.g. Hartemink et al., 2005 and Sachs et al., 2005) and information theory (mutual information-based; e.g. Margolin et al., 2006).
- Inference results (inferred networks) that can be permanently stored and further analyzed. For each network, the user can view a summary page, a topological characteristics and statistics page, a graph visualization using Cytoscape (Shannon et al., 2003) invoked via Java Web Start, summary pages for each node and edge showing the raw and binned node states, and job pages that record how the binning and inference tasks proceeded.
- Direct comparison of network inference methods on common synthetic or experimental datasets.
- A planning tool for experiments. How well can one do in reconstructing the edges in a genetic transcriptional network, given a set of gene expression data and a network of a given topology, using a given inference method? SEBINI will allow predictions, using different inference methods, on what can be reconstructed of the topology (regulatory connections) of a network of a given size and complexity.
- Export of inferred network structures as input to other tools (e.g. for dynamical modeling) and export of human-readable reports on the networks, with various topological characteristics noted.
To support algorithm developers, SEBINI also allows the following:
- Artificial datasets (e.g., topologies, perturbations and node input functions) that can be dynamically created and stored.
- Dynamic, step-wise refinement of inference methods, based on results. Scoring measures (recall, precision, F-measure) are used to score performance against the simulated networks with known structure.
- Well-defined addition of new inference algorithms, binning techniques and import and export methods.
- Supervised or unsupervised training of inference methods, with supervised inference results scored against the known network topologies.
- A guide for the interpretation of the scores produced by an inference technique. SEBINI can produce scoring distributions for a given inference method against known networks. Such distributions could then be used to determine appropriate cutoff scores for determining the existence of a regulatory influence (an edge) to a target gene.
| 4 FUTURE WORK |
|---|
|
|
|---|
We will continue to add to the capabilities of SEBINI: additional inference and binning algorithms, methods of generating simulated datasets and import/export techniques. We are actively seeking algorithm developers as collaborators. We are exploring (1) refining or combining algorithms for improved results, (2) adding genome-specific annotation, (3) various types of post-processing to remove incorrect or indirect edges and (4) the use of SEBINI as a platform for automated fitting of logic functions (e.g. Istrail and Davidson, 2005) and state equations to the state sets tied to the source and target nodes for each inferred edge in SEBINI's database, as an additional step toward dynamic modeling.
| Acknowledgments |
|---|
This work has been supported by the US Department of Energy (DOE) through the Biomolecular Systems Initiative at PNNL, and also through PNNL's William R. Wiley Environmental Molecular Science Laboratory (EMSL) and the EMSL Grand Challenge in Membrane Biology project, via PNNL's Laboratory Directed Research and Development Program. PNNL is operated by Battelle for the DOE under contract DE-AC05-76RL01830. C.T. and M.B. were supported by the DOE Science Undergraduate Laboratory Internship (SULI) program.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alvis Brazma
Received on June 23, 2006; accepted on August 15, 2006
| REFERENCES |
|---|
|
|
|---|
Aburatani, S., et al. (2004) ASIAN: a website for network inference. Bioinformatics, 20, 28532856
Hartemink, A., et al. (2005) Banjo (Bayesian Network Inference with Java Objects).
Ideker, T.E., et al. (2000) Discovery of regulatory interactions through perturbation: inference and experimental design. Pac. Symp. Biocomput, . 5, 30516.
Istrail, S. and Davidson, E.H. (2005) Logic functions of the genomic cis-regulatory code. Proc. Natl Acad. Sci. USA, 102, 49544959
Margolin, A.A., et al. (2006) ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics, 7, Suppl. 1, S1S7.
Mendes, P., et al. (2003) Artificial gene networks for objective comparison of analysis algorithms. Bioinformatics, 19, Suppl. 2, ii122ii129[Abstract].
Sachs, K., et al. (2005) Causal protein-signaling networks derived from multi parameter single-cell data. Science, 308, 523529
Shannon, P., et al. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res, . 13, 249824504
Sprites, P., Glymour, C., Scheines, R., II (eds). Causation, Prediction, and Search: Adaptive Computation and Machine Learning, (2000) , Massachusetts MIT Press. Cambridge.
Tamada, Y., et al. (2003) Estimating gene networks from gene expression data by combining Bayesian network model with promoter element detection. Bioinformatics, 19, ii227ii236[Abstract].
van Someren, E.P., et al. (2002) Genetic network modeling. Pharmacogenomics, 3, 507525[CrossRef][ISI][Medline].
Zak, D., et al. (2001) Simulation studies for the identification of genetic networks from cDNA array and regulatory activity data. Proceedings of the Second International Conference on Systems Biology, CA, USA California institute of Technology, pp. 231238.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||