Bioinformatics Advance Access originally published online on November 3, 2007
Bioinformatics 2008 24(1):132-134; doi:10.1093/bioinformatics/btm529
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Synthetic microarray data generation with RANGE and NEMO
1Biotechnology Computing Research Group, University of Alaska Fairbanks, PO Box 757000 and 2Department of Computer Science, University of Alaska Fairbanks, PO Box 756670, Fairbanks, AK 99775, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: For testing and sensitivity analysis purposes, it is beneficial to have known transcription networks of sufficient size and variability during development of microarray data and network deconvolution algorithms. Description of such networks in a simple language translatable to Systems Biology Markup Language would allow generation of model data for the networks.
Results: Described herein is software (RANGE: RAndom Network GEnerator) to generate large random transcription networks in the NEMO (NEtwork MOtif) language. NEMO is recognized by a grammar for transcription network motifs using lex and yacc to output Systems Biology Markup Language models for either specified or randomized gene input functions. These models of known networks may be input to a biochemical simulator, allowing the generation of synthetic microarray data.
Availability: http://range.sourceforge.net
Contact: jlong{at}alaska.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Algorithms that deconvolve biological networks from microarray data are currently an area of active research (Faith et al., 2007; Hu et al., 2005; Margolin et al., 2006; Zhou et al., 2005). Known transcription networks of sufficient size and variability facilitate objective testing and comparison of these algorithms, and can be used for sensitivity analysis. For several years now, data of this type has been available by Mendes et al. (2003) as a set of synthetic networks consisting of 100 genes and 200 interactions organized in two ways, as Erdös–Rényi random networks, and as scale-free topology networks. Meanwhile, however, more work has been done on what real biological networks look like, aptly summarized as a series of network motifs by Alon (2006), where examples are also given of how the various network motifs are topologically generalized into larger structures, along with their connection hierarchy.
RANGE generates random transcription networks with up to 16 000 nodes that are constrained by these motifs and connection hierarchies. A network backbone comprised of a series of DOR motifs (described below) is iteratively constructed with other motifs hung from regulated genes in each DOR. A master gene regulates the first transcription factor (TF) in each DOR, and is itself regulated by a set consisting of at least one gene from each DOR. The first TF in each DOR regulates the other TFs in its DOR. The node degrees of the master gene and the first TF in each DOR collectively constitute the fat-tail of a power-law distribution, with the master gene having the highest degree. The network is thus scale-free for high degree nodes, representing roughly 20% of the node degree range. Node degrees for motifs hung on DOR genes follow an exponential distribution, clamped to prevent alteration of the fat-tail distribution.
The random network is generated in the NEMO language, which is recognized by a grammar using yacc (Johnson, 1979) to output a model in Systems Biology Markup Language (SBML) (Hucka et al., 2003), employing as input function for a gene either a specified function or, by default, a generalized Hill function (Alon, 2006; Likhoshvai and Ratushny, 2007; http://www.cs.unm.edu/~treport/tr/07-02/combinatorial-control-transcription-regulatory-networks.pdf) with randomized parameters that includes non-linear terms to account for TF interactions. The SBML model file is generated using libSBML (Hucka et al., 2003) as appropriate productions are recognized in the yacc (Johnson, 1979) grammar. Once generated, the model file may be input to a biochemical simulator, such as COPASI (Hoops et al., 2006), in order to generate synthetic microarray data. RANGE includes an R (Ihaka and Gentleman, 1996) script to add normally distributed noise to data exported from COPASI (Hoops et al., 2006).
| 2 METHODS |
|---|
|
|
|---|
2.1 Transcription network motifs
The transcription network motifs categorized by Alon (2006) include
- Autoregulation—a gene that regulates itself.
- Single Input Module (SIM)—one gene regulating a group of genes.
- Feed-Forward Loop (FFL)—a constrained 3-node motif, some of which generalize topologically into larger structures called multi-output FFLs.
- Dense Overlapping Regulon (DOR)—a densely connected bipartite graph of genes and TFs.
The NEMO language includes constructs for each of these motifs, as well as a gene and its TFs that are not part of any motif.
2.2 The NEMO language
In the NEMO language, the letter G followed by a unique number, i.e. G10, specifies a gene. Proteins are likewise designated, beginning with P, whose appended number indicates the gene from which it was transcribed. The simplest description of a gene and its TFs is the gene followed by a list of its TFs, i.e. G0(P1+, P2–), where + and – indicate up and down regulation, respectively. An explicit input function may be added by specifying the equation within an :F() construct, whose contents must be a valid equation string recognized by libSBML (Hucka et al., 2003) and appending it to the TF list, i.e. G2(P3+:F(0.6/(1 + power(P3/1.27, 3)))). If not explicitly associated with any motif, groups of such descriptions are passed as comma-separated arguments to a GLIST(). If they are part of a DOR, they are passed as arguments to a DOR(). Genes that are part of a SIM or FFL are passed as arguments to a TMLIST(), where the description of the gene and its TFs takes different forms. A network is a comma-separated series of DOR(), GLIST() and TMLIST() constructs enclosed in square brackets ([]). In a valid sentence of the language, each gene may only appear once and no protein may appear whose gene does not appear.
Examples:
- Autoregulation—G10 down regulates itself: G10(P10–).
- Single Input Module (SIM)—P1 up regulates a group of genes: P1(+G2, G3, G4, G5).
- Feed-Forward Loop (FFLs)—P1 down regulates G2; G3 is down regulated by P2 (from G2), and up regulated by P1: P1(–G2–G3+).
- Multi-output FFL—P1 down regulates G2; G3, G4, and G5 are down regulated by P2 (from G2), and up regulated by P1: P1(–G2–(G3,G4,G5)+).
- Dense Overlapping Regulon (DORs)—DOR(G1(P1+,P2–), G2(P1–,P2+,P3+), G3(P1+,P2–)).
An example network is given in Figure 1.
|
2.3 The grammar and language translation
The grammar is context-free and is specified in BNF as input to yacc. DOR motifs must be connected bipartite graphs consisting of TFs and at least two genes regulated. This condition is evaluated before a string is accepted that would otherwise satisfy the grammar, throwing a syntax error if the condition fails. At each point in the grammar where a gene and its TFs are recognized, yacc invokes libSBML (Hucka et al., 2003) routines to instantiate either the specified input function or a generalized Hill function (Alon, 2006; Likhoshvai and Ratushny, 2007; http://www.cs.unm.edu/~treport/tr/07-02/combinatorial-control-transcription-regulatory-networks.pdf) for that gene with randomized parameters. Upon reaching the start symbol, the entire network is written out to a text file. Genes, proteins, SBML math functions and terminal symbols +,–, (,:,), [,] and F are recognized for the grammar by lex.
2.4 SBML and biochemical simulation
Many biochemical simulators accept SBML (Hucka et al., 2003) serialized in XML (http://www.xml.com) as input. COPASI (Hoops et al., 2006) is the simulator of choice for our work, distributed in both GUI and command line versions.
| 3 RESULTS |
|---|
|
|
|---|
COPASI (Hoops et al., 2006) successfully simulated a 6000-node RANGE network for 500 s on a linux workstation with 4 GB of RAM. Figure 2 shows the response of 75 selected genes from a 500-node RANGE network run for 300 s with the default Hill input functions. COPASI output may be exported to a file for further processing.
|
| 4 SUMMARY |
|---|
|
|
|---|
The NEMO language describes transcription networks and input functions in a straightforward manner and is readily compiled into SBML (Hucka et al., 2003). RANGE generates large random networks in the language that may be used to generate synthetic microarray data.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This work is supported in part by Grant Number 5P20RR016466 from the National Center for Research Resources (NCRR), a component of the National Institutes of Health (NIH), with prior support from 2P20RR016466-040005 and 2P20RR016466-049001, PI Thomas Marr.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Olga Troyanskaya
Received on August 14, 2007; revised on October 12, 2007; accepted on October 14, 2007
| REFERENCES |
|---|
|
|
|---|
Alon U. An Introduction to Systems Biology: Design Principles of Biological Circuits., ( (2006) ) 1st. Chapman & Hall/CRC. ISBN-13: 978-1584886426..
Faith JJ, et al. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol, ( (2007) ) 5, : 1.[CrossRef].
Hoops S, et al. COPASI – a COmplex PAthway SImulator. Bioinformatics, ( (2006) ) 22, : 3067–74.
Hu H, et al. Mining coherent dense subgraphs across massive biological networks for functional discovery. Bioinformatics (ISMB 2005), ( (2005) ) 21, (Suppl. 1): 213–221.[CrossRef].
Hucka M, et al. The Systems Biology Markup Language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, ( (2003) ) 19, : 524–531.
Ihaka R, Gentleman R. R: a language for data analysis and graphics. J. Comput. Graph. Stat, ( (1996) ) 5, : 299–314.[CrossRef].
Johnson SC. YACC: yet another compiler-compiler. Unix Programmer's Manual, ( (1979) ) Vol. 2, b..
Likhoshvai V, Ratushny A. Generalized hill function method for modeling molecular processes. J. Bioinform. Comput. Biol, ( (2007) ) 5, : 521–531.[CrossRef][Medline].
Margolin AA, et al. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics, ( (2006) ) 7, (Suppl. 1): S7..
Mendes P, et al. Artificial gene networks for objective comparison of analysis algorithms. Bioinformatics, ( (2003) ) 19, (Suppl. 2): 122–129.[CrossRef].
Zhou XJ, et al. Functional annotation and network reconstruction through crossplatform integration of microarray data. Nat. Biotechnol, ( (2005) ) 23, : 238–243.[CrossRef][ISI][Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

