Skip Navigation


Bioinformatics Advance Access originally published online on November 3, 2007
Bioinformatics 2008 24(1):132-134; doi:10.1093/bioinformatics/btm529
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
24/1/132    most recent
btm529v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Long, J.
Right arrow Articles by Roth, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Long, J.
Right arrow Articles by Roth, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Synthetic microarray data generation with RANGE and NEMO

James Long 1,* and Mitchell Roth 2

1Biotechnology Computing Research Group, University of Alaska Fairbanks, PO Box 757000 and 2Department of Computer Science, University of Alaska Fairbanks, PO Box 756670, Fairbanks, AK 99775, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 SUMMARY
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: For testing and sensitivity analysis purposes, it is beneficial to have known transcription networks of sufficient size and variability during development of microarray data and network deconvolution algorithms. Description of such networks in a simple language translatable to Systems Biology Markup Language would allow generation of model data for the networks.

Results: Described herein is software (RANGE: RAndom Network GEnerator) to generate large random transcription networks in the NEMO (NEtwork MOtif) language. NEMO is recognized by a grammar for transcription network motifs using lex and yacc to output Systems Biology Markup Language models for either specified or randomized gene input functions. These models of known networks may be input to a biochemical simulator, allowing the generation of synthetic microarray data.

Availability: http://range.sourceforge.net

Contact: jlong{at}alaska.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 SUMMARY
 ACKNOWLEDGEMENTS
 REFERENCES
 
Algorithms that deconvolve biological networks from microarray data are currently an area of active research (Faith et al., 2007; Hu et al., 2005; Margolin et al., 2006; Zhou et al., 2005). Known transcription networks of sufficient size and variability facilitate objective testing and comparison of these algorithms, and can be used for sensitivity analysis. For several years now, data of this type has been available by Mendes et al. (2003) as a set of synthetic networks consisting of 100 genes and 200 interactions organized in two ways, as Erdös–Rényi random networks, and as scale-free topology networks. Meanwhile, however, more work has been done on what real biological networks look like, aptly summarized as a series of network motifs by Alon (2006), where examples are also given of how the various network motifs are topologically generalized into larger structures, along with their connection hierarchy.

RANGE generates random transcription networks with up to 16 000 nodes that are constrained by these motifs and connection hierarchies. A network backbone comprised of a series of DOR motifs (described below) is iteratively constructed with other motifs hung from regulated genes in each DOR. A master gene regulates the first transcription factor (TF) in each DOR, and is itself regulated by a set consisting of at least one gene from each DOR. The first TF in each DOR regulates the other TFs in its DOR. The node degrees of the master gene and the first TF in each DOR collectively constitute the ‘fat-tail’ of a power-law distribution, with the master gene having the highest degree. The network is thus scale-free for high degree nodes, representing roughly 20% of the node degree range. Node degrees for motifs hung on DOR genes follow an exponential distribution, clamped to prevent alteration of the ‘fat-tail’ distribution.

The random network is generated in the NEMO language, which is recognized by a grammar using yacc (Johnson, 1979) to output a model in Systems Biology Markup Language (SBML) (Hucka et al., 2003), employing as input function for a gene either a specified function or, by default, a generalized Hill function (Alon, 2006; Likhoshvai and Ratushny, 2007; http://www.cs.unm.edu/~treport/tr/07-02/combinatorial-control-transcription-regulatory-networks.pdf) with randomized parameters that includes non-linear terms to account for TF interactions. The SBML model file is generated using libSBML (Hucka et al., 2003) as appropriate productions are recognized in the yacc (Johnson, 1979) grammar. Once generated, the model file may be input to a biochemical simulator, such as COPASI (Hoops et al., 2006), in order to generate synthetic microarray data. RANGE includes an R (Ihaka and Gentleman, 1996) script to add normally distributed noise to data exported from COPASI (Hoops et al., 2006).


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 SUMMARY
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Transcription network motifs
The transcription network motifs categorized by Alon (2006) include

  • Autoregulation—a gene that regulates itself.
  • Single Input Module (SIM)—one gene regulating a group of genes.
  • Feed-Forward Loop (FFL)—a constrained 3-node motif, some of which generalize topologically into larger structures called multi-output FFLs.
  • Dense Overlapping Regulon (DOR)—a densely connected bipartite graph of genes and TFs.

The NEMO language includes constructs for each of these motifs, as well as a gene and its TFs that are not part of any motif.

2.2 The NEMO language
In the NEMO language, the letter G followed by a unique number, i.e. G10, specifies a gene. Proteins are likewise designated, beginning with P, whose appended number indicates the gene from which it was transcribed. The simplest description of a gene and its TFs is the gene followed by a list of its TFs, i.e. G0(P1+, P2–), where + and – indicate up and down regulation, respectively. An explicit input function may be added by specifying the equation within an :F() construct, whose contents must be a valid equation string recognized by libSBML (Hucka et al., 2003) and appending it to the TF list, i.e. G2(P3+:F(0.6/(1 + power(P3/1.27, 3)))). If not explicitly associated with any motif, groups of such descriptions are passed as comma-separated arguments to a GLIST(). If they are part of a DOR, they are passed as arguments to a DOR(). Genes that are part of a SIM or FFL are passed as arguments to a TMLIST(), where the description of the gene and its TFs takes different forms. A network is a comma-separated series of DOR(), GLIST() and TMLIST() constructs enclosed in square brackets ([]). In a valid sentence of the language, each gene may only appear once and no protein may appear whose gene does not appear.

Examples:

  • Autoregulation—G10 down regulates itself: G10(P10–).
  • Single Input Module (SIM)—P1 up regulates a group of genes: P1(+G2, G3, G4, G5).
  • Feed-Forward Loop (FFLs)—P1 down regulates G2; G3 is down regulated by P2 (from G2), and up regulated by P1: P1(–G2–G3+).
  • Multi-output FFL—P1 down regulates G2; G3, G4, and G5 are down regulated by P2 (from G2), and up regulated by P1: P1(–G2–(G3,G4,G5)+).
  • Dense Overlapping Regulon (DORs)—DOR(G1(P1+,P2–), G2(P1–,P2+,P3+), G3(P1+,P2–)).

An example network is given in Figure 1.


Figure 1
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. [GLIST(G0(P0–)), TMLIST(P0(+G1–G2–), P1(+G3,G4,G5,G6))].

 
2.3 The grammar and language translation
The grammar is context-free and is specified in BNF as input to yacc. DOR motifs must be connected bipartite graphs consisting of TFs and at least two genes regulated. This condition is evaluated before a string is accepted that would otherwise satisfy the grammar, throwing a syntax error if the condition fails. At each point in the grammar where a gene and its TFs are recognized, yacc invokes libSBML (Hucka et al., 2003) routines to instantiate either the specified input function or a generalized Hill function (Alon, 2006; Likhoshvai and Ratushny, 2007; http://www.cs.unm.edu/~treport/tr/07-02/combinatorial-control-transcription-regulatory-networks.pdf) for that gene with randomized parameters. Upon reaching the start symbol, the entire network is written out to a text file. Genes, proteins, SBML math functions and terminal symbols +,–, (,:,), [,] and F are recognized for the grammar by lex.

2.4 SBML and biochemical simulation
Many biochemical simulators accept SBML (Hucka et al., 2003) serialized in XML (http://www.xml.com) as input. COPASI (Hoops et al., 2006) is the simulator of choice for our work, distributed in both GUI and command line versions.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 SUMMARY
 ACKNOWLEDGEMENTS
 REFERENCES
 
COPASI (Hoops et al., 2006) successfully simulated a 6000-node RANGE network for 500 s on a linux workstation with 4 GB of RAM. Figure 2 shows the response of 75 selected genes from a 500-node RANGE network run for 300 s with the default Hill input functions. COPASI output may be exported to a file for further processing.


Figure 2
View larger version (89K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. COPASI output for 75 genes from a 500-node network.

 

    4 SUMMARY
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 SUMMARY
 ACKNOWLEDGEMENTS
 REFERENCES
 
The NEMO language describes transcription networks and input functions in a straightforward manner and is readily compiled into SBML (Hucka et al., 2003). RANGE generates large random networks in the language that may be used to generate synthetic microarray data.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 SUMMARY
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work is supported in part by Grant Number 5P20RR016466 from the National Center for Research Resources (NCRR), a component of the National Institutes of Health (NIH), with prior support from 2P20RR016466-040005 and 2P20RR016466-049001, PI Thomas Marr.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Olga Troyanskaya

Received on August 14, 2007; revised on October 12, 2007; accepted on October 14, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 SUMMARY
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Alon U. An Introduction to Systems Biology: Design Principles of Biological Circuits., ( (2006) ) 1st. Chapman & Hall/CRC. ISBN-13: 978-1584886426..

    Faith JJ, et al. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol, ( (2007) ) 5, : 1.[CrossRef].

    Hoops S, et al. COPASI – a COmplex PAthway SImulator. Bioinformatics, ( (2006) ) 22, : 3067–74.[Abstract/Free Full Text].

    Hu H, et al. Mining coherent dense subgraphs across massive biological networks for functional discovery. Bioinformatics (ISMB 2005), ( (2005) ) 21, (Suppl. 1): 213–221.[CrossRef].

    Hucka M, et al. The Systems Biology Markup Language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, ( (2003) ) 19, : 524–531.[Abstract/Free Full Text].

    Ihaka R, Gentleman R. R: a language for data analysis and graphics. J. Comput. Graph. Stat, ( (1996) ) 5, : 299–314.[CrossRef].

    Johnson SC. YACC: yet another compiler-compiler. Unix Programmer's Manual, ( (1979) ) Vol. 2, b..

    Likhoshvai V, Ratushny A. Generalized hill function method for modeling molecular processes. J. Bioinform. Comput. Biol, ( (2007) ) 5, : 521–531.[CrossRef][Medline].

    Margolin AA, et al. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics, ( (2006) ) 7, (Suppl. 1): S7..

    Mendes P, et al. Artificial gene networks for objective comparison of analysis algorithms. Bioinformatics, ( (2003) ) 19, (Suppl. 2): 122–129.[CrossRef].

    Zhou XJ, et al. Functional annotation and network reconstruction through crossplatform integration of microarray data. Nat. Biotechnol, ( (2005) ) 23, : 238–243.[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
24/1/132    most recent
btm529v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Long, J.
Right arrow Articles by Roth, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Long, J.
Right arrow Articles by Roth, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?