Bioinformatics Advance Access originally published online on May 12, 2007
Bioinformatics 2007 23(14):1831-1833; doi:10.1093/bioinformatics/btm252
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BioMoby web services to support clustering of co-regulated genes based on similarity of promoter configurations
1Centre de Regulacio Genomica, 2Research Group in Biomedical Informatics, Institut Municipal d'Investigació Mèdica and Universitat Pompeu Fabra, Pg. Maritim de la Barceloneta, 08003 Barcelona, Catalonia, Spain and 3EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Here we present a computational protocol to analyze the promoter regions of a given set of co-expressed genes, and its implementation through the use of Web services technologies. This protocol aims to cluster a set of co-regulated genes in subsets of genes showing similar configurations of transcription factor binding sites. All the steps of this protocol have been developed as web services that are compliant with BioMoby specifications.
Availability: {http://genome.imim.es/cgi-bin/moby/GeneClustering_DataSubmission.cgi}
Contact: arnaud.kerhornou{at}crg.es
Supplementary information: Supplementary data are available at {http://genome.imim.es/webservices/}
| 1 INTRODUCTION |
|---|
|
|
|---|
With the completion of many sequencing projects, there are tremendous amount of data and, coming along, of analysis methods that are being made available through the Web to the scientific community. While these resources are of great help for retrieval of data, and quick hypothesis verification, they are mostly used through manual execution and cannot be applied for automated tasks. This implies some drawbacks, such as slowness and being error-prone when executed repeatedly.
In silico experiments, on the other hands, are described in protocols that can be seen as an orchestrated execution of atomic steps. Such computational protocols have been commonly implemented using a script language such as Perl. The various steps may be executed on local resources, but increasingly often using remote resources.
In this regard, web services architecture (web services architecture specifications document, {http://www.w3.org/TR/ws-arch/}) has emerged to provide programmatic access to remote resources, thus allowing users to perform in silico experiments through the web in an automatic manner (Stevens et al., 2004).
We have applied such technology to develop a pipeline for the characterization of the promoter regions of co-regulated genes. It is generally assumed that genes with similar transcriptional regulatory programs also exhibit similar configurations of transcription factor (TF) binding sites (TFBS) in their promoter regions upstream of the transcription start site (TSS) (Wray et al., 2003). Because TFBSs are short DNA motifs (8–15 bp in range), they can occur by chance very often in DNA sequences, thus producing a high level of false positives. To differentiate false positive predictions from truly functional elements, new methods have been proposed (for a review, see Wasserman et al., 2004). In addition, promoter elements bound by the same TF may not show sequence similarity and, therefore, sequence comparisons between promoter elements of co-expressed genes often fail to reveal the underlying common regulatory domains. To address this limitation, Blanco et al. (2006) introduced TF-map alignments. In these, TFBS on promoter sequences are labeled according to the corresponding TF, and the comparison is performed between the sequence of labels. TF-map alignments have been shown to uncover common regulatory domains, which cannot be detected by typical sequence comparisons. Here, we have developed and automated a protocol which clusters a given set of co-regulated genes in subsets of genes showing similar configurations of regulatory elements as revealed by TF-map alignments.
| 2 PROTOCOL |
|---|
|
|
|---|
The protocol is schematized in Figure 1. Given a set of gene identifiers, in the first step, the upstream sequences of the genes are automatically extracted from the Ensembl database (Birney et al., 2006). It is also possible to directly provide the upstream sequences in FASTA format. The second step is the search for putative TFBSs in the sequences. Two public position weight matrices (PWMs) libraries are available in our pipeline, Jaspar (Vlieghe et al., 2006) and Transfac (v6.4) (Matys et al., 2006). This step is performed using MatScan software (E. Blanco, unpublished). The third step performs the pairwise alignments of the TFBSs maps using the TF-alignment software (Blanco et al., 2006). In the fourth step, the pairwise alignment scores are parsed to generate a score matrix. In the fifth step, the SOTA clustering algorithm (Herrero et al., 2001) is applied to partition the gene space into clusters according to the score of the alignments of the TFBSs maps. Finally, for each gene cluster, the sixth step consists in running the multiple TF-map alignment software (Blanco et al., 2007) to define a consensus transcriptional regulatory pattern. To facilitate the analysis of the results, a graphical representation of the multiple TF-map alignment is produced using gff2ps tool (Abril et al., 2000).
|
| 3 IMPLEMENTATION |
|---|
|
|
|---|
At each step of this procedure corresponds a web service that has been implemented following the BioMoby specifications (Kawas et al., 2006). To facilitate the execution of the procedure, a data submission page has been setup at the following URL, {http://genome.imim.es/cgi-bin/moby/GeneClustering_DataSubmission.cgi}.
All the web services that compose our procedure have been registered, as synchronous services, in the primary BioMoby registry {http://mobycentral.icapture.ubc.ca}, as well as the one maintained by the Spanish National Bioinformatics Institute ({http://www.inab.org/MOWServ}), under the following authority, genome.imim.es. We have also prepared a workflow implementation to allow users to execute it using the stand-alone application called Taverna (Hull et al., 2006).
This pipeline was assessed on a set of genes that was extracted from Thompson et al. (2002), which reports the identification of a new module of co-expressed genes. Despite these genes have been characterized to be co-expressed, they may not show similar TFBSs configuration. Our pipeline aims to cluster them in sub-groups where each group is defined by a co-regulated expression pattern. For example, our method clusters together the two TFs, PDEF and NUCKS. Likewise, SQSTM1 and FLJ0111 are clustered together. In addition, Thompson et al. have shown that PDEF activates the SQSTM1 promoter. Our results suggest that PDEF also activates the FLJ0111 promoter (See Supplementary material).
| 4 DISCUSSION |
|---|
|
|
|---|
The search for modules of cis-regulatory elements associated with co-expressed genes is still a challenging task. Blanco et al. (2006) have shown that comparisons of annotations of higher order domains can be more meaningful to characterize the underlying functionality of sequences than direct comparisons at the sequence level. Based on this method, here we have presented a fully automatized implementation of a protocol of analysis of co-expressed genes. The different steps of the pipeline may be executed in different distant computational resources, but this is totally transparent to the user. The pipeline encompasses many steps that the users would otherwise need to perform individually, and ensure, therefore the repeatability of this in silico experiment. Furthermore, because BioMoby web services are formally described and their description is published in a central registry, this would also contribute in facilitating their integration in other pipelines of analysis.
Exposing an algorithm as a web service using the BioMoby framework is fairly straightforward. We believe that the BioMoby framework has gained in maturity over the last two years, and that, combined with the community support, it facilitates the development of bioinformatics web services as well as their visibility.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Miguel Pignatelli for useful discussions during this work, Enrique Blanco for feedback and helpful comments on the manuscript and Òscar Gonzalez for the technical support. The work described here has been developed under grants from the Spanish Instituto Nacional de Bioinformática and the Spanish Ministerio de Educación y Ciencia.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alfonso Valencia
Received on December 22, 2006; revised on April 23, 2007; accepted on May 4, 2007
| REFERENCES |
|---|
|
|
|---|
Abril JF, et al. gff2ps: visualizing genomic annotations. Bioinformatics, ( (2000) ) 16, : 743–744.
Birney E, et al. Ensembl 2006. Nucleic Acids Res., ( (2006) ) 34, : D556–D561.
Blanco E, et al. Transcription factor map alignment of promoter regions. PLoS Comput. Biol., ( (2006) ) 2, : e49.[CrossRef][Medline].
Blanco E, et al. Multiple non-collinear TF-map alignments of promoter regions. BMC Bioinformatics, ( (2007) ) 8, : 138.[CrossRef][Medline].
Herrero J, et al. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics, ( (2001) ) 17, : 126–136.
Hull D, et al. Taverna: a tool for building and running workflows of services. Nucleic Acids Res., ( (2006) ) 34, : W729–W732.
Kawas E, et al. BioMoby extensions to the Taverna workflow management and enactment software. BMC Bioinformatics, ( (2006) ) 7, : 523.[CrossRef][Medline].
Matys V, et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res., ( (2006) ) 34, : D108–D110.
Stevens RD, et al. Exploring Williams-Beuren syndrome using myGrid. Bioinformatics, ( (2004) ) 20, (Suppl. 1): I303–I310.[CrossRef][Medline].
Thompson HGR, et al. Identification and confirmation of a module of coexpressed genes. Genome Res, ( (2002) ) 12, : 1517–1522.
Vlieghe D, et al. A new generation of JASPAR, the open-access repository f transcription factor binding site profiles. Nucleic Acids Res., ( (2006) ) 34, : D95–D97.
Wasserman WW, et al. Applied bioinformatics for the identification of regulatory elements. Nat. Rev. Genet., ( (2004) ) 5, : 276–287.[CrossRef][ISI][Medline].
Wray GA, et al. The evolution of transcriptional regulation in eukaryotes. Mol. Biol. Evol., ( (2003) ) 20, : 1377–1419.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
