CROC: finding chromosomal clusters in eukaryotic genomes

Pignatelli, Miguel; Serras, Florenci; Moya, Andrés; Guigó, Roderic; Corominas, Montserrat

doi:10.1093/bioinformatics/btp248

Abstract

Summary: There is increasing evidence showing that co-expression of genes that cluster along the genome is a common characteristic of eukaryotic transcriptomes. Several algorithms have been used to date in the identification of these kinds of gene organization. Here, we present a web tool called CROC that aims to help in the identification and analysis of genomic gene clusters. This method has been successfully used before in the identification of chromosomal clusters in different eukaryotic species.

Availability: The web server is freely available to non-commercial users at the following address: http://metagenomics.uv.es/CROC/

Contact: miguel.pignatelli@uv.es

1 INTRODUCTION

It is well documented that high-order organization of genes occurs in the chromosomes of many eukaryotic genomes like yeast, worm, fly, mouse and human [see Hurst et al. (2004) for a comprehensive review]. In some cases, these genes appear to be functionally related (Blanco et al., 2008), evolutionarily conserved (de Wit et al., 2008; Poyatos and Hurst, 2007) or even belong to the same protein–protein interaction network (Teichmann and Veitia, 2004). A recent study determined that most of the transcription factors in Saccharomyces cerevisiae tend to bind targets that are positionally clustered within a specific region on the chromosome (Janga et al., 2008). Different mechanisms such as gene duplication, shared promoter regions, chromatin regulation or shared functional pathway or tissue have been proposed to explain the existence of these clusters (Hurst et al., 2004; Lercher et al., 2003; Oliver et al., 2002).

To date, different statistical models have been used to assess the significance of physical clusters. Recently, a strategy consisting of serial hypergeometric distribution tests in a sliding window along the chromosomes has been successfully used in some studies (Chang et al., 2004; Coppe et al., 2006; Yi et al., 2007). Following this approach, Coppe et al. developed a software called REEF that implements this algorithm in a python program. Using this software, they were able to identify 44 significant tissue-specific clusters in the human genome (Coppe et al., 2006). Using a similar approach, Yi et al. (2007) found gene clusters that share a common Gene Ontology (GO) annotation term in different genomes.

Here we present CROC, a web-based tool for the identification of chromosomal clustering in a wide variety of eukaryotic genomes. It uses a sliding window approach combined with hypergeometric distribution tests to evaluate the statistical significance of the predicted clusters. The most important parameters of the algorithm are customizable through a simple and powerful web interface. The results are linked to external databases for further analysis. Recently, using this algorithm we were able to identify different conserved clusters governed by chromatin regulators in Drosophila melanogaster (Blanco et al., 2008).

2 IMPLEMENTATION

The core of the CROC application has been implemented in Perl. The statistical analysis routines are written in C to maximize efficiency and inlined directly in the Perl code. All scripts are available for download in the web server page. The web interface uses Javascript routines heavily to maximize user friendliness. The communication between the web interface and the application's core is handled via AJAX techniques and JSON has been used for information exchange between both.

2.1 Data input

The program expects a list of genes, for instance a list of genes up- or down-regulated after a microarray experiment. These are the genes that will be tested for physical clustering. Only valid IDs for the selected genome/version are allowed, for example, for UCSC genomes, gene abbreviations (e.g. human ‘BCL2’ or ‘LDHAL6B’) or refseq ID's (e.g. NM_198576) are both allowed. Also, the reference genome and its version need to be selected. Examples of data input for different organisms can be obtained ready for analysis in the ‘Examples’ box. It is also possible to upload your own custom reference.

2.2 Other options

Three other parameters that control the behavior of the algorithm can be adjusted. The first one is the window type that the algorithm will use in the sliding-window strategy. Two types of windows can be selected, gene-based and DNA length-based. In the first case, the offset between windows will always be one gene. In the latter, the offset can be selected between 1 kb and the length of the window. The second parameter refers to the minimum number of input genes that define a cluster. Finally, the method for multiple testing correction can be selected.

2.3 Algorithm

Once an analysis is submitted all gene abbreviations in the list are converted to refseq identifiers. Transcript information is discarded, so only one transcript per gene is considered. Then, the program scans the chromosome with the defined sliding window. For each window, the probability of obtaining by chance the number of input genes in the window is calculated using a hypergeometric distribution test [Equation (1)].

(1)

In this equation, for each window W of the chromosome C, G represents the set of genes in C, A is the set of genes in G that are in W, k is the number of genes in the input list that are in C and n is the subset of k that is in W. The screening of many windows raises the multiple testing problem. To avoid it, the program allows different strategies for false discovery rate (FDR) correction of the type I error probability (0.05) (Benjamini and Hochberg, 1995; Benjamini and Liu, 1999). In a final step, consecutive ‘positive’ windows are merged.

2.4 Output

The output is distributed in three sections. The first one (‘Stats’) shows a brief summary of the analysis performed. The second section (‘Chromosomes’) summarizes the results disclosed by chromosomes. A graphical representation of the chromosome can be generated by clicking the ‘Plot’ button in the last field of each chromosome row. In this plot, gray lines represent all the genes present in the chromosome, red lines represent input and the blue lines over the genes represent clusters. Because these blue lines are typically very short, blue arrows are present over them to ensure visibility of all clusters. When available genomes are used, these arrows are themselves link to the UCSC/Ensembl genome browser, where clusters are uploaded as a ‘custom track’ to further analyze them. The third section (‘Clusters’) depicts individual information for each cluster including: the chromosomal coordinates of the cluster, the associated P-value of the statistical analysis, the input genes that form the cluster and a plot representing the cluster itself.

3 CONCLUSION

The tool presented here aims at helping in the identification of gene clusters in eukaryotic chromosomes. The input list can represent genes that share any biological feature, such as, for example, co-expression, having a common transcription factor in their proximal promoter, belonging to the same GO term or being fast-evolving genes. We have recently demonstrated the usefulness of this algorithm in studying clustered genes governed by chromatin regulators during the development of D. melanogaster (Blanco et al., 2008).

ACKNOWLEDGEMENTS

We thank Ana Pamblanco for help in the web page design and Juanjo Abellán for his helpful comments. MP was supported by Juan de la Cierva postdoctoral contract from the Ministerio de Educación y Ciencia (MEC), Spain.

Funding: This work was supported by the Ministerio de Educación y Ciencia, Spain [BMC2003-05018,BMC2006-07334].

Conflict of Interest: none declared.

REFERENCES

Blanco

E

, et al.

Conserved chromosomal clustering of genes governed by chromatin regulators in Drosophila

,

Genome Biol

,

2008

, vol.

9

pg.

R134

Benjamini

Y

,

Hochberg

Y

.

Controlling the false discovery rate - a practical and powerful approach to multiple testing

,

J. R. Stat. Soc. B.

,

1995

, vol.

57

(pg.

289

-

300

)

Google Scholar

OpenURL Placeholder Text

WorldCat

Benjamini

Y

,

Liu

W

.

A step-down multiple hypotheses testing procedure that controls the false discovery rate under independence

,

J. Stat. Plann. Inference

,

1999

, vol.

82

(pg.

163

-

170

)

Google Scholar

Crossref

WorldCat

Chang

CF

, et al.

Calculating the statistical significance of physical clusters of co-regulated genes in the genome: the role of chromatin in domain-wide gene regulation

,

Nucleic Acids Res.

,

2004

, vol.

32

(pg.

1798

-

1807

)

Coppe

A

, et al.

REEF: searching REgionally Enriched Features in genomes

,

BMC Bioinformatics

,

2006

, vol.

453

Google Scholar

OpenURL Placeholder Text

WorldCat

de Wit

E

, et al.

Global chromatin domain organization of the Drosophila genome

,

PLoS Genet

,

2008

, vol.

4

pg.

e1000045

Hurst

LD

, et al.

The evolutionary dynamics of eukaryotic gene order

,

Nat. Rev. Genet

,

2004

, vol.

5

(pg.

299

-

310

)

Janga

SC

, et al.

Transcriptional regulation constrains the organization of genes on eukaryotic chromosomes

,

Proc. Natl Acad. Sci. USA

,

2008

, vol.

105

(pg.

15761

-

15766

)

Google Scholar

Crossref

WorldCat

Lercher

MJ

, et al.

Coexpression of neighboring genes in Caenorhabditis elegans is mostly due to operons and duplicate genes

,

Genome Res

,

2003

, vol.

13

(pg.

238

-

243

)

Oliver

B

, et al.

Gene expression neighborhoods

,

J. Biol.

,

2002

, vol.

1

pg.

4

Poyatos

JF

,

Hurst

LD

.

The determinants of gene order conservation in yeasts

,

Genome Biol.

,

2007

, vol.

8

pg.

R233

Teichmann

SA

,

Veitia

RA

.

Genes encoding subunits of stable complexes are clustered on the yeast chromosomes: an interpretation from a dosage balance perspective

,

Genetics

,

2004

, vol.

167

(pg.

2121

-

2125

)

Yi

G

, et al.

Identifying clusters of functionally related genes in genomes

,

Bioinformatics

,

2007

, vol.

23

(pg.

1053

-

1060

)

Author notes

Associate Editor: Dmitrij Frishman

Download all slides

Month:	Total Views:
November 2016	8
December 2016	1
February 2017	3
March 2017	4
April 2017	7
May 2017	9
June 2017	5
July 2017	5
August 2017	6
October 2017	6
November 2017	1
December 2017	19
January 2018	14
February 2018	22
March 2018	12
April 2018	16
May 2018	9
June 2018	10
July 2018	9
August 2018	27
September 2018	9
October 2018	4
November 2018	27
December 2018	17
January 2019	27
February 2019	27
March 2019	40
April 2019	38
May 2019	25
June 2019	18
July 2019	35
August 2019	14
September 2019	17
October 2019	17
November 2019	6
December 2019	8
January 2020	25
February 2020	6
March 2020	7
April 2020	28
May 2020	5
June 2020	13
July 2020	20
August 2020	4
September 2020	17
October 2020	10
November 2020	11
December 2020	17
January 2021	3
February 2021	11
March 2021	8
April 2021	9
May 2021	14
June 2021	9
July 2021	32
August 2021	7
September 2021	10
October 2021	4
November 2021	13
December 2021	8
January 2022	15
February 2022	16
March 2022	8
April 2022	9
May 2022	5
June 2022	8
July 2022	10
August 2022	9
September 2022	6
October 2022	10
November 2022	8
December 2022	8
January 2023	4
February 2023	6
March 2023	7
April 2023	7
May 2023	5
June 2023	9
July 2023	9
August 2023	3
September 2023	10
October 2023	8
November 2023	6
December 2023	8
January 2024	12
February 2024	12
March 2024	24
April 2024	8

Article Contents

CROC: finding chromosomal clusters in eukaryotic genomes

Abstract

1 INTRODUCTION

2 IMPLEMENTATION

2.1 Data input

2.2 Other options

2.3 Algorithm

2.4 Output

3 CONCLUSION

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

CROC: finding chromosomal clusters in eukaryotic genomes

Abstract

1 INTRODUCTION

2 IMPLEMENTATION

2.1 Data input

2.2 Other options

2.3 Algorithm

2.4 Output

3 CONCLUSION

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only