Bioinformatics Advance Access originally published online on May 12, 2007
Bioinformatics 2007 23(14):1866-1867; doi:10.1093/bioinformatics/btm255
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
DoriC: a database of oriC regions in bacterial genomes
Department of Physics, Tianjin University, Tianjin 300072, China
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Replication origins (oriCs) of bacterial genomes currently available in GenBank have been predicted by using a systematic method comprising the Z-curve analysis for nucleotide distribution asymmetry, DnaA box distribution, genes adjacent to candidate oriCs and phylogenetic relationships. These oriCs are organized into a MySQL database, DoriC, which provides extensive information and graphical views of the oriC regions. In addition, users can Blast a query sequence or even a whole genome against DoriC to find a homologous one. DoriC will be updated timely and the latest version is DoriC 1.8, in which oriCs of 425 genomes (468 chromosomes) are identified.
Availability: DoriC can be accessed from http://tubic.tju.edu.cn/doric/
Contact: ctzhang{at}tju.edu.cn
Supplementary information: Supplementary data are available at http://tubic.tju.edu.cn/doric/supplementary.htm
| 1 INTRODUCTION |
|---|
|
|
|---|
The initiation of replication is the central event in the bacterial cell cycle. However, oriC regions remain unknown in many bacterial genomes sequenced so far. Experimental methods for identifying oriCs in vivo are reliable, but time-consuming and labor-intensive. The in silico methods to identify oriCs include the GC-skew analysis (Grigoriev, 1998; Lobry, 1996) and the oligomer-skew method (Salzberg et al., 1998; Worning et al., 2006), etc. Sequence analysis revealed that an oriC region usually contains multiple 9mer consensus elements termed the DnaA box. Jointly using the three methods (GC-skew, location of the dnaA gene and distribution of DnaA boxes) resulted in better prediction of oriC regions (Mackiewicz et al., 2004).
The Z-curve method is an alternative technique that detects the asymmetrical nucleotide distribution around oriCs. Using the Z-curve method, three oriCs were predicted in the genome of the archaeon Sulfolobus solfataricus, e.g. see a review of (Zhang and Zhang, 2005), and the prediction is consistent with recent experimental data, e.g. see a review of (Robinson and Bell, 2005). To extensively identify oriCs with high accuracy and reliability, an integrated in silico method to predict oriC regions of bacterial genomes has been developed, based on the Z-curve method, the distributions of DnaA boxes, the indicator genes such as dnaA (dnaN, hemE, gidA ... or repC) and phylogenetic relationships. The present work mainly consists of two parts: identifying oriC regions and setting up the database DoriC.
| 2 METHODS AND RESULTS |
|---|
|
|
|---|
2.1 The procedure to identify oriC regions
The procedure to identify oriC regions (refer to Supplementary Fig. 1A) is described as follows.
- Extract all intergenic sequences according to the annotation files. Complete bacterial genomes and the related annotation files were downloaded from the NCBI ftp server (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/). In general, the number of DnaA boxes differing by no more than one position from Escherichia coli perfect DnaA box (TTATCCACA) was counted for each intergenic sequence. However, it should be noted that some species-specific DnaA boxes were adopted for certain bacteria. In this case, the number of species-specific DnaA boxes proposed by us or previously experimentally identified was counted. For example, the DnaA box motif TTTTCCACA is universal for the genomes in the phylum Cyanobacteria.
- Assign every intergenic sequence an oriC type (types 1–5), according to the number of DnaA boxes within it and the location related to the dnaA gene or the minimum of the GC disparity curve. The definition and characteristics of different oriC types have been summarized in Supplementary Table 1.
- Once the oriC type and related information for every intergenic region is obtained, select one or two intergenic sequences with the highest oriC type priority (type 1 > type 2 > ··· > type 5) as candidate oriC regions.
- Finally, output the location, AT content, length, DnaA box number, type and the sequence of the identified oriC regions.
2.2 The procedure to set up DoriC
The procedure to set up DoriC (refer to Supplementary Fig. 1B) is described briefly as follows.
- Extract genome information, such as organism's name, lineage, topology of chromosome, dnaA gene location.
- Calculate the genome size, GC content, coordinates of GC, AT, RY, MK disparity curves (Zhang and Zhang, 2005) and the precise coordinates of extremes of the GC disparity curve, search for DnaA boxes and dif-like sequences on both strands and identify oriC regions according to the procedure described above.
- Output the information of genome and oriC region(s), and integrated plots for the original and rotated sequences to display the obtained results, such as general genome information, four disparity curves, distribution of DnaA boxes, locations of dnaA genes, dif sites and oriC regions. In each rotated sequence, the sequence coordinate origin begins and ends in the dif site or the maximum of the GC disparity curve.
- Organize the output information and integrated plots by using an open-source database management system, MySQL.
| 3 DATABASE CONTENT AND WEB INTERFACE |
|---|
|
|
|---|
DoriC is built using a relational database (MySQL) allowing rapid retrieval of data and making resource easily maintainable. In general, one entry corresponds to one genome (chromosome). However, for some genomes (chromosomes) the oriC region is split into two distinct sub-regions by the dnaA gene, resulting in two entries for one genome (chromosome). The database access is via a web interface based on PHP script and provides various ways to search for DoriC entries, such as organism's name, accession number, lineage, oriC type and a keyword, etc. DoriC can be arranged in the order of organism's name, accession number, genomic GC content and oriC type. In addition, users can also Blast a query sequence or even a whole genome against DoriC to find a homologous one. DoriC will be updated timely and the latest version is Doric 1.8, in which oriCs of 425 genomes (468 chromosomes) are identified.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We would like to thank Dr Ren Zhang for invaluable assistance. Technical supports from Dr Hong-Yu Ou and Yan Lin are gratefully acknowledged. The present work was supported in part by NNSF of China (Grant No. 90408028).
Conflict of Interest: none delcared.
| FOOTNOTES |
|---|
Associate Editor: Alfonso Valencia
Received on April 5, 2007; revised on April 29, 2007; accepted on May 4, 2007
| REFERENCES |
|---|
|
|
|---|
Grigoriev A. Analyzing genomes with cumulative skew diagrams. Nucleic Acids Res. (1998) 26:2286–2290.
Lobry JR. A simple vectorial representation of DNA sequences for the detection of replication origins in bacteria. Biochimie (1996) 78:323–326.[Medline]
Mackiewicz P, et al. Where does bacterial replication start? Rules for predicting the oriC region. Nucleic Acids Res. (2004) 32:3781–3791.
Robinson NP, Bell SD. Origins of DNA replication in the three domains of life. FEBS J. (2005) 272:3757–3766.[CrossRef][Medline]
Salzberg SL, et al. Skewed oligomers and origins of replication. Gene (1998) 217:57–67.[CrossRef][Web of Science][Medline]
Worning P, et al. Origin of replication in circular prokaryotic chromosomes. Environ. Microbiol. (2006) 8:353–361.[CrossRef][Medline]
Zhang R, Zhang CT. Identification of replication origins in archaeal genomes based on the Z-curve method. Archaea (2005) 1:335–346.[Medline]
This article has been cited by other articles:
![]() |
F.-B. Guo and J.-B. Yuan Codon Usages of Genes on Chromosome, and Surprisingly, Genes in Plasmid are Primarily Affected by Strand-specific Mutational Biases in Lawsonia intracellularis DNA Res, April 1, 2009; 16(2): 91 - 104. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. V. Sernova and M. S. Gelfand Identification of replication origins in prokaryotic genomes Brief Bioinform, September 1, 2008; 9(5): 376 - 391. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Gao and C.-T. Zhang Origins of Replication in Sorangium cellulosum and Microcystis aeruginosa DNA Res, June 1, 2008; 15(3): 169 - 171. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

