Bioinformatics Advance Access originally published online on January 10, 2006
Bioinformatics 2006 22(5):527-531; doi:10.1093/bioinformatics/btk033
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
IWoCS: analyzing ribosomal intergenic transcribed spacers configuration and taxonomic relationships
Evolutionary Genomics Group and Division de Microbiologia, Universidad Miguel Hernandez Campus de San Juan, Apartado 18, 03550 San Juan de Alicante, Spain
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Lately the use of 16S23S Intergenic Transcribed Spacer (ITS) sequences for bacterial typing purposes has increased. The presence of conserved regions like tRNA genes or boxes together with hypervariable regions allows performing intraspecific discrimination of very close bacterial strains. On the other hand this mosaic of variability makes the ITS a sequence difficult to analyze and compare.
Results: A software to study ITSs by a Word Count based System (IWoCS) is proposed. A large dataset of ITS was created (comprising 7355 sequences). A database indicating all the occurrences of possible n-mers (tags), describing each ITS sequence, was created (with n going from 5 to 13) including 32 061 819 entries. The database allows to analyze ITS sequences submitted by users using a web-based interface. The abundance in the database of each n-mer is given in a one-base sliding frame. A dominance plot reflects how common the tags are within different taxonomic levels. The obtained profile identifies highly repeated tags as evolutionarily conserved regions (like tRNA or boxes) or low frequency tags as regions specifically associated to taxonomic groups. The study of the dominance and abundance profiles combined with the taxonomy reports provides a novel tool for the use of the ITS in bacteria typing and identification.
Availability: The database is freely accessible at http://egg.umh.es/iwocs/
Contact: gdauria{at}umh.es
| 1 INTRODUCTION |
|---|
|
|
|---|
The 16S23S Intergenic Transcribed Spacers (ITSs) are characterized by an evident mosaicism of highly conserved and variable segments, somewhat reminiscent of the rRNA itself but at much more refined scale of taxonomic relationship (genusspecies level) where the ribosomal genes become nearly useless. For that reason the ITS represents an important element in strains typing for environmental (Garcia-Martinez and Rodriguez-Valera, 2000) or clinical purposes (Clementino et al., 2001; Baudart et al., 2000). ITS sequences have multiple functional roles such as (1) presence of secondary structure at the beginning and the end of the spacer that pairs with sequences upsteam of the 16S rRNA gene and down stream from the 23S rRNA respectively to allow its excising, (2) presence of antiterminator boxes to avoid the premature termination of transcription (Iteman et al., 2000) and (3) presence of tRNA genes. This landscape of highmoderatelow conservation makes ITS a potentially useful model for the study of functional motifs in other spacer regions of prokaryotic genomes (Boyer et al., 2001) as well as a powerful identification marker.
Possibly the main obstacle to widely use the ITS for structural or phylogenetic descriptions lies on the difficulties encountered in aligning and comparing these highly variable sequences.
The search for genomic signatures by the count of words of di- tri- tetra-, and n-nucleotides has been applied to analyze non-coding regions (Deschavanne et al., 1999; Wang et al., 2005; Teeling et al., 2004). In Archaea, the presence of repeated oligonucleotides within these regions was shown to be species-specific, representing an important element for evolutionary studies and genome mapping (Fadiel et al., 2003).
In this article, a system for an n-oligomer abundance representation of 16S23S ribosomal ITSs is presented. A database for n-mer count from an ITS dataset was compiled. The webinterface allows studying the abundance of each n-mer of a submitted sequence.
Currently there are no available methods to determine whether a specific sequence is evolutively conserved or associated to a particular taxon or group of strains. This could be obtained by comparing n-mer abundance from the complete database and statistical dominance (frequency within a given cluster). For this reason, the phylogenetic distribution of taxa containing the reported tag, in terms of dominance, is also provided. Hence those values (abundance and dominance) are relevant to structural and functional ITS analysis as well as helping to establish taxonomic identity. We have applied the software to three model examples to test the reliability of the system.
| 2 SYSTEM AND METHODS |
|---|
|
|
|---|
2.1 Database structure
A number of ad hoc queries was used to retrieve ITS sequences from NCBI (http://www.ncbi.nlm.nih.gov/). The IWoCS database also contains all the sequences coming from the Ribosomal Internal Spacer Sequence Collection (RISSC, Garcia-Martinez et al., 2001) and the ITS sequences of Micro-Mar database (Pushker et al., 2005). All the sequences were stored in MySQL (http://www.mysql.com) relational tables. The database contains 7355 ITS sequences. A summary of the sequences present in the database at phylum, class and order level is shown in the Table 1. Each sequence was processed by the wordcount software from the EMBOSS (Rice et al., 2000) from 5 to 13 nt. The obtained results were stored in MySQL tables producing >32 million entries representing the occurence of all identified mers. The IWoCS runs on a LAMP (Linux + Apache + MySQL + PERL) system, the most popular open source web platform. All the web pages follow HTML 4.01 standard and use CSS for consistent styling. Table 2 shows the amount of sequences present in the database for each n-mer.
|
|
2.2 User interface
The web interface allows to submit one or multiple sequences in FASTA format for a first analysis searching for tRNA by the tRNAscan-SE software (Lowe and Eddy, 1997). The retrieved fragments are presented in a graphical view and the possibility of separate BLAST (Altschul et al., 1997) searches within the database is given. The BLAST results are displayed in various formats and the BLAST hit sequences along with details can be downloaded.
The core of this software is represented by the word count process performed on the whole submitted sequences or for selected fragments. Before processing the sequences by wordcount some parameters have to be fixed: the mer length, the taxonomy level for dominance calculation and the fragments to analyze. When the taxonomy level is selected, the sequences lacking of corresponding phylogenetic information are excluded from the calculation of dominance/abundance profiles.
The results page shows a global view with all the bar charts for each submitted sequence. Each chart displays on the X-axis the sequence by nucleotides where each base is considered as the first of following n-mer. On the Y-axis, the abundance (Y1) and the dominance value (Y2) are plotted for each mer. The dominance value ranges from 0 (all taxa are equally present) to 1 (one taxon dominates the whole community) and is calculated for each mer along the submitted sequence by the following equation (Harper, 1999):
![]() |
2.3 Sequences studied to test the software
In the first case study, a single sequence amplified by PCR and belonging to the Alphaproteobacteria class from seawater of Yaquina Bay (Oregon, accession number AY033321
[GenBank]
, Suzuki et al., 2001) was employed to test the sensitivity of the software. In the second, the ITS of a strain Alteromonas macleodii ATCC 27126, AY831613
[GenBank]
and a previous published Alteromonas related ITS sequence AF408824
[GenBank]
(Med-SF2, Garcia-Martinez et al., 2002) are compared with a new set of original ITS sequences obtained in our laboratory from environmental samples coming from superficial (50 m: hereafter I50) and deep water (3000 m: hereafter I3K) of Ionian (Mediterranean) Sea AY933016
[GenBank]
(I3K-509ITS), AY933025
[GenBank]
(I50-284ITS), AY933111
[GenBank]
(I3K-543ITS), AY932944
[GenBank]
(I3K-242ITS), AY933368
[GenBank]
(I3K-554ITS), AY933369
[GenBank]
(I50-008ITS) and AY933375
[GenBank]
(I50-356ITS), (this article). In the third case study, ITS sequences related to the pathogenic bacterium Pseudomonas syringe and Pseudomonas fluorescent were analyzed (Milyutina et al., 2004) with accession numbers AY582360
[GenBank]
(S1 W-920), AY582361
[GenBank]
(S2 W-920), AY582373
[GenBank]
(F3 W-27-10), AY582371
[GenBank]
(F4 W-27-10), AY582372
[GenBank]
(F5 W-27-10), AY582367
[GenBank]
(F6 R-915), AY582365
[GenBank]
(F7 B-t5) and AY582363
[GenBank]
(F8 W-27-10).
| 3 DISCUSSION |
|---|
|
|
|---|
3.1 First case study (separating functional regions from noise)
By tuning the mer length during the analysis of a sequence, it is possible to observe the effect of variation in specificity. Figure 1 represents the reduction of noise because of the lower probability of identifying longer tags in the database. For this study a sequence containing two tRNAs was used. It is possible to identify highly conserved regions such as the two tRNAs, while the rest of the ITS region shows no conserved features. Using 5 and 6 nt mers analysis, noise is more frequent because of high repetition of low length mers. From 11 to 13 nt, only conserved regions are identified. In this study, apart from the two identified tRNAs (Alanine and Isoleucine), it is possible to see the presence of other three high frequency regions (hf1-1, hf1-2 and hf1-3). The hf1-1 and hf1-3 are two very abundant regions belonging to Proteobacteria and Bacteroidetes phylum. The hf1-2 represents part of the boxB region described in the work of Milyutina et al. (2004). These boxes become visible using 6 and 7 nt frames but disappear at 8 nt study (just one base more). This indicates that this 7 nt tag is functionally important. A preliminary screening of the incidence of wordcount from 5 nt to, at least, 10 nt is suggested to evaluate the most informative length to use for the analysis of the ITS. However, for general purposes, such as box identification or primer design, a frame range between 7 and 10 nt is recommended.
|
3.2 Second case study (Comparing new with other previously analyzed sequences)
An 8 mer frame study at order level was carried out. No tRNA was identified in this cluster of sequences. The obtained abundance plots show four highly frequent (hf2) regions, two longer conserved islands (ci2) and two regions suggested as probe or primer target (sp2) (Figure 2). The hf2-1 region was only identified for the two reference strain ATCC 27126 and Med-SF2 and for the I50-356ITS, I50-008ITS, I3K-354ITS. This tag is mainly shared between sequences belonging to the Alteromonadales order. The second highly frequent region hf2-2 (5'TGCTTTGC3') is specific for only one sequence (I3K-543ITS). The hf2-3 region was identified only in two very similar sequences I50-008ITS and I3K-554ITS. This is another very abundant region of 8 nt (5'TTTGCACG3') shared by 502 sequences in the database; the taxonomy report related to this tag shows to be present mainly in Gammaproteobacteria related sequences. Figure 3 showns the important changes in hf2-3 abundance, passing from 6 to 7 nt study, with a decrease of frequency of 78%, and from 8 to 9 nt study (97.4% less), with the corresponding increase in taxonomic specificity giving a functional importance to the hf2-3 tag. The hf2-4 regions, identified in all but two sequences, were previously described by Garcia-Martinez et al., 2002 as boxA. The long conserved islands ci2-1 and ci2-2 were identified in all the studied sequences, and the effect in the abundance profile of a single nucleotide polymorphism (snp) is evident on ci2-1 of the ATCC 27126, I3K-509ITS, I50-356ITS and Med-SF2 sequences. Also in the ci2-2 of the I3K-554ITS sequences the presence of a snp causes a deep variation in its abundance profile. This island has been already identified (Garcia-Martinez et al., 2002) as a highly conserved 25 nt region. Based on the dominance/abundance observation, a last consideration concerns the suggestion, of two island (sp1 and sp2), as primers or probes being taxonomically highly specific for Alteromonadales order.
|
|
3.3 Third case study (ITS structure analysis for typing and identification)
In this case the ITS sequences of some P.syringe and P.fluorescent strains that were extensively analysed by classical means (Milyutina et al., 2004) have been analyzed to check IWoCS's reliability (Fig. 4). As expected, IWoCS identified BoxB (ci3-1) and BoxA (ci3-2) as described in the original article. Besides, the tags of the selected boxes (A and B) were shown to be highly specific for the genus Pseudomonas.
|
Looking at the taxonomy reports, increasing the length of the frame Pseudomonadales and mostly fluorescent and syringe species remain associated to the selected boxes. By the use of IWoCS interface it was also possible to identify other highly conserved regions. The hf3-1 box (5'CTCAGCTG3') is widespread among the Proteobacteria phylum. Other two short highly conserved sequences were identified (hf3-2, 5'GAGCGCACCCC3' and hf3-3, 5'AGGGTGAGGTCGG3') in the F6R-915 sequence; these two regions are present in Proteobacteria as well but are more common in Pseudomonadales. The longest hf3-3 region was also specific for Proteobacteria and mainly found in the Xantomonadales and Pseudomonadales. The ci3-3 island was common to all P.fluorescent strains and the taxonomy report confirms its strict correlation with fluorescent species present in the IWoCS database.
| 4 CONCLUSIONS |
|---|
|
|
|---|
The ITS sequences represent nowadays a useful tool for bacterial identification both for ecological (Rocap et al., 2002) or pathogenic (Chang et al., 2005) studies where the classical ribosomal 16S discriminatory power is not sufficient to discriminate between closely related strains (Daffonchio et al., 2003). These elements, characterized by a faster evolution owing to a lower selective pressure, allow a more refined analysis at the level of microdiversity. The analysis of ribosomal intergenic spacers carried by IWoCS allows a deeper study of ITS sequences than using common alignment or direct BLAST searches, by (1) performing separate analysis for each fragment (tRNAs or hypervariable regions) of the submitted sequence. This becomes useful when the user wants to restrict a BLAST search to the hypervariable region excluding the highly conserved tRNAs; (2) identifying the relevance of each n-mer within the database giving the possibility to extrapolate and support functional or phylogenetic hypotheses; (3) n-mer abundance/dominance profiles provide a highly informative graphic display of the ITS sequences which is of great help in comparing several ITS; (4) revealing the taxonomic distribution within the database of each tag from 5 to 13 nt. The interface allows highlighting highly frequent tags or longer conserved islands facilitating defining conservation and/or specificity. Highly frequent regions have to be associated to conserved tags. Moreover, the IWoCS facilitate the observation of low frequent but highly specific tags, representing a good instrument for probes and primers design for microbial identification purposes. The dominance plot allows having an immediate idea about the taxa associated to a determinate sequence region. All these features make IWoCS an approach different from the pure comparative analysis carried out by BLAST and alignment software. This new tool is particularly powerful in analyzing sequences base by base, reporting stabilization or uniqueness of each region.
Although the three case studies presented appear to be quite successful, it has to be kept in mind that the results rely on the taxa abundance description that is highly dependent on the number of sequences represented in the database. This problem arises from biases in sequences represented in GenBank. As the number of identified sequence keeps increasing it will be possible to fill the gap and perform more robust statistical tests.
The interest about the study of non-coding regions increases with the continued release of complete genomes particularly regarding the regulatory signals affecting transcription/translation processes. Surely IWoCS will be expandible to higher order analysis widening the landscape of intergenic spacers studied. The next implementations of IWoCS will therefore concern protein-coding flanking regions focusing more on the secondary structure that will increase its analytical power.
| Acknowledgments |
|---|
The authors would like to thank Alex Mira for helping with the manuscript and providing constructive comments. This work was funded by GEMINI (QLK3-CT-2002-02056) project of the European Commission.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Dmitrij Frishman
Received on August 19, 2005; revised on December 5, 2005; accepted on December 29, 2005
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, . 25, 33893402
Baudart, J., et al. (2000) Diversity of Salmonella strains isolated from the aquatic environment as determined by serotyping and amplification of the ribosomal DNA spacer regions. Appl. Environ. Microbiol, . 66, 15441552
Boyer, S.L., et al. (2001) Is the 16S23S rRNA internal transcribed spacer region a good tool for use in molecular systematics and population genetics? A case study in cyanobacteria. Mol. Biol. Evol, . 18, 10571069
Chang, H.C., et al. (2005) Species-level identification of isolates of the Acinetobacter calcoaceticus-Acinetobacter baumannii complex by sequence analysis of the 16S23S rRNA gene spacer region. J. Clin. Microbiol, . 43, 16321639
Clementino, M.M., et al. (2001) PCR analyses of tRNA intergenic spacer, 16S23S internal transcribed spacer, and randomly amplified polymorphic DNA reveal inter- and intraspecific relationships of Enterobacter cloacae strains. J. Clin. Microbiol, . 39, 38653870
Daffonchio, D., et al. (2003) Nature of polymorphisms in 16S23S rRNA gene intergenic transcribed spacer fingerprinting of Bacillus and related genera. Appl. Environ. Microbiol, . 69, 51285137
Deschavanne, P.J., et al. (1999) Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol. Biol. Evol, . 16, 13911399[Abstract].
Fadiel, A., et al. (2003) Remarkable sequence signatures in archaeal genomes. Archaea, 1, 185190[Medline].
Garcia-Martinez, J. and Rodriguez-Valera, F. (2000) Microdiversity of uncultured marine prokaryotes: the SAR11 cluster and the marine Archaea of Group I. Mol. Ecol, . 9, 935948[CrossRef][Medline].
Garcia-Martinez, J., et al. (2001) RISSC: a novel database for ribosomal 16S23S RNA genes spacer regions. Nucleic Acids Res, . 29, 178180
Garcia-Martinez, J., et al. (2002) Prevalence and microdiversity of Alteromonas macleodii-like microorganisms in different oceanic regions. Environ. Microbiol, . 4, 4250[CrossRef][Medline].
Harper, D.A.T. Numerical Palaeobiology, (1999) , New York John Wiley and Sons.
Iteman, I., et al. (2000) Comparison of conserved structural and regulatory domains within divergent 16S rRNA23S rRNA spacer sequence of cyanobacteria. Microbiol, . 146, 12751286
Lowe, T.M. and Eddy, S.R. (1997) tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res, . 25, 955964
Milyutina, I.A., et al. (2004) Intragenomic heterogeneity of the 16S rRNA23S rRNA internal transcribed spacer among Pseudomonas syringae and Pseudomonas fluorescens strains. FEMS Microbiol. Lett, . 239, 1723[Medline].
Pushker, R., et al. (2005) Micro-Mar: a database for dynamic representation of marine microbial biodiversity. BMC Bioinformatics, 6, , pp. 222[CrossRef][Medline].
Rice, P., et al. (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet, . 16, 276277[CrossRef][Web of Science][Medline].
Rocap, G., et al. (2002) Resolution of Prochlorococcus and Synechococcus ecotypes by using 16S23S ribosomal DNA internal transcribed spacer sequences. Appl. Environ. Microbiol, . 68, 11801191
Suzuki, M.T., et al. (2001) Phylogenetical analysis of ribosomal RNA operons from uncultivated coastal marine bacterioplankton. Environ. Microbiol, . 3, 323331[CrossRef][Medline].
Teeling, H., et al. (2004) TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics, 5, 163[CrossRef][Medline].
Wang, Y., et al. (2005) The spectrum of genomic signatures: from dinucleotides to chaos game representation. Gene, 346, 173185[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
S. K. Tung, L. J. Teng, M. Vaneechoutte, H. M. Chen, and T. C. Chang Identification of species of Abiotrophia, Enterococcus, Granulicatella and Streptococcus by sequence analysis of the ribosomal 16S-23S intergenic spacer region J. Med. Microbiol., April 1, 2007; 56(4): 504 - 513. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. K. Tung, L. J. Teng, M. Vaneechoutte, H. M. Chen, and T. C. Chang Array-Based Identification of Species of the Genera Abiotrophia, Enterococcus, Granulicatella, and Streptococcus J. Clin. Microbiol., December 1, 2006; 44(12): 4414 - 4424. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






