Bioinformatics Advance Access originally published online on April 26, 2005
Bioinformatics 2005 21(13):3053-3055; doi:10.1093/bioinformatics/bti460
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

-Web, an online tool to assess composition similarity of individual nucleic acid sequences
1Department of Medical Microbiology, Academic Medical Center 1100 DE Amsterdam, The Netherlands
2Bioinformatics Laboratory, Academic Medical Center 1100 DE Amsterdam, The Netherlands
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: Although whole-genome sequences have been analysed for the presence of anomalous DNA, no dedicated application is currently available to analyse the composition of individual sequence entries, for instance those derived by experimental techniques, such as subtractive hybridization. Since genomic dinucleotide frequency values are conserved between related species, a representative genome sequence can often be found to score for anomalous sequence composition for many of these putative horizontally transferred sequences. We developed the application 
-web, which enables the determination of the differences between the dinucleotide composition of an input sequence and that of a selected genome in a size-dependent manner. A feature allowing batch comparisons is included as well. In addition, 
-web allows the analysis of the dinucleotide composition of complete genomes. This provides complementary information for the identification of large anomalous gene clusters.
Availability: The application is available through http://deltarho.amc.uva.nl and the software is available from the authors.
Contact: a.vanderende{at}amc.uva.nl
Supplementary information An online help file with more extensive user guidelines is supplied at http://deltarho.amc.uva.nl
| INTRODUCTION |
|---|
|
|
|---|
From the data obtained by sequencing many different prokaryotic genomes it has been inferred that horizontal gene transfer (HGT) contributed considerably to the shape and evolution of microbial genomes. Currently, the estimates of percentages of putatively horizontally acquired DNA range from 0.5% in the endocellular symbiont Buchnera sp. APS genome to 25% in the Methanosarcina acetivorans genome, with an average of 14% in 116 prokaryotic genomes (Nakamura et al., 2004).
This genomic patchwork is clearly visible in the amount of genomic islands (GIs) detected in microbial genomes (Mantri and Williams, 2004). Initially, acquisition of GIs were linked with gain in virulence in pathogenic bacteria. While more genome sequences of environmental strains are becoming available, an increasing variety of acquired gene clusters providing diverse metabolic capacities are being discovered in these non-pathogenic strains, emphasizing that lateral genetic transfer is not limited to virulence traits (Dobrindt et al., 2004).
GIs can be recognized by their composition with regard to codon usage and GC-content, they being different from that of their host's genome. It is of note however, that not all aDNA in a genome is necessarily horizontally transferred. Ribosomal gene clusters are known to be compositionally dissimilar from the rest of the genome. On the other hand, a sequence horizontally acquired from a donor with a genome compositionally similar to that of the recipient's genome will most probably not be anomalous in composition in the host's genome. In addition, sequences which have been horizontally acquired might become less anomalous in composition over time owing to a process called amelioration (Lawrence and Ochman, 1997). Hence, horizontally acquired DNA, which has been obtained relatively recently, will be more readily identified by their anomalous composition in the recipient's genome. However, the genomic context of aDNA may also aid in the identification of HGT. It has been previously indicated, that the location of GIs between mobile elements, such as phage sequences, and insertion sequences imply a heterologous origin (Blum et al., 1994; Hacker and Kaper, 2000; Karlin, 2001).
Nakamura et al. (2004) provided evidence that transferred genes are biased towards functional categories associated with the cell surface, pathogenicity and DNA-binding genes, although putative horizontally acquired sequences still contain many putative genes with unknown functions. Dobrindt et al. (2004) explain acquisition efficiency mainly in terms of fitness increase. Together, these findings imply that remarkable and diverse capacities are being transferred among micro-organisms, and this has generated great interest for HGT. However, the available databases and applications describing or identifying putative horizontally acquired sequences have focused exclusively on published genome sequences (Garcia-Vallve et al., 2003; Hsiao et al., 2003; Nakamura et al., 2004), and although informative, they do not consider individual sequences which are still abundant in the public databases.
Besides whole-genome sequencing, alternative techniques are used in vitro to selectively isolate putative horizontally acquired sequences. These include subtractive hybridization, representational difference analysis and adaptor-linked PCR with endonucleases clustered specifically in compositionally atypical regions of the genome (Lisitsyn and Wigler, 1993; Straus and Ausubel, 1990; van Passel et al., 2004). However, no dedicated tool is available to score individual sequences, isolated with these techniques, for their dinucleotide composition dissimilarities compared with a genome sequence, although for many of these putative horizontally transferred sequences a representative genome sequence (i.e. a genomic context) is available.
Our aim was to develop an application to score dinucleotide composition differences of individual sequence entries with a chosen representative host genome sequence.
| METHODS |
|---|
|
|
|---|
The approach is based on the dinucleotide relative abundance values or genome signature
. As published previously by Karlin and Burge (1995) each genome has its own typical dinucleotide frequency values, which are conserved between related species. Although the genome signature was found to be relatively constant in 50 kb windows (Karlin, 2001), smaller windows can be used to identify anomalous sequences (van Passel et al., 2004). This is carried out by calculating in a size-dependent manner the dinucleotide relative abundance difference between the input sequence and the selected representative genome sequence. In brief, the dinucleotide relative abundance values,
XY*, are defined as the frequency of the dinucleotide XY divided by the product of the background frequencies of the individual nucleotides in the combined sense and reverse complement sequence [
*XY=fXY/(fX* fY)].
* is the dinucleotide relative abundance difference given by
, where
denotes the abundance values calculated for input sequence fragment f and
denotes the abundance values calculated for the closely related genome sequence g. | APPLICATION |
|---|
|
|
|---|
We aimed to develop a tool that compares the dinucleotide composition of an input sequence with the composition of a selectable complete genome sequence, and can also handle batch comparisons. This application,

-web, first constructs a collection of genomic fragments identical to the input sequence length. For all these genomic fragments the
* values are calculated and depicted in an empirical distribution. To avoid statistically irrelevant computations, we recommend the minimum length of an input sequence to be 1000 bp, allowing adequate dinucleotide counts per sequence. Even so, the maximum length of an input sequence should not exceed 20 000 bp, as longer sequences may not allow a genomic frequency distribution with ample genomic fragments; however, these cut-off sizes should be considered carefully in relation to the size of the genome in question. Next, the
* value of the input sequence is compared with the distribution of
* values of the genomic fragments, which puts the composition of the input sequence in a genomic context. As mentioned previously, because different species contain different amounts of horizontally acquired sequences, the threshold to consider DNA to be anomalous varies accordingly, and a conservative cut-off value is therefore advised. As an example, although Neisseria meningitidis MC58 is thought to contain over 20% of horizontally acquired DNA (Nakamura et al., 2004), we used the conservative threshold of 10% in a previous study (van Passel et al., 2004). However, as compositional dissimilarity is merely indicative for horizontal transfer, more evidence, such as phylogenetic validation or the presence of species specific motifs (such as, DNA uptake sequences) (Sandberg et al., 2001) is desirable to be able to make claims concerning a heterologous origin. To determine the probability that the genomic dissimilarity of the input sequence differs from the average genomic dissimilarity of the collection of genomic fragments, 
-web compares
* of the input sequence with the empirical distribution of
* values of the genomic fragments. This empirical distribution is also graphically represented by 
-web (Figure 1A). The position of
* or the GC percentage of the input sequence in the plot of the distribution of genomic fragments can be expressed as the percentage of the genomic fragments which have a lower
* value or a lower GC percentage. Fragments are scored anomalous if
* has a high dissimilarity value compared with the genomic values (i.e. many genomic fragments have a lower
*), whereas the GC percentage may be either high or low compared with the genomic values. An extensive list of genomic fractions of horizontally acquired DNA is supplied by Nakamura et al. (2004) which may be used to determine cut-off values for genomic composition dissimilarity values. However, the fractions of horizontally acquired DNA described by Nakamura et al. (2004) are based only on computational approaches, hence as long as further evidence is lacking the cut-off values based on these fractions should therefore be considered as arbitrary.
|
In addition,

-web allows whole-genome composition analysis with a selectable window size, to supply an alternative analysis based on both the GC composition and the genomic signature to visualize large anomalous gene clusters in a prokaryotic genome. This is performed by dividing the genome sequence in non-overlapping windows, after which the composition of the windows is compared with the composition of the complete genome. In Figure 1B not only the different large islands of horizontal transfer (IHTs) as annotated by Tettelin et al. (2000) are visible as both high
* values and aberrant GC-percentage scored islands, but also smaller anomalous gene clusters are visible. The island designated B in Figure 1B was previously recognized by (Karlin, 2001), whereas the island designated X was previously identified by Garcia-Vallve et al. (2003).
In conclusion, 
-web allows composition similarity scoring for individual prokaryotic sequence entries compared with a selected representative prokaryotic genome sequence, including a many-to-many interface as well as genome composition visualizations.
Received on December 8, 2004; revised on April 15, 2005; accepted on April 20, 2005
| REFERENCES |
|---|
|
|
|---|
Bart, A., et al. (2001) NmeSI restriction-modification system identified by representational difference analysis of a hypervirulent Neisseria meningitidis strain. Infect Immun., 69, 18161820
Blum, G., et al. (1994) Excision of large DNA regions termed pathogenicity islands from tRNA-specific loci in the chromosome of an Escherichia coli wild-type pathogen. Infect. Immun., 62, 606614
Dobrindt, U., et al. (2004) Genomic islands in pathogenic and environmental microorganisms. Nat. Rev. Microbiol., 2, 414424[CrossRef][ISI][Medline].
Garcia-Vallve, S., et al. (2003) HGT-DB: a database of putative horizontally transferred genes in prokaryotic complete genomes. Nucleic Acids Res., 31, 187189
Hacker, J. and Kaper, J.B. (2000) Pathogenicity islands and the evolution of microbes. Annu. Rev. Microbiol., 54, 641679[CrossRef][ISI][Medline].
Hsiao, W., et al. (2003) IslandPath: aiding detection of genomic islands in prokaryotes. Bioinformatics, 19, 418420
Karlin, S. (2001) Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends Microbiol., 9, 335343[CrossRef][ISI][Medline].
Karlin, S. and Burge, C. (1995) Dinucleotide relative abundance extremes: a genomic signature. Trends Genet., 11, 283290[CrossRef][ISI][Medline].
Lawrence, J.G. and Ochman, H. (1997) Amelioration of bacterial genomes: rates of change and exchange. J. Mol. Evol., 44, 383397[CrossRef][ISI][Medline].
Lisitsyn, N. and Wigler, M. (1993) Cloning the differences between two complex genomes. Science, 259, 946951[Abstract].
Mantri, Y. and Williams, K.P. (2004) Islander: a database of integrative islands in prokaryotic genomes, the associated integrases and their DNA site specificities. Nucleic Acids Res., 32, D55D58
Nakamura, Y., et al. (2004) Biased biological functions of horizontally transferred genes in prokaryotic genomes. Nat. Genet., 36, 760766[CrossRef][ISI][Medline].
Sandberg, R., et al. (2001) Capturing whole-genome characteristics in short sequences using a naive Bayesian classifier. Genome Res., 11, 14041409
Straus, D. and Ausubel, F.M. (1990) Genomic subtraction for cloning DNA corresponding to deletion mutations. Proc. Natl Acad. Sci. USA, 87, 18891893
Tettelin, H., et al. (2000) Complete genome sequence of Neisseria meningitidis serogroup B strain MC58. Science, 287, 18091815
van Passel, M.W., et al. (2004) An in vitro strategy for the selective isolation of anomalous DNA from prokaryotic genomes. Nucleic Acids Res., 32, e114
This article has been cited by other articles:
![]() |
A. L. V. Cohen, J. D. Oliver, A. DePaola, E. J. Feil, and E. Fidelma Boyd Emergence of a Virulent Clade of Vibrio vulnificus and Correlation with the Presence of a 33-Kilobase Genomic Island Appl. Envir. Microbiol., September 1, 2007; 73(17): 5553 - 5565. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Yebra, M. Zuniga, S. Beaufils, G. Perez-Martinez, J. Deutscher, and V. Monedero Identification of a Gene Cluster Enabling Lactobacillus casei BL23 To Utilize myo-Inositol Appl. Envir. Microbiol., June 15, 2007; 73(12): 3850 - 3858. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. W. J. van Passel, A. van der Ende, and A. Bart Plasmid diversity in neisseriae. Infect. Immun., August 1, 2006; 74(8): 4892 - 4899. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. M. Quirke, F. J. Reen, M. J. Claesson, and E. F. Boyd Genomic island identification in Vibrio vulnificus reveals significant genome plasticity in this human pathogen Bioinformatics, April 15, 2006; 22(8): 905 - 910. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



