Skip Navigation


Bioinformatics Advance Access originally published online on January 10, 2006
Bioinformatics 2006 22(5):517-522; doi:10.1093/bioinformatics/btk029
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/5/517    most recent
btk029v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Dalevi, D.
Right arrow Articles by Hermansson, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dalevi, D.
Right arrow Articles by Hermansson, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Bayesian classifiers for detecting HGT using fixed and variable order markov models of genomic signatures

Daniel Dalevi 1,*, Devdatt Dubhashi 1 and Malte Hermansson 2

1Department of Computing Science, Chalmers University SE 412 96 Göteborg, Sweden
2Department of Cell and Molecular Biology, Microbiology, Göteborg University 405 30 Göteborg, Sweden

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 

Motivation: Analyses of genomic signatures are gaining attention as they allow studies of species-specific relationships without involving alignments of homologous sequences. A naïve Bayesian classifier was built to discriminate between different bacterial compositions of short oligomers, also known as DNA words. The classifier has proven successful in identifying foreign genes in Neisseria meningitis. In this study we extend the classifier approach using either a fixed higher order Markov model (Mk) or a variable length Markov model (VLMk).

Results: We propose a simple algorithm to lock a variable length Markov model to a certain number of parameters and show that the use of Markov models greatly increases the flexibility and accuracy in prediction to that of a naïve model. We also test the integrity of classifiers in terms of false-negatives and give estimates of the minimal sizes of training data. We end the report by proposing a method to reject a false hypothesis of horizontal gene transfer.

Availability: Software and Supplementary information available at www.cs.chalmers.se/~dalevi/genetic_sign_classifiers/

Contact: dalevi{at}cs.chalmers.se


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
Mutational and selectional forces acting at the cellular level result in species-specific DNA characteristics that manifest in, for example, G + C content (e.g. Muto and Osawa, 1987), nucleotide dependencies (e.g. Hooper and Berg, 2002), codon bias (e.g. Sharp and Matassi, 1994) and oligomer frequencies (e.g. Burge et al., 1992; Sandberg et al., 2001). These characteristics can be used to identify foreign genes in bacterial genomes. Introduction of genes by horizontal gene transfer (HGT), rather than vertical inheritance of genes from parents by the offspring, is an important mechanism of bacterial adaptation, and has profound implications for phylogenetic analyses (Woese, 1987; Doolittle, 1999). Existence of highly homologous genes in evolutionarily distantly related organisms has been taken as an indication of HGT. These can be detected using, for example, whole-genome phylogenies (e.g. Sicheritz-Ponten and Andersson, 2001) or by comparing gene trees with a species tree (e.g. Hallet and Lagergren, 2000). These approaches obviously require knowledge and sequence data of homologous genes from many organisms in order to work. Analyses of DNA characters on the other hand, can identify foreign genes directly with access to only one genome (e.g. Dufraigne et al., 2005).

In the case of G + C content much work has been done based on the assumption that organisms occupy a certain ‘G + C color’ shaped by different mutational and selectional forces (e.g. Muto and Osawa, 1987; Lawrence and Ochman, 1997; Lee et al., 2004; Sandberg et al., 2003; Forsdyke and Mortimer, 2000). However, these have proven to capture only about 40% of the species-specificity (Sandberg et al., 2003). This is not surprising since reliance on the G + C content alone places a lot of faith on a single number. This was also noted by Koski et al. (2001) who announced that G + C content on its own is a poor indicator of HGT. They also found that a high Codon Adaption Index (CAI) was a good indicator of positional conservation, meaning no HGT. But on the other hand, a low CAI was not as good an indicator of HGT as a atypical G + C content in the first and third codon positions. Their main conclusion was that any estimates of HGT ought to be confirmed using phylogenies whenever such can be made. Furthermore, when comparing the third position G + C content of the set of positionally conserved genes with the set of unique genes in a pair of Chlamydia species, the distributions were not significantly different indicating that either no HGT had occurred since time of divergence or G + C content was a poor indicator of such (Dalevi et al., 2002).

A better indicator of species-specificity is to study the genomic composition of oligomers of different lengths, or, so called genomic signatures (Karlin and Burge, 1995; Dufraigne et al., 2005; Sandberg et al., 2003; Wang et al., 2005). Burge et al. (1992) showed that the composition of dimers is highly specific and conserved through the genome for bacterial chromosomes and plasmids. Pride et al. (2003) declared similar results for tetramers. The use of overlapping k-mers can capture many more characteristics than any individual estimates of for example codon usage, restriction sites and transcription factor binding sites—all of which may be host-specific. Recently, Teeling et al. (2004) demonstrated that the discriminatory power of correlations in a tetramer derived Z-score was far superior to that of differences in G + C content using a dataset of 118 bacteria. More generally, in the studies of HGT, Sandberg et al. (2001), built a naïve classifier using models with arbitrary word size. The classifier was applied to different data verifying that it was indeed capable of identifying HGT that were already well-documented, such as HGT from Haemophilus influenzae to Neisseria meningitis (Kroll et al., 1998). The approach based on oligomer profiles and the technique of Sandberg et al. (2001) for classifying unknown DNA deserves further attention. One obvious direction in which to extend this work is to use more flexible and mathematically sound models that can take into account statistical dependencies between words. This is the approach we pursue here using fixed and variable length Markov models.

Markov models are very natural for modeling DNA and have been used successfully in many applications, e.g. identifying transcription binding sites (Ellrott et al., 2002; Zhao et al., 2004), modeling oligomers (Reinert et al., 2000; Scherer et al., 1994), gene finding (Borodovsky and McIninch, 1993; Salzberg et al., 1998) and identifying horizontally transferred genes (Nakamura et al., 2004). DNA sequences have the following properties: (1) A finite alphabet {Gamma} isin {A,C,G,T}, (2) There is a ‘time’ index taken as the direction of replication and transcription (5'->3') and (3) The dependencies on earlier nucleotides vanish with time, a sort of amnesia, also known as the Markov property. All (1)–(3) are typical characteristics of Markov chains.

In this report, we expand previous work (Karlin and Burge, 1995; Dufraigne et al., 2005; Sandberg et al., 2003; Wang et al., 2005) by proposing the framework of (fixed and variable order) Markov chains to capture dependencies in genomic signatures. In particular, we demonstrate that variable order Markov chains (Bühlmann and Wyner, 1999; Ron et al., 1996) are very well suited when dependencies are long in specific directions, and that the sparsity of their parameter space renders them very suitable for model inference. We demonstrate improvements over previous work and propose a novel method for locking the number of parameters in a variable length Markov model together with giving estimates of minimal sizes of training data. We confirm the integrity of the classifiers by estimating the rate of false negatives and propose a novel method for assigning confidence to candidates of HGT. The contribution of this study is a novel classifier with high flexibility, accuracy and mathematical soundness.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
This section is divided into several subsections. First the basic common principle behind all the Bayesian classifiers is presented. Next we discuss the different implementations of the likelihood function.

2.1 The Bayesian classifiers
The general classification problem we address may be formulated as follows. We are given a test sequence S and a collection of possible source genomes of origin Formula. We would like to classify the sequence S as arising from a particular Formula. The criterion for classification we adopt here is a standard simple Bayesian rule of maximum a posteriori. By Bayes' rule,

Formula 1(1)
Note, when using a flat prior, P(G) = const for all Formula 1, Equation (1) simplifies to the likelihood. To account for strand-specificity we simply take the maximum of the two possible probability values for the sequences, S and Sc, where Sc is the reverse complement of S,

Formula 2(2)
The genome, Formula 2, that maximizes L(S,G) is the most likely source of S.

2.2 The naïve classifier, Nk
The naïve classifier Nk of order k, earlier implemented in Sandberg et al. (2001), considers words M of length k + 1 and estimates each P(M|G) by the relative frequency with which M occurs in G (as a window of size k + 1 is slid from left to right across G). It then estimates P(S|G) by

Formula 3(3)
Here Formula 3 is the set of words of length k + 1 that occur in S (as we slide a window of size k across S from left to right). The sequence S is then classified among a set of genomes Formula 3 by a simple maximum likelihood rule: assign S to that Formula 3 which maximizes P(S|G).

The drawback of this classifier is that (3) assumes that the probabilities P(M|G) are all independent for Formula 3, an assumption which is clearly invalid given that the words Formula 3 are overlapping. This was pointed out by the authors themselves (Sandberg et al., 2001): ‘The naïve Bayesian classifier assumes each motif to be independent of the other motifs, which is clearly false’. To remedy this drawback, we consider Markov models.

2.3 The fixed order Markov classifier, Mk
A Markov chain is a discrete stochastic process (X0, X1, X2, X3, ... , Xi, ...), with Xi taking values in a finite set {Gamma} called the alphabet, such that there is a k ≥ 0 satisfying for all n ≥ k,

Formula 4(4)
Here and in the sequel, Formula 4 denotes the sequence xm, xm+1, ... , xn. The order of the Markov chain is the smallest integer k = k0 ≥ 0 such that (4) holds. A Markov chain of order k is completely specified by giving the initial sequence x0, ... , xk–1 and then the transition probabilities Formula 4 for u isin {Gamma}k, a isin {Gamma}. Thus specifying a Markov chain of order k involves |{Gamma}|k(|{Gamma}| – 1) degrees of freedom (since for each u isin {Gamma}k, we have {sum}a pu,a = 1). For example, when the alphabet is DNA, {Gamma} = {A, C, G, T} and the order k = 1, we have 12 degrees of freedom. Table 1 lists the number of free parameters for the first ks.


View this table:
[in this window]
[in a new window]
 
Table 1 The corresponding number of free parameters O(k) in a Markov model of order k when the alphabet is DNA

 
The transition probabilities are estimated by maximum likelihood using the empirical distribution, Formula 4) estimated from frequency counts Nx(w) of words w in the data x.

2.4 The variable length markov classifier, VLMk
The basic idea behind variable length Markov models (VLMMs, Bühlmann and Wyner, 1999; Mächler and Bühlmann, 2004; Ron et al., 1996) is that the memory of the stationary Markov chain should be allowed to depend on the context, i.e. the position in the data. Depending on the context, one may have to look back a variable number of symbols in order to determine the probabilities of the next transition.

This can be made precise via the notion of a Prediction Suffix Tree (PST). This is a rooted tree where each vertex v is labeled with a string sv isin {Gamma}* and each edge is labeled with a letter a isin {Gamma}. The root is labeled with the empty string. If a vertex v is labeled sv the child w is labeled sva, where a is the label of the edge (v, w). A terminal node v with label sv is also associated with transition probabilities {gamma}s,a, a isin {Gamma}. The transitions in a VLMM are determined as follows: given x1, ... , xn, follow the unique path from the root given by the edges labeled by xn,xn–1, ... , (i.e. in reverse order) until we arrive at a leaf node. Use the transition probabilities associated with the leaf node to generate the next transition.

VLMMs are a structurally much richer class of models allowing vastly increased flexibility in choosing the number of parameters to fit a given set of data.They may be seen as a broad class of Markov models that adapt to the environment. The optimal model can be found using standard model selection criteria such as the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) by generating all possible sub-models. However, in practice, it is not possible to perform an exhaustive search. Therefore other approaches have to be considered.

2.5 Locking the model
Instead of finding the optimal model we suggest that one should use the model with the most number of parameters that the data allows when classifying genes. Sandberg et al. (2001) found that the longer the oligomers they used the better the classification. This is true as long as there is enough data. This simply reflects the fact that a higher order model can always approximate a lower order model. However, when data get too sparse the approximation will not work because of unreliable estimates of the transition probabilities. In this section we propose a simple locking algorithm, based on the algorithm in Bühlmann and Wyner (1999), to generate a PST with a specified number of parameters m. This locking procedure is also necessary in order to make a sort of fair comparison with the fixed order model when comparing performance.

The algorithm first builds a maximal context tree having a minimum number of counts of each element and a maximal depth. For nodes with next-symbol probabilities equal to zero we apply the correction of pseudo-counts with Laplace's rule (e.g. Durbin et al., 1998). Then the algorithm assigns the node with string v the score, {Delta}w,v, where w is the |v| – 1 suffix of v, that is based on a Kullback–Liebler distance of next-symbol probabilities (Bühlmann and Wyner, 1999).

Formula 5(5)
The tree is then pruned, by removing terminal nodes with lower scores, until the PST has approximately m parameters (see also the Supplementary Material).


    3 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
3.1 Comparing performance
Two datasets, one with 28 taxa (same as used in, Sandberg et al., 2001) and one with 157 taxa, were used for testing the performance of the classifiers. Both datasets are available in the Supplementary Material. Of the data 90% was used for training and the remaining part for testing. The same k was used for the naïve and the fixed order Markov model meaning that we compare performance for oligomers of the same length (Fig. 1). The variable length Markov model was constrained to the same number of free parameters but with no limit to the length of the oligomers. The only exception was made for k = 1 where the VLMk got the opportunity to choose a model with 15 parameters if more appropriate. Otherwise, there would be no difference between M1 and VLM1 since there are very few ways of selecting a model of 12 parameters that is different from M1.


Figure 1
View larger version (26K):
[in this window]
[in a new window]
 
Fig. 1 Comparing performance. The y-axis shows the efficiency measured in percentage of correctly classified samples and the x-axis indicates the number of nucleotides. A total of 100 sequences were drawn randomly from the 10% test set and classified to the profiles constructed from the remaining 90%. In (a)–(e) 28 taxa were used. In (a) variable of 12–15 free parameters (vlm1), k = 1 for naïve (n1) and fixed order (m1), (b) k = 2, (c) k = 3 (d) k = 4, (e) improvement for the VLMk for k = 1, ... , 5 (f) k = 4, same as (d) but with 157 taxa.

 
100 sequences of different lengths were randomly chosen from each bacterium's test data and classified to the different profiles. The procedure was repeated for sequences with lengths of 35, 60, 100, 200, 400, 1000, 1500 and 2000. Figure 1 displays the results for different orders and different models. The improvement of using words longer than five (Fig. 1e, k = 4) is insignificant but in these simulations there is an improvement using words longer than the popular tetramers (e.g. Pride et al., 2003; Teeling et al., 2004; Dufraigne et al., 2005). Figure 1f shows the result for the bigger dataset of 157 taxa for k = 4. The sequences need to be twice as long to obtain approximately the same accuracy as with the lower number of taxa, but the signal is still well defined.

The major improvement is observed when going from the naïve model to a fixed order Markov model (Fig. 1). There is not a single case where the naïve performs better than either of the Markov models. This is more pronounced for the bigger dataset (Fig. 1f). Further, the VLMk shows a slight improvement compared with the Mk for all k. The improvement in favor of VLMk over Mk is consistent for the different runs but it is not a large improvement and it may be hard to argue for using VLMk only based on this result. However, what is important with the VLMk is that more or less any number of parameters can be used and not just those identified in Table 1. There are models in between that can fit almost any number of parameters.

Figure 1 indicates that a proportion of the sequences in bacteria is not species-specific in genomic signatures. This fraction contains candidates of horizontally transferred genes. These candidates must be further processed in a second step where biological constraints are considered, see e.g. Nakamura et al. (2004). This is true for all classifiers of genomic signatures since there are atypical genes that are not HGTs.

3.2 The number of data in training
A very important issue is to know when to use a certain model: How many parameters can we estimate accurately when training on a certain number of nucleotides? Deriving an exact formula is not trivial owing to the overlapping nature of the words. Reinert et al. (2000) discuss the issue and present various approximations using the Law of Large Numbers and a Poisson distribution when the probability of a word is low. However, one can do a simple heuristic estimate as follows: Assume data are generated under a multinomial model where each nucleotide occurs with probability pa where a isin {Gamma} = {A, C, G, T} (this corresponds to an M0 model). Under this model the standard deviation is given by Formula 5 where N is the number of nucleotides. By fixing the standard deviation we can derive an N for this independent model. This means that ~300 nt are required for having {sigma}a ≤ 0.025 when all nucleotides have the same probability. Extending this to counts of all dimers that start in nucleotide b, approximately 300 counts of ba will be required for a similar accuracy in pba for each a isin {Gamma}. Since |{Gamma}| = 4 this results in N > 300 x 4l–1 for an arbitrary l-mer. In the variable model, with a further assumption that the counts are more or less balanced in the tree, we need to keep the number of parameters less than N/100. The rules are of course rough approximations since the words are over-lapping, however, using simulations we can confirm that they serve as approximations at a specified accuracy.

A standard statistical approach to these kinds of question is via parametric bootstrapping (Efron, 1985). The technique relies on the assumption that the error measured between the true distribution and the observed sample is approximately the same as between a distribution based on the sample and new samples derived from this. Selecting 10 different bacteria we ran 1000 bootstrap replications on sequences of length N and calculated the error between the sample probabilities and the re-sampled probabilities. The result, giving estimates of the data needed for a given accuracy in various models is summarized in Table 2. Observe that the error increases when dependencies get longer.


View this table:
[in this window]
[in a new window]
 
Table 2 Listing the results of the parametric bootstrap

 
3.3 False negatives I: the IncP-ß dataset
Plasmids are important in the study of HGT since they are often the carriers of genes between bacteria. The members of the IncP-ß group are known to be promiscuous and often isolated from Pseudomonas species (e.g. Adamczyk and Jagura-Burdzy, 2003). Since plasmids may exist in different hosts they are affected by different selective forces. However, a plasmid group (like the IncP) often has a common set of conserved backbone genes that are essential for survival (i.e. maintenance, replication, transfer and stability). It is well known that this backbone is enriched in certain restriction sites and oligomers (such as GGCC and GCCG, Wilkins et al., 1996). The PST of VLMMs can be used for visualizing such dependencies in data. The algorithm proposed in Section 2.5 was used for this purpose with a constraint of 60 parameters. Many of the pB10 specific oligomers show up as long branches in the tree, e.g. the GGCC patterns (Fig. 2).


Figure 2
View larger version (30K):
[in this window]
[in a new window]
 
Fig. 2 A PST of the backbone genes in plasmid pB10 obtained after pruning with locking of ~60 parameters. The PST captures important features of the underlying data. The patterns within the square brackets all occur with high counts, however, not necessary the highest of them all. The backbone of IncP-ß plasmids are known to be enriched in the GGCC oligomer and this dependency is captured by the PST. The numbers within the brackets are next-symbol counts for A, C, G, T followed by the total number of next-symbols for the context in []. T denotes terminal node (i.e. leaf).

 
It is interesting to test if the classifiers can identify the backbone structure as plasmid-specific and by that capture the plasmid identity. We assume that these genes have not been acquired by HGT and define the error rate {varepsilon} as the probability of classifying a backbone gene to a non-IncP-ß genome. This should be seen as a test of integrity that can be applied to any classifier of HGT. For the evaluation of the VLMk we ran the backbone genes when removing the genes to classify from the profiles, one-by-one, from all of the six IncP-ß. This is done to remove the signal of the gene itself. Using a VLMk with 768 parameters, corresponding to pentamers, we manage to correctly classify all but two genes, these being kleB and traJ (Table 3). This results in an error {varepsilon} {approx} 5%.


View this table:
[in this window]
[in a new window]
 
Table 3 Percentage of highly conserved IncP-ß backbone genes that correctly classify to this group of plasmids (R751, pB4, pB10, pADP-1, pUO1, pJP4)

 
3.4 False negatives II: Escherichia coli and Salmonella typhi
The gene order between pairwise closely related bacteria holds useful information for studying evolutionary processes (e.g. Dalevi et al., 2002). Positional conserved genes have a preserved gene-order meaning that their corresponding permutation is not disrupted other than by possible inversion of the whole segment. These genes are very unlikely to result from HGT, at least not after the two species diverged. If introduced earlier they would have been exposed to cellular forces, resulting in amelioration, over a long time and hence would have a composition similar to the host. Koski et al. (2001) identified a list consisting of 2951 positionally conserved genes between E.coli K-12 and S.typhi. A list of these genes was obtained (G.B. Golding, personal communication). When running all genes >1000 nt with the VLMk classifier of 768 parameters, in the dataset of 157 bacteria, the classifier identifies 95% of the genes to be of E.coli origin (see Supplementary Material). This corresponds to an error {varepsilon} {approx} 5%. This indicates that a proportion of the atypical genes in Figure 1f will be naturally deviating in genomic signatures and hence not because of HGT.

3.5 Simulated data
Molecular data are hard to model. None of the above models is a ‘true’ representation of the mechanisms in any bacterium. However, simulations can be used to study the effect on classification of certain statistical properties of a bacterium (B) in a controlled manner. To produce data reflecting the forces that capture some of the important rules, we specified three characteristics the simulated data should have. These are listed below together with how they are modeled.

  1. Codon usage: Simulated using a fixed order Markov model, k = 2, trained on real data (coding regions only) in bacterium (B).
  2. Longer oligomers: These are over-represented words in a bacterium and inserted at a site with probability poligo > 0.
  3. Random noise: Transforming a site in the sequence with probability pnoise > 0. using a zero order Markov model reflecting the composition of the bacterium (B).
Using the schema we constructed a set of 15 artificial genomes of a million base pairs. The three classifiers were then tested on the data. The VLMk outperformed the others for all k and for all lengths of data except for longer test sequences. Having ~2000 bp, all classifiers were 100% correct. The results are shown in the Supplementary Material.

3.6 Assigning confidence: when the group is not in the sample
The problem with all classifiers of the types we present is that there will always be a most likely candidate, an answer. However, what if the organism is not in the sample? This is a very likely scenario since there are millions of bacterial species and only a tiny fraction is available in databases. The probability that the correct organism is in the sample is hence quite small and there is a need for methods of rejecting hypotheses of HGT.

For a target sequence s, suppose the maximum-likelihood genome pinpointed by our classifier is G. We wish to test the following null-hypothesis,

  1. H0: The unknown sequence s has been generated from source G

We adopt the following approach inspired by the technique of parametric bootstrapping (Efron, 1985):

  • Calculate the likelihood of sequence L(s) of s under H0.
  • Generate n samples of the same length as s from G at random. For each sample sequence si calculate L(si).
  • Count the number of L(si) that are above L(s). If this value is above {alpha}, say 0.05, we reject H0 at the level {alpha} = 0.05.
(Note that the same method can be applied for whatever model—fixed or variable order—we use and will be reflected in the way the likelihood is computed.) Table 4 shows P-values that result for some of the backbone genes of the R751 backbone-genes against various profiles. As a basic check, we note that the test does not reject any of the genes from the same profile. Many of the distant bacteria get clear rejections. However, it is interesting to observe that for especially Pseudomonas aeruginosa, which is a natural host for R751, and similar in genomic signatures, not so many rejections are made.


View this table:
[in this window]
[in a new window]
 
Table 4 P-values for the 12 first [according to annotation] backbone genes of R751 against different profiles using an M4 model

 

    4 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
In this report, we expand previous work (Karlin and Burge, 1995; Dufraigne et al., 2005; Sandberg et al., 2003; Wang et al., 2005) by proposing the framework of fixed and variable length Markov chains to capture dependencies in genomic signatures. The contribution is a novel classifier with high flexibility, accuracy and mathematical soundness. We proposed a novel method for locking the model to a certain number of parameters so that it can naturally adjust to the size of the data. We showed that the integrity of the classifier is high in terms of the number of false negatives and proposed a novel method for rejecting false candidates of HGT.


    Acknowledgments
 
The authors wish to thank Marianne Månsson and Olle Nerman for useful discussions and Hsien Toh for critically reading the manuscript. This work was supported by the Swedish National Research School in Genomics and Bioinformatics, funded by the Swedish Foundation for Strategic Research (SSF).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Christos Ouzounis

Received on June 21, 2005; revised on December 8, 2005; accepted on December 27, 2005

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 

    Adamczyk, M. and Jagura-Burdzy, G. (2003) Spread and survival of promiscuous IncP-1 plasmids. Acta Biochim. Pol, . 50, 425–453[Web of Science][Medline].

    Borodovsky, M. and McIninch, J. (1993) Recognition of genes in DNA sequence with ambiguities. Biosystems, 30, 161–171[CrossRef][Web of Science][Medline].

    Bühlmann, P. and Wyner, A. (1999) Variable length Markov chains. Ann. Statist, . 27, 480–513[CrossRef].

    Burge, C., et al. (1992) Over- and under-representation of short oligonucleotides in DNA sequences. Proc. Natl Acad. Sci. USA, 89, 1358–1362[Abstract/Free Full Text].

    Dalevi, D., et al. (2002) Measuring genome divergence in bacteria: a case study using chlamydian. J. Mol. Evol, . 55, 24–36[CrossRef][Web of Science][Medline].

    Doolittle, W.F. (1999) Phylogenetic classification and the universal tree. Science, 284, 2124–2129[Abstract/Free Full Text].

    Dufraigne, C., et al. (2005) Detection and characterization of horizontal transfers in prokaryotes genomic signature. Nucleic Acids Res, . 33, e6[Abstract/Free Full Text].

    Durbin, R., Eddy, S., Krogh, A., Mitchison, G. Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids, . (1998) , Cambridge Cambridge University Press.

    Efron, B. (1985) Bootstrap confidence intervals for a class of parametric problems. Biometrika, 72, 45–58[Abstract/Free Full Text].

    Ellrott, K., et al. (2002) Identifying transcription factor binding sites through Markov chain. Bioinformatics, 18, Suppl. 2, s100–s109.

    Forsdyke, D.R. and Mortimer, J.R. (2000) Chargaff's legacy. Gene, 261, 127–137[CrossRef][Web of Science][Medline].

    Hallet, H. and Lagergren, J. (2000) New Algorithms for the Duplication-Loss Model. Proceedings of the Research on Computational Molecular Biology, . , New York ACM Press, pp. 138–146.

    Hooper, S.D. and Berg, O.G. (2002) Detection of genes with atypical nucleotide sequence in microbial genomes. J. Mol. Evol, . 54, 365–375[Web of Science][Medline].

    Karlin, S. and Burge, C. (1995) Dinucleotide relative abundance extremes: a genomic signature. Trends Genet, . 11, 283–290[CrossRef][Web of Science][Medline].

    Koski, L.B., et al. (2001) Codon bias and base composition are poor indicators of horizontally transferred genes. Mol. Biol. Evol, . 18, 404–412[Abstract/Free Full Text].

    Kroll, J.S., et al. (1998) Natural genetic exchange between Haemophilus and Neisseria: intergeneric of chromosomal genes between major human pathogens. Proc. Natl Acad. Sci. USA, 95, 12381–12385[Abstract/Free Full Text].

    Lawrence, J.G. and Ochman, H. (1997) Amelioration of bacterial genomes: rates of change and exchange. J. Mol. Evol, . 44, 383–397[CrossRef][Web of Science][Medline].

    Lee, S.J., et al. (2004) Genomic conflict settled in favour of the species rather than the gene at GC percentage values. Appl. Bioinformatics, 3, 219–228[CrossRef][Medline].

    Mächler, M. and Bühlmann, P. (2004) Variable Length Markov chains: methodology, computing, and software. J. Comp. Graph. Stat, . 13, 435–455[CrossRef].

    Muto, A. and Osawa, S. (1987) The guanine and cytosine content of genomic DNA and bacterial evolution. Proc. Natl Acad. Sci. USA, 84, 166–169[Abstract/Free Full Text].

    Nakamura, Y., et al. (2004) Biased biological functions of horizontally transferred genes in genomes [Erratum (2004) Nat. Genet, 36 1126.]. Nat. Genet, . 36, 760–766[CrossRef][Web of Science][Medline].

    Pride, D.T. (2003) Evolutionary implications of microbial genome tetranucleotide frequency. Genome Res, . 13, 145–158[Abstract/Free Full Text].

    Reinert, G., et al. (2000) Probabilistic and statistical properties of words: an overview. J. Comput. Biol, . 7, 1–46[CrossRef][Web of Science][Medline].

    Ron, D., et al. (1996) The power of amnesia: learning probabilistic automata with variable memory length. Mach. Learn, . 25, 117–149.

    Salzberg, S.L., et al. (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Res, . 26, 544–548[Abstract/Free Full Text].

    Sandberg, R., et al. (2001) Capturing whole-genome characteristics in short sequences using a naive classifier. Genome Res, . 11, 1404–1409[Abstract/Free Full Text].

    Sandberg, R., et al. (2003) Quantifying the species-specificity in genomic signatures, synonymous codon choice, amino acid usage and G+C content. Gene, 311, 35–42[CrossRef][Web of Science][Medline].

    Scherer, S., et al. (1994) Atypical regions in large genomic DNA sequences. Proc. Natl Acad. Sci. USA, 91, 7134–7138[Abstract/Free Full Text].

    Sharp, P.M. and Matassi, G. (1994) Codon usage and genome evolution. Curr. Opin. Genet. Dev, . 4, 851–860[CrossRef][Medline].

    Sicheritz-Ponten, T. and Andersson, S.G. (2001) A phylogenomic approach to microbial evolution. Nucleic Acids Res, . 29, 545–552[Abstract/Free Full Text].

    Teeling, H., et al. (2004) Application of tetranucleotide frequencies for the assignment of genomic. Environ. Microbiol, . 6, 938–947[CrossRef][Medline].

    Wang, Y., et al. (2005) The spectrum of genomic signatures: from dinucleotides to chaos game. Gene, 346, 173–185[CrossRef][Web of Science][Medline].

    Wilkins, B.M., et al. (1996) Distribution of restriction enzyme recognition sequences on broad host range plasmid RP4: molecular and evolutionary implications. J. Mol. Biol, . 258, 447–456[CrossRef][Web of Science][Medline].

    Woese, C.R. (1987) Bacterial evolution. Microbiol. Rev, . 51, 221–271[Free Full Text].

    Zhao, X., Huang, H., Speed, T. (2004) Finding short DNA motifs using permuted Markov models. Proceedings of the Research on Computational molecular Biology, . , New York ACM Press, pp. 68–75.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
Y. Sun, Y. Cai, L. Liu, F. Yu, M. L. Farrell, W. McKendree, and W. Farmerie
ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences
Nucleic Acids Res., June 1, 2009; 37(10): e76 - e76.
[Abstract] [Full Text] [PDF]


Home page
Microbiol. Mol. Biol. Rev.Home page
V. Kunin, A. Copeland, A. Lapidus, K. Mavromatis, and P. Hugenholtz
A Bioinformatician's Guide to Metagenomics
Microbiol. Mol. Biol. Rev., December 1, 2008; 72(4): 557 - 578.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/5/517    most recent
btk029v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Dalevi, D.
Right arrow Articles by Hermansson, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dalevi, D.
Right arrow Articles by Hermansson, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?