Bioinformatics Advance Access originally published online on March 17, 2005
Bioinformatics 2005 21(11):2783-2784; doi:10.1093/bioinformatics/bti389
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
seq++: analyzing biological sequences with a range of Markov-related models
UMR CNRS 8071 Statistique et Génome 523 place des Terrasses, 91000 Evry, France
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: The seq++ package offers a reference set of programs and an extensible library to biologists and developers working on sequence statistics. Its generality arises from the ability to handle sequences described with any alphabet (nucleotides, amino acids, codons and others). seq++ enables sequence modelling with various types of Markov models, including variable length Markov models and the newly developed parsimonious Markov models, all of them potentially phased. Simulation modules are supplied for Monte Carlo methods. Hence, this toolbox allows the study of any biological process which can be described by a series of states taken from a finite set.
Availability: Under the GNU General Public Licence at http://stat.genopole.cnrs.fr/seqpp
Contact: miele{at}genopole.cnrs.fr
| INTRODUCTION |
|---|
|
|
|---|
A considerable range of genomic data can be modelled as sequences of characters taken from various alphabets. Obvious candidates are the nucleic acids or protein sequences. However, RNA or protein secondary structures can similarly be described using character strings. This paper presents seq++, a C++ software library, aiming to be a reference environment for studying the statistical properties of these sequences using a comprehensive set of Markov models. It already implements most of the classical methods, as well as some more recent ones, and offers an extensible framework to experiment with new models, which can be included in subsequent releases of the library.
| MODELS AND FEATURES |
|---|
|
|
|---|
The key to the flexibility of seq++ is its independence from any given alphabet: the algorithms implemented in seq++ handle strings of tokens. These tokens, which can be several characters long are enumerated in an alphabet file also allowing the definition of synonymous token groups. For instance, studying amino acid composition properties of protein sequences is possible by grouping amino acids according to their physical or chemical properties and assigning them a suitable label.
The algorithms themselves focus on Markov chains (MC) sequence modelling, including phased models. These models can be useful when phased emission heterogeneity occur in the observations. For example, it is well known that the third nucleotide of a codon has different occurrence patterns than the two preceding ones. This can be taken into account by fitting one transition matrix (which is the matrix determining emission probabilities for the observations) per phase to yield more accurate results (Borodovsky et al., 1995).
Models provided by seq++ include variable length Markov chains (VLMC) (Bühlmann and Wyner, 1999) and parsimonious Markov chains (PMC) [Bourguignon and Robelin, 2004; P.Y. Bourguignon, 2005 (submitted for publication)]. Let Yi,0<i
l be the set of tokens in a sequence of length l modelled by a d-order Markov model. In VLMC, the prediction of Yi (the token at position i) can be determined by the preceding words of variable length (Yid'
Yi1,d'
d). Using words of length less than d can considerably reduce the number of parameters and increase the quality of the model adjustment. The same motivations underlie the use of PMC, where predictors with identical emission probabilities are grouped into motifs representing degenerated token words with possible gaps (Fig. 1). Nevertheless such improvements in the model accuracy can require computational costs.
|
Therefore, methods are implemented in the library for Markovian transition matrix estimation, stationary distribution calculus, word probabilities, total variation distance between two Markovian matrices, likelihood and Bayesian information criterion (BIC) calculus. The efficiency of eigenproblems computation is ensured by the use of Arnoldi algorithms (Lehoucq et al., 1996). These methods aim, for example, to contribute to a motif-detection algorithm (Bulyk, 2003): given a model based on a selection of known motifs of interest and a background model on a target sequence, a sliding window approach can be developed based on the ratio of the likelihoods of the site model and the background model (high scores corresponding to potential sites; Fig. 2).
|
In addition, the programs estim_m, estim_vlm and estim_pm are also released in seq++ to perform a set of calculus associated with models MC, VLMC and PMC, respectively. Moreover, the program simul_m simulates a sequence according to a Markovian matrix (or p matrices for p phases) previously estimated. When working on homogeneous sequences, biologists are often interested in the P-value of an observation. This P-value is frequently impossible to calculate analytically, thus it can be estimated using simulated control sequences. As a result seq++ provides an efficient environment for Monte Carlo methods.
| DESIGN AND AVAILABILITY |
|---|
|
|
|---|
The object-oriented design of seq++ allows for further evolution. A module dedicated to the Mixture Transition Distribution (MTD) model (Lebre, 2004) is planned for future seq++ releases. To our knowledge, seq++ is the only available library for Markovian sequence analysis. A similar project, libsequence (Thornton, 2003), is dedicated to single nucleotide polymorphism analysis. The GHMM library (http://ghmm.org) for hidden Markov models may be a valuable alternative for various problematics on heterogeneous sequences.
The package is written in ANSI C++ and developed on x86 GNU/Linux systems with GCC 3.4. It has beensuccessfully tested with Intel ICC 8.0, on Sun systems using GCC 3.3 and Apple Mac OSX systems with GCC 3.1. Compilation and installation are compliant with the GNU standard procedure. The library is free and available at http://stat.genopole.cnrs.fr/seqpp. Online documentation is also available. Software using seq++ (Robelin et al., 2003) dedicated to DNA bioinformatics can also be accessed online. seq++ is licensed under the GNU General Public License (http://www.gnu.org/licences.html).
| Acknowledgments |
|---|
We are grateful to M.Baudry for the hardware support, M.Hoebeke and P.Nicolas for the programming advice.
Received on October 28, 2004; revised on February 1, 2005; accepted on March 9, 2005
| REFERENCES |
|---|
|
|
|---|
Borodovsky, M., et al. (1995) Detection of new genes in a bacterial genome using Markov models for three gene classes. Nucleic Acids Res., 25, 35543562.
Actes de JOBIM Bourguignon, P.Y. and Robelin, D. (2004) Modèles de Markov parcimonieux.
Bühlmann, P. and Wyner, A.J. (1999) Variable length Markov chains. Ann. Stat., 27, 480513[CrossRef].
Bulyk, M.L. (2003) Computational prediction of transcription-factor binding site locations. Genome Biol., 5, 201[CrossRef][Medline].
Evry University Internal Report Lebre, S. (2004) Estimation de modèle MTD.
Rice University Technical Report Lehoucq, R.B., Sorensen, B., Yang, C. (1996) ARPACK user's guide: solution of large-scale eigenvalue problems by implicitly restarted Arnoldi methods.
Robelin, D., et al. (2003) SIC: a tool to detect short inverted segments in a biological sequence. Nucleic Acids Res., 31, 36693671
Thornton, K. (2003) libsequence, a C++ class library for evolutionary genetic analysis. Bioinformatics, 19, 23252327
This article has been cited by other articles:
![]() |
S.-H. Bae, H. Tang, J. Wu, J. Xie, and S. Kim dPattern: transcription factor binding site (TFBS) discovery in human genome using a discriminative pattern analysis Bioinformatics, October 1, 2007; 23(19): 2619 - 2621. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


