Skip Navigation


Bioinformatics Advance Access originally published online on March 17, 2005
Bioinformatics 2005 21(11):2783-2784; doi:10.1093/bioinformatics/bti389
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/11/2783    most recent
bti389v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Miele, V.
Right arrow Articles by Richard, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Miele, V.
Right arrow Articles by Richard, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

seq++: analyzing biological sequences with a range of Markov-related models

Vincent Miele *, Pierre-Yves Bourguignon , David Robelin , Grégory Nuel and Hugues Richard

UMR CNRS 8071 Statistique et Génome 523 place des Terrasses, 91000 Evry, France

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 MODELS AND FEATURES
 DESIGN AND AVAILABILITY
 REFERENCES
 

Summary: The seq++ package offers a reference set of programs and an extensible library to biologists and developers working on sequence statistics. Its generality arises from the ability to handle sequences described with any alphabet (nucleotides, amino acids, codons and others). seq++ enables sequence modelling with various types of Markov models, including variable length Markov models and the newly developed parsimonious Markov models, all of them potentially phased. Simulation modules are supplied for Monte Carlo methods. Hence, this toolbox allows the study of any biological process which can be described by a series of states taken from a finite set.

Availability: Under the GNU General Public Licence at http://stat.genopole.cnrs.fr/seqpp

Contact: miele{at}genopole.cnrs.fr


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 MODELS AND FEATURES
 DESIGN AND AVAILABILITY
 REFERENCES
 
A considerable range of genomic data can be modelled as sequences of characters taken from various alphabets. Obvious candidates are the nucleic acids or protein sequences. However, RNA or protein secondary structures can similarly be described using character strings. This paper presents seq++, a C++ software library, aiming to be a reference environment for studying the statistical properties of these sequences using a comprehensive set of Markov models. It already implements most of the classical methods, as well as some more recent ones, and offers an extensible framework to experiment with new models, which can be included in subsequent releases of the library.


    MODELS AND FEATURES
 TOP
 Abstract
 INTRODUCTION
 MODELS AND FEATURES
 DESIGN AND AVAILABILITY
 REFERENCES
 
The key to the flexibility of seq++ is its independence from any given alphabet: the algorithms implemented in seq++ handle strings of tokens. These tokens, which can be several characters long are enumerated in an alphabet file also allowing the definition of synonymous token groups. For instance, studying amino acid composition properties of protein sequences is possible by grouping amino acids according to their physical or chemical properties and assigning them a suitable label.

The algorithms themselves focus on Markov chains (MC) sequence modelling, including phased models. These models can be useful when phased emission heterogeneity occur in the observations. For example, it is well known that the third nucleotide of a codon has different occurrence patterns than the two preceding ones. This can be taken into account by fitting one transition matrix (which is the matrix determining emission probabilities for the observations) per phase to yield more accurate results (Borodovsky et al., 1995).

Models provided by seq++ include variable length Markov chains (VLMC) (Bühlmann and Wyner, 1999) and parsimonious Markov chains (PMC) [Bourguignon and Robelin, 2004; P.Y. Bourguignon, 2005 (submitted for publication)]. Let Yi,0<i≤l be the set of tokens in a sequence of length l modelled by a d-order Markov model. In VLMC, the prediction of Yi (the token at position i) can be determined by the preceding words of variable length (Yid'···Yi–1,d' ≤ d). Using words of length less than d can considerably reduce the number of parameters and increase the quality of the model adjustment. The same motivations underlie the use of PMC, where predictors with identical emission probabilities are grouped into motifs representing degenerated token words with possible gaps (Fig. 1). Nevertheless such improvements in the model accuracy can require computational costs.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 1 A Markovian model can be represented by a tree: the root corresponds to the letter Yt to be predicted, the possible values of Yt–1 are on the first level, those of Yt–2 on the second and so on. (a) represents an MC of order 2. In a VLMC, branches can be cut: as in (b), when Yt–1 = C, the law of Yt does not depend on Yt–2. In a PMC, subtrees can be merged: as in (c) the two trees below Yt–1 = A and Yt–1 = T generate a unique tree. In each case, the number of predictors is equal to the number of leaves in the tree.

 
Therefore, methods are implemented in the library for Markovian transition matrix estimation, stationary distribution calculus, word probabilities, total variation distance between two Markovian matrices, likelihood and Bayesian information criterion (BIC) calculus. The efficiency of eigenproblems computation is ensured by the use of Arnoldi algorithms (Lehoucq et al., 1996). These methods aim, for example, to contribute to a motif-detection algorithm (Bulyk, 2003): given a model based on a selection of known motifs of interest and a background model on a target sequence, a sliding window approach can be developed based on the ratio of the likelihoods of the ‘site’ model and the ‘background’ model (high scores corresponding to potential sites; Fig. 2).



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 2 Code example for motif detection on DNA, where m_motif and m_backg are the models of order ord estimated on the known motifs and the target sequence, respectively. proba returns the probability of the word observed between positions t-nphase+1 and t.

 
In addition, the programs estim_m, estim_vlm and estim_pm are also released in seq++ to perform a set of calculus associated with models MC, VLMC and PMC, respectively. Moreover, the program simul_m simulates a sequence according to a Markovian matrix (or p matrices for p phases) previously estimated. When working on homogeneous sequences, biologists are often interested in the P-value of an observation. This P-value is frequently impossible to calculate analytically, thus it can be estimated using simulated control sequences. As a result seq++ provides an efficient environment for Monte Carlo methods.


    DESIGN AND AVAILABILITY
 TOP
 Abstract
 INTRODUCTION
 MODELS AND FEATURES
 DESIGN AND AVAILABILITY
 REFERENCES
 
The object-oriented design of seq++ allows for further evolution. A module dedicated to the Mixture Transition Distribution (MTD) model (Lebre, 2004) is planned for future seq++ releases. To our knowledge, seq++ is the only available library for Markovian sequence analysis. A similar project, libsequence (Thornton, 2003), is dedicated to single nucleotide polymorphism analysis. The GHMM library (http://ghmm.org) for hidden Markov models may be a valuable alternative for various problematics on heterogeneous sequences.

The package is written in ANSI C++ and developed on x86 GNU/Linux systems with GCC 3.4. It has beensuccessfully tested with Intel ICC 8.0, on Sun systems using GCC 3.3 and Apple Mac OSX systems with GCC 3.1. Compilation and installation are compliant with the GNU standard procedure. The library is free and available at http://stat.genopole.cnrs.fr/seqpp. Online documentation is also available. Software using seq++ (Robelin et al., 2003) dedicated to DNA bioinformatics can also be accessed online. seq++ is licensed under the GNU General Public License (http://www.gnu.org/licences.html).


    Acknowledgments
 
We are grateful to M.Baudry for the hardware support, M.Hoebeke and P.Nicolas for the programming advice.

Received on October 28, 2004; revised on February 1, 2005; accepted on March 9, 2005

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 MODELS AND FEATURES
 DESIGN AND AVAILABILITY
 REFERENCES
 

    Borodovsky, M., et al. (1995) Detection of new genes in a bacterial genome using Markov models for three gene classes. Nucleic Acids Res., 25, 3554–3562.

    Actes de JOBIM Bourguignon, P.Y. and Robelin, D. (2004) Modèles de Markov parcimonieux.

    Bühlmann, P. and Wyner, A.J. (1999) Variable length Markov chains. Ann. Stat., 27, 480–513[CrossRef].

    Bulyk, M.L. (2003) Computational prediction of transcription-factor binding site locations. Genome Biol., 5, 201[CrossRef][Medline].

    Evry University Internal Report Lebre, S. (2004) Estimation de modèle MTD.

    Rice University Technical Report Lehoucq, R.B., Sorensen, B., Yang, C. (1996) ARPACK user's guide: solution of large-scale eigenvalue problems by implicitly restarted Arnoldi methods.

    Robelin, D., et al. (2003) SIC: a tool to detect short inverted segments in a biological sequence. Nucleic Acids Res., 31, 3669–3671[Abstract/Free Full Text].

    Thornton, K. (2003) libsequence, a C++ class library for evolutionary genetic analysis. Bioinformatics, 19, 2325–2327[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
S.-H. Bae, H. Tang, J. Wu, J. Xie, and S. Kim
dPattern: transcription factor binding site (TFBS) discovery in human genome using a discriminative pattern analysis
Bioinformatics, October 1, 2007; 23(19): 2619 - 2621.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/11/2783    most recent
bti389v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Miele, V.
Right arrow Articles by Richard, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Miele, V.
Right arrow Articles by Richard, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?