Skip Navigation

This Article
Right arrow Full Text (Print PDF)
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Gelfand, M. S.
Right arrow Articles by Pevzner, P. A.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Gelfand, M. S.
Right arrow Articles by Pevzner, P. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© Oxford University Press

Extendable words in nucleotide sequences

Michael S. Gelfand , Constantine G. Kozhukhin 1 and Pavel A. Pevzner 2,3

Institute of Protein Research Acad. Sci. USSR, Pushchino, Moscow Region, 142292 USSR
1Department of Computer Sciences, Brandeis University Waltham, MA 02254
2Department of Mathematics, University of Southern California Los Angeles, CA 90089-1113, USA and Laboratory of Mathematical Methods, Institute of Genetics of Microorganisms Moscow 113545, USSR

3To whom reprint requests should be sent at the University of Southern California

Previous statistical analyses revealed several peculiarities of nucleotide sequences that preclude their description by existing models and thus allow one to distinguish DNA and RNA sequences from random A, T, G, C-texts. This is a consequence of the unusual distribution of certain words in nucleotide sequences: while the distribution of (most) words is consistent with Markov models of small orders, the distribution of certain words cannot be described by any previous model (anomalies in distribution of homonucleotide/homopurine/homopyrimidine runs, complementary and mirror palindromes, and non–stationary words). In this work we introduce a probabilistic approach that is partly motivated by analogy with linguistics. We also describe another important feature of DNA/RNA sequences: anomalies in distribution of words of poor nucleotide composition. We show that some classes of these words are the major obstacle for the simple Markov description of nucleotide sequences.


Received on March 21, 1991; accepted on August 23, 1991

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
INFORMS Journal on ComputingHome page
Y. Park and J. L. Spouge
Searching for Multiple Words in a Markov Sequence
INFORMS Journal on Computing, January 1, 2004; 16(4): 341 - 347.
[Abstract] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.