Skip Navigation



Bioinformatics Advance Access published online on October 12, 2006

Bioinformatics, doi:10.1093/bioinformatics/btl515
This Article
Right arrow Advance Access manuscript (PDF) Freely available
Right arrow Supplementary data
Right arrowOA All Versions of this Article:
22/24/3016    most recent
btl515v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Du, J.
Right arrow Articles by Gerstein, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Du, J.
Right arrow Articles by Gerstein, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2006 The Author(s)
Received July 24, 2006
Revised September 15, 2006
Accepted October 4, 2006

Article

A supervised hidden Markov model framework for efficiently segmenting tiling array data in transcriptional and ChIP-chip experiments: systematically incorporating validated biological knowledge

Jiang Du 1, Joel S. Rozowsky 2, Jan O. Korbel 2, Zhengdong D. Zhang 2, Thomas E. Royce 3, Martin H. Schultz 1, Michael Snyder 4, and Mark Gerstein 5 *

1 Department of Computer Science, Yale University, New Haven, CT 06520, USA
2 Department of Molecular Biophysics and Biochemistry, Yale University, New Haven CT 06520, USA
3 Department of Molecular Biophysics and Biochemistry, Yale University, New Haven CT 06520, USA; Program in Computational Biology and Bioinformatics, Yale University, New Haven CT 06520, USA
4 Department of Molecular Biophysics and Biochemistry, Yale University, New Haven CT 06520, USA; Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven CT 06520, USA
5 Department of Computer Science, Yale University, New Haven, CT 06520, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven CT 06520, USA; Program in Computational Biology and Bioinformatics, Yale University, New Haven CT 06520, USA

* To whom correspondence should be addressed.
Mark Gerstein, E-mail: mark.gerstein{at}yale.edu


   Abstract

Motivation: Large-scale tiling array experiments are becoming increasingly common in genomics. In particular, the ENCODE project requires the consistent segmentation of many different tiling array data sets into "active regions" (e.g. finding transfrags from transcriptional data and putative binding sites from ChIP-chip experiments). Previously, such segmentation was done in an unsupervised fashion mainly based on characteristics of the signal distribution in the tiling array data itself. Here we propose a supervised framework for doing this. It has the advantage of explicitly incorporating validated biological knowledge into the model and allowing for formal training and testing.

Methodology: In particular, we use a hidden Markov model (HMM) framework, which is capable of explicitly modeling the dependency between neighboring probes and whose extended version (the generalized HMM) also allows explicit description of state duration density. We introduce a formal definition of the tiling-array analysis problem, and explain how we can use this to describe sampling small genomic regions for experimental validation to build up a gold-standard set for training and testing. We then describe various ideal and practical sampling strategies (e.g. maximizing signal entropy within a selected region versus using gene annotation or known promoters as positives for transcription or ChIP-chip data, respectively).

Results: For the practical sampling and training strategies, we show how the size and noise in the validated training data affects the performance of an HMM applied to the ENCODE transcriptional and ChIP-chip experiments. In particular, we show that the HMM framework is able to efficiently process tiling array data as well as or better than previous approaches. For the idealized sampling strategies, we show how we can assess their performance in a simulation framework and how a maximum entropy approach, which samples sub-regions with very different signal intensities, gives the maximally performing gold-standard. This latter result has strong implications for the optimum way medium-scale validation experiments should be carried out to verify the results of the genome-scale tiling array experiments.

Supplementary information: The supplementary materials are available at http://tiling.gersteinlab.org/hmm/.


Associate Editor: Martin Bishop
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
P. Nicolas, A. Leduc, S. Robin, S. Rasmussen, H. Jarmer, and P. Bessieres
Transcriptional landscape estimation from tiling array data using a model of signal shift and drift
Bioinformatics, September 15, 2009; 25(18): 2341 - 2347.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
H. Choi, A. I. Nesvizhskii, D. Ghosh, and Z. S. Qin
Hierarchical hidden Markov model with application to joint analysis of ChIP-chip and ChIP-seq data
Bioinformatics, July 15, 2009; 25(14): 1715 - 1721.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Zhang
Poisson approximation for significance in genome-wide ChIP-chip tiling arrays
Bioinformatics, December 15, 2008; 24(24): 2825 - 2831.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
C. Bock and T. Lengauer
Computational epigenetics
Bioinformatics, January 1, 2008; 24(1): 1 - 10.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
M. B. Gerstein, C. Bruce, J. S. Rozowsky, D. Zheng, J. Du, J. O. Korbel, O. Emanuelsson, Z. D. Zhang, S. Weissman, and M. Snyder
What is a gene, post-ENCODE? History and updated definition
Genome Res., June 1, 2007; 17(6): 669 - 681.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.