Bioinformatics Advance Access originally published online on June 26, 2009
Bioinformatics 2009 25(18):2341-2347; doi:10.1093/bioinformatics/btp395
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Transcriptional landscape estimation from tiling array data using a model of signal shift and drift
1INRA, Mathématique Informatique et Génome UR1077, 78350 Jouy-en-Josas, 2AgroParisTech/INRA, Mathématiques et Informatique Appliquées UMR518, 16 rue Claude Bernard, 75005 Paris, France and 3Technical University of Denmark, Center for Biological Sequence analysis, Building 208, 2800 Lyngby, Denmark
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: High-density oligonucleotide tiling array technology holds the promise of a better description of the complexity and the dynamics of transcriptional landscapes. In organisms such as bacteria and yeasts, transcription can be measured on a genome-wide scale with a resolution >25 bp. The statistical models currently used to handle these data remain however very simple, the most popular being the piecewise constant Gaussian model with a fixed number of breakpoints.
Results: This article describes a new methodology based on a hidden Markov model that embeds the segmentation of a continuous-valued signal in a probabilistic setting. For a computationally affordable cost, this framework (i) alleviates the difficulty of choosing a fixed number of breakpoints, and (ii) permits retrieving more information than a unique segmentation by giving access to the whole probability distribution of the transcription profile. Importantly, the model is also enriched and accounts for subtle effects such as signal drift and covariates. Relevance of this framework is demonstrated on a Bacillus subtilis dataset.
Availability: A software is distributed under the GPL.
Contact: pierre.nicolas{at}jouy.inra.fr
Supplementary information: Supplementary data is available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
High-density oligonucleotide tiling arrays carry tightly spaced probes that provide uniform covering of the genomic sequence. By hybridization with RNA samples (cDNA), they have been used to query the transcriptional activity of the whole genome in an array of model organisms (Bertone et al., 2004; Biemar et al., 2006; He et al., 2007; Stolc et al., 2005). The approach is particularly attractive for organisms with small-sized genome such as bacteria and yeasts where a resolution >25 bp is more easily achieved (David et al., 2006; S.Rasmussen et al., submitted for publication). The generalization of the use of such arrays should provide unbiased and high-quality pictures of the complexity and the dynamics of transcriptional landscapes (Xu et al., 2009). The great promise of these data justifies the improvement of the currently available statistical methods dedicated to their analysis.
From the methodological standpoint, the problem is naturally stated in terms of finding segments where the hybridization signal is relatively constant, delimited by breakpoints that are expected to correspond to biological features such as transcript start and stop sites or splicing sites. A variety of tools including local non-parametric smoothing (Royce et al., 2007; Wang et al., 2009) and iterative hypothesis testing (Olshen et al., 2004) have been proposed to answer this question. Probably the most popular and best mathematically grounded methodology consists of seeking the piecewise constant model with Gaussian noise that best fits the signal (Huber et al., 2006; Picard et al., 2005). Namely, for a fixed number of segments S, fitting the model consists of finding the combination of breakpoints 1 < t1
···
tS–1
n that minimizes the sum of squared residuals:
|
| (1) |
The simplicity of this approach is appealing but hinders a number of difficulties, the most important being the choice of the number of segments. In principle, this issue can be tackled by embedding the segmentation model in a probabilistic setting that includes not only the noise but also the evolution of the signal. This idea stimulated the development of hidden Markov models (HMMs) (Fridlyand et al., 2004; Marioni et al., 2006; Stjernqvist et al., 2007) for the analysis of comparative genomic hybridization data. For transcriptomic data, a different approach consists of training HMMs to distinguish between transcribed and non-transcribed regions (Du et al., 2006; Munch et al., 2006). When the quality of the data is good enough it is both more natural and more ambitious to try to recover the denoised transcription signal instead of directly summarizing the data via a classification algorithm. Transcript level is, however, a continuous quantity and none of the available models is satisfactory for a continuous-valued underlying signal. An HMM that achieves this aim at a computationally affordable cost is described in the present article. The proposed model does also extend the piecewise constant model in two directions. First, it integrates the influence of covariates that serve to account for differential affinity between probes. This allows to achieve segmentation and within-array normalization in one step. Second, the proposed model relaxes the assumption of strictly constant transcript levels between abrupt shifts by also allowing progressive drift of the signal. Inference based on this model is examined and discussed.
| 2 METHODS |
|---|
|
|
|---|
2.1 Experimental data
The main example dataset used here comes from pilot experiments conducted on Bacillus subtilis within the European Consortium BaSysBio (S.Rasmussen et al., submitted for publication). This array consists of 383 149 probes starting every 22 nt on each strand of the B.subtilis genome (GenBank: AL009126). Probe lengths range between 45 nt and 65 nt and were adjusted to reduce melting temperature (TM) variations (isothermal design). Production of the tiling arrays, synthesis of labeled cDNA from the RNA samples with random priming, hybridization and signal acquisition were carried out by Nimblegen. Antisense artifacts were controlled by using actinomycin D during reverse transcription (Perocchi et al., 2007). RNA was extracted from B.subtilis culture during exponential growth on rich medium. One out of four biological replicates gave a high-quality signal and is analyzed here (S.Rasmussen et al., submitted for publication). For comparison with the algorithm of Huber et al. (2006), we also analyzed a dataset corresponding to the chromosome 1 of the yeast Saccharomyces cerevisiae (David et al., 2006). This second array was produced by Affymetrix and uses shorter oligonucleotide (25 nt) tiled at intervals of 8 nt on each strand. The data from the three biological replicates were averaged after quantile normalization.
Both experimental settings included hybridization of genomic DNA (gDNA) preparations to assess variation of affinity between probes (four replicates for B.subtilis and three replicates for S.cerevisiae). Data were averaged across replicates after quantile normalization. Bacillus subtilis gDNA data varied smoothly between the replication origin and the replication terminus, presumably reflecting the chromosome dosage. Taking the residuals after median smoothing (window size 110 011 bp) removed this trend. For S.cerevisiae data, we preferred to compute the residuals as the distance to the mode rather than to the median to account for the highly skewed distribution of probe affinities. The formatted datasets are distributed with the software.
2.2 Shift and drift in an HMM framework
Like in previous approaches (Huber et al., 2006; Olshen et al., 2004; Picard et al., 2005), the log2 of the observed intensity xt is modeled as the sum of an unobservable signal ut that is the focus of interest plus a Gaussian noise with SD
. This general model can be written as:
|
| (2) |
(ut, ut+1) and (xt, ut)1
t
n is thus said to be an HMM (Durbin et al., 1998; Rabiner, 1989). Compared with traditional use of HMMs, the complication comes from the continuous nature of ut, whereas the efficient algorithmic machinery of the HMMs (Viterbi algorithm, forward–backward algorithm, expectation-maximization (EM) algorithm) works well for discrete and typically small number of hidden states (Rabiner, 1989). In general, with K hidden states, the time complexity of the algorithms is O(nK2).
Here, we propose a structure of the transition matrix
(ut, ut+1) accounting for abrupt shifts and progressive drifts in the unobservable signal ut that allows to discretize the continuous range Umin
ut
Umax in K points spaced by a regular interval, h=(Umax–Umin)/(K – 1). This particular structure warrants time complexity O(nK) for the classical HMM algorithms and thus permits appropriately high resolution of discretization.
For values of ut and ut+1 taken in the discretized hidden state space, the transition probability writes
|
| (3) |

n,
s,
u,
d
1,
n+
s+
u+
d=1 and 0
u,
d<1, with
{X} standing for 1 if X is true, 0 otherwise.
This transition kernel is best understood as a mixture of four types of moves with weights
n,
s,
u and
d. The parameter
n accounts for unchanged u between successive probes. Shift moves have probability
s and the distribution of the signal after the move is independent of the value of the signal before the move. This distribution is given by
h and it approximates the marginal distribution of the signal. Namely,
h(ut+1) = 
(u)du, where
is the kernel density estimate computed on x with a Gaussian kernel and Scott's bandwidth (Scott, 1992). The possibility of small drift, either upward or downward, is accounted for by
u and
d. Drift amplitudes are modeled by two geometric distributions of parameters
u and
d and average amplitudes write h+h/(1 –
).
It can be verified that as h
0 and h/(1 –
)
the transition kernel of the discrete-valued Markov chain of Equation (3) converges in distribution toward the transition kernel of a continuous-valued Markov chain. In its continuous version, the kernel writes as a mixture of a point mass at ut of weight
n, a continuous-valued distribution of density
and weight
s, and two shifted exponential distributions of rates
u and
d and weights
u and
d. With an appropriately high K it should thus be possible to approach, using the discrete-valued model of Equation (3), the results that one would obtain with the continuous-valued model.
The Supplementary Material available online gives a detailed presentation of the equations that allow O(nK) implementations of the HMM classical algorithms, namely:
- likelihood computation (
(x1...n)),
- forward–backward algorithm (computation of
(ut|x1...n) for each t),
- Viterbi algorithm (finding the trajectory u1...n that maximizes
(u1...n|x1...n)).
2.3 gDNA signal as a covariate
gDNA hybridization data were used in a preprocessing step by Huber et al. (2006) for the purpose of between-probe signal normalization and outlier trimming. The model proposed here accounts for these effects by modeling the gDNA hybridization intensities as a covariate.
The probability distribution for the observed variable xt given the underlying signal ut and the gDNA residuals rt writes as a mixture model
|
| (4) |
(rt) corresponds to the probability of outliers,
(Umin, Umax) is the uniform distribution that models outlier data and
(ut +
(ut)rt,
(ut)2) is the Gaussian distribution modeling non-outlier data. This model is markedly richer than Equation (2). Notice (i) the non-constant proportionality factor
(ut) applied to rt; (ii) the non-constant standard error
(ut) of the Gaussian distribution; and (iii) the probability of outliers
that depends on rt. More precisely,
and
are modeled as piecewise constant function of ut with eight intervals, and
is a two-parameter logistic function of the absolute value of rt,
(rt) = 1/(1 + e–(a+b|rt|)). All the parameters are simultaneously estimated with the EM algorithm (see Supplementary Material). Finally, left and right censoring are incorporated in the model to account for the experimental limitations that preclude exact measurements of extremely high and extremely low intensities. In practice, the lower and upper 5% of the original range of variation of the intensity x are considered as censored.
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
3.1 Selecting the appropriate level of discretization
The model was designed with the explicit aim of modeling a continuous-valued underlying signal. In other words, discretization of the hidden state space is seen only as a necessary technicality and the step h
1/K should ideally be sufficiently small to have no impact on the results. Intuitively, the smaller the SD of the noise
, the smaller the step h should be. The results obtained on the B.subtilis dataset and presented in Figure 1 confirm this intuition and thereby provide some form of validation for the model.
|
Figure 1 shows that increasing K (and thus decreasing h) actually increases the model adequation to the data as measured by the log-likelihood after ML estimation. Beyond a certain value of K the impact of this change becomes, however, almost unnoticeable. Figure 1 also reports the parallel evolution of h and
. According to this plot, having h around 0.5
seems more than sufficient. Indeed, with such a value of h, the 95% confidence interval (CI) of the distribution of the noise is about eight times as large as the discretization interval h. K was set to 100 for this particular dataset. This choice of K = 100 corresponds to an acceptable running time for the algorithm. Our setting throughout this study consisted to explore 10 random starting points for the EM algorithm. Here, it resulted in a total of 885 iterations taking 5 h 6 min on an Intel(R) Xeon(TM) CPU 3.40 GHz CPU, less than the 5 h 36 min needed for the segmentation algorithm of Huber et al. (2006) with maximum segment length l = 1000 (22 000 bp) and segment number on each strand S = 1500.
3.2 Importance of modeling drift and covariates
Parameter estimates in model-based analyses are an invaluable source of information to understand both the behavior of the model and the data. The model contains a total of 23 parameters. Figure 2 is intended to provide an overview of their ML estimates on the B.subtilis data. The first row of Table 1 gives numerical values for a selection of parameters.
|
|
The shape of the transition matrix that describes the trajectory of the underlying signal is defined by the parameters in Equation (3), one row of this matrix is shown in Figure 2A. The sharp peak reflects the high value of
n: it is estimated that the underlying signal remains unchanged between adjacent probes in > 85% of the cases (
s in Table 1). The narrow shoulders on both sides of the peak correspond to the upward and downward drift moves and reflect the value of the parameters (
u,
u) and (
d,
d), respectively. Close inspection reveals a small asymmetry, with upward moves being less frequent than downward moves (5.0% versus 7.8%). The small estimated proportion of abrupt shift moves between adjacent probes is almost invisible at this scale (1.2%). As expected, the probability of outliers is estimated to increase with the magnitude of the residuals of the gDNA signal. The two-parameter logistic curve that models this relationship is shown in Figure 2C. Remarkably, the probability of outliers is found to be overall very small.
The parameters
and
that model the observed intensity xt are modeled as eight-parameter piecewise constant functions of the underlying signal level ut. Figures 2B and D show these two functions. Whereas the SD of the noise
is a relatively flat function of ut, the parameter
that serves to account for the gDNA covariate varies by more than a factor of eight. An obvious characteristic of the latter is its sharp decrease for low values of the signal. This behavior probably reflects higher level of non-specific signal in the lower end of the intensity spectrum. It is also re-insuring to observe that the value of
in the middle of the spectrum is just slightly below unity, the value that we expect in an idealized situation [see the rationale behind the preprocessing step in Huber et al. (2006)].
As a whole, these results emphasize the importance of two specificities of our model: the modeling of drift moves as a complement to shift moves and the non-constant
that provides a simple adaptive method to account for the variation of affinity between probes.
To better understand the behavior of the model and the characteristics of the data, we carried out a comparative analysis of eight models. For the purpose of robust assessment of model fitness with respect to the B.subtilis dataset each model was fitted two times, once on each strand of the chromosome, and the likelihood was each time computed on the other strand. The sum of both log-likelihood terms is reported as the cross-validated log-likelihood in Table 1. Parameter values in Table 1 were estimated on the full dataset.
Sorted by decreasing value of adequacy with the data, the models ranged from
1, the full 23-parameter model, to
8, a nine-parameter model that does not account for drifts, outliers nor covariates. Not accounting for outliers has only a small impact on the overall model fitness (
2 versus
1), but the probability of shift moves is increased by >30% in this simpler model. This can have a non-negligible impact in practice given that these particular shift moves are indeed likely to be spurious. Not modeling drifts has a much more pronounced impact (
5 versus
1). Fitness is 6.5% better for
1 than for
5 and the estimated proportion of shift moves is about four times lower in
1 (1.2% versus 4.6%), suggesting that a substantial fraction of the drift moves in
1 are interpreted as shift moves in
5. A closer examination underscores the importance of downward drift as compared with upward drift. Not accounting for downward drift has 74% more effect on the overall fitness that not accounting for upward drift (
3 and
4 versus
1). More spectacularly, if a single drift direction is allowed, modeling downward drift improves the model
4.5 times more than modeling only upward drift (
3 versus
5 and
4 versus
5). Setting
to either 1 or 0 were both found to result in a dramatic drop in fitness but with different specific effects. Setting
to 1 in
6 results in estimation of high drift compared with original model, whereas setting
to 0 in
7 results in estimation of high noise.
3.3 Estimation of transcriptional landscape: illustration on B.subtilis data
The ultimate goal of the use of the model is to infer the underlying signal supposed to reflect the actual transcriptional landscape.
The adoption of a probabilistic setting for the trajectory of the underlying signal allows for a considerably richer signal reconstruction than just optimal trajectory reconstruction. Figure 3 gives an illustration of these possibilities by superimposing a number of results obtained with the model on a 10 000 bp region of the B.subtilis chromosome. Results include: (i) the prediction interval for the value of the signal ut at each chromosome position; (ii) a point prediction for the signal value by the conditional mean of ut (the best predictor in terms of quadratic error); (iii) the inferred position of the experimental point after correction for differential probe affinity [computed as
]; (iv) the exact position of each type of move in the best trajectory given by the Viterbi path (abrupt shift, upward drift and downward drift); and (v) the probability of having each type of move at each position. All these values can be read directly from the output of our software.
|
The biological pertinence of the distinction between shifts and drifts seems remarkable in Figure 3. Inferred shifts are found mostly in intergenic regions that a priori correspond to possible positions for transcriptional promoters and terminators.
The position of each move (2893 shifts and 13 460 drifts) was compared with sequence predictions for two biological features: Rho-independent (intrinsic) terminators predicted with the algorithm of d'Aubenton-Carafa et al. (1990); promoters dependent on Sigma-A predicted using an HMM whose structure was chosen according to the results of Nicolas et al. (2006). To fulfill the needs of an unbiased analysis, both categories of predictions were made without prior on the position of the genes and confidence cutoffs were set relatively low to increase sensitivity (a total 4164 Sigma-A predictions and 3492 terminator predictions are considered).
The results presented in Figure 4 confirm the practical relevance of the distinction between shift and drift moves. For upward moves, it shows the difference between shift and drift with respect to the distance between the breakpoint and the nearest promoter prediction. Similar results for downward moves and terminator predictions are presented in the Supplementary Material (Fig. S1). Although shifts represent only 18% of all moves, a clear majority of the moves lying at <22 bp of a predicted biological feature are shifts. The proportion of shifts is 59% among the 977 upward moves near a predicted promoter, and 71% among the 1157 moves near a predicted terminator.
|
Drift might partly reflect local variations of labeled cDNA that result from technical artifacts such as random priming bias. Drift could also reflect biological differences in the amount of mRNA. In particular, Figures 4 and S1 leave no doubt that a fraction of the drifts correspond to promoters or terminators whose activity is too weak to be detected as shifts in this biological condition. A preliminary exploration of the patterns of drift is reported in the Supplementary Material. Figure S2 shows that downward drift is most pronounced after upward shifts and before downward shifts, near the 5' and 3' ends of transcriptionally active regions. An excess of upward drift is found before upward shifts, at the 3' end of regions with low-transcriptional activity. Random priming artifacts could most easily be invoked to explain downward drift at the 3' end of transcriptionally active regions (Xu et al., 2009). Downward drift may also, for instance, be partly caused by molecules whose synthesis is still incomplete. Here, no single explanation could apparently account the patterns of upward and downward drift. Instead, drift is observed in a variety of chromosomal and transcriptional contexts that the landscape snapshots presented in Figures S3, S4 and S5 intend to illustrate. As an example, some spectacular cases of downward drift are found for transcription units apparently lacking a clear terminator. The intensity of the resulting downstream antisense transcription drifts downward progressively. In Figure S3, a pattern reminiscent of the bidirectional transcriptional activity recently described in S.cerevisiae (Xu et al., 2009) can also be observed.
3.4 Benchmark comparisons
In addition to allow insightful reconstructions of the transcriptional landscape, good algorithms should identify breakpoints that match, as closely as possible, the position of the promoters and terminators. To compare different sets of breakpoints, promoter and terminator predictions were used as a proxy for the true (unknown) reference. Results are shown in Figure 5.
|
The results obtained with the HMMs,
1,
5 and
8, give another confirmation of the biological pertinence of the distinction between shift moves and drift moves in
1. It also revealed the deep impact of the correction for variation of affinity between probes using covariates, not implemented in
8. The misbehavior of
8 translates paradoxically in an apparent success at detecting terminators. This most likely does not reflect the transcription signal itself, but rather the low probe affinity due to the stem–loop secondary structure distinctive of the rho-independent terminators. For the comparison of the new HMM segmentation method and the piecewise constant regression implemented in the algorithm of Huber et al. (2006), the later was run on the data after correction for difference of affinity between probes (as shown in Fig. 3) with maximum segment length l = 22000 bp and number of segments on each strand S between 1000 and 3000. Results clearly demonstrate the benefit of the new HMM framework. For S = 1 500, the number of breakpoints matching promoter and terminator predictions were, respectively, 8.9% and 25% higher for the HMM.
3.5 Results on S.cerevisiae data
Examination of the segmentation produced by piecewise constant regression on Watson (+)-strand of S.cerevisiae yeast chromosome 1 leads to the choice of 152 (average segment size 1500 bp) as a sensible number of breakpoints (Huber et al., 2006). A question was thus whether the automatic procedure presented here will identify a similar number of shift moves. The model was fitted on the mRNA and gDNA data of the 57 616 probes representing both strands of the chromosome 1.
The Viterbi path of our HMM on the (+)-strand contained 125 shift moves and 373 drift moves with a median distance of 60 bp between each of the 152 breakpoints of Huber et al. (2006) and the closest of the 125 shift moves. On this dataset, modeling drift can thus be useful to single out the most abrupt changes in the signal intensity.
Interestingly, further comparisons of models with and without drift indicated that drift improve the model fitness by only 1.4% on the S.cerevisiae data, much less than the 6.5% found on B.subtilis data. Biology and array technology are two sources of possible differences between S.cerevisiae and B.subtilis datasets. Our model of drift seems more relevant for prokaryotic data obtained using long isothermal probes.
| 4 CONCLUSIONS |
|---|
|
|
|---|
This article describes a new methodology based on an HMM that embeds the segmentation of a continuous-valued signal in a probabilistic setting. For a computationally affordable cost, this framework alleviates the difficulty of choosing a fixed number of breakpoints and permits retrieving more information than a unique segmentation. Probabilistic modeling makes it straightforward to compute confidence measures on the estimated transcriptional landscape. This information should prove particularly useful to pinpoint the differences in large collections of arrays. Extension of the model could also be imagined to tackle the problem of the joint segmentation of datasets where transcript boundaries and expression level differ.
By accounting for gDNA hybridization data as a covariate, the model automatically corrects the data for the variation of affinity between probes. David et al. (2006) proposed for this purpose a preprocessing step to be carried out on the raw data, before log-transformation, and producing a significant fraction of negative values. The data could thus no longer be simply log-transformed and more complicated variance stabilization transformation, requiring multiple arrays, was used (Huber et al., 2002). In comparison, the normalization carried out by the model needs only one array and it alters only minimally the overall distribution of the log of the original data.
The model is also enriched and accounts for subtle effects such as signal drift and covariates. Interestingly, our results unambiguously document the existence of a drifts in the B.subtilis dataset. The interest of this observation is 2-fold. First, drift have not been accounted in the previous models and this may partially explain why selecting the number of breakpoints on real dataset proved so difficult (Huber et al., 2006; Picard et al., 2005). Second, the causes and the patterns of drift deserve to be investigated if we want to make the best use of tiling array expression data.
The software is distributed under the GNU Public License http://genome.jouy.inra.fr/
pnicolas/hmmtiling/.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Etienne Dervyn, Philippe Noirot and Franck Picard for constructive comments on the content of the manuscript.
Funding: BaSysBio project, European Commission research grant (LSHG-CT2006-037469).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Trey Ideker
Received on January 27, 2009; revised on May 11, 2009; accepted on June 19, 2009
| REFERENCES |
|---|
|
|
|---|
d'Aubenton Carafa Y, et al. Prediction of rho-independent Escherichia coli transcription terminators. A statistical analysis of their RNA stem-loop structures. J. Mol. Biol. (1990) 216:835–858.[Web of Science][Medline]
Bertone P, et al. Global identification of human transcribed sequences with genome tiling arrays. Science (2004) 306:2242–2246.
Biemar F, et al. Comprehensive identification of Drosophila dorsal-ventral patterning genes using a whole-genome tiling array. Proc. Natl. Acad. Sci. USA (2006) 103:12763–12768.
David L, et al. A high-resolution map of transcription in the yeast genome. Proc. Natl Acad. Sci. USA (2006) 103:5320–5325.
Du J, et al. A supervised hidden Markov model framework for efficiently segmenting tiling array data in transcriptional and ChIP-chip experiments: systematically incorporating validated biological knowledge. Bioinformatics (2006) 22:3016–3024.
Durbin R, et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. (1998) Cambridge University Press.
Fridlyand J, et al. Hidden Markov model analysis of array CGH data. J. Multivar. Anal. (2004) 90:132–153.[CrossRef][Web of Science]
He H, et al. Mapping the C. elegans noncoding transcriptome with a whole-genome tiling microarray. Genome Res. (2007) 17:1471–1477.
Huber W, et al. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics (2002) 18(Suppl. 1):96–104.
Huber W, et al. Transcript mapping with high-density oligonucleotide tiling arrays. Bioinformatics (2006) e22:1963–1970.
Marioni JC, et al. BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics (2006) 22:1144–1146.
Munch K, et al. A hidden Markov model approach for determining expression from genomic tiling micro arrays. BMC Bioinformatics (2006) 7:e239.[CrossRef]
Nicolas P, et al. A reversible jump Markov chain Monte Carlo algorithm for bacterial promoter motifs discovery. J. Comput. Biol. (2006) 13:651–667.[CrossRef][Web of Science][Medline]
Olshen AB, et al. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics (2004) 5:557–572.[Abstract]
Perocchi F, et al. Antisense artifacts in transcriptome microarray experiments are resolved by actinomycin D. Nucleic Acids Res. (2007) 35:e128.
Picard F, et al. A statistical approach for array CGH data analysis. BMC Bioinformatics (2005) 6:e27.
Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE (1989) 77:257–286.[CrossRef]
Royce TE, et al. An efficient pseudomedian filter for tiling microrrays. BMC Bioinformatics (2007) 8:e186.[CrossRef]
Scott DW. Multivariate Density Estimation. Theory, Practice and Visualization (1992) New York: Wiley.
Stolc V, et al. Identification of transcribed sequences in Arabidopsis thaliana by using high-resolution genome tiling arrays. Proc. Natl Acad. Sci. USA (2005) 102:4453–4458.
Stjernqvist S, et al. Continuous-index hidden Markov modelling of array CGH copy number data. Bioinformatics (2007) 23:1006–1014.
Wang L-Y, et al. MSB: a mean-shift-based approach for the analysis of structural variation in the genome. Genome Res. (2009) 19:106–117.
Xu Z, et al. Bidirectional promoters generate pervasive transcription in yeast. Nature (2009) 457:1033–1037.[CrossRef][Web of Science][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






