Skip Navigation


Bioinformatics Advance Access originally published online on December 1, 2007
Bioinformatics 2008 24(3):396-403; doi:10.1093/bioinformatics/btm592
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/3/396    most recent
btm592v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Reiss, D. J.
Right arrow Articles by Baliga, N. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Reiss, D. J.
Right arrow Articles by Baliga, N. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Model-based deconvolution of genome-wide DNA binding

David J. Reiss *, Marc T. Facciotti and Nitin S. Baliga

Institute for Systems Biology, 1441 N. 34th St. Seattle, WA 98103-8904, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Chromatin immunoprecipitation followed by hybridization to a genomic tiling microarray (ChIP-chip) is a routinely used protocol for localizing the genomic targets of DNA-binding proteins. The resolution to which binding sites in this assay can be identified is commonly considered to be limited by two factors: (1) the resolution at which the genomic targets are tiled in the microarray and (2) the large and variable lengths of the immunoprecipitated DNA fragments.

Results: We have developed a generative model of binding sites in ChIP-chip data and an approach, MeDiChI, for efficiently and robustly learning that model from diverse data sets. We have evaluated MeDiChI's performance using simulated data, as well as on several diverse ChIP-chip data sets collected on widely different tiling array platforms for two different organisms (Saccharomyces cerevisiae and Halobacterium salinarium NRC-1). We find that MeDiChI accurately predicts binding locations to a resolution greater than that of the probe spacing, even for overlapping peaks, and can increase the effective resolution of tiling array data by a factor of 5x or better. Moreover, the method's performance on simulated data provides insights into effectively optimizing the experimental design for increased binding site localization accuracy and efficacy.

Availability: MeDiChI is available as an open-source R package, including all data, from http://baliga.systemsbiology.net/medichi.

Contact: dreiss{at}systemsbiology.org

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Localization of the genomic targets of DNA-binding proteins (e.g. transcription factors (TFs), histones, etc.) is central to understanding gene regulation and chromatin organization. Chromatin ImmunoPrecipitation (ChIP) is an experimental protocol that can be used to enrich protein–DNA complexes using antibodies against a protein of interest. The chromatin in these enriched complexes are then randomly fragmented and hybridized against a tiling microarray (chip). This combined procedure, hybridizing a ChIP sample against a DNA microarray, is often referred to as ChIP-chip or ChIP-to-chip (Ren et al., 2000; Lee et al., 2002).

The enriched DNA fragments are of variable length, usually ~300–500 bp long—often far longer than the spacing (tiling array resolution; hereafter TAR) of the microarray probes. Thus, each binding event influences the intensities of multiple probes adjacent to the one containing the binding site (TFBS). Most methods for detecting binding sites in such data identify contiguous windows of significantly enriched probes (e.g. Buck et al., 2005; Ji and Wong, 2005; Johnson et al., 2006). Therefore, regardless of the TAR or the accuracy of these detection methods, the locations of the identified TFBS are still limited to windows of at least the average length of the enriched DNA fragments.

Of course, the accuracy of all TFBS prediction methods using ChIP-chip data (including the one we will describe here) depends upon proper normalization of the data (e.g. normalization versus a reference sample of sonicated chromatin, or via models of probe-specific hybridization, such as those of MAT (Johnson et al., 2006) or others (Cambon et al., 2007). Some methods for detecting TFBS integrate normalization into their procedure; others assume that the normalization is performed ahead-of-time. However, no method is immune to poor data quality or improper normalization (see e.g. Euskirchen et al., 2007 for a review).

Recently, Qi et al. (2006) developed the Joint Binding Deconvolution (JBD), a method that learns a Bayesian graphical model of joint binding events in normalized ChIP-chip data to infer posterior-binding probabilities (and strengths) at resolutions (~30 bp) better than that allowed by previous methods—and better than the TAR itself. Uniquely, the method uses a model for peak profiles derived from experimentally-measured DNA fragment lengths. JBD requires multiple replicates or previous experiments (in order to parameterize the noise model), as well as DNA fragment lengths, measured separately for each sample via electrophoretic analysis. Additionally, as it is a Bayesian inference procedure, JBD requires several user-defined prior parameters, e.g. two hyperprior parameters for the likelihood of binding (which will vary depending on the type of DNA-binding molecule). JBD's generated binding distributions may be used as a prior constraint for motif detection (Qi et al., 2006). However, their interpretation as discrete TFBS necessitates post-processing (via a user-defined threshold), which can decrease the effective resolution and information content of the predictions.

In this article, we introduce MeDiChI, a regression-based method that learns a generative model of joint (multiple, potentially overlapping) binding events in normalized ChIP-chip data. It estimates the positions of likely binding sites via constrained, penalized least squares regression. In Section 2, we derive a model for the ChIP-chip profile of an individual binding event, which becomes the deconvolution kernel in a joint model of multiple binding events. We show how this model is learned from data, and demonstrate (Section 3) that our procedure is robust, and is able to correctly predict locations of binding events at high resolution from data of widely varying quality and resolution: original H. salinarium NRC-1 general transcription factor (GTF) TFBd binding data; previously-published data for the yeast TF Gcn4 (Pokholok et al., 2005); and simulated binding data. We also perform an in-depth comparison of MeDiChI with other popular ChIP-chip analysis methods, including JBD (Qi et al., 2006). Finally, we show that the application of MeDiChI to simulated data reveals important considerations for the design of future ChIP-chip experiments.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 The model of a single binding event
The ChIP-chip assay has been described previously in detail (Ren et al., 2000; Lee et al., 2002), and we depict the protocol and primary aspects of our single binding event model (on a tiling array) in Figure 1. We introduce the following physical parameters (all genomic units are in base-pairs):

  1. the probe length, 2 x lP (assumed to be the same for all probes),
  2. the footprint 2 x {lambda} of the DNA binding protein on the DNA,
  3. the distribution f(l) of enriched, sonicated DNA fragments of length l cross-linked to the TF,
  4. the intensity I(l) of a labeled DNA fragment of length l and
  5. the conditional probability {Phi}(v) of hybridization of each fragment to those probes to which it has contiguous complementary by v bp.
An enriched DNA fragment of length l containing a binding site at genomic coordinate x0 must begin at a coordinate somewhere on the interval [x0 + {lambda}l; x0{lambda}]. Such a fragment can hybridize with any probe that intersects the interval [x0 + {lambda}l; x0{lambda} + l ]. These probes' center coordinates will then lie in the range [x0 + {lambda}llP; x0{lambda} + l + lP]. Therefore, the fragment has the following conditional probability of binding to a probe centered at position x:


Formula 1

(1)


Figure 1
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Top: The ChIP-chip protocol for a DNA-binding protein results in a profile, in a tiling array, for a single binding event similar to the one shown here. Bottom: Physical model parameters, as described in the text.

 
The final intensity profile is obtained by integrating the fragment length distribution f(l) and intensities I(l) over all fragment lengths l, weighted by the probability Pl(xx0) (Equation 1) of hybridizing with the probe at xx0:


Formula 2

(2)
K(xx0) is a symmetric, inverted, nearly-triangular profile, with broader wings contributed by longer DNA fragments. This profile is used as a kernel for subsequent deconvolution via linear regression.

Model parameters: The parameter with the greatest influence on the shape of K(xx0) is the distribution of fragment lengths f (l), which we parameterize using a Gamma distribution, {Gamma}({alpha}, β). We use a Sigmoidal distribution, 1/(1 + exp[–(v v0) x vS]) for {Phi}(v). We find that the effects of I(l) and {Phi}(v) on the profile shape are highly covariant with that of f(l), and therefore these first two are fixed at canonical values: I(l) {propto} l; v0 {equiv} 15; vS {equiv} 0.1 (making {Phi}(v) a step function [1 iff v ≥ 15; 0 otherwise]). Any difference between these settings and the ‘real’ values can be captured by changes in the scale {alpha} and shape β of f(l). Practically, while {lambda} can strongly influence the profile's shape (by dulling the point of the peak profile), it can also be captured by imposing a lower-limit on f(l). Thus only f(l) is necessary to uniquely define K(xx0) for a given data set (for which lP is given). f(l) for each ChIP-chip sample may be obtained via electrophoretic analysis, such as via a DNA BioAnalyzer (Agilent Technologies, Santa Clara, CA) (Qi et al., 2006). Alternatively, f(l) (and, optionally {lambda}) may be learned directly from the tiling array data itself, as described in Section 2.6.

2.2 A constrained linear model of multiplebinding events
Equation (2) gives the intensities K(xxb) of probes at distance x from the location xb of an individual binding event. We now discuss the joint model of all functionally independent binding events for a given TF simultaneously across the genome. The measured intensity yi of probe i centered at chromosomal location xi is the sum of intensities arising from each of the binding events at nearby locations xb:


Formula 3

(3)
{sigma}i is the error term for the intensity of probe i, which includes random noise and systematic (possibly correlated) variance which is not explained by our model (such as cross-hybridization; see below). The parameters βb quantify the intensity of the binding event at xb, which we note is an (unknown) function of various factors, including the fraction of cells with the binding event occurring at the time of cross-linking, the binding affinity at that location, the efficiency of cross-linking or of IP of the protein–DNA complex, and the efficiency of IP, amplification and hybridization of the DNA fragments containing the site. However, for a given TF we can expect that each βb is related to the TF's affinity for binding at location xb.

Kib = K(xixb) is a N x Nb matrix containing the set of predictors for the Nb potential binding locations. Nb is defined by the desired resolution rb. We have found (Section 3) that a reasonable default for this quantity is on the order of rb ~= max({Delta}xP/10, 10).

The coefficients β are solved by minimizing the residual sum of squares,


Formula 4

(4)
Here we have assumed independence in the observations y and identical and uncorrelated errors {sigma}i. We are thereby ignoring secondary effects such as (1) nonlinearity (saturation) in the microarray intensities, which could potentially be alleviated through the use of a generalized linear model (Bonneau et al., 2006); (2) independence of the relevant functional DNA-binding unit (where a dimer, for example, is considered one independent functional-binding unit); (3) cross-hybridization and probe sequence-related systematics, which is usually (mostly) removed through normalization, or via a model such as that proposed by Johnson et al. (2006); and (4) non-constant variances {sigma}i, which can be accounted for with weighted least squares.

Finally, binding events generate only positive signal, so we enforce, for each coefficient βj,


Formula 5

(5)

2.3 Constrained variable selection via positive Lasso
In most cases, for the regression defined above, we will have N << Nb. In addition, many of the predictors Kj are highly-correlated and therefore redundant. Therefore, to avoid overfitting, we wish to select a sparse model defined by a small subset Nb << Nb of the complete set of variables. There are many methods available for variable subset selection or coefficient shrinkage, and of these, we chose the Lasso (Hastie et al., 2001), which applies an L1 penalty to the coefficients. The L1 constraint results in a sparser solution, which is what is desired here, than other commonly-used penalties, such as L2 (ridge). Under the Lasso constraint, we update Equation (4) to:


Formula 6

(6)
where we have also added the positivity constraint of Equation (5). We choose a value for the ‘tuning parameter’ Formula in Equation 6 to select a model Formula that increases in complexity (and bias) as s increases from zero (the null model Formula ) to one [the OLS estimate Formula Equation (4)]. Least angle regression (Lars; Efron et al. (2004)), an efficient procedure for generating piecewise-linear solutions to Equation 6 over small intervals of s isin [0; 1], produces a set of kLars models with increasing complexity as s increases. In practice, we limit the number of Lars steps computed kmax ~= Nb / 4, corresponding to the model of greatest complexity that is considered, Formula . In addition, we modified the authors' implementation (Efron et al., 2004) to enforce the additional positivity constraint in the coefficients (Equation 6).

The Lasso constraint has been effectively utilized for other bioinformatics problems such as learning regulatory influences from microarray data (Bonneau et al., 2006; van Someren et al., 2006) and measuring genomic copy number alterations in tiling array data (Huang et al., 2005). The Lasso has also been previously applied to the deconvolution of mass spectra (Du and Angeletti, 2006), and 2D images (Ting et al., 2006).

2.4 Model selection via BIC
Assuming that the underlying model is correct, parametric model selection criteria such as Akaike information criterion (AIC), or Bayesian information criterion (BIC) can be more accurate and less costly than cross-validation (Zou et al., 2004). When log (N) >> 2, BIC tends to penalize complex models more heavily than AIC (Hastie et al., 2001), which is precisely what is desired here. In addition, it has been shown that BIC is typically more consistent in selecting the true model if it is among the list of candidate models (Zou et al., 2004). These motivations led us to choose BIC for model selection:


Formula 7

(7)
(Zou et al., 2004), for a given set of parameters Formula selected for model Formula at Lasso shrinkage parameter s. Here, Formula can be estimated by the mean squared error of a high-complexity model (Hastie et al., 2001); we use Formula ; and df(s) is the effective number degrees of freedom of model Formula . It was empirically found (Efron et al., 2004; Zou et al., 2004) that df(s) is well-approximated by the step number kLars of the most complex model in the Lars sequence that has the same number of non-zero parameters as does Formula . Model selection via BIC involves simply identifying the model Formula that minimizes BIC(s).

The final selected model Formula consists of a limited set of nonzero coefficients Formula , defining the fitted intensities of a set of binding sites at their predicted high-resolution locations. However, we find (via simulations; see below) that Lasso often shrinks the nonzero coefficients Formula to slightly below their optimal values, thereby underestimating their intensities. We therefore update these by estimating the final solution to Equation (4) (for only the nonzero coefficients) under the positivity constraint [Equation (5)] via quadratic programming (Goldfarb and Idnani, 1983).

2.5 Deconvolution of an entire ChIP-chip data set
Because the size of the kernel matrix K increases with the square of the size of the genomic region being modeled, we break up the deconvolution of an entire ChIP-chip data set into smaller (e.g. 30 000 bp) overlapping segments, so that they can be deconvolved at high resolution (e.g. 10 bp) with limited memory requirements. These resulting fits are then combined into a single set of peak locations and intensities for each chromosome. Such a procedure is easily distributed to multiple processor cores or on a computing cluster.

Additionally, we note that this procedure is performed simultaneously on multiple technical and/or biological replicates without signal averaging.

2.6 Learning peak profile parameters
The accurate model of a single, isolated peak [Equation (2)] for each data set is important to enable precise predictions of correct binding sites. The model of Equation (2) depends upon several parameters, which can be determined experimentally with some significant overhead. As an alternative, we developed a procedure to learn certain important model parameters (including, but not limited to f (l) = {Gamma}({alpha}, β) and {gamma}) by minimizing the model residuals for a number of bright (Np ~ 20), isolated peaks across the data set. The iterative procedure begins with reasonable starting values for the parameters, and repeats the following:

  1. Compute a peak profile K(xx0) [Equation (2)] using the current parameter values.
  2. Apply the Lars/BIC procedure described above to identify the Np + N{Delta} ~30 brightest peaks matching the profile.
  3. Compute, for each of the Np + N{Delta} peaks, the mean RSS for the probes within ~1 000 bp of its center, and identify the Np of these peaks with the lowest mean RSS (presumably isolated peaks).
  4. Compute the total mean RSS for all probes surrounding the Np isolated peaks, and update the model parameters using the Simplex method (Nelder and Mead, 1965) to minimize this score.
This procedure usually requires ~30 iterations.

2.7 Assessment of model uncertainty
Due to the complexity of the Lars procedure (and selection via BIC), we cannot compute parametric estimates of the uncertainties in the inferred binding site parameters. Therefore, we use the bootstrap to estimate the uncertainties in either (1) which coefficients are selected by Lars/BIC (and thus in the coordinates of the predicted binding sites), or (2) their significance. Further, the effects of the model's i.i.d. assumptions (describe above) on the predictions can be tested.

For this article, we chose the wild bootstrap (Liu, 1988) data generation procedure for this analysis. The advantage of this data generation procedure is that its simulated data sets mimic the heteroskedastic and spatially-correlated characteristics of the original data. For bootstrap j isin [1; Nboot], new residuals Formula are sampled from an asymmetric two-point distribution (Mammen, 1993),


Formula

where the data residuals Formula [Equation (4)] are computed from model Formula (parameterized by Formula ). A sample data set Formula is generated, for which a set of Formula coefficients are selected using the Lars/BIC procedure described above. The sampled data can mimic either (1) the intensities of the original data (Formula ) or (2) the residuals alone (Formula ). In the first case, a distribution of predicted binding locations may be computed via a kernel-based density estimator of the resulting Formula . In the second case, the p-value for binding site Formula is estimated from the CDF of all nonzero Formula over all bootstraps:


Formula

where the Formula are among the Formula non-zero Formula detected (in the entire data set) for bootstrap j, and {Delta}(x) equals 1 when x is true; 0 when x is false. The computational time required for the Nboot bootstraps is significantly less than Nboot x (the time for a single non-bootstrap MeDiChI estimate) because many computationally intensive operations (e.g. construction of the kernel matrix K) need only be performed once.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We assessed the performance of MeDiChI deconvolution using three real-world data sets: ChIP-chip data for the H. salinarium NRC-1 TFBd from two microarray platforms of varying resolution (Facciotti et al., 2007) (Section 3.1); and previously-published S. cerevisiae Gcn4 binding data (Pokholok et al., 2005) (section 3.2). We also tested MeDiChI on simulated data (section 3.3), to evaluate the accuracy and limitations of the algorithm on data of varying resolution and quality.

3.1 MeDiChI performs well on both low-resolution and high-resolution data
Genome-wide ChIP-chip data for H. salinarium NRC-1 were collected for the cmyc epitope-tagged GTF TFBd, by hybridizing the same biological sample against two different whole-genome tiling array platforms. The first of these was a low-resolution (‘low-res’) HaloSpan (Facciotti et al., 2007) 500 mer PCR-product array spotted in-house in octuplicate with a TAR of 500 bp. The second array utilized a 392 000-feature Nimblegen Systems (Madison, WI) array with single-stranded 50 mer probes spotted in duplicate with a TAR of 13 bp (‘hi-res’). For each set, we estimated the parameters for the peak profile model (Section 2.6). The resulting profiles for both data sets are shown in Figure 2.


Figure 2
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Best-fit peak profile model (red line) for low-res (left) and hi-res (right) H. salinarium NRC-1 TFBd data. Points are scaled intensity values for the 20 brightest, isolated peaks, with their central positions offset to zero.

 
A sample binding site is shown in Figure 3 (a larger segment is shown in Supplementary Fig. 1). The high density of binding sites for this global regulator suggests that we may consider this data set a rather stringent test of the efficacy of the algorithm. Since the TFBSs appear, as expected, to be concentrated in the small fraction of non-coding regions (NCRs; only ~13% of the entire H. salinarium NRC-1 genome is non-coding), we chose to use the fraction of detected binding sites in NCRs as a measure of the method's deconvolution capabilities. (However, some TFBS will inevitably fall in coding regions, due to the high coding density of the H. salinarium NRC-1 genome.)


Figure 3
View larger version (25K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. A 2000 bp segment of TFBd ChIP-chip data (Top: low-res HaloSpan chip; Bottom: hi-res Nimblegen chip), with intensity-scaled MeDiChI bootstrap-based predictions (green). Also shown are the model fits (red lines; 10%, 50%, 90% bootstrap quantiles) and the JBD (Qi et al., 2006) solution (blue dotted line; posterior probability x intensity). Known gene coding regions are shown below each plot.

 
This analysis (Fig. 4) reveals that the TFBSs identified by all methods fall prevalently in NCRs, and that MeDiChI performs roughly equally well in this measure on both the high- and low-res data. We also ran JBD (Qi et al., 2006) (using default parameters and the peak profile learned directly from the data as described above), MPeak (Kim et al., 2005), and a naive threshold-based peak detection procedure (whereby TFBS are assigned to the central coordinates of probes whose IP enrichment ratio surpasses a threshold) on both data sets. MeDiChI appears to outperform all of these methods, for both arrays, and to a greater degree as the detection threshold is decreased (thus as fainter peaks are included).


Figure 4
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Fraction of predicted TFBSs in non-coding regions, for MeDiChI and three other methods, as a function of number of detections, on both low-res and hi-res H. salinarium NRC-1 TFBd binding data.

 
The Nimblegen array's extremely high resolution of 13 bp allows us to consider using the peaks identified in this data set to test the TFBS localization capabilities of an algorithm on the low-res data. An all-versus-all comparison of such a comparison, for the 220 brightest peaks detected by each method (corresponding to the number of peaks detected by MPeak; the fewest of any of the methods), is shown in Table 1. We find that the binding site predictions of all four methods in the hi-res data are, on average, closer to the MeDiChI predictions in the low-res data, than they are to the predictions of any other method in the low-res data. This suggests that MeDiChI was the most accurate on this very low-res data set.


View this table:
[in this window]
[in a new window]

 
Table 1. Median distances between the 220 brightest TFBd binding site predictions from the low-res array (Rows) versus those from the hi-res array (Columns), for MeDiChI, JBD (Qi et al., 2006), MPeak (Kim et al., 2005), and basic thresholding (see text)

 
We note that we had difficulty finding parameters for JBD that resulted in (visually) satisfactory JBD predictions on all peaks in the hi-res data (e.g., the left-most peak in Supplementary Fig. 1b). Therefore, a further in-depth comparison of the two methods on data in which the JBD default parameters are known to be appropriate follows in the next section.

3.2 MeDiChI shows superior performance on yeast Gcn4 binding data
Qi et al. (2006) performed a complete comparison of their JBD procedure against several others on a 266 bp resolution in vivo Gcn4 ChIP-chip binding array consisting of ~44 000 60 mer oligonucleotide probes (Pokholok et al., 2005). They convincingly showed that their bayesian method outperformed several others (Boyer et al., 2005; Kim et al., 2005) at resolving the positions of known Gcn4 targets. We now compare the spatial resolution of TFBSs detected by MeDiChI with those of JBD (and transitively, the methods which JBD were compared against (Qi et al., 2006)) on this same data set. To ensure that the optimal JBD predictions for this data set were used, we utilize the posterior-binding probabilities and strengths (hereafter PBDs) generated by the authors http://cgs.csail.mit.edu/jbd/). JBD PBDs were discretized into binding events using the authors' scripts with the default parameters.

TFBS predictions (examples in Fig. 6 and Supplementary Fig. 2) were compared with the 128 known Gcn4 targets (Harbison et al., 2004) after filtering all predictions to only the npred isin {80, 128, 200} brightest across the entire genome (Table 2). A "missed" binding event was one for which none of the npred brightest predictions were within 500 bp of any known location. For comparison, we also used a basic ‘thresholding’ technique in which we identified the npred brightest probes in the data, and chose their center coordinates to be predicted TFBS. We also computed similar statistics for randomly-chosen probes. While both MeDiChI and JBD attain significantly better average resolutions than the thresholding method, MeDiChI appears to outperform JBD by this measure. Although the necessity of discretizing JBD's predictions (e.g. red rectangles in Fig. 6) clearly results in loss of information contained in the PBDs (solid blue lines in Fig. 6), it reveals a problem in interpreting JBD's PBDs as discrete binding events.


Figure 6
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Example MeDiChI predictions on yeast Gcn4 data (black points) (Pokholok et al., 2005), (see Fig. 3 for a legend). Rectangles at the bottom show the JBD discrete TFBSs (red), and known Gcn4 TFBSs (Harbison et al., 2004) (green).

 

View this table:
[in this window]
[in a new window]

 
Table 2. Median distances and false negatives (missed events) between the npred brightest predicted Gcn4 TFBSs, and the 128 known TFBSs (Harbison et al. 2004), for MeDiChI, JBD (Qi et al. 2006), and basic thresholding (see text)

 
Median resolutions for Gcn4 TFBS predictions were computed for both MeDiChI and JBD over a range of sensitivities (number of predicted TFBS, npred) (Fig. 5). The MeDiChI TFBS predictions were computed using peak profiles both (1) estimated by Qi et al. (2006) for this data set, and (2) fit by our procedure directly from the data (Section 2.6). The JBD TFBS predictions were selected using both (1) the authors' filtering script, and (2) a custom procedure that we developed. For a wide range of sensitivities, by this measure, MeDiChI appears to outperform the other methods, in particular when the peak profile learned directly from the data (Section 2.6) is used.


Figure 5
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Median resolution of TFBS predictions, as a function of number of detections, on yeast Gcn4 data (Pokholok et al., 2005), for MeDiChI (using both the JBD binding site profile and ours – see text); JBD (Qi et al. (2006); using both the JBD peak filtering script and our own procedure) and threshold-based peak selection.

 
In general, there was a large degree of agreement between the predictions of MeDiChI and JBD, (for example, Fig. 6a and Supplementary Figure 2b), with some exceptions (e.g. Fig. 6b and Supplementary Fig. 2d). Clearly, the results such as those shown in Table 2 and Figures 6 and 5 suggest that by integrating the predictions of both methods, one could both decrease the uncertainty in binding site predictions, and reduce the number of false negatives. For example, with npred = 128, MeDiChI locates 18 TFBS that JBD misses, and likewise, JBD identifies 13 sites that MeDiChI misses.

3.3 Analysis of synthetic data highlights trade-offs in experimental design
We generated synthetic ChIP-chip data, where the locations and intensities of all binding events are precisely known, to assess the accuracy of MeDiChI's predictions on data of widely varying quality and resolution. Since the peak profile is also known, we do not use this data to test the accuracy of the profile model itself. The synthetic data were used to test the effect of probe spacing, replicate count, DNA fragment length, and binding site separation on the accuracy of MeDiChI's predictions. The tests were divided into two classes: one to evaluate MeDiChI's predictions on single, isolated binding events, and one to test MeDiChI's ability to correctly separate two binding events at various separations.

We generated 7000 bp regions, each with varying tiling resolution (between 10 and 500 bp), and replicate counts (between 1 and 9). Each had zero signal except for either one or two binding events of equal intensity (peak intensities were set to one in all cases; see below), varying separation (between 50 and 700 bp), and profile width [by varying f (l) in the model of Equation (3)]. Gaussian-distributed noise was then added to each probe, at a level of {sigma}i = {sigma}n (yi + {sigma}n), where yi is the probe intensity and {sigma}n the input noise scale (we used {sigma}n isin {0.05, 0.1, 0.2, 0.3}). Because the simulated peaks all had the same intensities, we considered the detection efficiency as a function of {sigma}n (effectively, signal-to-noise) rather than absolute peak intensity. Projections of the median measured resolutions of the MeDiChI predictions on these simulated data are summarized in Figure 7 and Supplementary Figure 3.


Figure 7
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7. Median resolution (a, b, d, e) and fractional recovery (c) of MeDiChI predictions on synthetic data for two binding sites, as a function of their separation (a–c); and for isolated binding sites as a function of the TAR (d) and mean chromatin fragment length (e). Curves show: varying TAR (a, c, e); varying number of replicate measurements (b) and varying noise level (d; see text); see legends for details.

 
Inspection of the Figures reveals that MeDiChI increases the effective resolution of data over all tested TARs, with the greatest fractional gain for the lowest-resolution data. Not surprisingly, the resolving power of MeDiChI on pairs of binding sites (Figures 7a and b) improves for greater separations, and asymptotically approaches that for isolated sites (Fig. 7a–c).

We see that while there is a clear trade-off between TAR and number of (technical or biological) replicates, the gain seems limited when the number of replicates exceeds approximately three. The benefit of additional replicate observations is also less pronounced for higher TARs, suggesting that the parameter to consider in experiment design is actually the number of unique observations per binding site. In fact, the biggest apparent gain seems to come from increasing the TAR (and therefore the number of observations at unique locations across the binding site) than from increasing the number of observations per probe. This effect is not surprising, and is similar to what is known in imaging (e.g. observational astronomy) circles.

We also see this latter effect in a plot of median resolution (on isolated binding sites) as a function of the mean chromatin fragment length (Fig. 7e). TFBSs are localized more accurately when this length is approximately greater than the probe spacing. Thus, for lower-resolution tiling arrays, less aggressive sonication may be beneficial. This result, as above, suggests that TFBS localization accuracy increases if there are more observations at multiple locations across each observed peak. For very high-resolution arrays, this effect vanishes, although there is of course a limit to the degree of sonication that is possible. While this result is true for individual binding sites, it should be noted, however, that increasing the mean DNA fragment length will decrease the effective resolvability of two (or more) closely-overlapping peaks.

Finally, Figure 7c and Supplementary Figures 3(c and d) show the fraction of peaks that were not detected by MeDiChI (the number of false negatives) on the simulated data. On the two-peak simulations, a non-detection was defined as a case where only zero or one peak was reported. These curves explain the apparent increase in resolution for even the low-resolution data at small peak separations (Fig. 7a). The detection rate increases quickly as the separation increases, roughly independent of the TAR (except for the 500 bp case, which is a clear outlier). For isolated peaks, the number of peaks detected (to within 1 000 bp) drops from nearly 100% as the TAR decreases (Supplementary Figures 3(c and d)). The number of replicates and, even more so, the degree of signal-to-noise of the binding sites, also affects the number of false negatives.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have presented a generative model the probe intensities in ChIP-chip data and an efficient, accurate, and robust method for solving that model via constrained linear regression. We demonstrated that the procedure recovers accurate binding site locations from data covering a wide range of resolutions: both simulated data covering 10–500 bp resolution, and real arrays collected for a H. salinarium NRC-1 GTF and for the yeast TF Gcn4 at very different resolutions (500, 13 and 266 bp, respectively). Our analyses suggest that the lower-resolution arrays benefit most from the resolution gain associated with the use of this algorithm, although the potential still exists for resolution gains for TARs as low as ~10 bp or even lower. We also demonstrated that the generative nature of this model can be used to produce realistic simulated ChIP-chip data, which enables the analysis of experimental design tradeoffs. For example, we showed that the utility of increasing the number of replicates beyond three is questionable, however increasing the average sonicated DNA fragment size can be beneficial for localizing binding sites using low-resolution arrays, in particular.

For localizing TFBSs in ChIP-chip data, MeDiChI clearly outperforms existing methods that cannot increase the resolution of the binding locations above that of the microarray probe spacing, (e.g. Buck et al., 2005; Kim et al., 2005). JBD (Qi et al., 2006), which resembles MeDiChI in both its model and in its ultimate goal, performs the most similarly to MeDiChI. Indeed, one may consider MeDiChI to be the frequentist alternative to the Bayesian JBD, whose predicted posterior-binding distributions are related to MeDiChI's predictions of discrete binding locations and intensities (which may be broadened to posterior-probabilities via the bootstrap; see Section 2.7). In fact, as we presented in Section 3.2), in some instances these two algorithms complement one another and we suggest that their integration may improve overall performance in terms of accuracy, sensitivity, and specificity.

While the goals and overall model of MeDiChI and JBD are quite similar, we believe that MeDiChI exhibits several advantages that make it more practical for general use. First, MeDiChI requires only one user-defined parameter: the desired deconvolution resolution (and even for that, a reasonable default may be chosen). Second, MeDiChI includes a procedure by which the peak profile model (required by both MeDiChI and JBD) can be learned directly from the data. This enables, for example, the reanalysis previously published data where the DNA fragment length distribution f (l) was not measured. Finally, the computational cost of a non-bootstrapped MeDiChI deconvolution (once the peak profile is known) is only slightly higher than the simpler window-based algorithms listed above, and the procedure is easily distributed on multiple processor cores or a computing cluster. All of these features make MeDiChI suitable for integration into an automated ChIP-chip analysis pipeline, and for processing the much larger data sets that are beginning to arise from multi-million-feature or multiplexed arrays.

There are some simplifications in the current MeDiChI model that could be improved upon in the future. The first of these is that there is no probe-specific hybridization model, (as in Johnson et al., 2006), to account for sequence-related hybridization efficiencies and cross-hybridization. It should be possible to integrate such a model into this procedure, which would enable it to more accurately handle data from array technologies that are more subject to such issues (e.g. Affymetrix 25-mer arrays; Euskirchen et al., 2007). Also, the use of a weighted constrained generalized regression (as in Park and Hastie, 2007) would more accurately handle the intrinsic heteroskedasticity of the intensities, as well as potential nonlinear effects (e.g. saturation) of the microarray.

We also believe that, with simple modifications, the MeDiChI procedure may also aid in the analysis of ‘ChIP-Seq’ data (Johnson et al., 2007). Such data, while of higher effective resolution, will still be limited by the sonicated DNA fragment lengths. The MeDiChI model could be applied to such data by altering the peak profile model to (1) count DNA fragments themselves, rather than their intensities, (2) eliminate the hybridization efficiency term {Phi}(v), and (3) consider the data to have a resolution of 1 bp. In addition, because of this effective high resolution of ChIP-seq data, we expect that a MeDiChI-based procedure may be able to infer the size of the TF's footprint (2 x {lambda}) on the chromatin.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We would like to thank Richard Bonneau, Greg Carter, Harri Lähdesmäki, Vesteinn Thorsson, and the anonymous reviewers for helpful comments. This work was supported by NSF Postdoctoral Fellowship DBI 0400598 to M.T.F and the following research grants to N.S.B: NSF (DBI-0640950); DoE (DE-FG02-07ER64327); and NIH (P50 GM076547).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on August 16, 2007; revised on November 27, 2007; accepted on November 27, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Bonneau R, et al. The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo. Genome Biol (2006) 7:R36.[CrossRef][Medline]

    Boyer LA, et al. Core transcriptional regulatory circuitry in human embryonic stem cells. Cell (2005) 122:947–956.[CrossRef][Web of Science][Medline]

    Buck MJ, et al. ChIPOTle: a user-friendly tool for the analysis of ChIP-chip data [Evaluation Studies]. Genome Biol (2005) 6:R97.[CrossRef][Medline]

    Cambon AC, et al. Analysis of probe level patterns in Affymetrix microarray data [Comparative Study]. BMC Bioinformatics (2007) 8:146.[CrossRef][Medline]

    Du P, Angeletti RH. Automatic deconvolution of isotope-resolved mass spectra using variable selection and quantized peptide mass distribution. Anal. Chem (2006) 78:3385–3392.[Medline]

    Efron B, et al. Least angle regression. Ann. of Stat. (with discussion) (2004) 32:407–499.

    Euskirchen GM, et al. Mapping of transcription factor binding regions in mammalian cells by ChIP: comparison of array- and sequencing-based technologies [Comparative Study]. Genome Res (2007) 17:898–909.[Abstract/Free Full Text]

    Facciotti MT, et al. General transcription factor specified global gene regulation in archaea. Proc. Natl Acad. Sci. USA (2007) 104:4630–4635.[Abstract/Free Full Text]

    Goldfarb D, Idnani A. A numerically stable dual method for solving strictly convex quadratic programs. Math. Program (1983) 27:1–33.[CrossRef]

    Harbison CT, et al. Transcriptional regulatory code of a eukaryotic genome. Nature (2004) 431:99–104.[CrossRef][Medline]

    Hastie T, et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. (2001) New York: Springer-Verlag.

    Huang T. Detection of DNA copy number alterations using penalized least squares regression. Bioinformatics (2005) 21:3811–3817.[Abstract/Free Full Text]

    Ji H, Wong WH. TileMap: create chromosomal map of tiling array hybridizations. Bioinformatics (2005) 21:3629–3636.[Abstract/Free Full Text]

    Johnson DS, et al. Genome-wide mapping of in vivo protein–DNA interactions. Science (2007) 316:1497–1502.[Abstract/Free Full Text]

    Johnson WE, et al. Model-based analysis of tiling-arrays for ChIP-chip [Evaluation Studies]. Proc. Natl Acad. Sci. USA (2006) 103:12457–12462.[Abstract/Free Full Text]

    Kim TH, et al. A high-resolution map of active promoters in the human genome. Nature (2005) 436:876–880.[CrossRef][Medline]

    Lee TI, et al. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science (2002) 298:799–804.[Abstract/Free Full Text]

    Liu RY. Bootstrap procedure under some non-i.i.d. models. Annals of Stat (1988) 16:1696–1708.[CrossRef]

    Mammen E. Bootstrap and wild bootstrap for high dimensional linear models. Ann. Stat (1993) 21:255–285.[CrossRef]

    Nelder JA, Mead R. A simplex algorithm for function minimization. Comput. J (1965) 7:308–313.

    Park MY, Hastie T. L1 regularization path algorithm for generalized linear models. J. R. Statist. Soc. B (2007) 69:659–677.[CrossRef]

    Pokholok DK, et al. Genome-wide map of nucleosome acetylation and methylation in yeast. Cell (2005) 122:517–527.[CrossRef][Web of Science][Medline]

    Qi Y, et al. High-resolution computational models of genome binding events. Nat. Biotechnol (2006) 24:963–970.[CrossRef][Web of Science][Medline]

    Ren B, et al. Genome-wide location and function of DNA binding proteins. Science (2000) 290:2306–2309.[Abstract/Free Full Text]

    Ting M, et al. Sparse image reconstruction using sparse priors. (2006) Proceedings of the IEEE International Conference on Image Processing. 1261–1264.

    van Someren EP, et al. Least absolute regression network analysis of the murine osteoblast differentiation network. Bioinformatics (2006) 22:477–484.[Abstract/Free Full Text]

    Zou H, et al. On the ‘degrees of freedom’ of the lasso. In: Technical report. (2004) Stanford University Department of Statistics.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
Y. Kim, S. Bekiranov, J. K. Lee, and T. Park
Double error shrinkage method for identifying protein binding sites observed by tiling arrays with limited replication
Bioinformatics, October 1, 2009; 25(19): 2486 - 2491.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/3/396    most recent
btm592v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Reiss, D. J.
Right arrow Articles by Baliga, N. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Reiss, D. J.
Right arrow Articles by Baliga, N. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?