Skip Navigation


Bioinformatics Advance Access originally published online on December 12, 2006
Bioinformatics 2007 23(3):314-320; doi:10.1093/bioinformatics/btl606
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/3/314    most recent
btl606v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Lerman, G.
Right arrow Articles by Mishra, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lerman, G.
Right arrow Articles by Mishra, B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Published by Oxford University Press 2006
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Functional genomics via multiscale analysis: application to gene expression and ChIP-on-chip data

Gilad Lerman 1,*, Joseph McQuown 3, Alexandre Blais 4, Brian D. Dynlacht 4,5, Guangliang Chen 1 and Bud Mishra 2,4

1 Department of Mathematics, University of Minnesota 127 Vincent Hall, 206 Church St. S.E., Minneapolis, MN 55455, USA
2 Courant Institute of Mathematical Sciences, New York University 251 Mercer Street, New York, NY, 10012, USA
3 Department of Applied Mathematics and Statistics, State University of New York Stony Brook, NY 11794, USA
4 NYU School of Medicine 550 First Avenue, New York, NY, 10016, USA
5 NYU Cancer Institute 550 First Avenue, New York, NY, 10016, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM AND METHODS
 3 CASE STUDIES
 4 BRIEF DISCUSSIONS
 REFERENCES
 

We present a fast, versatile and adaptive-multiscale algorithm for analyzing a wide-variety of DNA microarray data. Its primary application is in normalization of array data as well as subsequent identification of ‘enriched targets’, e.g. differentially expressed genes in expression profiling arrays and enriched sites in ChIP-on-chip experimental data.

We show how to accommodate the unique characteristics of ChIP-on-chip data, where the set of ‘enriched targets’ is large, asymmetric and whose proportion to the whole data varies locally.

Contact: lerman{at}umn.edu

Supplementary information: Supplementary figures, related preprint, free software as well as our raw DNA microarray data with PCR validations are available at http://www.math.umn.edu/~lerman/supp/bioinfo06 as well as Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM AND METHODS
 3 CASE STUDIES
 4 BRIEF DISCUSSIONS
 REFERENCES
 
Microarray analysis is a high-throughput method to measure abundance of multiple species of target DNA by simultaneous hybridization to an array of DNA probes. When the target DNA is cDNA corresponding to gene expression, it measures the transcriptomic state of a cell under an experimental condition. When the target DNA is sampled from genomic DNA, it measures copy-number variations within a genome as population polymorphisms or as chromosomal aberrations in a tumor genome (Pollack et al., 1999). Finally, when the target is genomic DNA selected by immunoprecipitation (IP) with a protein, it identifies those regions of genome that interact with proteins, such as transcriptional factors, thus elucidating regulatory genetic controls (Ren et al., 2000).

All these applications use comparative methods. In a ‘two-color’ scheme, simultaneous array hybridization detects target DNAs of two different experiments, which are labeled with different fluorescent dyes. Target DNAs that have differential behavior from one experiment to the other are called ‘enriched,’ the objects sought after in these high-throughput experiments. Enriched targets are found in two steps: first, the measurements are transformed through a ‘normalization’ step in order to assign similar local means (or medians) to ‘presumed unenriched’ targets. The purpose of this step is to compensate for experimental sources of variation, like dye-specific effects and hybridization unevenness in DNA arrays (Smyth et al., 2003; Buck and Lieb, 2004). Next, the normalized data is used as a basis for statistical identification of enriched targets that truly differ between the two analyzed DNA samples.

In practice, the targets are measured optically in terms of a raw intensity value, and analyzed after being transformed by a logarithmic function. With just two samples involved, the following transformed data representation is standard: for every target, the logarithms of intensities (according to the two samples) are averaged to create an A value (log of their geometric mean) and subtracted to create an M value (log of their ratio), and then plotted in an M versus A plot. The majority of targets will belong to a stable unenriched set of targets, and after correct normalization their M values will be close to zero. The normalized M values of the enriched targets will lie either significantly above or below zero, but not necessarily with any known distributions, or even symmetry.

This paper describes a fast and general multiscale method for normalization of microarray data and identification of enriched targets without assuming any prior distribution. Its utility is greatest when the data is difficult to model statistically, for instance, when they contain unavoidable distortion and asymmetry. Such problems are most acute for ChIP-on-chip experimental data.

ChIP-on-chip experiments combine microarrays (‘chips’) with Chromatin immunoprecipitation (ChIP) assays to identify the genomic loci bound by any given transcription factor (Ren et al., 2000; Blais and Dynlacht, 2005; van Steensel, 2005). The immunoprecipitated sample represents gene fragments attached to the transcription factor, and is compared with a sample not subjected to immunoprecipitation and thus representing all genes equally (‘input’ sample). The two samples are labeled with different fluorescent dyes and are co-hybridized onto a DNA microarray representing all gene promoters of the particular species examined. Those spots that show a significant increase in fluorescence in IP sample relative to the input sample are termed enriched spots, and are considered to represent the target genes bound by the transcription factor in the cell nucleus.

There are currently well-established methods for normalization and identification that work well for many cDNA array datasets (Quackenbush, 2002; Smyth et al., 2003). However, in ChIP-on-chip experiments, enriched target DNA segments deviate in only one direction. That is, the M values (log ratio of input to IP signals) of enriched sites are mostly negative. In this case, the statistical distribution of the corresponding M values is asymmetric, hard to model and thus difficult to estimate. Moreover, in such data the whole M values are frequently skewed, their observed distribution varies locally, and the dependence of their local means on the A values is nonlinear. Consequently, common probabilistic techniques have been difficult to adapt to data arising from ChIP-on-chip experiments (Buck and Lieb, 2004).

Occasionally, one also encounters cDNA array data with asymmetric distribution of expression values. For example, the sex-biased genes of Drosophila melanogaster constitute a subpopulation whose expression values deviate asymmetrically from the bulk of expression values (Parisi et al., 2003).

Three algorithms for ChIP-on-chip experimental data have been developed very recently. Two of them: ChIPOTle (Buck et al., 2005) and Mpeak (Zheng et al., 2005) take into account the ‘neighbor effect’, observed in experiments using locus tiling arrays, to improve the identification of targets. However, those methods cannot be used with microarrays where genes are represented by only one spot. Another method Chipper (Gibbons et al., 2005) applies to the latter type of microarrays (one spot per gene). We show here better performance of our algorithm over the latter method for a specific dataset.

We model the microarray data as arising from a mixture of two distributions. The first one (the ‘stable’ part) represents unenriched targets. In the M versus A plots introduced earlier, this part is concentrated along a graph of an unknown function f mapping A (mean log-intensity) to M (difference in log-intensities): M = f (A) + noise. In the ideal case of perfect correlation between the two intensities, the function f is zero, and the noise is symmetric and homoscedastic (its local variance is independent of A). In reality, the graph is frequently curved due to the systematic sources of variation discussed above, and the local variances around the normalizing curve are not necessarily constant (namely, the data is heteroscedastic). The second component of the mixture distribution represents outliers (enriched targets). The goal of our algorithm is to estimate the conditional mean and variance of the ‘stable’ distribution, ignoring the presence of outliers.

We refer to our algorithm as Multiscale Strip Construction (MSC), as it constructs a normalizing curve (the estimated conditional mean) with an enveloping strip around it (the estimated conditional variance) in a multiscale fashion. The algorithm zooms in adaptively on the ‘stable’ part of the data by constructing parallelograms of decreasing sizes, centered and oriented along the underlying curve. Higher dimensional generalizations of the algorithm for different kinds of data appear elsewhere (Lerman et al., 2006).

Our MSC algorithm performs well on various ChIP-on-chip and cDNA array data, even in problematic cases of significant outliers and strong asymmetries, skewness and curvature of main curve.


    2 ALGORITHM AND METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM AND METHODS
 3 CASE STUDIES
 4 BRIEF DISCUSSIONS
 REFERENCES
 
We provide a simplified description of the algorithm and leave its more detailed elaborations and analyses to a mathematical paper (Lerman et al., 2006).

The main input to our algorithm is a planar dataset E of N points. The algorithm identifies a ‘stable’ set within that data and estimates its local means and standard deviations (SDs). It also uses those estimates in order to assess ‘outliers’ coming from a different model.

In order to simplify the technical description, we assume that the data is normalized along the x-axis so that the M values of the data remain constant at 0. We also assume that the ‘unenriched’ (‘stable’) part of the data is distributed symmetrically around the x-axis. In this case, the algorithm only estimates the local SDs of the ‘stable’ set. We refer to this ideal case as the linear-symmetric case, or LS-Case, and explain its generalization later.

In some cases (e.g. the artificial data exemplifying the algorithm in the Supplementary material) the task of the algorithm is well-studied. Indeed, it can be addressed by robust estimation of local means (robust regression) and local SDs (Rousseeuw and Leroy, 1987). However, in many cases, in particular, ChIP-on-chip data, the local percentages of outliers are significantly high, in particular >50% in some regions, and their distribution is asymmetric and hard to model. In order to overcome this obstacle, we suggest a stopping procedure which separates significant ‘outliers’ in a multiscale fashion and forms local regions that cover the ‘stable’ set (excluding those ‘outliers’). In each local region standard techniques could then be applied to estimate local statistics and then use it to reassess the significance of outliers coming from a different distribution.

The algorithm depends on the following parameters: l0, c0, n0, nsh and {alpha}0. We explain later how to choose their values and elaborate further in Lerman et al. (2006).

We sketch our algorithm below (Algorithm 1) for the linear-symmetric case with nsh = 1. Later subsections contain the details of the different steps of this scheme and its generalization to other instances. Figure 2 of Lerman et al. (2006) illustrates its various steps when applied to an artificial dataset.

2.1 Dyadic intervals and rectangles
We fix an interval Q0 of nearly minimal length containing the projection of the dataset onto the x-axis. A dyadic interval with respect to Q0 is an interval that occurs when dividing recursively Q0 to halves. It is of level l, if it has been obtained by l consecutive partitions. We denote all dyadic intervals with respect to Q0 by Formula . If Q is a dyadic interval, we denote its length by {ell}(Q). If Formula , then denote by PQ the dyadic parent of Q according to the grid Formula and also define PQ0 colone Q0.

In order to describe the stopping constructions more formally, we will need to define properties of several regions, which extend dyadic intervals to the plane. Supplementary Figure S1 illustrates those regions. For an interval Formula , its extension Str (Q) to an infinite strip is


Formula

Its extension Rec(Q) to a rectangle (centered around Q) is


Formula

The ‘outer’ or ‘putative-enriched’ part of Rec(Q) is defined, by removing a subrectangle of appropriate height, as follows:


Formula

Algorithm 1. MSC Algorithm (LS-Case)

Formula

2.2 The stopping criterion
We describe the formal steps of the stopping construction and then explain their motivation. Supplementary Figures S2 and S3 illustrate the stopping construction for an artificial dataset. More properties implied by the stopping construction are formulated and proved in Lerman et al. (2006).

The algorithm proceeds in a top-down procedure and computes fQ and FQ at any dyadic interval Q it visits. The fraction fQ has the form


Formula

The cumulative sum of fractions, FQ, is computed as follows: First, the algorithm initializes Formula , then it applies the reduction formula (from coarse levels to fine levels):


Formula

While proceeding from upper to lower levels, the algorithm stops at an interval Formula [together with all of its descendants in Formula ] if any one of the following two conditions is satisfied:


Formula 1

(1)

The main stopping criterion [Equation (1)] implies a global estimate on the percentages of initially detected outliers (points outside the rectangular regions) as a function of the parameter {alpha}0 (Lerman et al., 2006, Proposition 5.1). The second stopping criterion is necessary for having valid local estimates in each interval.

The stopping construction results in local rectangles which aim to cover most of the ‘stable’ set and to separate away ‘significant’ outliers. The heuristic justification for the success of this separation can be given as follows. The local quantity fQ measures the local fraction of ‘putative outliers’ in Rec(Q). High values of fQ, occurring in combination with sufficiently farther local distance from the core of the ‘stable’ distribution, imply presence of locally significant outliers. In order to identify outliers that are also globally significant, we follow several strategies commonly used in harmonic analysis, which combine local quantities at different scales to identify global structure. We use an additive function FQ, whose analogs have appeared in similar formulations (Jones, 1990; Lerman, 2003).

2.3 The output functions
The main output function Formula for the LS-Case estimates the local ‘SDs’ of the stable distribution. i.e.


Formula

We create a smoother version of the above function by generating nsh instances of the corresponding piecewise constant function according to different grids and averaging those piecewise constant functions (Lerman et al., 2006, Section 4.6).

The function Formula estimates ‘SDs’ in the rectangles associated with stopping intervals, and requires a small correction to extend it to the region outside those rectangles. We thus alter it by assuming that for each stopping interval Q, the points in Rec(Q) were sampled from the restriction of a Gaussian random variable to that region. The function Formula estimates the SDs of the underlying local Gaussian distributions (Lerman et al., 2006, Section 4.5). Note that except in this last stage, the algorithm need not make any assumptions about the exact nature of the statistical distributions of data.

2.4 Generalization of the algorithm
Our approach permits two generalizations of the LS-Case algorithm that are necessary for our applications. The first generalization constructs the normalizing curve, which we denote by C, instead of assuming an underlying line. The generalized algorithm approximates the data by lines at different scales and shear the regions Formula around those lines. That is, it uses appropriately chosen parallelograms at different scales and locations instead of the rectangles used in the LS-Case. Once the sheared regions Rec(Q) and Out(Q) are defined appropriately for any Formula , the algorithm then proceeds mutatis mutandis. The curve C is obtained as the union of the corresponding line segments. The averaging process described above results in a smooth curve. Supplementary Figures S4–S7 illustrate this process. We initialize the process by shifting and rotating the data on its principal axis (Lerman et al., 2006).

The second generalization allows the algorithm to adapt to asymmetric data, in particular ChIP-on-chip experimental data. It uses asymmetric regions Formula (illustrated by Supplementary Figures S8 and S9). Details of both generalizations appear in Lerman et al. (2006).

We also modify slightly the function FQ following Lerman et al. (2006), Section 4.9.1.

2.5 Ranking and identification of outliers
In order to rank and identify enriched targets, we define a scoring function R for a point (A,M) as follows:


Formula

Initial P-values are obtained from those scores by assuming that the stable distribution is normal. i.e.


Formula

Following Reiner et al. (2003), we have adjusted the P-values in order to control the false discovery rate (FDR) of the multiple testing procedure. That is, given a FDR level q, we order the computed P-values: p(1) ≤ ... ≤ p(N) and set


Formula

We identify the points with P-values less than or equal p* as enriched.

2.6 Choice of parameters
We fix the values of the following parameters: l0 colone 10, n0 colone 30, nsh colone 30 and c0 is the minimal constant (or almost minimal) for which E {subseteq} Rec(Q).

The parameter {alpha}0 is important for good performance of the algorithm. It describes the global expected percentage of outliers. We have developed an algorithm for estimating this parameter (Lerman et al., 2006, Section 4.8). The main idea is to apply the MSC algorithm with different values of {alpha}0 and identify outliers at different fixed levels of FDR. For each value of {alpha}0, we draw the curve of the number of outliers detected by the algorithm as a function of the FDR level. We then observe the jumps between the curves. We choose the value of {alpha}0 according to the first significant jump in the profile curves, so that it corresponds to separating the first significant subgroup of outliers (Lerman et al., 2006, Section 4.8). In cases of ambiguity of first significant jump of a given replicate, we choose the one closest to the median jump (among all replicates). We show later that the output of our algorithm is not too sensitive to the choice of {alpha}0, but is optimal when fixing {alpha}0 according to our method.

2.7 Complexity
The speed of the algorithm for a dataset of N points, when using {ell}0 levels and nsh shifts is of order O(N · {ell}0 · nsh) and the required storage is O(N) (Lerman et al., 2006, Section 5.3).

Note that in the ideal situation (homoscedastic variance), the algorithm never recurs beyond the highest level Q0 and outputs essentially the same constant width strip in same time complexity as would the binding-ratio-algorithm, the most popular alternative algorithm currently used to isolate outliers in ChIP-on-chip array data.

In practice, the CPU time of our algorithm (written in a Matlab code, which was not optimized) was 1.11 s when computing C and the strip Formula and using a data with N = 5823 points and a laptop with Intel Pentium processor of 1.60 GHZ and 1 GB of RAM (the data is replicate A of Myogenin ChIP-on-chip described in Subsection 3). When also computing the strip Formula , the CPU time was 7.96 s. For comparison, the CPU times of LOESS using the same data and pc with the bandwidths parameters 0.1, 0.3 and 0.7 are 8.54, 15.28 and 28.05 s, respectively. While LOESS only normalizes the data, our algorithm also estimates the local SDs of the stable distribution and the significance of enriched points.

Clearly, the use of Formula instead of Formula reduces considerably the computational time. Our experience shows that for values of {alpha}0 <0.2, the differences between the two functions are not significant.


    3 CASE STUDIES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM AND METHODS
 3 CASE STUDIES
 4 BRIEF DISCUSSIONS
 REFERENCES
 
We demonstrate results of our algorithm for both gene expression and ChIP-on-chip data with emphasis on the latter.

3.1 Clostridium acetobutylicum gene expression data
Using our algorithm, we have analyzed cDNA array data comparing gene expression of megaplasmid pSOL1 deficient C.acetobutylicum strain M5 relative to its wild-type (WT) strain (Yang et al., 2003). The pSOL1 genes are postulated to have expression with a broad range of levels in WT, but unexpressed in M5. Therefore, these genes were expected to be characterized as enriched in the WT strain versus the M5 strain.

To measure the statistical power of our algorithm, we focused on the following quantities: the false positive rate (FPR), the true positive rate (TPR) and the identification error (Er), all described in Yang et al. (2003).

Yang et al. (2003) have used the same data in order to compare various algorithms for identification of differentially expressed genes, including their own algorithm: SNN-LERM (segmental nearest neighbor method of logarithmic expression ratios). They concluded that their algorithm performed better than the other algorithms.

We have compared FPR, TPR and Er of both MSC and SNN-LERM for the six glass arrays of M5 versus WT in the Supplementary material of Yang et al. (2003). We maintained a similar ratio of outliers and summarized our findings in Table 1. Supplementary Figure S10 displays the separating strips of the two algorithms for slide 804 and Supplementary Figure S11 demonstrate the corresponding ROC curves.


View this table:
[in this window]
[in a new window]

 
Table 1 Comparison of SNN-LERM and MSC for identification of C.acetobutylicum pSOL1 genes in six slides of M5 versus WT experiments [using data where SNN-LERM was shown to be superior to other methods (Yang et al., 2003)]

 
The results indicate better identification by MSC in four out of the six experiments, though the magnitude of improvement is arguably small. In view of the superiority of SNN-LERM over other existing algorithms for this particular data (as claimed by Yang et al., 2003), we find our results noteworthy.

3.2 Mouse DNA microarray from ChIP-on-chip experiment
We have performed ChIP-on-chip experiments using the Mm4.7 k mouse promoter DNA microarray, with highly specific antibodies against well-characterized transcription factors. The experiments have been replicated three times. A detailed biological analysis of these experiments is published elsewhere (Blais et al., 2005).

In the main data described here, the antibodies recognized Myogenin and the experiment was performed in myotubes. In the Supplementary material we have also analyzed ChIP-on-chip experiments where antibodies recognized MyoD in both growing cells and myotubes. Following Blais and Dynlacht (2005), we have excluded any data point with more than one replicate with dust speckles on the glass slide or with low spot fluorescence intensity (65% with respect to background).

With an aim to independently validate observed binding of a transcription factor to a given genomic locus, we have performed confirmatory, gene-specific PCR on immunoprecipitated chromatin in the special case of the Myogenin data. This is a method that does not involve DNA amplification, DNA labeling and microarray hybridization, which are the most prominent sources of error and noise in the ChIP-on-chip procedure. We have chosen microarray spots from different levels of binding ratios. Thirty-five tested genes were determined as unambiguously enriched (and thus considered true targets of the transcription factor under study), while 35 were considered unenriched. Original data for this comparison is described in Blais et al. (2005) and also provided in the Supplementary material of this paper.

In order to determine an optimal value of the parameter {alpha}0, we have applied our method of detecting first jumps in the number of outliers found by MSC. Supplementary Figure S12 describes those jumps. Accordingly we have chosen {alpha}0 = 0.2 for replicates A and C and {alpha}0 = 0.21 for replicate B.

We have compared the MSC with the binding ratio (BR) method as applied to this data in Blais et al. (2005), binding ratios with respect to the principal axis of the data, binding ratios together with LOESS normalization and the recent Chipper algorithm (Gibbons et al., 2005). The binding ratio method identifies enriched sites by selecting (according to a subjective threshold) the points with highest ratios of IP signal to input signal (equivalently, lowest M values). BR can be combined with LOESS by initial application of LOESS normalization and then identifying enriched sites by selecting points with lowest second coordinates. Similarly, BR can be applied with respect to the principal axis by shifting the data so its center of mass is zero, rotating it so the x-axis coincides with the main principal axis and then applying BR. Other approaches for normalization and identification of cDNA arrays yielded even less compelling results than the four methods and are thus not presented here.

A comparison along a full-range of true positive and negative rates is described by a ROC curve. Figure 1 presents ROC curves comparing MSC with the other four Methods. The ranks of the various methods are averaged among unexcluded replicates and their sorted values are used for identification. The areas below the curves for the different methods are recorded in Table 2. We have chosen carefully the LOESS span parameter to maximize its area below the ROC curve. In both instances, MSC performs slightly better than the four other algorithms over a full-range of FDR. However, only 1.19% of the data has been verified to be either enriched or unenriched and therefore the differences between the methods (in particular BR combined with LOESS versus MSC) are not statistically significant. We nevertheless find those results important as we are not aware of ChIP-on-chip experimental data with larger percentage of confirmatory PCRs (in practice, not commonly used with large data, since it is both a time-consuming and an expensive process).


Figure 1
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 (ad) ROC Curve of MSC compared with ROC curves of the methods: BR, BR with PCA, BR with LOESS and Chipper. The MSC curve is described by a solid line and the other curves by dashed lines.

 


View this table:
[in this window]
[in a new window]

 
Table 2 Areas below ROC curves for the different methods

 
While the ROC curve describes a full range of true positive and negative rates, in practice, there is a specific range, which is important for identification. We identify such a range by controlling the FDR level. We have chosen a level of 0.1. In the absence of clear underlying models in some of the other competing methods, we have balanced them with the same number of identified enriched points (combining all three replicates by a weighted score) for the purpose of fairly comparing identification rates. Figure 2a illustrates such a comparison between BR and MSC. MSC has identified correctly Chrnb1, Chrng, Cited2 and Myc as enriched, whereas BR misidentified them. However, BR has identified correctly Sema6c as enriched whereas MSC missed it. BR has falsely identified Hist1h2bc and Cacng1 as enriched, unlike MSC. MSC true positive rate is 0.629, whereas that of BR is 0.542. MSC false positive rate is 0, whereas that of BR is 0.057.


Figure 2
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 In (a) regular MSC (with initial transformation) is compared with BR for the three replicates of our data. In (b), we have applied MSC without initial shift and rotation onto principal axis. BR threshold is indicated by a straight horizontal line, while MSC strip is represented by the thicker curve. We have identified the enriched points of MSC in at least two replicates while applying FDR level of 0.1 for each replicate; We identified the same number of enriched points with a weighted BR score. Circles reveal enriched spots that the MSC algorithm distinguished and the binding ratio method failed to distinguish over all three replicates, while squares reveal enriched points identified by BR and not by MSC. Triangles reveal points which are not enriched and were identified by BR as enriched, unlike MSC. There were no spots that MSC failed to identify as not enriched, while BR did not. Supplementary Figure S14 describes the comparison of regular MSC and BR with respect to the coordinates obtained after applying the initial transformation. Supplementary Figure S17 compares the normalizing curves of MSC with and without initial transformation.

 
Optimal performance of MSC depends on a correct choice of the parameter {alpha}0 and our algorithm for detecting such a choice is a distinctive advantage of our method. Nevertheless, MSC is not highly sensitive to variations in the parameter {alpha}0. Table 3 illustrates this point, by recording areas below ROC curve for different values of {alpha}0 (more details appear in Supplementary Table S1). The variations in areas are not significant and the optimal area is near our choice of the optimal parameter. Similarly, in Supplementary Table S4 we record identification of true and false positives and negatives for MSC with different values of {alpha}0 and the corresponding identification values for the other methods with same percentage of detected enriched points, while maintaining a FDR level of 0.1 for MSC.


View this table:
[in this window]
[in a new window]

 
Table 3 Areas below ROC curves for MSC with different values of {alpha}0 s

 
Our application of MSC includes an initial rotation on principal axis. We have compared it to BR with respect to this axis in order to show that the initial transformation is not enough for good identification (it is worse or at least comparable to regular BR). When applying our method without this initial transformation the area below the ROC is 0.904 (Supplementary Figure S13 explains the choice of {alpha}0 and Supplementary Figure S18 presents the corresponding ROC curves). However, the area decreases with higher values of {alpha}0 (Supplementary Table S2). Nevertheless, those differences are not statistically significant. Indeed, they are mainly due to a single spot: Cacng1. This spot is falsely identified by MSC without initial rotation as enriched with very low FPR. The other methods have identified it as enriched with higher FPR (Supplementary Figures S15, S16, S19 and Supplementary Table S3).

The identification for fixed FDR (we have chosen 0.1 but other values worked as well) of the MSC without the initial transformation is good when compared with the other methods, even for a large range of {alpha}0 (Supplementary Table S5). Figure 2b illustrates such a comparison between BR and MSC. Supplementary Figure S17 demonstrates the normalizing curves obtained by MSC with and without the initial transformation as well as that of LOESS. The differences are noticeable only in a very sparse region.

Our conclusion is that applying MSC without initial rotation can also work well in identifying outliers. It is more convenient to plot the resulting curve and strip that way, as there is no need to rotate backward. Differences of the two implementations have also been compared for the MyoD data (Supplementary Figures S26–S31). The good performance of our method irrespective of the initial transformation is a strong indicator of its robustness. On the other hand, LOESS did not perform as well when rotated on the principal axis (e.g. its area under ROC is 0.896).

We believe that our experimental results provide clear indication of the attractive performance of the MSC method, as it allows investigators to extract more meaningful and reliable information from their datasets. Supplementary Figures S20 and S21 show instances of failures of some standard techniques to the latter data. We have also applied our algorithm to ChIP-on-chip experiments where antibodies recognized MyoD in both growing cells and myotubes, but with no confirmatory PCRs and report the results in Supplementary Figures S22–S31. There are several marked advantages enjoyed by our algorithm: its adaptability to regions of lower or higher variance and to areas of the data, which exhibit significant nonlinearity between channels as well as asymmetry; its model for identifying enriched points under a fixed FDR; its fast implementation; its robustness to transformation of the data and to change of parameters and its ability to choose the main parameter {alpha}0 to improve the identification results. In the ideal case when there is no nonlinearity between channels and the variance is relatively constant throughout the data, we expect MSC to perform similarly to the binding ratio method. However, MSC proves its effectiveness in analyzing many important datasets, because one is frequently confronted with experimental results that stray far from the ideal. This is due to numerous types of artifacts (unequal dye incorporation, unequal background in the two channels, different quantum yield of the dyes, etc) that remain difficult to control and confound the analysis in the presence of the inherent asymmetry of ChIP-on-chip experiments.


    4 BRIEF DISCUSSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM AND METHODS
 3 CASE STUDIES
 4 BRIEF DISCUSSIONS
 REFERENCES
 
The approach described here fills in a substantial void in the analysis of general DNA arrays, in particular arrays from ChIP-on-chip experiments. Namely, it represents an effective method for identifying enriched targets while handling logarithmic ratios of intensities with asymmetric and heteroscedastic characteristics. Currently, most standard techniques fail to analyze a large fraction of these data and many investigators resort to the simple binding ratio method in order to rank ‘outliers’ (e.g. IP-enriched sites in ChIP-on-chip experiments).

The MSC will prove most advantageous for ChIP-on-chip datasets that display mild or pronounced non-linearity, as well as for datasets where the proportion of enriched spots is very large. However, when working on datasets that are close to ideality, it still performs as well as other existing methods.


    Acknowledgments
 
The authors are grateful to the NYU Cancer Institute Genomics Facility for providing necessary instrumentation and expertise. The authors thank Mary Tsikitis and Diego Acosta for their help in performing confirmatory PCR, and Rick Young and Duncan Odom of the Whitehead Institute for advice in the design of a mouse promoter Microarray. The authors also thank Fang Cheng, Ronald R. Coifman, Peter Jones and Yi (Joey) Zhou for helpful discussions; E. Terry Papoutsakis and Carles Paredes for their help in interpreting the data appearing in their original paper on pSOL1 genes; James Glimm and Jacob Schwartz for commenting on earlier versions of this paper and finally, Mark Green and IPAM (UCLA) for inviting G.L. and J.M. to take part in their bioinformatics as well as multiscale geometry meetings, where discussions of similar topics stimulated our research. Special thanks for the careful anonymous reviewers and their constructive suggestions. J.M., G.L. and B.M. are supported by grants from NSF's ITR program, Defense Advanced Research Projects Agency (DARPA), and New York State Office of Science, Technology & Academic Research (NYSTAR). G.L. is supported by NSF grant #0612608. B.D. is supported by an NIH grant #5R01 GM067132-02. A.B. is supported by a postdoctoral training fellowship from the Fonds de la Recherche en Sante du Quebec. Funding to pay the Open Access publication charges for this article was provided by NSF grant #0612608.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

Received on May 2, 2006; revised on November 8, 2006; accepted on November 23, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM AND METHODS
 3 CASE STUDIES
 4 BRIEF DISCUSSIONS
 REFERENCES
 

    Blais, A. and Dynlacht, B.D. (2005) Devising transcriptional regulatory networks operating during the cell cycle and differentiation using ChIP-on-chip. Chromosome Res, . 19, 1499–1511.

    Blais, A., et al. (2005) An initial blueprint for myogenic differentiation. Genes Dev, . 19, 553–569[Abstract/Free Full Text].

    Buck, M.J. and Lieb, J.D. (2004) ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics, 83, 349–360[CrossRef][ISI][Medline].

    Buck, M.J., Nobel, A.B., Lieb, J.D. (2005) ChIPOTle: a user-friendly tool for the analysis of ChIP-chip data. Genome Biol, . 6, R97[CrossRef][Medline].

    Gibbons, F.D., et al. (2005) Chipper: discovering transcription-factor targets from chromatin immunoprecipitation microarrays using variance stabilization. Genome Biol, . 6, R96[CrossRef][Medline].

    Jones, P.W. (1990) Rectifiable sets and the traveling salesman problem. Invent. Math, . 102, 1–15[CrossRef].

    Lerman, G. (2003) Quantifying curvelike structures of measures by using L2 Jones quantities. Commun. Pure Appl. Math, . 56, 1294–1365[CrossRef].

    Lerman, G., McQuown, J., Mishra, B. (2006) Multiscale curve and strip constructions, preprint; attached in supplemental material.

    Parisi, M., et al. (2003) Paucity of genes on the Drosophila X chromosome showing male-biased expression. Science, 299, 697–700[Abstract/Free Full Text].

    Pollack, J.R., et al. (1999) Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat. Genet, . 23, 41–46[ISI][Medline].

    Quackenbush, J. (2002) Microarray data normalization and transformation. Nat. Genet, . 32, 496–501.

    Reiner, A., Yekutieli, D., Benjamini, Y. (2003) Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics, 19, 368–375[Abstract/Free Full Text].

    Ren, B., et al. (2000) Genome-wide location and function of DNA binding proteins. Science, 290, 2306–2309[Abstract/Free Full Text].

    Rousseeuw, P.J. and Leroy, A.M. Robust Regression and Outlier Detection, (1987) , New York Wiley.

    Smyth, G.K., Yang, Y.H., Speed, T.P. (2003) Statistical issues in microarray data analysis, In: Functional Genomics: Methods and Protocols. In Brownstein, M.J. and Khodursky, A.B. (Eds.). Methods Mol. Biol, . 224, , pp. 111–136[Medline].

    van Steensel, B. (2005) Mapping of genetic and epigenetic regulatory networks using microarrays. Nat. Genet, . 37, S18–S24.

    Yang, H., et al. (2003) A segmental nearest neighbor normalization and gene identification method gives superior results for DNA-array analysis. Proc. Natl Acad. Sci. USA, 100, 1122–1127[Abstract/Free Full Text].

    Zheng, M., et al. (2005) A probability theory of ChIP-chip data. Proceedings of Joint Statistical Meetings, (extended preprint can be found at http://preprints.stat.ucla.edu/443/CHIP_zmdl_1036.pdf).


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/3/314    most recent
btl606v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Lerman, G.
Right arrow Articles by Mishra, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lerman, G.
Right arrow Articles by Mishra, B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?