Skip Navigation

Bioinformatics 2007 23(2):e77-e83; doi:10.1093/bioinformatics/btl316
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Lipson, D.
Right arrow Articles by Aumann, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lipson, D.
Right arrow Articles by Aumann, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Computational Genomics

Optimization of probe coverage for high-resolution oligonucleotide aCGH

Doron Lipson 1,*, Zohar Yakhini 1,2 and Yonatan Aumann 3

1 Computer Science Department Technion, Israel
2 Agilent Laboratories CA, USA
3 Computer Science Department, Bar-Ilan University Israel

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATHEMATICAL FORMULATION
 3 ALGORITHMS
 4 APPLICATION
 APPENDIX
 REFERENCES
 

Motivation: The resolution at which genomic alterations can be mapped by means of oligonucleotide aCGH (array-based comparative genomic hybridization) is limited by two factors: the availability of high-quality probes for the target genomic sequence and the array real-estate. Optimization of the probe selection process is required for arrays that are designed to probe specific genomic regions in very high resolution without compromising probe quality constraints.

Results: In this paper we describe a well-defined optimization problem associated with the problem of probe selection for high-resolution aCGH arrays. We propose the whenever possible isin-cover as a formulation that faithfully captures the requirement of probe selection problem, and provide a fast randomized algorithm that solves the optimization problem in O(n logn) time, as well as a deterministic algorithm with the same asymptotic performance. We apply the method in a typical high-definition array design scenario and demonstrate its superiority with respect to alternative approaches.

Availability: Address requests to the authors.

Contact: dlipson{at}cs.technion.ac.il


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATHEMATICAL FORMULATION
 3 ALGORITHMS
 4 APPLICATION
 APPENDIX
 REFERENCES
 
Alterations in DNA copy number are characteristic of many cancer types and are thought to drive some cancer pathogenesis processes. These alterations include large chromosomal gains and losses as well as smaller scale amplifications and deletions. Because of their role in cancer development, regions of chromosomal instability are useful for elucidating other components of the process. For example, since genomic instability can trigger the activation of oncogenes and the silencing of tumor suppressors, mapping regions of common genomic aberrations has been used to discover cancer-related genes. Understanding genome aberrations is important for both the basic understanding of cancer and for diagnosis and clinical practice.

Alterations in DNA copy number have been initially measured using local fluorescence in situ hybridization-based techniques. These evolved to a genome-wide technique called Comparative Genomic Hybridization [CGH, (Kallioniemi et al., 1993)], now commonly used for the identification of chromosomal alterations in cancer (Mertens et al., 1997; Balsara and Testa, 2002). In this genome-wide cytogenetic method differentially labeled tumor and normal DNA are co-hybridized to normal metaphases. Ratios between the two labels allow the detection of chromosomal amplifications and deletions of regions that may harbor oncogenes and tumor suppressor genes. Classical CGH has, however, a limited resolution (10–20 Mb). With such low resolution it is impossible to predict the borders of the chromosomal changes or to identify changes in copy numbers of single genes and small genomic regions. In a more advanced method termed array-based CGH (aCGH), tumor and normal DNA are co-hybridized to a microarray of thousands of BAC, cDNA or oligonucleotide probes (Pinkel et al., 1998; Pollack et al., 1999; Lucito et al., 2003; Brennan et al., 2004; Bignell et al., 2004; Barrett et al., 2004). The use of aCGH allows the determination of changes in DNA copy number of relatively small chromosomal regions. Using oligonucleotides arrays the resolution can, in theory, be finer than single genes.

In fact, when using oligonucleotides the resolution appears to have no limitation as these can be designed to probe any region in the genome of interest. In reality the resolution is limited for two main reasons. One is the fact that not all genomic locations can be effectively probed. For example, genomic sequences that are not unique (genomic repeats) or genomic regions with extreme GC content limit the design of specific probes. The other limitation is array real-estate—the number of genomic regions that can be probed in one array. Note that for technology implementations that use redundant probes this number is much smaller than the number of features on the array.

In this work we address the joint optimization of these two resolution limiting factors. The methods we develop are useful in the context of designing high-definition arrays. These are arrays designed to probe specific genomic regions in very high resolution. For this purpose a dense set of probes is selected for the region of interest. We are interested in doing so while maintaining optimal coverage and without compromising probe quality constraints. Major determinants of probe quality are its expected sensitivity, which is predicted by thermodynamic evaluation of the probe–target complex, and specificity, which is inferred from the extent to which close repeats of the probe can be found in the genome. In previous work (Lipson et al., 2002; Li and Stromo, 2001; Mei et al., 2003; Rouillard et al., 2003) probe sensitivity and specificity have been studied mostly in the context of expression profiling. The background context in CGH is different and so is the thermodynamics but the overall considerations remain valid. The methods described in the current paper are completely independent from those used for assigning quality to candidate probes and can therefore be applicable to any such assignment, including one derived from experimental data.

Lucito et al. (2003) describe design criteria that are based only on an empirically derived quality assignment. That is, they choose to use all probes with a given quality or better without taking any uniformity considerations into account. This approach leads to having regions that are more densely probed than others as well as to coverage discontinuities. When studying a given genomic region we are typically not biased to any specific subregions and seek to obtain copy number information for all subregions. It is therefore best to have a uniform (equidistant) distribution of probes in the region. This way we best avoid coverage discontinuities and have comparable information about all subregions. However, a uniformly distributed set of probes will greatly compromise quality considerations. We may be selecting probes that are highly non-specific, for example. This paper describes methods that assume a given quality threshold above which probes are acceptable and then optimize the coverage using only these candidate probes and working with the array real-estate constraints.

The rest of the paper is organized as follows: in Section 2 we describe a new score that accounts for probe coverage, and formally define the optimization problem. In Section 3 we describe two algorithms that efficiently solve the defined problem: a fast stochastic algorithm, and a deterministic variant. Finally, in Section 4, we demonstrate the application of our method in a biological scenario, and compare the obtained results with some alternative methods.


    2 MATHEMATICAL FORMULATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATHEMATICAL FORMULATION
 3 ALGORITHMS
 4 APPLICATION
 APPENDIX
 REFERENCES
 
In this section we formally define the optimization problems associated with probe selection for aCGH arrays, as discussed in the Introduction. We believe that such a formal definition is of independent interest, as the problem is most commonly stated in ill-defined terms.

Two types of parameters are typically considered when evaluating candidate probes for a hybridization assay. Sensitivity is a measure of a probe's ability to strongly interact with its target, and is typically assessed by considering the thermodynamic stability of the probe–target complex. Specificity is a measure of a probe's ability to discriminate between its intended target and other non-specific molecules it might cross hybridize to, and is typically assessed by considering the similarity of the target to the expected molecular background (e.g. the entire transcriptome in an expression profiling assay, or the entire genome in an aCGH assay). Methods for predicting the sensitivity and specificity of candidate probes have been extensively studied in the past (see Introduction) and will not be considered here. Instead, we shall assume we are provided with a quality parameter q(p), specifying the overall predicted performance of the probe p.

When designing oligonucleotide probes for aCGH an additional criterion should also be taken into consideration—that of coverage. Specifically, when an array is designed for the purpose of pinpointing genomic breakpoints, an important consideration in the design is to minimize the uncertainty at which the breakpoints are mapped, wherever they may be. Thus, probes should be somehow uniformly spaced, so as to be not ‘too far’ from the location of any possible breakpoint.

Accordingly, when designing an aCGH array, we are interested in selecting ‘highest quality’ probes, with the ‘best possible’ coverage. In this section we provide formal definitions that capture the intuitive meaning of these notions.

2.1 Probe quality
Let P be the set of candidate probes. For brevity, we identify each probe with its genomic location, thus Formula. Let Formula be the quality function associating a quality score with each probe. Intuitively, we are interested in using the ‘best possible’ probes, i.e. those with the highest quality score. However, the high quality probes may not be evenly distributed within the genomic area of interest. Thus, choosing the globally best probes may result in poor coverage of the genome. Hence, rather than using the globally highest quality probes, we seek to use the locally best probes. Specifically, for a probe p isin P and window size w, we define the w-local quality of p to be the percentile of q(p) within scores in the w window around p. Formally, let Pw be the set of candidate probes within the window [p w/2, p + w/2]. Then

Formula 1(1)
We now define the ‘good probes’ to be those with high local quality. Formally, for a window size w and threshold {tau} isin [0,1], the w-local {tau}-good probes are all the probes p for which qw(p) ≥ {tau}. We shall seek to use only probes which are w-local {tau}-good, for some proper choice of w and {tau}. We note that it is also possible to optimize for {tau}, as explained later.

2.2 Probe coverage
As mentioned, we are also interested that the probes ‘cover’ the genomic region of interest with the most granular coverage possible. Specifically, if there is a breakpoint at some point on the genome, we would like to be able to determine its location as precisely as possible. However, two factors limit the precision that can obtained:

  • With a limited number of probes in the array, probes cannot be placed at each genomic location. Rather, they must be spread across the region of interest. In this case, the localization of genomic events can only be determined by the pair of flanking probes. We shall seek that for all genomic locations, the gap between this pair of probes is as small as possible.
  • Some genomic regions do not contain any candidate probes, or only probes of poor quality. In such cases, large gaps between probes are inevitable. Any breakpoint occurring within these gaps can only possibly be localized to within the pair of candidate probes bordering the gap.
Thus, we seek to choose a set of probes that uniformly cover all genomic locations—whenever possible, and as close as possible to the points within the large gaps—otherwise. The following definition captures this intuition:

DEFINITION 2.1
Given a genomic region G, a set of candidate probes P and a parameter isin, a subset C = (c1, ... , ck) {subseteq} P is a whenever possible (WP) isin-cover of G with respect to P if for any genomic location x isin G, the following holds. Let ci and ci+1 be the two selected probes closest to x from the left and from the right, respectively (if x < c1 then c0 is set to be the left-end of G, and for x > ck, ck+1 is the right-end of G). Then, one of the following holds:
  1. ci+1 – ci ≤ isin (i.e. the flanking selected probes are within isin distance of each other), or
  2. there is no candidate probe between ci and ci+1.
For such a cover C, we say that the resolution of C is isin.

Thus, a WP isin-cover of G guarantees that for any possible breakpoint x, it can either be localized to within isin base pairs, or to within the best resolution that could have been obtained even if all candidate probes would have been used. We believe that the notion of WP isin-cover faithfully captures the requirements of probe selection.

2.3 The optimization problem
Given a fixed number of probes in the array, the problem of designing an aCGH array is therefore a bicriteria optimization problem: select a subset of probes that are (i) of high quality (ii) with high resolution. A standard approach to such bicriteria problems is to optimize one criterion, given a constraint on the other. In our case, this gives rise to the following two optimization problems:

PROBLEM 2.2
(Probe selection—resolution optimization). Given an integer k, genomic region G, window size w and quality threshold {tau}, find the minimal possible resolution isin* and a probe subset C sub P such that:
  1. |C| = k,
  2. C contains only {tau}-good probes,
  3. C is a WP isin*-cover of G with respect to the {tau}-good probes.

PROBLEM 2.3
(Probe selection—quality optimization). Given an integer k, genomic region G, window size w and resolution threshold isin, find a maximum quality score {tau}*, and a probe subset C sub P such that:
  1. |C| = k,
  2. C contains only {tau}*-good probes,
  3. C is a WP isin-cover of G with respect to the {tau}*-good probes.

In this paper we focus on the resolution optimization version, but also provide an efficient solution to the quality optimization version.

2.4 Problem variants
2.4.1 Multiple genomic segments
A simple but important variant of the problem involves multiple genomic segments. In many cases, an aCGH array is designed to assay more than a single genomic segment at a time, most typically different chromosomal segments. In this case we would like to uniformly cover all the genomic regions of interest.

2.4.2 Biased selection
Another useful variant of the problem involves biased selection of probes in certain genomic segments. The most typical application of this variant is for increasing the resolution of probes in gene coding regions at the expense of non-coding regions, while preserving the uniform resolution within each of the subtypes.


    3 ALGORITHMS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATHEMATICAL FORMULATION
 3 ALGORITHMS
 4 APPLICATION
 APPENDIX
 REFERENCES
 
In this section we describe efficient algorithms for the probe selection problem, focusing on the resolution optimization version. We first show how to determine the minimal number of probes required for a WP isin-cover, for a given isin. We then use this procedure to search for the minimal isin for a given number of probes. We provide a fast randomized algorithm, described in Section 3.3.1, that finds the optimal isin in O(nlogn) steps. A deterministic algorithm with the same asymptotic performance is described in Section 3.3.2. Finally, we show how these algorithms can be extended to accommodate the variants of the problem that were described in Section 2.4.

3.1 Determining the {tau}-good probes
Given a window size w, it is possible to calculate qw(p) for each p isin P by scanning the genome with a sliding window of size w. As we slide the window from left to right, at each step, at most one candidate probe is added and one omitted. We maintain the quality scores of the probes in the window in a balanced binary search tree. This tree can be maintained in O(nlogm) steps, using any of the known balanced tree data structures (here, m is the maximum number of probes in a window of size w, n = |P|). For each probe p, its w-local quality score can be obtained in O(nlogm) steps, by finding its rank in the search tree. Thus, the w-local quality of all candidate probes can be determined in O(nlogm) steps. This also allows to determine the {tau}-good probes, for any threshold {tau}. From here and on, we denote by Formula 1(p1 < p2 < ··· <Formula 1) the sequence of {tau}-good candidate probes, for the chosen threshold {tau}.

3.2 Finding the minimal size for a WP isin-cover
In order to solve the resolution optimization problem, we first consider the reverse optimization problem:

PROBLEM 3.1
Given a genomic region G = [gbeg,gend] and a fixed value of isin, find a WP isin-cover C for G of minimal size.

Algorithm 1, based on a greedy approach, solves this problem in Formula 1 steps.


Algorithm 1 Find a minimal size WP isin-cover

FindMinCover(isin)
1: Formula 1
2: Formula 1
3: While Formula 1 do
4:     Set ci+1 to be the rightmost candidate probe following ci such that Formula 1.
5:     If no such probe exists: ci+1 is the next candidate probe to the right of ci.
6:     i <- i + 1
7: return Formula 1.

CLAIM 3.2
The subset C returned by Algorithm 1 is a WP isin-cover of G.

PROOF. Consider a point x isin G. Let ci be leftmost probe in C to the right of x. Consider two cases:

  1. ci is within distance isin of ci – 1.
  2. ci is not within distance isin of ci 1. In this case, by the algorithm, ci – 1 and ci are consecutive candidate probes in Formula 1.

CLAIM 3.3
There is no WP isin-cover C* for G with |C*|<|C|.

PROOF. Assume there is such a WP isin-cover C*, with Formula 1. Let Formula 1 be the first i such that Formula 1 (there must be one since Formula 1). Accordingly, Formula 1. Two cases are possible:

  1. (ci ci–1) > isin and therefore (Formula 1.
  2. ci was chosen as the rightmost probe following ci–1 such that (cici–1) ≤ isin. Consequently, Formula 1.

Let x be a genomic location strictly between the two selected probes Formula 1 and Formula 1. Formula 1 and there exists an unselected candidate probe ci between Formula 1 and Formula 1 contradicting the fact that C* is a WP isin-cover.

3.3 Optimizing resolution
In the probe selection problem, we are given a number k of probes and seek to optimize the resolution. We do so by performing a binary search on the resolutions, using Algorithm 1 as the decision criteria. The recursive binary search procedure is described in Algorithm 2.


Algorithm 2 Find a k-sized WP isin-cover with minimal isin.

UniProbe (k, G, Formula 1)
1: return RecursiveOptimizeResolution (k, 1,|G|)
RecursiveOptimizeResolution (k,low,high)
1: if low = high then
2:     C <- FindMinCover (low)
3:     return C
4: split <- FindSplit (low,high)
5: C <- FindMinCover (split)
6: if |C| > k then
7:     return RecursiveOptimizeResolution (k,split,high)
8: else
9:     return RecursiveOptimizeResolution (k,low,split)

The procedure performs a binary search for the optimal isin within the range [1,|G|]. Clearly, the optimal isin is a distance between some two probes Formula 1 (otherwise, isin could be reduced to the closest distance). There are Formula 1 such pairwise distances. Thus a balanced binary search could pinpoint the optimal value in Formula 1 recursive calls. However, these Formula 1 pairwise distances are not provided to us explicitly, and enumerating them all would take Formula 1 steps by itself. Thus, we seek to somehow perform a balanced binary search on this Formula 1-sized space without explicitly enumerating it. Interestingly, this can be done in Formula 1 time, as explained hereunder. The key element is to find a good split value (Line 4) in linear time. We provide both randomized and deterministic procedures to find such a split value efficiently.

Each invocation of the RecursiveOptimizeResolution procedure takes Formula 1 steps (the complexity of FindMinCover). Provided that the depth of the recursion is Formula 1, the overall complexity is Formula 1.

3.3.1 Fast randomized split selection
We first describe a fast randomized procedure for the selection of the split value. The procedure provides that the expected number of recursive calls in Algorithm 2 is Formula 1. A description of the algorithm is provided as Algorithm 3, and an explanation follows.


Algorithm 3 Fast randomized selection of split value

Randomized-FindSplit (low,high)
1: for Formula 1 do
2:     {ell}(i)<-argmin{pj:|pjpi|isin[low,high]}
3:     h(i)<-argmax{pj:|pjpi|isin[low,high]}
4: Formula 1
5: r <- random integer in [1,{sigma}]
6: <i0,j0> <- r-th element of the set Formula 1
7: return split = |pj0pi0|

Let X = X (low, high) be the set of all pairs of probes <pi,pj> such that the distance between the two is within the range [low, high]. We wish to find a split value split such that the number of pairs in X for which the distance is above split is roughly the same as the number of pairs for which the distance is under split. For a given starting probe pi, let {ell}(i) be the smallest index such that <pi, p{ell}(i)> isin X (i.e. |p{ell}(i)pi| isin [low,high]), and let h(i) be the largest such index. Clearly, for any j between {ell}(i) and h(i), <pi, pj> isin X. The algorithm first determines {ell}(i) and h(i), for all Formula 1. This can be completed in Formula 1 steps, as explained later. Given these values, we can compute the size of X (Line 4), and choose a random element from this set (Lines 5–6). The distance of this pair is the split value (Line 7). It is easy to see that with probability half the split value is between the 25-th and 75-th percentile of the distances of pairs in X. Hence, after an expected Formula 1 recursive calls of Algorithm 2, the recursion ends.

It remains to show how to efficiently compute {ell}(i) and h(i). Note that Formula 1, and similarly for h(i). Thus, with a single scan over the indexes Formula 1, we can determine all {ell}(i)s and h(i)s. We thus obtain:

CLAIM 3.4
Algorithms 2 and 3 solve the resolution optimization version of the probe selection problem in Formula 1 expected time (where Formula 1 is the number of {tau}-good candidate probes).

3.3.2 Deterministic split selection
A deterministic procedure for split selection, which also runs in Formula 1 time, can be obtained using the methods for selection in sorted matrices, directed sum matrices, point distances in Formula 1, and other multisets (6,17,1). A full description of the algorithm, as it applies to our setting, is provided in the appendix. With this algorithm we obtain:

CLAIM 3.5
The resolution optimization version of the probe selection problem can be solved deterministically in Formula 1 steps (where Formula 1 is the number of {tau}-good candidate probes).

In practice, the randomized algorithm is substantially simpler and faster, and provides the same results.

3.4 Algorithmic variants
We now show how to solve the different problem variants discussed in Sections 2.3 and 2.4.

3.4.1 Quality optimization
Suppose we wish to optimize quality, rather than resolution, as described in Problem 3. In this case we perform an algorithm similar to Algorithm 2, using a binary search on quality, rather than resolution. In this case the number of different quality values is bounded by n, so a simple binary search can be employed.

3.4.2 Multiple genomic segments
The case of multiple genomic regions is handled almost identically to the single segment case. To do so, note that for any collection of disjoint genomic segments G1, G2, ... , Gt, any WP isin-cover C for this collection can be divided into a collection of disjoint covers C1, C2, ... , Ct, each constituting a WP isin-cover for the corresponding Gj. Thus, the minimal WP isin-cover for G1, G2, ... , Gt, is the union of the minimal covers for each of the Gjs. Accordingly, in the binary search algorithm (Algorithm 2), we create the minimal cover for the collection of segments (Lines 2 and 5) as the union of the individual covers. All the rest of the algorithms remain the same, with the understanding that the distance between probes in separate segments is infinite.

3.4.3 Biased selection
Suppose that we have two types of regions, say coding and non-coding, and we seek a cover for which the resolution in the coding regions is {gamma} more granular than that in the non-coding regions. To do so, we multiply all distances within the coding regions by a factor of {gamma} and run the algorithm described above. We will obtain an optimal cover, such that the resolution of the coding region is smaller by a factor of {gamma} from that of the non-coding region. This technique can be extended for more than two types of regions as well.


    4 APPLICATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATHEMATICAL FORMULATION
 3 ALGORITHMS
 4 APPLICATION
 APPENDIX
 REFERENCES
 
We demonstrate the application of UniProbe (Algorithm 2) on a typical scenario of high-resolution probe design. Assume we are searching for a genomic breakpoint at the 6 Mb p-terminus of chromosome 10. Given a set of candidate probes, we would like to select 3000 oligonucleotide probes that offer the best guaranteed resolution of detecting breakpoints. Figure 1 depicts a set of 82 500 candidate probes, taken from a probe database (unpublished data). Each point represents the genomic position and quality score of a single probe, predicted from thermodynamical consideration of the probe sequence and comparison to the entire genome background.


Figure 1
View larger version (50K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 The 82 500 candidate probes in chromosome 10 0–6 Mb. Each point represents the genomic position and predicted quality score of a single probe.

 
In theory, placing the probes at equal spacing would provide a guaranteed 2 Kb resolution, in the sense that the genomic distance between each pair of consecutive probes would be 2 Kb. However, achieving this theoretical resolution is impossible. For example, unclosed gaps in the genomic sequence at positions 0–50 Kb and 5.63–5.68 Mb [UCSC Genome Browser, Kent et al., (2002)] are genomic regions at which maximal resolution cannot be obtained. In addition, we require that highest quality probes are used while sustaining a high degree of uniformity. As can be seen in Figure 1 this dual objective may force us to use probes of lower quality in genomic regions that contain only inferior probes, e.g. at 50–130 Kb or 4.97–5.07 Mb.

We compare the set of probes selected by UniProbe to two different naive methods for uniformly spaced probe selection. The first method (Qual) selects probes on the basis of their quality alone, assuming that randomness in the positions of the superior probes will lead to some degree of uniformity in their coverage. This method was described by Lucito et al. (2003) for designing a whole-genome representational oligonucleotide array, where the probes with the best empirical performance were chosen. Here, we select the 3000 probes with highest quality score. The second naive method (Bin) is based on binning: the total genomic segment is divided into 3000 equally sized bins, and the highest quality probe in each bin is selected. Note that no probes can be selected from empty bins, reducing the total number of selected probes to below 3000.

Figure 2a compares of the resolution performance of the different algorithms—Qual, Bin and UniProbe with three different values of {tau}: 0.5, 0.75 and 0.85 (w = 2 Kb). For each algorithm, and for each genomic distance d ≤ 6 Kb, we note the fraction of genomic positions in the complete segment that lie within gaps of size ≤d between adjacent selected probe. For comparison we depict the optimal performance that could be expected from probes that are placed at equal spacing along the segment (Unif). Figure 2b depicts the quality distributions of the selected probes.


Figure 2
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Comparison of the resolution performance of the naive algorithms Qual and Bin, and the new algorithm UniProbe (UP) with quality threshold {tau} = 0.5, 0.75, 0.85. (a) Resolution performance—the curve for each algorithm describes the fraction of genomic positions within the target genomic segment that lie within a gap of up to the given size between adjacent selected probes. Unif describes the theoretical optimal result that could be expected from probes that are placed at equal spacing along the segment. (b) Quality performance—the curve for each algorithm describes the distribution of quality scores of the selected probes.

 
It is clear that the algorithm Qual performs very badly, in terms of resolution, as could be expected from an algorithm that does not take uniformity directly into consideration, although the probe qualities are clearly superior. The algorithm UniProbe guarantees a ‘whenever possible’ resolution of isin = 2085,2186,2258 for {tau} = 0.5, 0.75, 0.85, respectively, and some 80% of all genomic positions are located within gaps of size ≤isin between adjacent selected probes. The remaining 20% are located within gaps that cannot be probed at the guaranteed resolution. The new algorithm outperforms Bin for all three values of {tau}. The performance of Bin converges with UniProbe for values of d > 4000. This observation is explained by the fact that Bin may deviate from the optimal solution by a maximal factor of 2 (the maximal distance between probes in two consecutive non-empty bins is twice the size of the bin). Overall, the resolution obtained by UniProbe is close to optimal, with values of isin approaching the optimal 2 Kb. By definition, this resolution is guaranteed for all genomic locations for which this is possible whereas the closest possible probes are guaranteed for the remaining locations, which are located within gaps in the candidate probe set. The quality of the probes selected by UniProbe, although slightly inferior to those selected by Bin, are still satisfactory high.


    APPENDIX
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATHEMATICAL FORMULATION
 3 ALGORITHMS
 4 APPLICATION
 APPENDIX
 REFERENCES
 
Here, we show how to deterministically find a split point in Formula 1. As noted, the algorithm follows the ideas of (Johnson and Mizoguchi, 1978; Salowe, 1989; Agarwal et al., 1993) for selection in sorted matrices, direct sum sets, point distances in Formula 1, and other multisets. We describe the algorithm as it pertains to our problem. The algorithm provides that we find a split value that is guaranteed to be ‘relatively-balanced’, i.e. for at least one-fourth of pairs in X the distance is above the split value, and for at least one-fourth of the pairs—the distance is below (recall that X is the set of all probe pairs <pi, pj> such that the distance between the two is in the range [low, high]). A pseudocode description of the procedure is provided as Algorithm 4, and an explanation follows.

In the procedure, we use the following notations. For each Formula 1 we denote:

  • {ell}(i)—smallest index such that |p{ell}(i)pi|isin [low,high]
  • h(i)—largest index such that |ph(i)pi| isin [low,high]
  • (d(i) = h(i)–{ell}(i) + 1—number of pairwise distances within the range [low, high] that start at the probe pi
  • Formula —midpoint between {ell}(i) and h(i)
  • Formula 1—distance from pi to its corresponding mid-point.

The values of {ell}(i) and h(i) can be computed in Formula 1 steps for all i collectively, as explained above in the description of the randomized procedure. The other values can be computed from {ell}(i) and h(i) in an additional Formula 1 steps.


Algorithm 4 Deterministic selection of split value

Deterministic-FindSplit (low,high)
1: for all Formula 1 do
2:     initialize d(i),m(i),v(i)
3: Formula 1
4: Formula 1
5: return RecursiveSplit (S,0)
RecursiveSplit (S,b)
1: if Formula 1 then
2:     let i0 be such that S = {i0}
3:     return v(i0)
4: {alpha} <- median of { v(i):i isin S}
5: S <- {i isin S : v(i)≤{alpha}}
6: S+ <- {i isin S : v(i) ≥{alpha}}
7: Table 4
8: if Formula 1 then
9:     return RecursiveSplit (S, b)
10: else
11:     return RecursiveSplit (Formula 1)

Consider the set of mid-point distances Formula 1. For a given Formula 1, let

Formula 2(2)
For any Formula 2, necessarily Formula 2. Similarly, setting

Formula 3(3)
we have that Formula 3 for all Formula 3. Thus, we seek to find an i for which both below (i) and above (i) are large. Specifically, let i* be such that:

  1. |below(i*)|≥|X|/4,
  2. below(i*) is the smallest possible, subject to the above constraint.

Then,

CLAIM 0.1
For i* as defined above,
  1. |below (i*)|≥|X|/4
  2. |above (i*)|≥|X|/4

PROOF.

  1. By definition |below(i*)|≥|X|/4.
  2. Suppose that |above(i*)|<|X|/4

Let i1 = argmax{v(i):v(i)<v(i*)}. Note that below(i1){cup}above(i*) covers at least half of X. Thus, |below(i1)|≥|X|/4. However, below(i1) sub below (i*), in contradiction to the minimality of below(i*).

Thus, choosing v(i*) as a split value guarantees that at least a constant fraction of the search space is eliminated.

CLAIM 0.2
Algorithm 4 returns v(i*) in Formula 3 steps.

PROOF. We first prove that it returns v(i*). By induction we prove that at all recursive calls to the function RecursiveSearch, S always contains i*. Initially, S is Formula 3 and the claim holds. Suppose that i* isin S for some call of the recursive function. Consider the i for which v(i) is the median chosen in Line 4. If i* < i then i* isin S. The value of Formula 3 is exactly the size of below(i). Hence, the test at Line 8 will succeed and the next iteration will be with S = S and the claim holds. Similarly, if i* ≥ i then i* isin S+. Now the test at Line 8 will fail and the next iteration will be with S = S+, and the claim holds.

The running time of executing a single iteration of the function RecursiveSearch is dominated by Lines 4–7, all of which can be completed in O(|S|) steps. At each recursive call, the size of S is reduced by half. Hence, the entire process is completed in Formula 3 steps.


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATHEMATICAL FORMULATION
 3 ALGORITHMS
 4 APPLICATION
 APPENDIX
 REFERENCES
 

    Agarwal, P.K., et al. (1993) Selecting distances in the plane. Algorithmica, 9, 495–514[CrossRef].

    Balsara, B.R. and Testa, J.R. (2002) Chromosomal imbalances in human lung cancer. Oncogene, 21, 6877–6883[CrossRef][Web of Science][Medline].

    Barrett, M.T., et al. (2004) Comparative genomic hybridization using oligonucleotide microarrays and total genomic DNA. PNAS, 101, 17765–17770[Abstract/Free Full Text].

    Bignell, G.R., et al. (2004) High-resolution analysis of DNA copy number using oligonucleotide microarrays. Genome Res, . 14, 287–295[Abstract/Free Full Text].

    Brennan, C., et al. (2004) High-resolution global profiling of genomic alterations with long oligonucleotide microarray. Cancer Res, . 64, 4744–8[Abstract/Free Full Text].

    Johnson, D.B. and Mizoguchi, T. (1978) Selecting the kth element in x + y and x1 + x2 + ... + xm. SIAM J. Comput, . 7, 147–153[CrossRef].

    Kallioniemi, O.P., et al. (1993) Comparative genomic hybridization: a rapid new method for detecting and mapping DNA amplification in tumors. Semin Cancer Biol, . 4, 41–46[Web of Science][Medline].

    Kent, W.J., et al. (2002) The human genome browser at UCSC. Genome Res, . 12, 996–1006[Abstract/Free Full Text].

    Li, F. and Stormo, G.D. (2002) Selection of optimal DNA oligos for gene expression arrays. Bioinformatics, 17, 1067–1076.

    Lipson, D., et al. (2002) Designing specific oligonucleotide probes for the entire S. cerevisiae transcriptome. In Second Workshop on Algorithms in Bioinformatics (WABI 02). LNCS, 2452, 491–505.

    Lucito, R., et al. (2003) Representational oligonucleotide microarray analysis: s high-resolution method to detect genome copy number variation. Genome Res, . 13, 2291–2305[Abstract/Free Full Text].

    Mei, R., et al. (2002) Probe selection for high-density oligonucleotide arrays. PNAS, 100, 11237–11242.

    Mertens, F., et al. (2002) Chromosomal imbalance maps of malignant solid tumors: a cytogenetic survey of 3185 neoplasms. Cancer Res, . 57, 2765–2780.

    Pinkel, D., et al. (1998) High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat. Genet, . 20, 207–211[CrossRef][Web of Science][Medline].

    Pollack, J.R., et al. (1999) Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat. Genet, . 23, 41–46[Web of Science][Medline].

    Rouillard, J., et al. (2003) Oligoarray 2.0: design of oligonucleotide probes for DNA microarrays using a thermodynamic approach. Nucleic Acids Res, . 31, 3057–3062[Abstract/Free Full Text].

    Salowe, J.S. (1989) L-infinity interdistance selection by parametric search. Inform. Process. Lett, . 30, 9–14[CrossRef].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
S. Lemoine, F. Combes, and S. Le Crom
An evaluation of custom microarray applications: the oligonucleotide design challenge
Nucleic Acids Res., April 1, 2009; 37(6): 1726 - 1739.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Lipson, D.
Right arrow Articles by Aumann, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lipson, D.
Right arrow Articles by Aumann, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?