Bioinformatics Advance Access originally published online on January 24, 2006
Bioinformatics 2006 22(7):849-856; doi:10.1093/bioinformatics/btl004
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Computation of recurrent minimal genomic alterations from array-CGH data
1LRI, UMR CNRS 8623, Université Paris Sud, bât 490 91405 Orsay cedex, France
2UMR CNRS, 144, Institut Curie 26 rue d'Ulm 75248 Paris cedex 05, France
3Service de Bioinformatique, Institut Curie 26 rue d'Ulm 75248 Paris cedex 05, France
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: The identification of recurrent genomic alterations can provide insight into the initiation and progression of genetic diseases, such as cancer. Array-CGH can identify chromosomal regions that have been gained or lost, with a resolution of
1 mb, for the cutting-edge techniques. The extraction of discrete profiles from raw array-CGH data has been studied extensively, but subsequent steps in the analysis require flexible, efficient algorithms, particularly if the number of available profiles exceeds a few tens or the number of array probes exceeds a few thousands.
Results: We propose two algorithms for computing minimal and minimal constrained regions of gain and loss from discretized CGH profiles. The second of these algorithms can handle additional constraints describing relevant regions of copy number change. We have validated these algorithms on two public array-CGH datasets.
Availability: From the authors, upon request.
Contact: celine{at}lri.fr
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Cancer is a genetic disease. Tumour initiation and progression result from the activation of oncogenes and the inactivation of tumour suppressor genes (Fearon and Vogelstein, 1990; Vogelstein and Kinzler, 2004). Genomic instability is a hallmark of cancer and most cancers display various genomic alterations, such as losses, gains and amplifications of chromosome regions. Cancer-associated gains and amplifications are thought to be responsible for oncogene activation and chromosomal deletions are thought to result in the inactivation of at least one copy of a tumour suppressor gene, the other copy being inactivated by a point mutation or other genetic or epigenetic event (Pinkel and Albertson, 2005).
Large-scale analysis of genomic alterations is now possible with array-CGH (comparative genomic hybridization) (Solinas-Toldo et al., 1997; Pinkel et al., 1998). Fragments of genomic DNA are spotted as probes on a glass slide and hybridized with a mixture of tumour and normal DNA, labelled with two different fluorophores. An alternative approach to the spotting of genomic DNA fragments is the use of cDNA arrays (Pollack et al., 2002) or oligonucleotide arrays (Lucito et al., 2003; Herr et al., 2005). The data obtained with array-CGH techniques should provide a rapid, precise identification of the chromosomal regions altered in tumours. An increasing number of tools [see, in particular, CGHAnalyzer (Margolin et al., 2005), ChARMView (Myers et al., 2005)] have recently become available for managing, discretizing, visualizing sets of CGH profiles. CGHAnalyzer also provides statistical tools for supervised or unsupervised analysis of sets of genes, based on copy number status, with or without discretization. However, transverse analyses of array-CGH profiles to define genomic regions frequently subject to copy-number change are frequently performed manually, as a preliminary step before further analysis [see, among others Veltman et al. (2005), Schraders et al. (2005), de Leeuw et al. (2004)]. Attempts have recently been made to construct common alteration regions automatically (Aguirre et al., 2004; Tonon et al., 2005), but this crucial task is still mostly carried out on a manual, ad hoc basis. No general, reusable formalization or tool for finding common or recurrent alteration regions in a CGH-array dataset is currently available. We define a recurrent region as a sequence of altered probes common to a set of CGH profiles and a minimal recurrent region as a recurrent region that contains no smaller recurrent region. In many cases, the accurate determination of minimal regions of chromosomal alterations is the first, crucial step towards the identification of new oncogenes and tumour suppressor genes. If the number of array-CGH profiles to be analysed approaches a few tens, or if there are more than a few thousand array probes, it is very difficult to develop a global view of all the genomic alterations in the dataset, and therefore to identify recurrent regions of gain and loss. Characterization of the minimal regions of alteration can also improve our understanding of tumour progression, even before identification of the genes involved. Minimal regions can be used as new variables for the analysis of array-CGH profiles, e.g. to explore the patterns of copy number alterations in groups of tumours. As these minimal regions are thought to convey concise, biologically meaningful information, their use should improve both supervised and unsupervised classification analyses.
We propose a formalization and two algorithms for computing minimal copy number alteration regions. The first step is identificationstarting from normalized array-CGH dataof the chromosomal regions altered in a tumour. We recently described a method, the Gain and Loss Analysis of DNA (GLAD) algorithm (Hupé et al., 2004), for the automatic detection of breakpoints, using array-CGH profiles. GLAD assigns a status (Gain, Loss or Normal) to each chromosomal region. We now describe two algorithms that compute recurrent alteration regions from such sets of discretised profiles generated by GLAD. The first one, MAR, efficiently computes all minimal recurrent alteration regions from a set of discretized profiles. MAR may identify too many minimal regions (see Section 4), the manual validation of which may be time-consuming for biologists. In such situations, the tumour biologist may have expert knowledge concerning what makes an alteration region relevant. It may be possible to express this knowledge as a set of simple constraints, such as a minimum frequency of a given alteration region in a dataset, or the number of observations defining the border of the alteration region. We therefore propose a second algorithm, CMAR, that computes minimal constrained regions of chromosomal gain and loss from the discretized CGH profiles generated by GLAD. This algorithm can easily be extended to handle additional constraints. Although CMAR is less computationally efficient than MAR (quadratic rather than linear in terms of the number of probes describing the profiles), it should generate fewer, potentially more relevant alteration regions. The parameters of the current constraints implemented in CMAR can easily be adapted to any given dataset.
In Section 2, we introduce the terminology and notations required for the two algorithms and present the first algorithm. Section 3 introduces a number of constraints and their properties, and presents the extension of the first algorithm to the computation of minimal constrained regions. Section 4 provides an experimental validation of the approach, using public CGH data for various types of cancer and, finally, Section 5 sums up the advantages and current limitations of the method and indicates promising directions for further research.
| 2 MINIMAL REGIONS |
|---|
|
|
|---|
The notion of a minimal common alteration region has not been formalized as such in the bioinformatics community. This concept is, however, used by biologists searching for candidate genes involved in tumour initiation and progression, using data describing genomic alterations across sets of genomic profiles. We provide here a formalization based on Formal Concept Analysis theory (Ganter and Wille, 1999).
2.1 Formalization
We assume that we have, as input data, a three-value discrete matrix describing each observation in terms of gain, loss and normal probes. It is straightforward to transform such a discrete matrix into two boolean contexts, Mg and Ml, describing gain and loss events in array-CGH profiles, respectively, with no loss of information.
DEFINITION 1. A context is a triplet (O, P, M) where O is a finite set of observations of size NO, P is a finite set of probe attributes of size NP and M is a binary relationship between O and P, (M
O x P). For simplicity, we will also refer to M as a context when O and P are clearly known.
The contexts Mg and Ml are computed such that Mg(o,p) = 1 if probe p is gained in o, Mg(o, p) = 0 if p is not gained in o, Ml(o, p) = 1 if probe p is lost in o and Ml(o, p) = 0 if p is not lost in o. The computations of minimal gain and loss regions can therefore be handled as identical, independent problems. We will now illustrate our algorithms in the context described in Table 1.
|
With this representation of the array-CGH data, recurrent gained and lost genomic regions can first be seen as rectangles of ones in two 10 matrices. This problem has been extensively studied in Data Mining, in the area of frequent itemset mining [see the seminal paper Agrawal and Srikant (1994)]. The problem of computing closed (Pasquier et al., 1999) and constrained (Ng et al., 1998) patterns has recently received much attention, particularly for large and dense extraction contexts, such as those for most DNA array data (Pang et al., 2003; Besson et al., 2005).
The problem dealt with here is more specific: the set of probes P is totally ordered by the relationship
P, where
P is the ordering of probes in the genome. Consequently, the patterns of interest are not subsets of P, but are instead sequences of probes.
We introduce the following notation for the representation and handling of our specific sequences. Given two probe attributes pi and pj, pi
P pj if and only if i
j. A sequence of contiguous probes is denoted by [pi.pj]. For simplicity, a single probe sequence [pi.pi] is denoted by pi. Strict inclusion between sets or sequences is denoted by
, whereas loose inclusion is denoted by
.
DEFINITION 2. A lattice is a partially ordered set (L,
) such that any two nodes n1 and n2
L have a single least upper bound (lub) and a single greatest lower bound (glb). The lub s
L of two nodes n1 and n2
L is such that n1
s and n2
s and there is no other node s'
L such that n1
s', n2
s' and s'
s. Symmetrically, given any two nodes, n1 and n2
L, the glb g
L of n1 and n2 is such that g
n1, g
n2 and there is no other node g'
L such that g'
n1 and g'
n2 and g
g'.
The set of all probe sequences of P is denoted by S(P). (S(P),
) is a lattice, isomorphic to the lattice of intervals of [1.NP]. Therefore, for simplicity, a sequence of S(P),[pi.pj] with 1
i
j
NP, will be denoted [i.j].
DEFINITION 3. Let s be a sequence of probes of S(P). The extension of s, given the context M, denoted ext(s, M) [or ext(s), when there is no ambiguity concerning the reference context], is the set of all observations oi of M such that all probe attributes of s are set to 1 in oi, denoted in short as s
oi. The frequency of s is the size of its extension.
Some sequences of S(P), the closed ones, are remarkable in the context M: they are the largest sequences occurring in a given set of profiles. Closed sets and sequences are useful for Data Mining because they provide a complete and compact representation of all possible solutions to a mining problem.
Formally, a closure operation on S(P) can be defined as follows.
DEFINITION 4. Let s a sequence of P, the closure of s given a context M, denoted closureP(s, M) be the largest supersequence of s that has the same extension as s. A sequence s is closed if closureP(s, M) = s. Hereafter, a closed sequence of S(P) will be referred to as a region.
The closure of a sequence s can be computed iteratively by intersecting the largest supersequences of s in each observation of s extension. As a consequence, a closed sequence of P, s = [pi.pj]
S(P) of extension e, is one for which e
ext(pi1)
e and e
ext(pi+1)
e.
EXAMPLE 1. In the context of Table 1, [p4.p5] is a sequence of S(P) of extension {o3, o4, o5}, but is not closed. Its closure is [p3.p6].
From the above definitions, we can derive the following properties. The proofs of these properties are provided in Appendix 1 of the Supplementary Material.
PROPOSITION 1. Let us denote by R(P) the set of closed sequences of P. (R(P),
) forms a lattice.
EXAMPLE 2. The set of all closed sequences of the context defined in Table 1 is [p1.p1], [p3.p3], [p5.p5], [p7.p8], [p10.p10], [p11.p11], [p5.p6], [p5.p8], [p3.p6], [p1.p8]. The lattice of all such sequences is given in figure 1 of Appendix 1 of the Supplementary Material.
DEFINITION 5. A region r
R(P) is minimal if there is no other region r'
R(P) such that r'
r.
EXAMPLE 3. In the context of Table 1, [p3.p6] is a region, but not a minimal one, because [p5.p6] is a closed subsequence of [p3.p6]. Note that ext([p3.p6]) = {o3,o4,o5}
ext([p5.p6]) = {o2,o3,o4,o5}.
Note that the minimal regions are the smallest elements of this lattice. The sequential organization of probes in the genome could be used to design an efficient algorithm for detecting minimal regions.
2.2 Computing minimal regions
This algorithm, MAR is based on a transformation of the context, provided that the computation of minimal zones does not require access to the extension of probes and requires only knowledge concerning changes of extension in the genome: breakpoints. This approach is conceptually similar to the traditional definition of minimally altered regions based on multiple alignments of alterations. In this case the region is the intersection of the aligned alterations and is therefore delimited by the breakpoints that narrow down the intersection the most.
DEFINITION 6. Given a context (O, P, M), we define a breakpoint as an index i, 1 < i < NP such that there is at least one observation o for which M(o, pi) = 0 and M(o, pi+1) = 1 or vice versa. Additionally, 1 is a breakpoint if M(o, p1) = 1 and NP is a breakpoint if M(o, pNe) = 1. If M(o, pi) = 0 and M(o, pi+1) = 1, i is an in-breakpoint; if M(o, pp) = 1 and M(o, pi+1) = 0, i is an out-breakpoint. Note that 1 can only be an in-breakpoint and NP an out-breakpoint. If b is a breakpoint, we will denote shift_in(b) as the set of all observations oi
O such that M(oi, b) = 0 and M(oi, b + 1) = 1 and shift_out(b) as the set of observations oi such that M(oi, b) = 1 and M(oi, b + 1) = 0.
Note that for a given index i, 1
i
NP can be an in- and an out-breakpoint simultaneously. Figures 2 and 3, in Appendix 1 of the Supplementary Material, illustrate the notions introduced in the definition. In the MAR algorithm, we will make use of the following:
THEOREM 1. A region r = [in.out] is a minimal region if and only if (1) in is an in-breakpoint and out is an out-breakpoint and (2) there is no breakpoint b such that in < b < out.
The proof of this Theorem 1 is provided in Appendix 1 of the Supplementary material. From this theorem, we can readily derive a linear algorithm for finding all minimal regions (Fig. 1). This algorithm clearly has complexity in O(No * Np) with NO the number of observations and NP the number of probes.
|
EXAMPLE 4. Lin = {p1, p3, p5, p7, p10, p11} and Lout = {p1, p3, p5, p6, p8, p10, p11}. The minimal regions of the context of Table 1 are [p1.p1], [p3.p3], [p5.p5], [p7.p8], [p10.p10], [p11.p11].
Note that if the genome studied consists of several chromosomes and if we set the constraint that a region does not overlap two chromosomes, the above algorithm will be iteratively applied to all chromosomes in the genome.
| 3 CONSTRAINED MINIMAL REGIONS |
|---|
|
|
|---|
The definition of gain/loss regions above may yield a large number of minimal regions (see Section 4), the manual validation of which may be time-consuming for biologists. Biologists may have expert knowledge about what constitutes a relevant alteration region that is much more useful than a frequency test. We therefore introduce the notion of a minimal constrained region, which extends the Definition 5 to regions that satisfy a particular combination of properties C = C1, ..., Cn.
DEFINITION 7. A region r is minimal for the conjunction of constraints C = C1, ..., Cn if and only if r satisfies each Ci, 1
i
n and there is no region r', r'
r such that r' satisfies C.
3.1 Constraintsproperties for use in the search for minimal regions
We have identified the following constraints as relevant for finding recurrent chromosomal regions of gain/loss. These constraints concern either the sequence or the extension of the region:
- Minimum/maximum frequency of the region in M.
- Minimum/maximum size of the region in number of probes.
- The regions extension contains/does not contain a given observation.
- The region is well bounded (see Definition 9).
The first three of these constraints are intuitive and have been extensively studied in the domain of data mining [see, among others De Raedt and Kramer (2001)]. Some of these constraints are anti-monotone with respect to set inclusion (
) and can be used to search the lattice efficiently for subsets of P (and of sequences of P) satisfying these constraints.
DEFINITION 8. A constraint Cam is anti-monotone with respect to
if, for all sets s and g such that g
s, if s satisfies Cam, then g satisfies it also.
EXAMPLE 5. Setting a minimum frequency or a maximum size for a region, or imposing that a particular observation belongs to the extension of a region are anti-monotone constraints.
These constraints can be used to search efficiently for constrained closed sequences, avoiding the exploration of parts of the search space that cannot contain solutions, based on current information collected during the search. For instance, if a set or sequence of probes does not satisfy an anti-monotone constraint C, there is no need to explore and evaluate its supersequences, because they will not satisfy C. In particular, if a sequence s is infrequent in a given context M, all supersequences of s are infrequent in M, and need not be evaluated.
Other properties of constraints may be useful for improving the efficiency of pattern search [e.g. monotone or convertible constraints (Ng et al., 1998)], but are not dealt with in this paper (see Supplementary Material for a discussion). For instance, our experience with CGH data analysis led us to use the following constraint, which is neither anti-monotone nor monotone, but is nonetheless essential for selecting relevant regions.
DEFINITION 9. Given a context (O, P, M), a region r = [in.out] is well bounded on the left given a fixed parameter b if and only if there are at least b observations oi in ext(r, M) such that M(oi, in 1) = 0 and M(oi, in) = 1. In other words, in is an in-breakpoint for at least b observations of the extension of r. Symmetrically, r is well bounded on the right if out is an out-breakpoint for at least b observations of ext(r, M). A region is well bounded if and only if it is well bounded on both the left and right.
EXAMPLE 6. Let us consider well bounded regions with b = 2. In the context of Table 1, [p5.p5] and [p7.p8] are minimal regions according to the Definition 5, but [p5.p5] is not well bounded on the right, whereas [p7.p8] is not well bounded on the left. [p3.p6] is a well bounded supersequence of [p5.p5], and there is no well bounded supersequence of [p7.p8].
Well boundedness is not anti-monotone, as demonstrated in the above example.
The above definition can be relaxed to the following definition, which is more suitable for our biological context.
DEFINITION 10. Given a context (O, P, M), a region r = [in.out] of extension e is well fuzzy bounded on the left given parameters b and m (m is referred to hereafter as the margin parameter) if and only if there are at least b observations of the extension of r that switch from 0 to 1 in the interval [(in m).in]. Formally, for all in-breakpoints i, in m
i
in, such that shift_in(i)
e
, |
i(shift_in(i)
e)|
b. The definition is symmetric for regions well fuzzy bounded on the right. A region r is well fuzzy bounded for parameters b and m if and only if it is both well fuzzy bounded on both the left and right for b and m.
This definition is illustrated in figure 3 of Appendix 1.
EXAMPLE 7. Given the bound b = 2, the smallest m for which [p5.p5] is well fuzzy bounded is m = 1. [p7.p8] is well fuzzy bounded for m = 2.
Computing constrained minimal regions (Fig. 2) is a more complex problem than computing minimal regions, as demonstrated by the Example 6. If the problem is defined exclusively in terms of anti-monotonic constraints, the MAR algorithm can be used to find all minimal regions, and those minimal regions satisfying the anti-monotonic constraints can then be selected. However, if non-anti-monotonic constraints are involved, a level-wise exploration (Mannila and Toivonen, 1997) of the R(P) lattice should be carried out, and this exploration should be as efficient as possible. In the following, we assume, without loss of generality, that the set of constraints on the solution regions can be split into AC, a conjunction of anti-monotone constraints with respect to
and OC, a conjunction of non-anti-monotone constraints for the problem.
|
CMAR searches R(P), the lattice of closed probe sequences breadth first. The first sequences it considers are minimal regions, as computed with the algorithm in Figure 1, because no smaller sequence of S(P) can be closed, according to Theorem 1. If a candidate region r satisfies all constraints of the problem (i.e. both AC and OC), then r is a solution. Regions that do not satisfy AC are stored to prune the search space (Fig. 3). If a region r satisfies AC but does not satisfy OC, it will be used to generate candidate regions for the next level.
|
When generating candidate minimal zones of level L + 1 (Fig. 3), the algorithm first generates all the smallest closed supersequences of regions that failed against OC at level l (Step 1 of Algorithm 3). It then checks that none of the resulting regions is either a superset of a smaller region already in CMR (minimality constraint) or of a region that failed against AC (anti-monotonicity of AC).
EXAMPLE 8. Given the context of Table 1 and its minimal regions for Example 4, we will search for all minimal regions that have both a minimal support of two (AC) and are delimited in at least two observations (OC). At level 1, [p3.p3] and [p10.p10] succeed against AC and OC, [p1.p1], [p5.p5] and [p7.p8] fail against OC and [p11.p11] fails against AC. [p1.p1] cannot be left extended, its right extension, [p1.p8], is a superset of [p3.p3], and should therefore be disregarded. The left extension of [p5.p5] is [p3.p6] and its right extension is [p5.p6]. The smallest of these two regions, [p5.p6] is not well delimited on the left. Finally, [p5.p6] cannot be left-extended any further without covering a minimal region. The result is thus {[p3.p3], [p10.p10]}. If OC is modified to be the region should be fuzzy delimited in at least two observations with margin m = 1, the above solution can be extended with [p5.p5].
THEOREM 2. The above algorithm is completeit generates all minimal closed sequences of P that satisfy the constraints of the problem.
The proof of this theorem is detailed in Appendix 1 of the Supplementary Material.
CMAR differs from algorithms that compute closed constrained itemsets in the context of biological constraints (Pang et al., 2003; Besson et al., 2005), because it handles sequential data. CMAR therefore searches the lattice of intervals of [1.NP], the size of which is NP(NP 1)/2, i.e. much smaller than the search space for itemsets, of size
. It therefore does not need to rely on the Galois connexion used in the other approaches, to search the power-set of observations, 2|O|, which in most applications1 is larger than
. The main difference between CMAR and state-of-the art sequence mining algorithms (Pei et al., 2002; Yan et al., 2003) is the type of sequences handled. The sequences CMAR handles are totally pre-aligned on a fixed set of probes spread throughout a given genome (here, the human genome), explaining why the algorithm has a quadratic worst-time complexity, whereas the other methods handle unaligned data streams. As a consequence, CMAR requires a simpler and more efficient partial ordering and fully exploits the characteristics of the handled sequences to generate candidate closed sequences efficiently (see Algorithm in Fig. 3). Finally, and unusually in the context of pattern mining (Mannila and Toivonen, 1997), this algorithm computes the most general (rather than the most specific) patterns satisfying the set of constraints.
Our approach can also be seen, from a different viewpoint, as a kind of biclustering algorithm (Madeira and Oliveira, 2004) for discrete data, with the user explicitly setting constraints concerning the shape of the 1-containing rectangles that he or she wishes to extract from the 01 context matrix (the height of the rectangle sets the frequency threshold, the closeness constraint ensures that this rectangle has maximal width for a given set of observations, etc.). The ordering of probes in the genome optimizes the efficiency of search for rectangles of 1s satisfying the constraints (Gionis et al., 2004).
3.2 Complexity
The algorithm enumerates closed sequences of S(P), starting from the smallest ones. The number of probe sequences, and therefore of regions, is finite [in the worst case,
], so the algorithm terminates. It stops when no candidate regions can be generated for a given level (i.e. all closed sequences of level l are either supersequences of a region of CMR or FailedAC). In other words, the complexity of the algorithm is in
, but it is efficient as AC prunes small sequences and small sequences satisfy all the constraints of the problem. A more detailed discussion of complexity issues with CMAR can be found in Appendix 2 of the Supplementary Material.
| 4 VALIDATION |
|---|
|
|
|---|
In this section, we validate the proposed algorithms by applying them to two different public datasets, containing CGH-array data for two kinds of tumour: colorectal tumours studied with BAC arrays (Nakao et al., 2004) and breast tumours studied with cDNA arrays (Pollack et al., 2002). These datasets have been handled as uniformly as possible: each dataset was first discretized, pre-processed and provided as input to the algorithms, which then computed the minimal (constrained) recurrent regions. Finally, genes were extracted from the obtained regions; the regions were visualised with VAMP software (http://bioinfo.curie.fr/projects/VAMP) and analysed manually.
Discretization was performed using the GLAD algorithm, as previously described (Hupe et al., 2004). The default parameters of the R function glad.R were used. The status (i.e. Gain, Normal, Loss) given by the Label assignment step is used as the input for the computation of minimal recurrent regions.
Missing values, which are frequent in microarray experiments (some spots and/or clones are discarded owing to poor quality), need not be preprocessed. During the minimal region computation step, unmeasured probes take the value of their neighbouring probes, as assigned by GLAD. They are therefore, by default, included in neighbouring regions.
GLAD automatically detects outliers, which are difficult to handle in the minimal region computation step: outliers may correspond to noise (e.g. mislocated probes, polymorphisms, etc.), or to highly valuable information (i.e. very narrow alteration regions). The taking into account of outliers during the minimal region computation step may yield very short (i.e. one-probe-long) regions, the statistical relevance of which may be difficult to evaluate. For this reason, many approaches simply ignore outliers and one-probe-long regions (Aguirre et al., 2004). We have implemented an outlier selection procedure (see details in Appendix 3 of the Supplementary Material) that makes use of the distributions of gain and loss outlier log2 ratios to select gain and loss outliers with significantly large (for gain outliers) or small (for loss outliers) log2 ratios. A similar strategy has been implemented in the CLAC approach (Wang et al., 2005). Outliers which are not selected are set as unmeasured.
All datasets were treated with both MAR and CMAR algorithms. MAR does not handle constraints and has no parameters that must be adjusted. The current version of CMAR has three such parameters: minimal frequency threshold, and the bound and margin parameters (see Definition 10). These parameters can be adjusted according to the characteristics of the dataset and we describe briefly here and more precisely in the appendices, the adjustment of these parameters for the datasets studied. First, a region r should have a minimum frequency of 10% in the dataset; second, r should be bounded on the left and right in at least two profiles. Note that these left and right delimiting profiles are not necessarily the same. These constraints are very permissive: the minimum frequency is low (i.e. much lower than the frequency used in most current approaches), making it possible to detect relevant regions with a low recurrence rate. Setting b to 2 ensures that a region is not delimited because of noise in a single profile, thereby increasing the biological relevance of the regions obtained.
Finally, as a means of setting the value for the last parameter, the margin m (see Definition 10), we have studied the distribution of distances between two consecutive breakpoints on the same chromosome, for both gain and loss regions, and both in and out breakpoints (see Appendices 4 and 5 of the Supplementary Material). Intuitively, the distance between two related in- or out-breakpoints (related in the sense that they correspond to the same region) should be smaller than the distance between two unrelated in or out breakpoints (see figure 3 in Appendix 1 of Supplementary Material). The left and right margins for computing gain and loss regions can be set to the n-th percentile of such distributions. Basically, increasing the margin for a given bound has the effect of both increasing the number of regions and decreasing the mean size of regions. We will discuss here the results for m equals the first quartile of breakpoint distance distributions, for both gain and loss regions. This seems to provide a good compromise between the size of the minimal regions obtained and the number of regions obtained. This value of m also gives very good results in terms of the number of known oncogenes and tumour suppressor genes occurring in the constrained minimal regions.
For both datasets and for the parameter setting described below, full lists of minimal regions, and associated genes for regions containing 20 or fewer genes, are provided in Appendices 7a and b of the Supplementary Material for the Nakao et al. (2004) dataset and Appendices 8a and b of the Supplementary Material for the Pollack et al. (2002) dataset.
4.1 Colon cancer dataset
The (Nakao et al., 2004) dataset describes 125 CGH profiles, generated with a resolution of 1.5 Mb, on a human array. Each sample is described in terms of 2120 clones, 2081 of which were selected after pre-processing. A summary of the computed minimal regions can be found in Table 2.
|
MAR computed 142 minimal gain regions from this dataset and 173 minimal loss regions. Based on predefined constraints, CMAR computed 121 minimal constrained regions, 55 gain regions and 66 loss regions. We found that 17% of the total number of human genes considered, as defined in Appendix 3 of the Supplementary Material, belonged to gained regions whereas 16% of these genes belonged to lost regions. All the regions identified by Nakao et al., 2004 were identified by this algorithm, including the regions on chromosomes 8p and 20q. The mean length of gain regions was 7 BACs and 61 genes. Loss regions were slightly smaller: 5 BACs and 46 genes. The size of the regions in BAC clones ranged from 1 to 61, with 85% of the regions containing no more than 10 BACs. Most of the oncogenes and tumour suppressor genes known to be involved in colorectal cancer are found in the minimal regions of alterations (Table 2). Serpin genes, which have been identified as potential tumour suppressor genes, are located in the frequent minimal region of loss GS-385N22.
4.2 Breast cancer dataset
We used the dataset described by Pollack et al., (2002), for which both mRNA and DNA copy numbers had been determined with cDNA arrays. This dataset describes 41 profiles, 4 cell lines and 37 tumours, originally described in terms of 6095 cDNA probes, including 5758 retained after pre-processing. The cDNA technology is less sensitive for the detection of losses (Bilke et al., 2005), and this dataset seems much more noisy than the colon cancer dataset described in section 4.1: before pre-processing, about 1500 minimal regions were identified in the dataset, >90% of which were one probe long. This high level of noise and the tendency of breast cancer tumours to display a high level of genetic rearrangement made it much more difficult to set the threshold for selecting outliers. We observed the distribution of outliers log2 ratios for both tumoural and normal additional profiles (denoted X0, XX, XXX, XXXX and XXXXX), which the authors initially used to assess the sensitivity of the cDNA technique (the details can be found in Appendix 5 of the Supplementary Material).
MAR computed 350 minimal gain regions and 302 minimal loss regions, whereas CMAR computed 71 altered regions, including 36 loss and 35 gain regions. These regions contained 6.4 and 1.6% of all the genes considered, respectively. The mean lengths of the regions of gain and loss were 1.6 and 9.8 cDNA probes, respectively, or 8.7 and 34 genes, respectively. Most of the gain regions identified by Pollack et al., (2002) or known to be involved in breast cancer are found in this list: I:773724 contains CCND1, I:825577.I:783729 contains ERBB2, a close but different region, I:236059 contains GRB7. The algorithm also identified regions containing RPS6KB1, NCOA3, ABC1, TP53. Clusterin, which has been identified both as a potential oncogene and as a tumour suppressor gene, is located in the frequently lost region, I:810358. PDCD4, a putative tumour suppressor gene, is located in the frequently lost region, I:328567. I:268258. Some of the regions frequently lost and gained seem to be fragmented, lying very close to one another (see in particular, the various regions on 8q24 and 17q, in Appendix 8a of the Supplementary Material). Some of these neighbouring alteration regions probably correspond to a single minimal region as these two regions are separated by a single or a small number of cDNAs. This would be consistent with the findings of most studies that the minimal regions of amplification on 17q1217q21 always contain both ErbB2 and GRB7.
| 5 DISCUSSION |
|---|
|
|
|---|
As datasets describing copy-number genomic alterations in sets of samples obtained from large-scale analyses become increasingly common, the need for adequate formalization and tools for analysing such discrete datasets is also increasing.
We propose here two algorithms dedicated to the computation of minimal recurrent alteration regions. The first computes all minimal regions observed in a set of discretized alteration profiles. We then introduced a set of constraints to increase the efficiency of selection of biologically relevant regions, generating a second algorithm designed to compute all the minimal constrained alteration regions. The identification of minimal regions is extremely important in the search for genes involved in tumour progression. If the minimal regions are small enough (i.e. do not contain too many genes), the genes located in these regions can be studied in more detail. The genes located in a region of loss can be screened for inactivating mutations in the remaining allele. The tumour biologist involved in this study established, by reviewing the literature, a list of the most common oncogenes and tumour suppressor genes (putative or proven) involved in breast and colon tumours, and most of these genes were found to be located in the minimal regions identified. Moreover, as expected, the status of the regions (gained or lost) was consistent with the supposed function of the genes involved: gained for the oncogenes and lost for the tumour suppressor genes.
Although many biological studies have handled minimal regions for cancer-related studies [e.g. (Veltman et al., 2003; Tonon et al., 2005; Schraders et al., 2005; de Leeuw et al., 2004; Veltman et al., 2005)], very few studies (Aguirre et al., 2004; Tonon et al., 2005; Diskin et al., 2005) have tried to formalize the notion of relevant recurrent minimal regions of alterations or the process for automatically computing such regions across a set of observations. Aguirre et al., 2004 and Tonon et al., 2005 have made the most sophisticated attempt to date to formalize the process for computing common alteration regions in sets of CGH profiles. They introduce a method that selects relevant alteration regions based on both smoothed log2-ratios and frequency in the data. However, this method seems to focus more on high-amplitude deviations for the definition of potentially interesting regions. An empirical comparison of this method with CMAR will become possible once Aguirre et al. (2004) make their code available.
In this paper, we have dealt with datasets obtained with a BAC arrays of
2100 probes and a cDNA array of
6000 probes. Comprehensive segmental copy number arrays covering the whole genome (Ishkanian et al., 2004) and oligonucleotide arrays (Lucito et al., 2003; Herr et al., 2005) have recently been developed. We checked the generality of our approach by applying CMAR to a dataset describing eight mantle cell lymphoma (MCL) cell line profiles, obtained with tiling BAC technology (de Leeuw et al., 2004). Each profile in this dataset is described in terms of 32 433 probes (each spotted in triplicate), making it possible to evaluate the scaling-up capabilities of CMAR. With the same parameters as in the publication, with the margin parameter set as for the two previous datasets (see Appendix 6 of the Supplementary Material), CMAR obtained the regions listed in Appendix 9a of the Supplementary Material. This informal comparison showed a good overlap with the regions obtained by de Leeuw et al. (2004).
The identification of minimal regions should make it possible to decrease considerably the number of variables associated with a given tumour. Rather than having to know the status of all the probes used in the array, copy number alterations can be coded as the status of the minimal regions only, reducing the complexity from 2000 to 6000 variables (in the examples we have studied) to a few tens or hundreds of variables. Machine learning or statistics techniques, which could not be applied efficiently to the initial CGH data, could be applied to the simplified dataset. We are currently extracting association rules relating combinations of alteration regions to biological (specific gene mutations) or anatomical/clinical attributes, such as the stage of tumours (Rouveirol and Radvanyi, 2005).
This work could be developed in many different directions. First, CMAR performed well with data that had a low signal-to-noise ratio. Performance may be poorer in the presence of high levels of noise or considerable sample rearrangement, as we observed that some minimal regions computed from the Pollack dataset seemed to be fragmented, possibly owing to noise. However, most of the important cancer-related genes were still found in the minimal region computed for this dataset. An additional parameter could be added to CMAR to merge these regions, as proposed by Aguirre et al., 2004. This would involve minor changes to the minimal regions obtained in this case, as 10% of these regions were contiguous. This also suggests that another type of constraint may be more suitable for coping with noisy data. One such constraint might involve the computation of chromosomal regions with a high density of alterations rather than fully altered in a set of observations.
Genomic alterations have recently been studied with arrays composed of 100 000 SNPs (Matsuzaki et al., 2004) or oligonucleotide arrays (Lucito et al., 2003; Herr et al., 2005). These arrays differ from the arrays considered in this study in providing datasets with a much larger number of attributes. The CMAR algorithm, as demonstrated by the first experiment conducted on a tiling array dataset (with
30 000 probes), should be easy to adapt to the determination of minimal regions of alteration in genomic data obtained with much denser arrays.
| Acknowledgments |
|---|
The authors are particularly grateful to Ch. Froidevaux for her constant support and H. Radvanyi for fruitful discussions and A.V. Salle for her support during experimentations. They thank the anonymous referees for their pertinent suggestions. The authors also thank Alex Edelman & Associates for careful reading of the manuscript. This work was initiated, partially supported as part of European project HKIS IST-2001-38153 and mostly carried out while the first author was seconded to the CNRS and working in the Molecular Oncology Group, UMR 144. This work was supported by the CNRS, the Institut Curie and the Ligue Nationale Contre le Cancer, Comité d'Ile de France (Laboratoire Associé).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
1If the number of observations to handle exceeds 20 or 30. ![]()
Received on June 16, 2005; revised on December 28, 2005; accepted on January 13, 2006
| REFERENCES |
|---|
|
|
|---|
Agrawal, R. and Srikant, R. (1994) Fast algorithms for mining association rules. Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, , San Francisco, CA Morgan Kaufmann, pp. 487499.
Aguirre, A.J., et al. (2004) High-resolution characterization of the pancreatic adenocarcinoma genome. Proc. Natl Acad. Sci. USA, 101, 90679072
Besson, J., et al. (2005) Constraint-based concept mining and its application to microarray data analysis. Intell. Data Anal, . 9, 5982.
Bilke, S. (2005) Detection of low level genomic alterations by comparative genomic hybridization based on cDNA micro-arrays. Bioinformatics, 217, 11381145.
de Leeuw, R.J. (2004) Comprehensive whole genome array CGH profiling of mantle cell lymphoma model genomes. Hum. Mol. Genet, . 13, 18271837
De Raedt, L. and Kramer, S. (2001) The level-wise version space algorithm and its application to molecular fragment finding. Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI-01) , San Francisco, CA Morgan Kaufmann, pp. 853859.
Diskin, S.J., Eck, T., Greshock, J., Mosse, Y.P., Naylor, T., Stoeckert, C.J., Weber, B.L., Maris, J.M., Grant, G.R. (2005) Statistical analysis of aCGH (STAC) a novel method for analysing multiple experiments. AACR 2005, , Anaheim, CA.
Fearon, E.R. and Vogelstein, B. (1990) A genetic model for colorectal tumorigenesis. Cell, . 61, 759767[CrossRef][Web of Science][Medline].
Ganter, B. and Wille, R. (1999) Formal Concept Analysis Mathematical Foundations. , Berlin Springer.
Gionis, A., Mannila, H., Seppänen, J. (2004) Geometric and combinatorial tiles in 0-1 data. Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD) , Berlin Springer Verlag, pp. 173184.
Herr, A., et al. (2005) High-resolution analysis of chromosomal imbalances using the Affymetrix 10K SNP genotyping chip. Genomics, 85, 392400[CrossRef][Web of Science][Medline].
Hupé, P., et al. (2004) Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics, 20, 34133422
Ishkanian, A.S., et al. (2004) A tiling resolution DNA microarray with complete coverage of the human genome. Nat. Genet, . 36, 299303[CrossRef][Web of Science][Medline].
Lucito, R., et al. (2003) Representational oligonucleotide microarray analysis: a high-resolution method to detect genome copy number variation. Genome Res, . 13, 22912305
Madeira, S.C. and Oliveira, A.L. (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinform, . 1, 2445.
Mannila, H. and Toivonen, H. (1997) Levelwise search and borders of theories in knowledge discovery. Data Min. Knowl. Disc, . 1, 241258.
Margolin, A.A., et al. (2005) CGHAnalyzer: a stand-alone software package for cancer genome analysis using array-based DNA copy number data. Bioinformatics, 21, 33083311
Matsuzaki, H., et al. (2004) Genotyping over 100 000 SNPs on a pair of oligonucleotide arrays. Nat. Methods, 1, 109111[CrossRef][Web of Science][Medline].
Myers, C.L. (2005) Visualization-based discovery and analysis of genomic aberrations in microarray data. BMC Bioinformatics, 6, 146[CrossRef][Medline].
Nakao, K., et al. (2004) High resolution analysis of DNA copy number alterations in colorectal cancer by array-based comparative genomic hybridization. Carcinogenesis, 25, 13451357
Ng, R.T., Lakshmanan, V.S., Han, J., Pang, A. (1998) Exploratory mining and pruning optimizations of constrained associations rules. Proceedings of the 1998 ACM SIGMOD International Conference Management of Data Seattle, WA, pp. 1324.
Pang, F., Cong, G., Tung, A., Yang, J., Zaki, M. (2003) Carpenter : Finding closed patterns in long biological datasets. Proceedings of the SIGKDD'03 ACM, pp. 637642.
Pasquier, N., et al. (1999) Efficient mining of association rules using closed itemset lattices. Inform. Syst, . 24, 2546.
Pei, J., Han, J., Wang, W. (2002) Mining sequential patterns with constraints in large databases. Proceedings of the 2002 ACM CIKM Conference (2002)Mac Lean, VA , pp. 1825.
Pinkel, D. and Albertson, D.G. (2005) Array comparative genomic hybridization and its applications in cancer. Nat. Genet, . 37, S1117.
Pinkel, D., et al. (1998) High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat. Genet, . 20, 207211[CrossRef][Web of Science][Medline].
Pollack, J.R., et al. (2002) Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc. Natl Acad. Sci, . 99, 1296312968
Rouveirol, C. and Radvanyi, F. (2005) Local pattern discovery in array-CGH data. In Morik, K., Boulicaut, J.F., Siebes, A. (Eds.). Local Pattern Detection, Internal Seminar. Dagstuhl Castle, Revised Selected Paper Springer, pp. 135152.
Schraders, M., et al. (2005) Novel chromosomal imbalances in mantle cell lymphoma detected by genome-wide array-based comparative genomic hybridization. Blood, 105, 16861693
Solinas-Toldo, S., et al. (1997) Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances. Genes Chromosomes Cancer, 20, 399407[CrossRef][Web of Science][Medline].
Tonon, G., et al. (2005) High-resolution genomic profiles of human lung cancer. Proc. Natl Acad. Sci, . 102, 96259630
Veltman, I., et al. (2005) Identification of recurrent chromosomal aberrations in germ cell tumors of neonates and infants using genomewide array-based comparative genomic hybridization. Genes Chromosomes Cancer, 367376.
Veltman, J.A., et al. (2003) Array-based comparative genomic hybridization for genome-wide screening of DNA copy number in bladder tumors. Cancer Res, . 63, 28722880
Vogelstein, B. and Kinzler, K.W. (2004) Cancer genes and the pathways they control. Nat. Med, . 10, 789799[CrossRef][Web of Science][Medline].
Wang, P., et al. (2005) A method for calling gains and losses in array CGH data. Biostatistics, 6, 4558[Abstract].
Yan, X., Han, J., Afshar, R. (2003) Clospan: Mining closed sequential patterns in large datasets. Proceedings of the 2003 SIAM Data Mining Conference (SDM 2003)San Francisco, CA , pp. 166177.
This article has been cited by other articles:
![]() |
L. Y. Wu, H. A. Chipman, S. B. Bull, L. Briollais, and K. Wang A Bayesian segmentation approach to ascertain copy number variations at the population level Bioinformatics, July 1, 2009; 25(13): 1669 - 1679. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. P. Shah, K-J. Cheung Jr, N. A. Johnson, G. Alain, R. D. Gascoyne, D. E. Horsman, R. T. Ng, and K. P. Murphy Model-based clustering of array CGH data Bioinformatics, June 15, 2009; 25(12): i30 - i38. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Trolet, P. Hupe, I. Huon, I. Lebigot, C. Decraene, O. Delattre, X. Sastre-Garau, S. Saule, J.-P. Thiery, C. Plancher, et al. Genomic Profiling and Identification of High-Risk Uveal Melanoma by Array CGH Analysis of Primary Tumors and Liver Metastases Invest. Ophthalmol. Vis. Sci., June 1, 2009; 50(6): 2572 - 2580. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Andre, B. Job, P. Dessen, A. Tordai, S. Michiels, C. Liedtke, C. Richon, K. Yan, B. Wang, G. Vassal, et al. Molecular Characterization of Breast Cancer with High-Resolution Oligonucleotide Comparative Genomic Hybridization Array Clin. Cancer Res., January 15, 2009; 15(2): 441 - 451. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. van Doorn, M. S. van Kester, R. Dijkman, M. H. Vermeer, A. A. Mulder, K. Szuhai, J. Knijnenburg, J. M. Boer, R. Willemze, and C. P. Tensen Oncogenomic analysis of mycosis fungoides reveals major differences with Sezary syndrome Blood, January 1, 2009; 113(1): 127 - 136. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. I. Ferreira, J. F. Garcia, J. Suela, M. Mollejo, F. I. Camacho, A. Carro, S. Montes, M. A. Piris, and J. C. Cigudosa Comparative genome profiling across subtypes of low-grade B-cell lymphoma identifies type-specific and common aberrations that target genes with a role in B-cell neoplasia Haematologica, May 1, 2008; 93(5): 670 - 679. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. H. Vermeer, R. van Doorn, R. Dijkman, X. Mao, S. Whittaker, P. C. van Voorst Vader, M.-J. P. Gerritsen, M.-L. Geerts, S. Gellrich, O. Soderberg, et al. Novel and Highly Recurrent Chromosomal Alterations in Sezary Syndrome Cancer Res., April 15, 2008; 68(8): 2689 - 2698. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Klijn, H. Holstege, J. de Ridder, X. Liu, M. Reinders, J. Jonkers, and L. Wessels Identification of cancer genes using a statistical framework for multiexperiment analysis of nondiscretized array CGH data Nucleic Acids Res., February 2, 2008; 36(2): e13 - e13. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. P. Shah, W. L. Lam, R. T. Ng, and K. P. Murphy Modeling recurrent DNA copy number alterations in array CGH data Bioinformatics, July 1, 2007; 23(13): i450 - i458. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Liu, S. Ranka, and T. Kahveci Markers improve clustering of CGH data Bioinformatics, February 15, 2007; 23(4): 450 - 457. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. J. Diskin, T. Eck, J. Greshock, Y. P. Mosse, T. Naylor, C. J. Stoeckert Jr., B. L. Weber, J. M. Maris, and G. R. Grant STAC: A method for testing the significance of DNA copy number aberrations across multiple array-CGH experiments Genome Res., September 1, 2006; 16(9): 1149 - 1158. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Liu, J. Mohammed, J. Carter, S. Ranka, T. Kahveci, and M. Baudis Distance-based clustering of CGH data Bioinformatics, August 15, 2006; 22(16): 1971 - 1978. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Liva, P. Hupe, P. Neuvial, I. Brito, E. Viara, P. L. Rosa, and E. Barillot CAPweb: a bioinformatics CGH array Analysis Platform. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W477 - W481. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||










