Skip Navigation


Bioinformatics Advance Access originally published online on February 5, 2008
Bioinformatics 2008 24(6):768-774; doi:10.1093/bioinformatics/btn048
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/6/768    most recent
btn048v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Rigaill, G.
Right arrow Articles by Barillot, E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Rigaill, G.
Right arrow Articles by Barillot, E.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

ITALICS: an algorithm for normalization and DNA copy number calling for Affymetrix SNP arrays

Guillem Rigaill 1,2,5,{dagger}, Philippe Hupé 1,2,3,5,*,{dagger}, Anna Almeida 4, Philippe La Rosa 1,2,5, Jean-Philippe Meyniel 4, Charles Decraene 3,4 and Emmanuel Barillot 1,2,5

1Institut Curie, Service de Bioinformatique, 2INSERM, U900, 3CNRS UMR144, 4Institut Curie, Translational Research Department, 26 rue d’Ulm, Paris F-75248 and 5Ecole des Mines de Paris, ParisTech, Fontainebleau, F-77300 France

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION AND PERSPECTIVES
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Affymetrix SNP arrays can be used to determine the DNA copy number measurement of 11 000–500 000 SNPs along the genome. Their high density facilitates the precise localization of genomic alterations and makes them a powerful tool for studies of cancers and copy number polymorphism. Like other microarray technologies it is influenced by non-relevant sources of variation, requiring correction. Moreover, the amplitude of variation induced by non-relevant effects is similar or greater than the biologically relevant effect (i.e. true copy number), making it difficult to estimate non-relevant effects accurately without including the biologically relevant effect.

Results: We addressed this problem by developing ITALICS, a normalization method that estimates both biological and non-relevant effects in an alternate, iterative manner, accurately eliminating irrelevant effects. We compared our normalization method with other existing and available methods, and found that ITALICS outperformed these methods for several in-house datasets and one public dataset. These results were validated biologically by quantitative PCR.

Availability: The R package ITALICS (ITerative and Alternative normaLIzation and Copy number calling for affymetrix Snp arrays) has been submitted to Bioconductor.

Contact: italics{at}curie.fr

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION AND PERSPECTIVES
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The development of high-throughput technologies, and of microarrays in particular, has made it possible to analyze DNA copy number throughout the entire genome, with ever-increasing resolution. Various techniques for detecting DNA copy number alterations are available (for a review, see Ylstra et al., 2006). Affymetrix SNP arrays, such as the Affymetrix GeneChip Human Mapping 100K Set (Kennedy et al., 2003), seem to be one of the most widely used tools. These chips can be used for simultaneous genotyping and copy number determination for single nucleotide polymorphism (SNP), at high resolution. This technology has various uses, including studies of copy number variations in populations and the identification of genomic alterations in developmental genetics or cancer (for a review, see Pinkel and Albertson, 2005). In cancer studies, Affymetrix SNP arrays provide new insight into the mechanisms of tumor progression; they can be used to pinpoint new candidate genes for tumor-suppressor genes (Liu et al., 2007) and oncogenes (thought to be present in loss and gain regions, respectively), and to classify tumors, improving diagnosis for new patients and the evaluation of prognosis.

Like all microarrays, Affymetrix SNP arrays are affected by systematic non-relevant sources of experimental variation. For accurate extraction of the biologically relevant effect (i.e. the true DNA copy number of each SNP in the genome, corresponding to the biological signal), the raw data must be corrected, taking these different effects into account. We present here a normalization algorithm for this purpose, which can be used for the simultaneous correction of different sources of experimental variation and biological signal estimation when trying to infer DNA copy number.

Several methods have already been developed for correcting non-relevant sources of variation. These methods include CNAG (Nannya et al., 2005), GIM (Komura et al., 2006) and CARAT (Huang et al., 2006). However, none of these methods take into account that the range of variation due to the non-relevant effects is similar or higher than the biologically relevant effect. Therefore, the impacts of the biologically relevant effect and non-relevant effects may easily be confused. Correct estimation of the non-relevant effects also depends on the correct estimation of copy number. We therefore propose an alternative, iterative method for estimating the biologically relevant effect and non-relevant effects, to improve biological signal estimation. We will begin by briefly presenting Affymetrix SNP arrays. We will then describe our algorithm (ITerative and Alternative normaLIzation and Copy number calling for affymetrix Snp arrays: ITALICS) for data normalization in detail. We then discuss the results obtained with this algorithm, comparing them with those obtained with other algorithms. Finally, we discuss the advantages of ITALICS and possible improvements to this method.


    2 MATERIALS AND METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION AND PERSPECTIVES
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Affymetrix SNP arrays
Technology: Affymetrix SNP arrays can be used to detect DNA copy number alterations at a resolution of 6–210 kb, using around 11 000–500 000 human SNPs. The Affymetrix GeneChip Human Mapping 100K and 500K Sets are comprised of two arrays. Each array is based on specific restriction enzymes: XbaI and HindIII for the 100K set and StyI and NspI for the 500K set. The Affymetrix 50K XbaI and HindIII arrays contain no common SNPs and their combination provides the DNA copy numbers of more than 115 000 SNPs.

Each allele of each SNP is represented by ni perfect match (PM) probes and ni mismatch (MM) probes. Reverse or forward probes may be used and these probes may be centered on the SNP position or offset by –4 to +4 base pairs. Thus, all the PM probes of an SNP allele have different DNA sequences. Probes are grouped into probe quartets of four probes: one PM and one MM probe for each of alleles A and B. All four probes have the same orientation and offset.

The Affymetrix SNP arrays assay is carried out as follows. Genomic DNA is digested with a restriction endonuclease. Adaptors are ligated to all fragments. These fragments are amplified by PCR and then fragmented, labeled with biotin and hybridized with the chip. The chip is then washed and scanned to generate the cell intensity file (.CEL) which is used as input to the proposed algorithm.

Hereafter, the raw signal Yi. of a given SNP i is given by:


Formula

whereFormula andFormula are the log-intensity of the PM probe A and B of the j-th probe quartet for the SNP i, and Yij is the sum of PM log-intensities for the j-th quartet. Yi. is the mean PM log-intensity of the ni quartets for the SNP i. MM probes are not taken into account in our algorithm. The two PM probes defining the entity Yij are referred subsequently as QuartetPM, the subscript i is referred to as SNP i, and the subscript j as one of the ni quartets.

Non-relevant sources of variation: ITALICS deals with known systematic sources of variation, such as the GC-content of the QuartetsPM (QGCij), the length of the PCR-amplified fragment (FLi) and the GC-content of the fragment amplified by PCR (FGCi) (Nannya et al., 2005; Komura et al., 2006). It also takes into account the QuartetPM effect (Qij), resulting from the systematically low intensity of some QuartetsPM and the systematically high intensity of others.

We also found that some Affymetrix SNP arrays suffer from spatial artifacts, as reported by Neuvial et al. (2006) for CGH array data. A spatial artifact is illustrated in Figure 1A: neighboring QuartetsPM on the chip present abnormal intensities. The corresponding SNPs which appear as outliers in the genomic profile, as shown in Figure 1C, D and E, and should be removed. We have addressed this issue using a filtering criterion, making it possible to discard bad probes, as described subsequently.


Figure 1
View larger version (73K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Impact of spatial artifacts on genomic profiles. Image of an XbaI 100K Set chip (HF0844_Xba, Kotliarov et al. (2006)) before (A) and after normalization with ITALICS (B) (flagged QuartetsPM in white). The Yij value of each QuartetPM is represented, using a gradient from green to red. (C), (D), (E) and (F) are the genomic profiles normalized with CNAT 3.0, CNAG, GIM and ITALICS. Vertical dashed red lines represent the breakpoints detected with GLAD and the assigned statuses are indicated by a color code: green for loss, yellow for normal and red for gain. Two stains of abnormally high QuartetsPM values (in red) are visible in (A) and their corresponding SNP values correspond to outliers (colored in red) in the genomic profiles (C), (D) and (E), for which 1661, 1818 and 2331 outliers respectively, were detected. ITALICS flagged most of these QuartetsPM (B) but evaluated the signals for their SNPs using the QuartetsPM from the rest of the chip, resulting in the removal of only 13 of the 57 500 SNPs. ITALICS eventually identified only 88 outliers (F).

 
2.2 The ITALICS algorithm
Overview: In Affymetrix SNP arrays, non-relevant sources of variation (NonRelij) have comparable or greater influence on the raw signal variability than the biological signal (CopyNbi) (see Section 3.2 to compare the type III sum of squares of the different effects in a multiple linear model). We therefore propose an iterative, alternative normalization method, making it possible to estimate the biological signal and non-relevant effects and, therefore, to eliminate most of the non-relevant effects while preserving most of the biological information. During each iteration, ITALICS:
  1. Estimates the biological signal CopyNbi using the GLAD algorithm (Hupé et al., 2004),
  2. Assuming the biological signal to be known, it estimates the non-relevant effects NonRelij on raw data, by multiple linear regression.

After the last iteration, the QuartetsPM for which multiple linear regression predicts the signal poorly are flagged. They correspond to QuartetsPM with abnormal values and are excluded from the final step, in which ITALICS uses GLAD to estimate the biological effect CopyNbi on the remaining normalized QuartetsPM. The algorithm is presented in more detail below.

Biological signal estimation (CopyNb_step): ITALICS applies the GLAD algorithm to Yi. values to estimate the biological signal. The GLAD algorithm segments the genomic profile, defining regions of homogeneous DNA copy number. For each of these regions, it provides a smoothing value and a status (gain, normal or loss). The smoothing value is the median of the Yi. values within the region concerned, and corresponds to the inferred copy number CopyNbi.

Non-relevant effect estimation (NonRel_step): After estimating the biological effect CopyNbi, ITALICS infers the non-relevant effects by multiple linear regression. The model used is as follows:


Formula

with:


Formula

The multiple linear regression can also be expressed in classical matrix notation:


Formula

with:


Formula

The parameter {theta} is estimated using the ordinary least-squares method. The degrees of the polynomial functions Pk were chosen using the BIC criterion (Schwarz, 1978) on a training data set of 128 reference diploid chips (Matsuzaki et al., 2004).

The QuartetPM effect is dealt with by calculating Qij as the mean of each QuartetPM on the 64 female chips of the same Affymetrix reference data set (Matsuzaki et al., 2004).

Once the non-relevant effects have been estimated, the Yij values are corrected as follows:


Formula

whereFormula corresponds to the estimate of non-relevant effects based on multiple linear regression. The correctedFormula is used in the next step of the GLAD procedure, to re-estimate the biological effect. This algorithm is repeated until the number of iterations reaches the predetermined fixed number of iterations itermax.

ITALICS uses GLAD and therefore we investigate if the normalization was influenced by the choice of GLAD parameters. In Supplementary information, we give guidelines for choosing parameters and expose the result of sensitivity analysis that shows a large robustness of ITALICS to parameter settings.

Elimination of poorly predicted QuartetsPM: After the last iteration, QuartetsPM Yij poorly predicted by multiple linear regression are flagged out. This is achieved by calculating the 95% prediction interval. All Yij outside this interval are flagged. SNPs with less than three non-flagged QuartetsPM in a total of ni are then discarded. If more than three Yij are not flagged,Formula is recalculated as:


Formula

with Fi the set of flagged QuartetsPM for the SNPi and NbFi the number of flagged QuartetsPM for the SNPi.

Data scaling: The data are scaled to allow between-chip comparison. After the first GLAD step, the biological signal is subtracted and the standard deviation s of (Yi.CopyNbi) is calculated for each chip using all SNPs i of the chip. The data are then scaled as follows:


Formula

The ITALICS procedure is summarized in Table 1.


View this table:
[in this window]
[in a new window]

 
Table 1. ITALICS algorithm overview

 
2.3 Comparison with other methods
Other methods: Several other methods have already been developed. Most use linear regression to estimate and correct for non-relevant effects. They differ in the effects taken into account and in their pre- and post-processing steps.

CNAG: Copy Number Analysis for GeneChip (Nannya et al., 2005). CNAG corrects the raw signal intensity of a sample, by introducing the notion of averaged best fit, corresponding to a pseudochip constructed from the five samples most similar to the reference samples. CNAG subtracts this averaged best fit from the raw signal and then corrects for the length of the PCR-amplified fragment and GC-content effects by linear regression. This method is available within CNAG 2.0 and is also used in CNAT 4.0 (Copy Number Analysis Tool, see below).

CNAT 3.0: Chromosome Copy Number Analysis Tool 3.0. Affymetrix developed this method for the extraction of DNA copy number. No specific step for the correction of non-relevant effects is included. This method uses samples with varying chromosome X copy number for intensity calibration and transforms SNP intensity into copy number values.

CNAT 4.0: Chromosome Copy Number Analysis Tool 4.0. This tool uses CNAG to normalize the data and then smoothes the data with a user-defined window. This step artificially reduces the variance of the data and visibly improves the quality of the profile.

CARAT: Copy Number Analysis with Regression And Tree (Huang et al., 2006). CARAT uses a reference data set to select probes showing a high-allelic response and to remove those with no such response. For each new sample, it first standardizes the probe signal, based on mismatch probe information. It then corrects for probe GC-content and PCR fragment length effects, by linear regression. Finally, each SNP intensity is regressed against the average intensity of the reference samples with the same genotype.

GIM: Genomic Imbalance Map (Komura et al., 2006). GIM roughly estimates the biological effect and subtracts it from the raw signal, using a simpler version of ChARM (Myers et al., 2004). It removes defective probes with a high local GC-content and then re-estimates the biological effect without using the defective probes and subtracts this effect from the raw signal. It takes into account probe GC-content, the length of the PCR-amplified fragment and its GC-content, and mean SNP intensity for the reference dataset, by linear regression. GIM is implemented in Matlab and is freely available.

We compared ITALICS with CNAG, CNAT 3.0 and GIM. We did not compare ITALICS with CARAT, because no software was available for CARAT at the time of the study, or with CNAT 4.0, which presents no improvement over CNAG. For the CNAG, CNAT 3.0 and GIM genomic profiles, copy number and the status of the genomic regions were inferred with the GLAD algorithm, using the same parameters as for the ITALICS algorithm.

Quality criteria: As described by Neuvial et al. (2006), we used several quality criteria to compare the various normalization algorithms.

As defined by Neuvial et al. (2006), the dyn criterion estimates the dynamics of the DNA copy number signal. Its value is:


Formula

with G and N the regions considered to correspond to Gain and Normal andFormula the corrected signal of SNP i using the normalization method a.Formula for orderedFormula throughout the genome. smt quantifies the smoothness of the signal over the genome, and dyn assesses the dynamics of the signal, as defined by the signal-to-noise ratio (SNR). If no gain region have been identified, the dyn criteria is computed over loss regions. A high dyn should be obtained with good normalization methods.

The criterion out is the number of outliers detected by GLAD. GLAD defines regions of homogeneous DNA copy number and outliers are SNPs with values different from those of other SNPs in the same region. These abnormal values may be accounted for by point mutations in the genome. However, a large number of such changes is unlikely, so the total number of outliers should be relatively low and the out parameter close to zero.

The criterion flag is the number of flagged SNPs. We introduced this criterion for the comparison of methods that remove SNPs, such as GIM and ITALICS. These methods may artificially improve the quality of the signal (as measured by dyn and out), by removing SNPs with abnormal behavior. The number of flagged SNPs should, therefore, not be too high. When faced with a choice between two methods with equal SNR, the method with the lowest flag should be preferred.

Comparison of two normalization methods: These three criteria can be used to determine which of the two normalization methods gives the best results for a given array. In this pairwise comparison context, dyn must be calculated with the same definition of gain, normal and loss regions for both normalized arrays. We therefore define consensus gain, normal and loss regions associated with an array processed with two different normalization methods, as the intersection of the two corresponding gain, normal and loss regions obtained with the two different normalization methods [see also Neuvial et al. (2006) for details].

For the comparison of two different methods, a and b, in terms of a certain criterion, we calculate relative performances as follows:


Formula

RP measures the percentage improvement observed with method a, with respect to method b. The minus signs for the out and flag criteria ensure that a positive RPcri(a,b) always means that method a is better than method b for criterion cri.

2.4 Datasets
We carried out our study on two public datasets: a dataset for 128 reference diploid chips (Matsuzaki et al., 2004) and a glioma dataset corresponding to 356 chips (Kotliarov et al., 2006). We also used datasets produced by the Affymetrix platform of the Institut Curie obtained with 22 uveal melanoma samples, 40 ovarian cancer samples and 26 breast cancer samples.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION AND PERSPECTIVES
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Choosing the number of iterations
We assessed the extent to which each iteration within the ITALICS algorithm improved the SNR, by calculating the dyn criteria for different values of itermax (0, 1, 2, 3 and 5) for each chip of the 356-glioma chips dataset. The percentage improvement RPdyn for different values of itermax (1, 2, 3 and 5) with respect to no iteration was then calculated (Fig. 2). One iteration gave 53.8% improvement, two gave 56.1% improvement and three and five gave 56.3% improvement. As the third and subsequent iterations gave only a very slight improvement, we set itermax to two in the ITALICS algorithm.


Figure 2
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Improvement in SNR with the number of ITALICS iterations The improvement in SNR obtained with each iteration was assessed by calculating the percentage improvement RPdyn for 1, 2, 3, and 5 iterations with respect to no iterations. The results are summarized in this graph, showing RPdyn as a function of the number of iterations. The SNR improved with the first two iterations, with no major improvement observed for subsequent iterations.

 
3.2 Importance of each effect on the signal
For each chip of the glioma dataset, we calculated the type III sum of squares for each effect in our multiple linear regression model. A low type III sum of squares indicates that the difference between the full model and the model excluding the studied effect is very small. The QuartetsPM effect gave the highest type III sum of squares, with a mean of 550 x 103 versus 10.4 x 103, 16 x 103 and 14 x 103 for QuartetsPM GC-content, fragment length and fragment GC-content. The biological effect was the second most important effect, with a mean of 24 x 103.

3.3 ITALICS outperformed the other methods
We calculated dyn and out with ITALICS, GIM, CNAT 3.0 and CNAG, using three different cancer datasets: two in-house datasets corresponding to 22 choroidal melanoma chips and 40 ovarian cancer chips and one public data set of 356 glioma chips. All methods were used with their default parameters.

We calculated the percentage improvement (RP) for CNAT 3.0, CNAG and GIM, in terms of dyn and out, with respect to ITALICS (Fig. 3). For the three competitors RPcri(competitor,ITALICS) is calculated and we performed t-tests to assess the significance of the improvement. We found that ITALICS outperformed CNAT 3.0, CNAG and GIM, in terms of dyn and out, with t-test P-values below 10–5 for all three data sets. For GIM, RPdyn ranged from –10.9% to –6.5%, for CNAG, it ranged from –23.9% to –16.0% and for CNAT 3.0 it ranged from –33.4% to –26.0%. RPout ranged from –98.1% to –89.0% for all three methods. Chip data normalized with ITALICS therefore had a significantly better SNR than those normalized with CNAT, CNAG and GIM, with fewer outliers.


Figure 3
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Comparison of ITALICS with other normalization methods. We compared ITALICS with CNAT 3.0, CNAG and GIM for two quality criteria—dyn and out—using three different cancer datasets: two in-house data sets corresponding to 22 choroidal melanoma chips and 40 ovarian cancer chips and one public dataset corresponding to 356 glioma chips (Kotliarov et al., 2006). Each color corresponds to the comparison of ITALICS with a different method or data set. ITALICS is taken as the reference [red point 0 at (0, 0)]. For each method, the cross indicates the mean relative performance on the data set concerned, for the dyn and out criteria, and the lines give the corresponding 95% quantile for relative performance. ITALICS significantly outperforms all methods for both quality criteria, dyn and out.

 
Both ITALICS and GIM flag certain SNPs for elimination. The improvement in SNR obtained with these methods may therefore be partially due to the mechanical effect of this removal. We compared the number of SNPs flagged between GIM and ITALICS and found that ITALICS flagged significantly fewer SNPs than GIM, with a mean of 300 SNPs per chip for ITALICS versus 3000 for GIM. The RPflag(GIM,ITALICS) is –90%.

3.4 Spatial artifact correction
Some Affymetrix SNP arrays suffer from spatial artifacts. The step flagging poorly predicted QuartetsPM removes most QuartetsPM with abnormal intensity detected by visual inspection, as shown in Figures 1A and B. To our knowledge, ITALICS is the only method capable of doing this. Moreover, the removal of these abnormal QuartetsPM increases the quality of the signal, by removing many outliers from the genomic profile: 1661, 1818 and 2331 outliers were detected for CNAT 3.0, CNAG and GIM (Figure 1C, D and E). With ITALICS, there were only 88 outliers (Figure 1F), but only 13 of the 56 000 SNPs were removed because they had less than three non-flagged QuartetsPM.

3.5 Biological validation
Quantitative PCR validation: We used QPCR (see Supplementary Material for more detail) to validate our method with a different technology. As a test case, we used a set of paired breast cancer samples (primary tumor and relapse, Bollet et al. 2008) and tried to identify a breakpoint in chromosome 20. We compared the results obtained with QPCR with those obtained with ITALICS, CNAG, GIM and CNAT, for the XbaI and HindIII arrays. We also carried out QPCR on two breast cancer tumors, each with a normal chromosome 20 (white and striped bars in Fig. 4) to assess noise for QPCR and to validate the significance of copy number change. As shown in Figure 4, ITALICS was more accurate than CNAG, GIM and CNAT 3.0 for comparisons of copy numbers, based on the estimates obtained with PCR. ITALICS, CNAG, GIM and CNAT 3.0 detected changes in copy number in this region of chromosome 20. However, ITALICS breakpoints were closer to QPCR breakpoints than CNAT breakpoints (see Fig. 4A, C and D) and CNAG and GIM breakpoints (see Figure 4A). In Figure 4A, QPCR and ITALICS breakpoints are found at identical positions (between P14 and P15). In Figure 4C and D, CNAG, GIM and ITALICS detect a copy number change between P12 and P13, close to that detected by QPCR between P14 and P15, whereas CNAT detects this breakpoint further away, between P06 and P07 in Figure 4C and between P08 and P09 in Figure 4D. In Figure 4B, QPCR, CNAT, GIM, CNAG and ITALICS found the same breakpoint.


Figure 4
View larger version (40K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Affymetrix SNP arrays and QPCR DNA copy number profiles for a patient with breast cancer relapse. CNAT 3.0 (dashed line) and ITALICS (solid line) DNA copy number determination along chromosome 20, from position 17453432 (P01) to position 49386812 (P22), for the primary tumor (A, C) and the relapse (B, D) using the HindIII (C, D) and XbaI (A, B) Affymetrix SNP arrays. CNAG and GIM results are identical to CNAT for (A) and identical to ITALICS for (B, C and D). We performed QPCR on two breast cancer tumors with a normal chromosome 20, to estimate the noise associated with QPCR and to validate the significance of copy number change. The bar charts generated show the QPCR estimation of DNA copy number in two breast cancer tissues with a normal chromosome 20 (white and striped bars, A, B, C and D), the primary breast tumor (black bars, A and C) and the corresponding relapse (black bars, B and D). In (A), both ITALICS and QPCR detect a copy number change between P14 and P15, whereas GIM, CNAG and CNAT detects a change between P21 and P22. In (C) and (D), ITALICS detects a copy number change between P12 and P13, close to that detected by QPCR between P14 and P15, whereas CNAT detects a breakpoint further away, between P06 and P07 in (C), and between P08 and P09 in (D). In (B) QPCR, CNAT and ITALICS found the same breakpoints.

 
Patients with breast cancer relapses: The problem tackled was determining whether the second cancer was a true recurrence of the first cancer or a new primary tumor, based on the two Affymetrix SNP array profiles (Bollet et al., 2008). We tried to identify common breakpoints between the cancer chips for the two tumors. The breakpoints detected with CNAT 3.0 or ITALICS normalization are represented in Figure 5A and B for chromosome 6 and 9, respectively, for one patient. GIM and CNAG results are similar to ITALICS for chromosome 6 and similar to CNAT for chromosome 9 (data not shown). ITALICS identified breakpoints at identical locations for both cancers and this is true for the two chromosomes presented in Figure 5A and B. It is important to notice that this was not possible with CNAT 3.0, CNAG and GIM. The precise match between the breakpoints mapped in the two cancers with ITALICS suggests that the second cancer is a true recurrence, whereas the opposite conclusion would have been drawn with CNAT 3.0. As CNAG and GIM detect less precise matches, they lead to the same conclusion as ITALICS, but the evidences for this conclusion are weaker. Expert assessment based on clinical data also indicated that this was a true recurrence, and was therefore consistent with the results obtained with ITALICS. Similar conclusions were drawn for the rest of the data set (13 first and second cancer pairs). Thus, ITALICS improves the classification of true recurrences and new primary tumors.


Figure 5
View larger version (39K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Detection of breakpoints common to first and second cancers, using ITALICS. We present part of the chromosome 6 (A) and 9 (B) profiles obtained with VAMP (La Rosa et al., 2006) for a patient with two breast tumors. For both (A) and (B), the first two profiles are CNAT 3.0 profiles of the first and second cancers and the last two profiles are ITALICS profiles of the first and second cancers. GIM and CNAG results are similar to ITALICS for chromosome 6 and similar to CNAT for chromosome 9 (data not shown). CNAT 3.0 identified no breakpoints (red dashed lines) common to the two cancers, whereas ITALICS did (red arrows), strongly suggesting that the second cancer was a true recurrence. Moreover, the results obtained with ITALICS are supported by an expert classification based on clinical data.

 

    4 DISCUSSION AND PERSPECTIVES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION AND PERSPECTIVES
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We present here a new method for normalizing Affymetrix SNP arrays: ITALICS. This method is highly efficient and outperforms other normalization methods, such as CNAT 3.0, CNAG and GIM, in terms of SNR, giving a more accurate localization of breakpoints validated by QPCR. This improvement may be due to various features of the ITALICS algorithm. This algorithm estimates alternatively and iteratively both non-relevant and biologically relevant effects. The correct estimation of relevant effects depends on correct estimation of the biological signal and vice versa, as the relevant effects induce similar or higher ranges of variation than the biologically relevant effect. By estimating both the non-relevant and biologically relevant effects in an iterative manner, we avoid overestimation of the non-relevant effects and a loss of biological signal. The first estimation on raw data is necessarily rough, but improves the subsequent estimation of non-relevant effects. Each new estimation of the biological or non-relevant effects leads to a better estimation of the other effects. In practice we iterate our algorithm twice, as additional iterations were found to lead to no significant improvement in the SNR. This algorithm also includes a flagging step, making it possible to remove aberrant SNPs. Indeed, some PM intensity values are subject to spatial artifacts. The PM intensity of their QuartetsPM is therefore abnormal, poorly predicted by the regression model and flagged. The discarding of poorly predicted QuartetsPM does not necessarily lead to the discarding of the corresponding SNP, provided that enough QuartetsPM remain elsewhere on the chip. As a result, very few SNPs are removed from the final genomic profile. This filtering step detects spatial artifacts only indirectly, but nevertheless gives good results in practice. Methods for the precise detection of spatial artifacts and the removal of all probes within spatial artifacts have already been developed (Neuvial et al., 2006). However, their direct application to SNP chips is impossible due to the very high density of these chips (more than 2 million probes per chip). Computing QuartetsPM effect on an in-house reference dataset would certainly improve the quality of the normalization. Nevertheless, the QuartetsPM effect is the most important effect and ignoring it would decrease the efficiency of the normalization.

We normalized XbaI and HindIII chips separately. The same major changes were detected with both chips. However, it is difficult to merge XbaI and HindIII data due to the difference in signal amplitude for consecutive alterations between the two chips. The merging of the XbaI and HindIII genomic profiles would result in a higher resolution profile, but also in a lower SNR. The ITALICS algorithm could be improved by taking into account the enzyme effect (XbaI and HindIII) to overcome this problem.

Technically, the ITALICS algorithm could be applied to higher density chips, such as the Affymetrix GeneChip Human Mapping 500K Set and even the Genome Wide SNP array 5.0 and 6.0, which do not have MM probes, as ITALICS is based solely on PM probes. Of course, we would have to check whether the non-relevant effects in our model are also observed with these higher density chips. We would also need to obtain a reference dataset for calculating the quartet effect.


    5 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION AND PERSPECTIVES
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We developed ITALICS, a new normalization algorithm for Affymetrix SNP arrays. This method was designed for the normalization and analysis of DNA copy number and significantly outperformed other methods, such as CNAT 3.0, CNAT 4.0, CNAG and GIM, in terms of SNR and can also be used to correct for experimental artifacts due to spatial effects. This method was validated by QPCR and accurately detected the breakpoints in genomic profiles. It could therefore be used to improve the characterization of samples in genomic studies.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION AND PERSPECTIVES
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work was supported by the Institut Curie and the Centre National de la Recherche Scientifique. We thank Sophie Piperno-Neumann and Simon Saule, Jean-Paul Thiery and Marc Bollet, who were kind enough to provide us with access to their uveal melanoma, ovarian cancer and breast cancer datasets, respectively. We thank Marc Bollet, Nicolas Servant and Pierre Neuvial for fruitful discussions. We thank Audrey Rapinat and David Gentien for performing the Affymetrix Genechip experiments.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Chris Stoeckert

{dagger}The authors wish to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Back

Received on August 21, 2007; revised on January 29, 2008; accepted on January 29, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 DISCUSSION AND PERSPECTIVES
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Bollet M, et al. High resolution mapping of breakpoints to define true recurrences among ipsilateral breast tumor recurrences. J. Natl Cancer Inst (2008) 100:48–58.[Abstract/Free Full Text]

    Huang J, et al. CARAT: a novel method for allelic detection of DNA copy number changes using high density oligonucleotide arrays. BMC Bioinformatics (2006) 7:83.[CrossRef][Medline]

    Hupé P, et al. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics (2004) 20:3413–3422.[Abstract/Free Full Text]

    Kennedy GC, et al. Large-scale genotyping of complex DNA. Nat. Biotechnol (2003) 21:1233–1237.[CrossRef][Web of Science][Medline]

    Komura D, et al. Noise reduction from genotyping microarrays using probe level information. In Silico Biol (2006) 6:79–92.[Medline]

    Kotliarov Y, et al. High resolution global genomic survey of 178 gliomas reveals novel regions of copy number alteration and allelic imbalances. Cancer Res (2006) 66:9428–9436.[Abstract/Free Full Text]

    La Rosa P, et al. VAMP: visualization and analysis of array- CGH, transcriptome and other molecular profiles. Bioinformatics (2006) 22:2066–2073.[Abstract/Free Full Text]

    Liu W, et al. Deletion of a small consensus region at 6q15, including the MAP3K7 gene, is significantly associated with high-grade prostate cancers. Clin. Cancer Res (2007) 13:5028–5033.[Abstract/Free Full Text]

    Matsuzaki H, et al. Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nat. Methods (2004) 1:109–111.[CrossRef][Web of Science][Medline]

    Myers CL, et al. Accurate detection of aneuploidies in array CGH and gene expression microarray data. Bioinformatics (2004) 20:3533–3543.[Abstract/Free Full Text]

    Nannya Y, et al. A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res (2005) 65:6071–6079.[Abstract/Free Full Text]

    Neuvial P, et al. Spatial normalization of array-CGH data. BMC Bioinformatics (2006) 7:264.[CrossRef][Medline]

    Pinkel D, Albertson DG. Comparative genomic hybridization. Annu Rev. Genomics Hum. Genet (2005) 6:331–354.[CrossRef][Web of Science][Medline]

    Schwarz G. Estimating the dimension of a model. Ann. Stat (1978) 6:461–464.[CrossRef]

    Ylstra B, et al. BAC to the future! or oligonucleotides: a perspective for micro array comparative genomic hybridization (array CGH). Nucl. Acids Res (2006) 34:445–450.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
R. Pique-Regi, A. Ortega, and S. Asgharzadeh
Joint estimation of copy number variation and reference intensities on multiple DNA arrays using GADA
Bioinformatics, May 15, 2009; 25(10): 1223 - 1230.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/6/768    most recent
btn048v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Rigaill, G.
Right arrow Articles by Barillot, E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Rigaill, G.
Right arrow Articles by Barillot, E.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?