Bioinformatics Advance Access originally published online on March 10, 2009
Bioinformatics 2009 25(10):1223-1230; doi:10.1093/bioinformatics/btp119
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Joint estimation of copy number variation and reference intensities on multiple DNA arrays using GADA
1Signal and Image Processing Institute, Ming Hsieh Department of Electrical Engineering, Viterbi School of Engineering, University of Southern California, EEB 400, 3740 McClintock Ave, Los Angeles, CA 90089-2564 and 2Departments of Pediatrics and Pathology, Saban Research Institute, Childrens Hospital Los Angeles, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
Motivation: The complexity of a large number of recently discovered copy number polymorphisms is much higher than initially thought, thus making it more difficult to detect them in the presence of significant measurement noise. In this scenario, separate normalization and segmentation is prone to lead to many false detections of changes in copy number. New approaches capable of jointly modeling the copy number and the non-copy number (noise) hybridization effects across multiple samples will potentially lead to more accurate results.
Methods: In this article, the genome alteration detection analysis (GADA) approach introduced in our previous work is extended to a multiple sample model. The copy number component is independent for each sample and uses a sparse Bayesian prior, while the reference hybridization level is not necessarily sparse but identical on all samples. The expectation maximization (EM) algorithm used to fit the model iteratively determines whether the observed hybridization levels are more likely due to a copy number variation or to a shared hybridization bias.
Results: The new proposed approach is compared with the currently used strategy of separate normalization followed by independent segmentation of each array. Real microarray data obtained from HapMap samples are randomly partitioned to create different reference sets. Using the new approach, copy number and reference intensity estimates are significantly less variable if the reference set changes; and a higher consistency on copy numbers detected within HapMap family trios is obtained. Finally, the running time to fit the model grows linearly in the number samples and probes.
Availability:http://biron.usc.edu/
piquereg/GADA
Contact: rpique{at}ieee.org; shahab{at}chla.usc.edu
Supplementary information:Supplementary data are available at Bioinformatics online.
Associate Editor: John Quackenbush
Received on October 14, 2008; revised on February 17, 2009; accepted on February 26, 2009