Bioinformatics Advance Access originally published online on December 16, 2007
Bioinformatics 2008 24(3):433-434; doi:10.1093/bioinformatics/btm614
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
affyGG: computational protocols for genetical genomics with Affymetrix arrays
Groningen Bioinformatics Centre, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Haren, The Netherlands
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Affymetrix arrays use multiple probes per gene to measure mRNA abundances. Standard software takes averages over probes. Important information may be lost if polymorphisms in the mRNA affect the hybridization of individual probes.
Results: We present custom software to analyze genetical genomics experiments in human, mouse and other organisms: (i) an R package providing functions for QTL analysis at the individual probe level and (ii) Perl scripts providing custom tracks in the UCSC Genome Browser to check for sequence polymorphisms in probe regions.
Availability: http://gbic.biol.rug.nl/supplementary
Contact: r.c.jansen{at}rug.nl
| 1 INTRODUCTION |
|---|
|
|
|---|
In genetical genomics (Jansen and Nap, 2001), gene expression profiles of genetically different individuals are combined with molecular markers in the DNA to reveal expression quantitative trait loci (QTL). Especially the Affymetrix technology, which is a popular platform for profiling gene expression, poses many challenges in data analysis since it uses multiple 25 mer probes per gene to measure its mRNA abundance. Alberts et al. (2005) proposed a statistical multi-probe model: log(y ij) = m + Pj + Gi + PGij + ei + eij, where yij is the signal of the jth probe of the ith individual, Pj is the probe effect and Gi is the genotype effect at the marker under study. PGij is the probe-specific genotype effect, which may be caused by (unassayed) single nucleotide polymorphisms (SNPs) in probe regions leading to a difference in hybridization, and not in gene expression (Alberts et al., 2007a). The model is computed at all marker positions to find the position (QTL) with the most significant Gi; the significance of the corresponding PGij quantifies the probe specificity of the QTL (QTL x probe). Using parametric bootstrap the significance is calculated from data drawn from the no-QTL model log(yij) = m + ei + eij. See Alberts et al. (2005) and Alberts et al. (2007a) for applications in human (association analysis of cell lines) and mouse (linkage analysis of recombinant inbred lines). The model is also applicable to backcross, intercross and other experimental designs (optionally with additive QTL effects). The model can also include a batch factor to remove unwanted batch effects.
| 2 PROTOCOL |
|---|
|
|
|---|
Figure 1 shows the workflow from the raw Affymetrix data in the form of .CEL files to QTL visualization. In the pre-processing part, the .CEL files are background corrected and normalized using the RMA method (Irizarry et al., 2003). In the processing part, QTL analysis is performed as described in Alberts et al. (2005), and deviating probes are detected using a procedure for backward elimination as described in Alberts et al. (2007a). Finally, a custom track in the UCSC Genome Browser is created, visualizing individual Affymetrix probes and known sequence polymorphisms (SNPs, splicing variants) in probe regions. The affyGG software accepts missing marker genotypes, but excludes them from the QTL analysis. In crosses, most likely genotypes can be imputed in advance using R/QTL (www.rqtl.org/manual/html/argmax.geno.html).
|
2.1 Pre-processing
- Create and load in R a comma separated values (CSV) file containing the genotype data (e.g. genotypes.csv), with format:{molecular marker names} x {individuals}, where the cells contain the genotype labels (e.g. 1 or 2; AA, AC or CC; U for missing data).> genotypes
read.csv(genotypes.csv)
- Create and load in R a CSV file (e.g. markerpositions.csv) containing the positions of the markers, with format:{molecular marker names} x {chromosome number (chr), position in basepairs (bp)}> markersPos
read.csv(markerpositions.csv)
- Create a vector containing the names of the .CEL files to be used, and specify the directory where the .CEL files are located:> celfiles
c("bxd1a.cel", "bxd2a.cel", "bxd5a.cel", "bxd6a.cel", ...)> directory
"C: /myproject/celfiles"
- Run the pre-processing function by typing:> probesignals
rma.preprocessing(directory, celfiles, filename = "probesignals.csv")Custom CDF files, as provided by Alberts et al. (2007b), can be used as follows:>probesignals
rma.preprocessing(directory, celfiles, filename = "probesignals.csv", cdffile = cdffile, cdfpackage = cdfpackage)
- (5) Specify in which batch each individual was processed:>batch
c(2, 2, 2, 1, 1, 2, 1, 1, 2, 3, 2, 2, 2, 2, 3, 3, 3, 3, 2, 3, 3, 1, 1, 1, 2, 3, 1, 2, 1, 1)
- (6) Select the probe level data for one probe set:>traits
probesignals[probesignals$probeset = = 96254_f_at,]
- (7) Perform QTL analysis on probe level:>qtlmap
qtlMap.xProbe(genotypes = genotypes, traits = traits, batch = batch)>qtlProbe
cbind(qtlmap[,1:3], minlog10pmarker =–log10(qtlmap[,4]))
- (8) Perform QTL analysis on probe set level:>qtlmapProbeset
qtlMap.xProbeSet(genotypes = genotypes, traits = traits, batch = batch)>qtlProbeset
–log10(qtlmapProbeset$pmarker)>intProbeset
–log10(qtlmapProbeset$pinteraction)
- (9) Calculate the genome-wide significance threshold:>qtlThres
qtlThresholds.sma(genotypes = genotypes, batch = batch, nProbes = 16, nIndiv = 30, n.simulations = 1000, filename = "qtlThres.csv")
- (6) Select the probe level data for one probe set:>traits
- (10) Create plot of the probe signals for a given probe set (Figure 2):Specify the interrogation positions of the probes on the mRNA:>pos
c(1893, 1894, 1897, 1904, 1906, 1911, 1912, 1913, 1916, 1925, 1929, ...)>probePlot(traits = traits, probesetName = "96243_f_at", probesPos = pos, alleleColors = mycolors)
- (11) Create QTL plots for each of the probes of one probe set:Collect the starting positions of each chromosome (in basepairs):>chrOffsets
c(0.000000, 197.842934, 379.178330, 540.742912, 693.473822, ...)>qtlPlot.xProbe(probesetName = "96243_f_at", markersPos = markersPos, probeQtlProfiles = qtlProbe, qtlThres = 3.72, chrOffsets = chrOffsets, filename = "out1.png")
- (12) Create QTL plots on probe set level (Figure 3):>qtlPlot.xProbeSet(probesetName = "96243_f_at", markersPos = markersPos, probesetQtlProfile = qtlProbeset, interactionProfile = intProbeset, qtlThres = 3.65, interactionThres = 3.75, chrOffsets = chrOffsets, filename = "out2.png")
- (11) Create QTL plots for each of the probes of one probe set:Collect the starting positions of each chromosome (in basepairs):>chrOffsets
|
|
2.4 Eliminate and check deviating probes
- (13) Eliminate deviating probes using the statistical method developed in Alberts et al. (2007a):>pe
probeElimination(probesetName = "160371_at", markerName = "D7Mit100", genotypes = genotypes, traits = traits, batch = batch)
- (14) Users can download our Perl scripts to check probes in the UCSC Genome Browser (http://genome.ucsc.edu) for known SNPs not used in the genotyping of the mapping population.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
Received on July 4, 2007; revised on November 22, 2007; accepted on December 10, 2007
| REFERENCES |
|---|
|
|
|---|
Jansen RC, Nap JP. Genetical genomics: the added value from segregation. Trends Genet (2001) 17:388–391.[CrossRef][Web of Science][Medline]
Irizarry RA, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics (2003) 4:249–264.[Abstract]
Alberts R, et al. A statistical multiprobe model for analyzing cis and trans genes in genetical genomics experiments with short-oligonucleotide arrays. Genetics (2005) 171:1437–1439.
Alberts R, et al. Sequence polymorphisms cause many false cis eQTLs. PLoS ONE (2007a) 2:e262.[CrossRef]
Alberts R, et al. A verification protocol for the probe sequences of Affymetrix genome arrays reveals high probe accuracy for studies in mouse, human and rat. BMC Bioinformatics (2007b) 8:132.[CrossRef][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


