Bioinformatics Advance Access originally published online on June 29, 2006
Bioinformatics 2006 22(16):1942-1947; doi:10.1093/bioinformatics/btl341
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GEL: a novel genotype calling algorithm using empirical likelihood
1 Department of Statistics, The University of Chicago
2 Department of Medicine, The University of Chicago
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Preliminary results on the data produced using the Affymetrix large-scale genotyping platforms show that it is necessary to construct improved genotype calling algorithms. There is evidence that some of the existing algorithms lead to an increased error rate in heterozygous genotypes, and a disproportionately large rate of heterozygotes with missing genotypes. Non-random errors and missing data can lead to an increase in the number of false discoveries in genetic association studies. Therefore, the factors that need to be evaluated in assessing the performance of an algorithm are the missing data (call) and error rates, but also the heterozygous proportions in missing data and errors.
Results: We introduce a novel genotype calling algorithm (GEL) for the Affymetrix GeneChip arrays. The algorithm uses likelihood calculations that are based on distributions inferred from the observed data. A key ingredient in accurate genotype calling is weighting the information that comes from each probe quartet according to the quality/reliability of the data in the quartet, and prior information on the performance of the quartet.
Availability: The GEL software is implemented in R and is available by request from the corresponding author at nicolae{at}galton.uchicago.edu
Contact: nicolae{at}galton.uchicago.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Large-scale genotyping platforms are used in genetic association studies that aim to identify the DNA variation affecting risk to disease, or modifying the distribution of phenotypic variables such as quantitative traits and response to drug. The Affymetrix genotyping sets (Matsuzaki et al., 2004), GeneChip®Mapping 10K, 100K and 500K Arrays, were the first widely available platforms, and are used in many of current studies. They provide a large number of genotypes at low (per genotype) costs, but the technology used in producing the data can lead to high variation in the genotype calling rates. The high variability in calling rates can be owing to many factors including quality of DNA and reagents, and small variation in the experimental protocols. We have observed that the decrease in the calling rate is usually associated with a disproportionate increase (relative to other genotypes) in the number of heterozygous genotypes assigned a NoCall label (see also, section 4). We think that a significant percentage of the NoCall heterozygotes are due to the difference in binding efficiency of the probes corresponding to the two alleles, a difference that is not properly modeled by the algorithm implemented in the GeneChip DNA Analysis Software. We propose an algorithm that is based on a weighted likelihood, using weights that incorporate information on the properties of each quartet (e.g. the interrogating base), on the sample specific characteristics (e.g. presence of outliers), and prior information on the performance of the quartet that is obtained from data for which the genotypes are known. Calling genotypes is, for biallelic markers, a three-label classification problem so it is possible that many well-developed supervised learning algorithms can be modified for this problem. We think that an off-the-shelf approach might not be suitable for these data because the arrays are generally not exchangeable. The proposed likelihood-based approach can be used even when there is only one array available (and no training data), and it offers a framework for simple generalizations to methods for detecting chromosomal abnormalities such as copy-number polymorphisms.
| 2 DATA STRUCTURE AND GENOTYPE CALLING ALGORITHMS |
|---|
|
|
|---|
The Affymetrix GeneChip Mapping Arrays consist of a large number of probes containing 25-base oligonucleotides designed to test a given sequence. The probes are grouped in quartets: perfect match for allele A (PA), mismatch for allele A (MA), perfect match for allele B (PB) and mismatch for allele B (MB). The mismatch probes are used to measure the magnitude of cross-hybridization. For the AA genotype, it is expected that the intensity of PA is higher than that of MA, PB and MB, and for the AB genotype it is expected that the intensities of PA and PB are higher than those of MA and MB. For each SNP there are 10 quartets used on the Affymetrix GeneChip 100K Mapping Arrays (Matsuzaki et al., 2004), and six quartets on the Affymetrix GeneChip 500K Mapping Arrays. The information available in the data files that are part of the output of the imaging software consist, for each probe, of the number of pixels (nx) corresponding to the probe, the mean (
) and the standard deviation (sx) of the pixel intensity. These summaries of the pixel-level information are used by the Affymetrix GeneChip DNA Analysis Software to make the genotype calls. Figure 1 shows summaries of the probe data for the same SNP in two reference samples, one with a NoCall assigned by the Dynamic Model (DM) Mapping Algorithm (Di et al., 2005) implemented in the Affymetrix GeneChip DNA Analysis Software, and one with a correctly assigned AB genotype. The DM algorithm is calculating log-likelihood statistics for four models (NULL, AA, AB and BB) for the 10 probe quartets using a Gaussian model that assumes independent pixel intensities and equal mean intensities for probes in the same group (e.g. same mean intensity for PB, MA and MB if the genotype is AA, same mean intensity for PA and PB if the genotype is AB). The data for the first quartet in Figure 1 show that the latter assumption is in general not valid [see also Zhan and Kulp (2005)]. A one-sided Wilcoxon signed-rank test is then used on differences in log-likelihood values to calculate reliability rank scores (p-values corresponding to the signed-rank test). The smallest reliability score indicates the most probable genotype. In the NoCall example shown in Figure 1, the fifth quartet favors the model AA, and the sixth and seventh quartet favor the model BB; the Wilcoxon rank test is not powerful enough to overcome this, and therefore the DM algorithm is unable to call this genotype. The (lack of) power of the Wilcoxon signed-rank test is even more problematic for the Affymetrix GeneChip 500K Mapping Arrays where only six quartets are available for each SNP; one outlier quartet is sufficient to yield reliability rank scores as high as 0.4375.
The DM algorithm performs remarkably well and it has several strengths. The main strength is that calling genotypes does not require prior training that is difficult for rare alleles, or information from other variants on the array. As a consequence, the accuracy of the call is not affected by the SNP allele frequency. Also, the DM algorithm efficiently uses the low signal (small differences in PM and MM), and the no-signal quartets (i.e. quartets where the PMs means are smaller than the MMs means), and it aggregates robustly the quartet information. Some of the deficiencies of the approach are in the modeling data: it assumes that the PM means of the two alleles are equal for heterozygous genotypes, and independence of pixel intensities in calculating the likelihoods. Also there is no obvious way to incorporate the Quality Scores from the output of the software in the analytic methods for association studies.
Several data characteristics are important for design of a more effective allele calling algorithm. The standard deviations of the probe intensities are highly correlated with the mean intensities (for our arrays, the correlation between the mean and the standard deviation of the mis-matches are between 0.891 and 0.925), so the pixel standard deviations offer mostly redundant information. This has been discussed also in more general contexts of microarray-based hybridization (Zhan and Kulp, 2005). Also, the standard deviations of the statistics used in the algorithm depend on the location of the interrogation nucleotide, with lower variation if the offset (distance between the SNP location and the interrogation position) is 0 or 1. The offset is used explicitly in other algorithms (LaFramboise et al., 2005), but we use it only in designing weights for the quartets, with higher weights on the quartets with smaller standard deviations [see also Hekstra et al. (2003)].
Finally, the mean intensities for the two alleles are not equal for heterozygous genotypes. To investigate this we looked at the heterozygous genotype in the reference samples (where the genotypes are known), and we calculated the number of times the mean PA is bigger than the mean PB (an integer between 0 and 10). The distribution of this deviates substantially (data not shown) from what is expected under exchangeable mean intensities (i.e. a binomial 0.5 distribution).
The DM algorithm and the method proposed in this paper can be applied to individual arrays, and do not require training data or a large number of arrays to be available. Rabbee and Speed (2006) introduced an algorithm, RLMM, based on a multi-chip model that takes advantage of similarities of the intensities for the same probes across arrays. RLMM is using a supervised learning algorithm, and the construction of the classifier requires that the genotypes are known for some of the arrays. We compare the performance of the three algorithms in the Results section.
| 3 THE PROPOSED ALGORITHM |
|---|
|
|
|---|
The ideas behind the algorithm we introduce are straightforward. For each quartet, we calculate two statistics: tA that measure the difference between PA and the mis-matches and tB that measure the difference between PB and the mismatches. The statistics are used to quantify the support for the presence of allele A and/or B. Optimal statistics can be inferred from assumptions on the distribution of the probe intensities, and our choice is described in the next subsection. We denote with
,
and
the sampling distributions of tA when the genotype is AA, AB and BB respectively. Note that these distributions generally depend on the array, on the SNP, and on the allele under investigation. Also note that we assume that the distributions of the statistics are different when allele A is present in a homozygous or heterozygous genotype (LaFramboise et al., 2005). If these distributions are known, the likelihood of the data under each genotype can be inferred, e.g.
![]() |
is the observed value for the q-th quartet (we implicitly assume that tA and tB are sufficient statistics). The (posterior) probabilities for each genotype can be calculated using Bayes rule as
![]() |
There are several issues to deal with: (1) what are the optimal statistics for classification? (2) how to calculate/estimate the distributions of the statistics used in the likelihood calculations? (3) how to deal with outliers, data quality etc.? Note that optimal statistics are such that their distributions given the presence and given the absence of the allele should be well separated, similar for all levels of hybridization, and similar across arrays (so we can combine information from multiple arrays).
3.1 Empirical likelihood classification
We noted before that the data on the arrays we analyzed show that the standard deviation of the probe intensities is proportional to their mean, suggesting a Gamma model for the intensities. It is well-known that optimal tests for difference in the means of two Gammas are based on the ratio of the sample means [e.g. Shuie and Bain (1983)], and therefore we propose to use
![]() |
![]() |
and
based on data on a large number of normalized arrays. We plan to do this in future work. It is easy to see that the distributions of the tA and tB statistics can be approximated by Beta distributions, and one solution for calculating the likelihoods is to estimate the parameters of the distributions from the data (an empirical Bayes approach). The disadvantage of this method is that it relies on a good fit of the Beta distribution to the observed data. We therefore propose using empirical approximations to the distributions of tA and tB, and the corresponding algorithm is called GEL: genotype calling using empirical likelihood. Based on the results of a preliminary genotype calling algorithm (e.g. the DM algorithm), we can assign a genotype call to each SNP, and we can use the observed values of the statistics under a given genotype to build the empirical estimates for the distributions. Ideally, we would like to build empirical distributions for each SNP separately based on the data on a large number of arrays. This can be done if it would be possible to normalize the arrays such that the data on them are exchangeable. A simple alternative is to assume that the SNPs on the same array are exchangeable (the statistics have the same marginal distribution) and estimate the three distributions for each array separately.
Finally, we need to adjust the inference for the presence of outliers and other data quality issues. We propose to do this by constructing weights for each quartet that incorporate several aspects of the data. The weights are used in combining the information across quartets,
![]() |
The algorithm has the following steps (the current implementation is run for each algorithm separately): (1) run a preliminary algorithm (described next) to get calls for each SNP; (2) use the calls from (1) to construct empirical estimates for
,
and
; (3) use the empirical distributions to calculate the likelihoods for each quartet; (4) calculate the weight of each quartet; (5) combine the data on each quartet and make the genotype call.
3.2 The preliminary algorithm
This step can be skipped if calls are available from a different algorithm such as DM. These initial calls are used to build empirical distributions for the statistics tA and tB.
- For each quartet, calculate statistics that indicate presence or absence of the two alleles,




- For each SNP, average the s statistics from Step 1. Ignore the quartets for which both PM means are smaller than their corresponding MM means. The corresponding statistics are:
where i indexes the quartets that are used, and n is their number. Similarly, get tB, tAB, tM.
- For each value of tA calculate qA, the value of the distribution function of tM evaluated at tA, i.e. qA is the proportion of tMs smaller than tA. Similarly for tB, tAB.
- Call the genotypes:
- if qA > 0.99 and qAB > 0.95 call AA;
- if qB > 0.99 and qAB < 0.05 call BB;
- if qA > 0.975 and qB > 0.975 call AB;
- otherwise, call NoCall.
- if qA > 0.99 and qAB > 0.95 call AA;
3.3 The main algorithm
The main part of the algorithm is described in this subsection. Preliminary genotype calls (e.g. using the above module) and specification of the weights assigned to individual quartets (next module) are necessary to run this part of the algorithm. The user needs to specify the threshold used in calling the genotypes, q
[0,1]. For example, if q = 0.95, the SNPs for which the maximum posterior probability is <0.95 are assigned a NoCall.
- Calculate statistics that indicate presence or absence of the two alleles,
and

- Define empirical distributions for the statistics. We will define three distributions: the distribution of the statistic when the allele is present in a homozygous genotype (
), the distribution of the statistic when the allele is present in a heterozygous genotype (
), and the distribution of the statistic when the allele is absent (
). We use the calls from the Preliminary Algorithm (PrAl) to do this:
is the union of the set of tAs when the Preliminary Algorithm call is AA and the set of tBs when the PrAl call is BB;
is the set of tAs and tBs when the PrAl call is AB;
is the union of the set of tAs when the PrAl call is BB and the set of tBs when the PrAl call is AA.
- Discretize the empirical distributions for efficient computations, i.e. we use histogram approximations to their distribution. The first step is to define a partition of the range of the statistics. We use the intervals: (0,0.02], (0.02, 0.04], ... , (0.98,1]. Because some of the intervals might contain no points, we add a count of 10 to each bin, and this will lead to shrinkage estimators for the bin probabilities. For each interval Ij, (j = 1,J) calculate
as the fraction of points in
that are in Ij (and similarly for
and
).
- For each quartet calculate the likelihood of AA, AB, BB, NoCall in the following way:
- Find jA the index for the interval where tA is, and similarly for jB.
- Calculate




- Find jA the index for the interval where tA is, and similarly for jB.
- The contribution of individual quartets to the final call depends only on the relative probabilities; e.g. (AA, AB, BB)-genotype probabilities equal to (0.001, 0.001, 0.01) have the same information as the probabilities (0.01, 0.01, 0.1), i.e. BB is 10 times more likely than AA and AB. To make more robust calls, we shrink them by adding1/J2 to each of them. We denote the resulting quantities by pAA, pAB, pBB, pNoCall.
- Calculate weights for each quartet, wi, i = 1, 2, 3, ... 10, using the module described next.
- For each SNP, calculate the probabilities for AA, AB, BB, NoCall, by averaging the corresponding probabilities from the ten quartets with the weights from Step 6. That is,
and similarly for AB and BB. Calculate P(AA), P(AB) and P(BB) from these probabilities such that their sum is equal to 1.
- Make the call as the genotype that has the highest probability.
- Calculate the quality measure, Q, as the probability of the called genotype.
- All the genotype calls with quality measures lower than q are changed to NoCall.
3.4 The weights for each quartet
As discussed above, the 10 quartets that are used to call the genotype for one SNP do not yield data of the same quality. The differences are because of design (location of the interrogation probe), variable hybridization kinetics owing to different oligonucleotide sequence composition, and local array characteristics. Initially all the weights, wi, are equal to 1, and the weight changes are done in two modules: standard (applied all the time), and multi-chip (uses information from many arrays to quantify the quartet efficiency and reliability).
Standard weighting:
- Down-weight the quartets with a high probability of NoCall. That is we multiply the weight with (1 –pNoCall).
- Up-weight the probes with the polymorphic nucleotide close to the center of the oligo. We multiply the weights with 1.43 when the offset is zero, and with 1.07 when the offset is 1 (the factors come from standard deviations of the statistics for different offsets, data not shown).
Multi-chip weighting: the relative weights are obtained from arrays with called genotypes. Assuming we have n arrays, for a given quartet we calculate the set of probabilities
such that pj is the probability of the called genotype for the i-th array in the given quartet. If this set consist mostly of high probabilities, the quartet probabilities are generally consistent with the made calls. The weights are modified as follows:
- For each quartet we calculate the median of
, and the weights are multiplied by these reliability scores.
| 4 RESULTS |
|---|
|
|
|---|
The methods described above, the DM and RLMM algorithms, were applied to data from 15 XbaI arrays that were assayed at The University of Chicago. The hybridizations for these arrays were done on reference genomic DNA. The reference DNA was used for quality control and troubleshooting, and consensus genotypes for this sample were provided by Affymetrix based on nine independent replicates. Note that it is possible that some of the consensus genotypes are different than the underlying genotypes because they reflect only the majority call from the DM algorithm. The RLMM and GEL algorithms require the probe intensities for the genotype calling, and these were extracted from the .CEL files using the Gtype software provided by Affymetrix. The R code for RLMM was downloaded from Bioconductor, and the necessary internal files were downloaded from http://www.stat.berkeley.edu/nrabbee/RLMM/.
The RLMM algorithm relies on information from files containing information necessary for classification/genotype calling, files that are provided by the developer. Although we used the default of calling all genotypes, 4743 SNPs out of the total of 58 960 were not called. We believe that this is a consequence of the small training sample (the 90 CEU individuals in HapMap). For example, the SNP SNP_A-1641749 (rs16936866) is monomorphic in HapMap and no training data are available for two of the possible genotypes, and as a consequence it is not called by RLMM. Because
8% of the genotypes are missing even when the default is used, there is no meaningful comparison of call rates between RLMM, DM and GEL. The RLMM classification is based on determining a cutoff for the quality scores, and it was suggested (Rabbee and Speed, 2006) to use the quantiles of
. The quality scores for these arrays do not follow a
distribution (median scores range from 2.06 to 24.76), so we could not decide on threshold based on this distribution. To allow a direct comparison of error rate for RLMM and GEL, we use only the most accurate RLMM genotypes that give the same call rate as GEL. For example, for the first array in Table 1, we used the top (lowest distances to clusters) 88.7% genotypes out of the 54217 RLMM-called genotypes.
Table 1 show the results of the three algorithms applied to the reference samples. GEL was applied in its basic form (multi-chip weights were not used). Note that there is a trade-off between accuracy and call rates, i.e. decreasing the threshold q will increase both the call and error rates. We choose a threshold for GEL (q = 0.85) that gives similar error rates to those given by the DM algorithm. Note that we could have chosen a higher threshold that would lead to call rates that are similar to those obtained using DM, but much smaller error rates.
The arrays show variable call rates, e.g. DM call rates range from 88.7 to 99.6%. The variability in calling rate is surprising because hybridization was done on DNA of similar quality coming from a single subject. We think that this variability is owing to many factors including differences in the reagents used, sample preparation and variation in experimental protocol. Note that the call rates obtained using GEL are uniformly higher than those given by the DM algorithm, and the improvement is larger for the samples with low-throughput. For example, there is an additional
4% genotypes called for the two arrays with DM call rates <90%.
Genotyping errors were calculated with respect to the consensus genotypes. Note that for both DM and GEL, the proportion of errors are <1%, and they are negatively correlated with the call rates. The proportion of errors when using RLMM was much higher. To investigate whether this is a consequence of incorrect consensus genotypes, we also looked at the replicability of the genotype calls. For RLMM and GEL we calculated the number of SNPs for which at least two genotypes were called at least three times each in the 15 reference arrays. For GEL, there were 111 SNPs with inconsistent genotypes, and for RLMM there were 659.
Strikingly, when using the DM and RLMM algorithms, the percentage of heterozygous genotypes in the set of missing genotypes or in the set of discordant genotypes is very large (29% of the reference genotypes are heterozygous). This implies that there is a higher chance for a heterozygous genotype to be missing or miscalled than a homozygous genotype. When using GEL, there is a significant decrease in the proportion of heterozygous in the missing and miscalled genotypes, but there is still an excess of heterozygous genotypes in the missing dataset.
We also applied the GEL algorithm using multi-chip weights. The reliability of the quartets was inferred from 30 independent arrays for which there was no prior information on the genotypes. We noticed an improvement in call rates (data not shown); as before the improvement is larger in the low-throughput samples—almost 1% increase over basic GEL for the two arrays with DM call rates under 90%. There was also a substantial improvement in the heterozygous rates.
| 5 DISCUSSION |
|---|
|
|
|---|
The algorithm described in this paper provides results for the Affymetrix 100K platform that are more accurate than the DM results for similar call rates, or more abundant (higher call rates) for similar error rates. The algorithm can be applied to individual arrays, but can also incorporate information from other arrays by defining reliability measures for each quartet that are used in averaging the information across quartets. The biggest improvement when using GEL is that there seems to be no inflation in the percentage of heterozygous genotypes in the missing data or in the errors.
There are several directions for improvement of the algorithm. First, the error rates are not constant across arrays, and this complicates the adjustment for errors in the genetic association studies. Array-specific genotype calling thresholds can be defined such that the error rates are constant at a pre-specified level. Another improvement to the current implementation can be made by calibrating the quality scores (posterior probabilities), i.e. in the long run, p x 100% of the genotypes with maximum posterior probability equal to p are called correctly. Also, we assume the independence of the A and B statistics. An algorithm that would look at their joint distribution (i.e. derive the bivariate empirical distributions for (tA, tB) and use them to calculate posterior probabilities) can provide more accurate posterior probabilities. Finally, the method we introduce in this paper is not designed to handle structural polymorphisms such as copy number variation.
It is probable that an optimal genotyping calling algorithm would be based on a combination of single-chip and multi-chip approaches. Approaches such as RLMM that require training data and exchangeable arrays are likely to be optimal when the training data are informative (e.g. when sufficient data on all three genotypes are available). Methods that can be applied for individual arrays with no prior information, such as DM and GEL, work well for rare variants and unusual arrays. We plan to investigate hybrid algorithms in future work.
|
|
| Acknowledgments |
|---|
The authors would like to thank Graeme Bell, Xinmin Li, Laura Martinolich and Raluca Nicolae for their contribution to producing the data used in this paper. The research was supported in part by NIH grants DK-55889, DK-20595, and DK-47486. K.M. was supported in part by a fellowship from the Manpei Suzuki Diabetes Foundation.
Conflict of Interest. none declared.
| FOOTNOTES |
|---|
Associate Editor: Alex Bateman
Received on April 22, 2006; revised on June 16, 2006; accepted on June 20, 2006
| REFERENCES |
|---|
|
|
|---|
Di, X., et al. (2005) Dynamic model based algorithms for screening and genotyping over 100 K SNPs on oligonucleotide microarrays. Bioinformatics, 21, 1958–1963
Hekstra, D., et al. (2003) Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays. Nucleic Acids Res, . 31, 1962–1968
LaFramboise, T., et al. (2005) Allele-specific amplification in cancer revealed by SNP array analysis. PLoS Comput. Biol, . 1, e65[CrossRef][Medline].
Matsuzaki, H., et al. (2004) Genotyping over 100,000 SNPs on a Pair of Oligonucleotide Arrays. Nat. Meth, . 1, 109–111[CrossRef].
Rabbee, N. and Speed, T.P. (2006) A genotype calling algorithm for affymetrix SNP arrays. Bioinformatics, 22, 7–12
Shiue, W.K. and Bain, L.J. (1983) A two-sample test of equal gamma distribution scale parameters with unknown common shape parameter. Technometrics, 25, 377–381[CrossRef][ISI].
Zhan, Y. and Kulp, D. (2005) Model-P: a basecalling method for resequencing microarrays of diploid samples. Bioinformatics, 21, ii182–ii189[Abstract].
This article has been cited by other articles:
![]() |
M. G. Hayes, A. Pluzhnikov, K. Miyake, Y. Sun, M. C.Y. Ng, C. A. Roe, J. E. Below, R. I. Nicolae, A. Konkashbaev, G. I. Bell, et al. Identification of Type 2 Diabetes Genes in Mexican Americans Through Genome-Wide Association Studies Diabetes, December 1, 2007; 56(12): 3033 - 3044. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Xiao, M. R. Segal, Y.H. Yang, and R.-F. Yeh A multi-array multi-SNP genotyping algorithm for Affymetrix SNP microarrays Bioinformatics, June 15, 2007; 23(12): 1459 - 1467. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






