Skip Navigation


Bioinformatics Advance Access originally published online on September 10, 2007
Bioinformatics 2007 23(20):2741-2746; doi:10.1093/bioinformatics/btm443
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/20/2741    most recent
btm443v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Teo, Y. Y.
Right arrow Articles by Clark, T. G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Teo, Y. Y.
Right arrow Articles by Clark, T. G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

A genotype calling algorithm for the Illumina BeadArray platform

Yik Y. Teo 1,2,*,{dagger}, Michael Inouye 2,{dagger}, Kerrin S. Small 1,2, Rhian Gwilliam 2, Panagiotis Deloukas 2, Dominic P. Kwiatkowski 1,2 and Taane G. Clark 1,2

1Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford OX3 7BN and 2Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Large-scale genotyping relies on the use of unsupervised automated calling algorithms to assign genotypes to hybridization data. A number of such calling algorithms have been recently established for the Affymetrix GeneChip genotyping technology. Here, we present a fast and accurate genotype calling algorithm for the Illumina BeadArray genotyping platforms. As the technology moves towards assaying millions of genetic polymorphisms simultaneously, there is a need for an integrated and easy-to-use software for calling genotypes.

Results: We have introduced a model-based genotype calling algorithm which does not rely on having prior training data or require computationally intensive procedures. The algorithm can assign genotypes to hybridization data from thousands of individuals simultaneously and pools information across multiple individuals to improve the calling. The method can accommodate variations in hybridization intensities which result in dramatic shifts of the position of the genotype clouds by identifying the optimal coordinates to initialize the algorithm. By incorporating the process of perturbation analysis, we can obtain a quality metric measuring the stability of the assigned genotype calls. We show that this quality metric can be used to identify SNPs with low call rates and accuracy.

Availability: The C++ executable for the algorithm described here is available by request from the authors.

Contact: teo{at}well.ox.ac.uk or tgc{at}well.ox.ac.uk


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The possibility of genome-wide studies hinges on advances in genotyping technology to perform large-scale genotyping quickly and cheaply, assaying up to a million single nucleotide polymorphisms (SNPs) simultaneously. In this setting, it is difficult and time consuming to determine genotypes manually from the examination of fluorescent dye intensities representing the presence or absence of alleles and automated genotype calling procedures are necessary for such large-scale genotyping.

A number of genotype calling algorithms have been established recently for the Affymetrix GeneChip and ParAllele Molecular Inversion Probe genotyping technology (Affymetrix Inc., 2006; Di et al., 2005; Moorhead et al., 2006; Plagnol et al., 2007; Rabbee and Speed, 2006; The Wellcome Trust Case Control Consortium, 2007; Xiao et al., 2007). There has been a confluence in the style and input of the calling algorithms towards the use of multi-component mixture models on hybridization intensities. These intensities are typically normalized and transformed to coordinates which yield distinct genotype clouds that are easier to call.

Generally speaking, the end-products of the genotyping process yield hybridization intensities for the alleles and genotypes are assigned by comparing the relative strength of these intensities. While the genotype of a SNP for each sample can be assigned independently, it has been shown that pooling information across multiple samples at each SNP can enhance the calling process and result in higher quality calls (Affymetrix Inc., 2006). However, this requires non-biological differences of intensities to be minimized. Various normalization schemes have been explored and implemented although these schemes are generally unique to each genotyping technology (Bolstad et al., 2003; Carvalho et al., 2007; Rabbee and Speed, 2006; Xiao et al., 2007). These normalized signal intensities are often transformed to different coordinate scales for ease of calling, and the contrast-strength transformation has been the preferred scale for a number of algorithms (Affymetrix Inc., 2006; Moorhead et al., 2006; Plagnol et al., 2007). The strength and contrast can be respectively interpreted as the equivalent of r and {theta} in polar coordinates for the allelic signals (see Fig. 1).


Figure 1
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. A typical clusterplot of the allelic hybridization signals for a SNP: (a) after normalization; (b) after transformation of the same data to yield the contrast-scale coordinates. Each point in the figures represent the intensity data for an individual.

 
While suitably normalized intensities generally result in genotype clusters with similar location characteristics across the SNPs, assigning genotypes for hundreds of thousands of SNPs means that a calling algorithm has to be suitably flexible to variations in the signal intensities. For most of the genotyping technologies, a fraction of the SNPs can have genotype clusters that are significantly shifted away from the expected positions (see Fig. 2). This can potentially result in erroneous genotype calls when the calling algorithm is unable to account for such extreme shifts.


Figure 2
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Clusterplot for a SNP with shifted genotype clusters. Points in grey represent the observed signal data and the black ellipses represent the expected positions of the three genotype clouds.

 
In an independent pursuit of a calling algorithm for Illumina data, we formalize a calling strategy which is similar to the approach by Moorhead et al. (2006) and Plagnol et al. (2007), and has a number of features which are specifically designed for the BeadArray genotyping technology. This strategy is set within an Expectation-Maximization framework, which is extremely fast without compromising on the quality of the genotype calls. We also explored the use of perturbation analysis to quantify the stability of the genotype calls, and provide a metric for assessing the quality of the assigned genotypes for each SNP as a whole. We emphasize the difference between the quality of the assigned genotype for each individual at a SNP, versus the quality of the assigned genotypes for each SNP across all the individuals. Our algorithm also has the ability to accommodate noisier data from whole-genome amplified DNA. It has been implemented in a C++ program Illuminus and is available by request from the authors.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Chip design
Here, we provide a description of the chip design for the Illumina HumanHap 550 K SNP microarray. Other Illumina HumanHap microarrays (e.g. the 650 K and 1 M versions) which assay different number of SNPs are of similar technology and construction (Gunderson et al., 2006; Steemers et al., 2007). Each microarray consists of lateral stripes, each of which contains a beadpool of 55 000 different beadtypes. Every beadtype assays a single SNP and is represented by 20 beads on average. Each of these beads accommodates locus-specific 50-mer probes which correspond to the nucleotide sequence directly adjacent to the SNP. The single base extension (SBE) biochemistry format allows each beadtype to assay both SNP alleles, thus potentially providing 20 allele measurements per SNP on average for each DNA sample. Visualization of bead hybridization is achieved through the addition of hapten-labelled ddNTPs. If the labelled ddNTP corresponds to the complement of the assayed SNP, it will be incorporated by the extending polymerase and, following a staining step, the resultant fluorescent wavelength and intensity is measure by a scanner.

2.2 Normalization
Each microarray is divided into a number of sub-bead pools (25 sub-bead pools for the 550 K chip) and normalization of the bead intensities occurs at the sub-beadpool level. The normalization algorithm uses a six-degree of freedom affine transformation which occurs in five steps (Kermani, 2005):

  1. outlier removal
  2. background estimation
  3. rotational estimation
  4. shear estimation
  5. scaling estimation.
A brief description of the algorithm is as follows: within each sub-beadpool, outlier SNPs are removed if their allelic intensities are smaller than either the 5th smallest or 1st percentile as compared to all SNPs, or if their intensities are larger than the 5th largest or 99th percentile as compared to all SNPs. Background estimation occurs by uniform sampling of 400 points along each intensity axis to create a linear fitting to candidate homozygotes. The intercept of the linear fittings from both homozygotes then defines the origin. Rotation and shear of the data points by the same uniform sampling then occurs with respect to this defined origin. The final normalized intensities are then determined by mean scaling via virtual control points. This procedure at present occurs automatically within the Illumina BeadStudio software and outputs the normalized intensities, which provide a pair of coordinates corresponding to the signals for the two alleles at each SNP.

2.3 Manual and automated Illumina calling
The automated GenCall proprietary algorithm which Illumina provides with BeadStudio was initiated within the BeadStudio analysis on all Illumina 550 K SNPs. These genotypes are herein referred to as ‘GenCall’ genotypes. Manual curation of the Illumina SNPs was initiated on the GenCall clustered genotypes and are herein called ‘GenCall-C’ genotypes. Manual inspection and adjustment of the genotype classifications was performed on all SNPs with: (a) call rates < 95.0% at a GC score cutoff of 0.20; (b) call rates > 95.0% with cluster separation scores < 0.25 or average GC score < 0.60; (c) Hardy–Weinberg equilibrium {chi}2 P-values < 0.0001. In addition, all mitochondrial and SNPs located on the X or Y chromosomes were inspected.

2.4 Mixture model
Let (xjl, yjl) denote the normalized signal intensities for the two alleles which we generically define as A and B, respectively, for sample j at SNP l. We define the contrast and strength respectively as


Formula 1

(1)


Formula 2

(2)
We fit a three-component bivariate mixture model for Xjl = (cjl, sjl) using multivariate truncated t distributions, where the three components correspond to the genotype classes of AA, AB and BB, respectively. Let f(x; µ, {Sigma}, {nu}) denote the density function for data x at a t distribution with location parameter µ, variance–covariance matrix {Sigma} and {nu} degrees of freedom. The density for Xjl can be written as


Formula 3

(3)
where ({lambda}1, {lambda}2, {lambda}3) corresponds to the mixture proportions obtained by assuming Hardy–Weinberg equilibrium, and


Formula 4

(4)


Formula 5

(5)


Formula 6

(6)
We also introduce a fourth bivariate Gaussian component with zero covariance and significantly large variances such that the density is flat across the possible range of values. This serves as an outlier class for samples with intensity profiles which do not clearly belong to any of the three valid genotypes.

The parameters µk and {Sigma}k are estimated from the data, while {nu}k are pre-determined. As the Illumina BeadArray technology yields extremely sharp and uniform signals for the homozygotes as compared to the heterozygotes at bulk of the SNPs, the variance profiles for the distributions of the contrast for the AA and BB genotype clusters are significantly peaked (see Fig. 3). This can potentially bias the genotype calling since any homozygote samples with contrast intensities which are marginally different from the uniform signals may be assigned heterozygous genotypes. The use of t distributions means we can model this feature with heavier-tails for the homozygous clusters as compared to the heterozygous cluster. In practice, this means we fit {nu}1 = {nu}3 < {nu}2 for most SNPs, and where the variance profiles for the contrast of the three genotype clusters are similar, we fit a constant degree of freedom for all three components.


Figure 3
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. The clusterplot of a typical SNP on the Illumina array which yields highly homogeneous signals for the homozygous clusters, resulting in significantly peaked variance profiles for the homozygous clusters. Lines in black represent the kernal densities of the observed data (in grey).

 
An Expectation-Maximization procedure is implemented which alternates between recalibrating the parameters µk, {Sigma}k and {lambda}k using maximum-likelihood estimation conditional on the assigned genotypes (the M-step), and reassigning the genotypes to the intensity data conditional on the recalibrated cluster characteristics (the E-step). We iterate between the process of recalibration and reassignment until the reassignment yields exactly the same genotype configuration as that of the previous iteration. At every reassignment step, we introduce an additional step which reflects the natural decision-making process of assigning genotypes. Samples with contrast values that are larger than the mean contrast of the heterozygous cluster will never be assigned an AA genotype, while samples with contrast values that are smaller than the mean contrast of the heterozygous cluster will never be assigned a BBgenotype. The genotype for the j sample can be assigned to the genotype class with the maximum posterior probability, subject to the constraint that the maximum posterior probability is above some threshold. Samples where the maximum posterior probability is below the threshold will be assigned a NULL genotype.

Initialization of the EM procedure for each SNP is performed using a one-dimension three-component Gaussian mixture model with equal mixing for the contrast axis, and searches across five guided starts to identify the optimal set of parameters which best fits the data. Of the five starting sets of parameters, two are guided by the data and this modification allows genotype clusters which are shifted to be accommodated accordingly. The spread of the Gaussian distributions are similarly pre-determined for the initialization and are assumed to be identical across all three genotype classes for each of the five guided starts. As with the location parameters, the SDs for the first three starts are pre-determined, and are calculated from the data for the last two starts. A NULL component with a flat density is introduced to accommodate samples with contrast scores which are ambivalent to cluster membership. The EM algorithm is initialized with the set of starting parameters which yields the largest likelihood.

2.5 Chromosome X
For SNPs on chromosome X which are not located in the pseudo-autosomal regions, the genotype calling procedure is modified to incorporate gender information as males contain only one copy of chromosome X and thus will never be heterozygous at these SNPs. As we expect only the strength coordinates to differ between males and females with similar distribution for the contrast coordinates, there is no change to the calling procedure for females, whereas


Formula

in Equations (5). The mixture proportions utilizing Hardy–Weinberg equilibrium is calculated such that males only contribute one allele copy when evaluating the allele frequencies and females contribute two allele copies as usual.

2.6 Modelling parameters
The parameters used for our implementation of the calling algorithm are listed in detail here:

2.6.1 Initialization
The location parameters for the five guided starts are:

  • (– 0.9, 0.0, 0.9)
  • (– 0.9, – 0.5, 0.9)
  • (– 0.9, 0.5, 0.9)
  • (– 0.9, 0.5 (max (c) + min (c)), 0.9 max (c))
  • (0.9 min (c), 0.5 (max (c) + min (c)), 0.9).
The SDs for the three genotype classes are identical for each guided start, and they are:
  • 0.1
  • 0.1
  • 0.1
  • 0.05 (max (c) + min (c))
  • 0.05 (max (c) + min (c)).
The location and SD of the NULL component are 0 and 100 000, respectively for genomic DNA, and 0 and 1000, respectively for whole-genome amplified DNA.

2.6.2 EM mixture model
The location parameter for genotype class k is updated by:


Formula

where nk denote the number of samples assigned to genotype class k, and the superscript refers to the points which have been assigned to genotype class k.

The variance–covariance matrix for genotype class k, {Sigma}k, is updated by:


Formula

For most of the SNPs, we fit multivariate t distributions with heavier tails for the two homozygous classes. This is represented by {nu}1 = {nu}3 = 6, {nu}2 = 20. For a fraction of the SNPs where the variance profiles for the contrast of the three genotype clusters are similar, we fit {nu}1 = {nu}2 = {nu}3 = 20.

The location and variance–covariance matrix for the NULL component are


Formula

respectively for genomic DNA, and


Formula

respectively for whole-genome amplified DNA.

2.7 Perturbation analysis
In the process of calling the genotypes for hundreds of thousands of SNPs, it is inevitable that there will be SNPs with high-confidence and yet erroneous calls which will filter through despite stringent thresholds on the maximum posterior probabilities. As it is impossible to manually curate the assigned genotypes for all the SNPs, we implemented a perturbation analysis step, which provides a metric for quantifying the stability of the assigned genotypes to minor perturbations in the normalized intensities (submitted). This works on the remit of comparing the genotype calls obtained from two independent runs of the algorithm using the original and perturbed intensities respectively, where a high rate of discordance between the two runs implies that the genotype assignments are less reliable for the particular SNP. While this means that every SNP needs to be called twice, once on the original intensities (xjl, yjl) and once on the perturbed intensities (xjl + {varepsilon}1, yjl + {varepsilon}2) where {varepsilon}1 and {varepsilon}2 are independent and identically N(0, 0.052), the algorithm within the EM framework is extremely fast and the time taken to run the analysis with perturbation analysis is realistic.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We compared the performance of Illuminus on genotype data for 1409 samples with genomic DNA which have been genotyped on both the Illumina 550 K and the Affymetrix 500 K GeneChip genotyping arrays. These samples are from the 1958 British Birth Cohort which have been genotyped as part of the Wellcome Trust Case Control Consortium (The Wellcome Trust Case Control Consortium, 2007). Of these data, we identified 82 981 SNPs which overlap on the two chips, and the genotypes for the Affymetrix data are available. Genotypes for the Affymetrix data have been called using the automated program Chiamo (The Wellcome Trust Case Control Consortium, 2007). Four metrics are used in assessing the performance of Illuminus against the genotype calls obtained using the GenCall algorithm from the BeadStudio software (Version 3): (i) call rate, defined as the percentage of valid genotype calls (excluding the NULL calls); (ii) concordance, defined as the percentage of valid calls made by each method which are identical to the genotypes from the Affymetrix GeneChip technology (see Discussion Section); (iii) overall concordance, defined as the percentage of concordant genotype calls out of all possible calls—essentially a product of the concordance and the call rate (iv) filtered SNPs, defined as the number of SNPs which have been dropped from further analysis due to low SNP-wise call rates, defined at 0.95 for samples with genomic DNA and 0.90 for whole-genome amplified samples. GenCall genotypes have been thresholded at a confidence score of 0.20, and any genotype with a score less than 0.2 is assigned as a NULL genotype. The BeadStudio software comes with the flexibility of manually curating the calls made by GenCall. We also assess the performance of Illuminus against GenCall genotypes which have undergone manual curation. We calculated the call and concordance rates for genotypes by Illuminus at thresholds of 0.95. We also run Illuminus with perturbation analysis, flagging and excluding SNPs with more than 5% discordant genotypes between two runs of Illuminus with the original and perturbed intensities. The results are summarized in Table 1.


View this table:
[in this window]
[in a new window]

 
Table 1. Comparison of Illuminus against GenCall on 82 981 SNPs for 1409 samples genotyped using genomic DNA

 
GenCall yielded high quality genotypes, successfully assigning a valid genotype to 99.597% of the data, with high accuracy as reflected by a concordance of 99.785% with genotypes from a different platform. This yielded an overall concordance of 99.383%. The process of manual curation appeared to be conservative and chose to classify more potentially ambiguous calls as NULL genotypes. This reduced the call rates to 99.419%, but increases the concordance to 99.801%. While this reduces the overall concordance to 99.221%, manual curation can increase the accuracy of the calls made, and also reduce the number of SNPs with < 90% call rate. At a threshold of 0.95, Illuminus made around 0.28% more calls, and this translated to 322 699 genotypes which were assigned by Illuminus but failed to be assigned by GenCall. The rate of concordance for Illuminus was marginally lower but overall Illuminus achieved a much higher concordance rate of 99.603%. Perturbation analysis identified 2805 SNPs to be removed. These SNPs have mean call and concordance rates of 99.302 and 98.787%, and removing these SNPs from the analysis resulted in higher call and concordance rates of 99.896 and 99.762% for the remaining SNPs. This suggests that perturbation analysis can correctly identify SNPs with higher rates of erroneous calls. In order to provide a fair comparison of GenCall and Illuminus, we also investigated the performance of GenCall on the same set of remaining SNPs after exclusion based on perturbation analysis on Illuminus. Our analysis showed that removing SNPs identified by perturbation analysis also improved the performance of GenCall, suggesting that the SNPs removed were performing poorly. SNP-wise call rates have often been used as a criterion for filtering SNPs (Gudmundsson et al., 2007; Rioux et al. 2007; Saxena et al. 2007; Scott et al. 2007; The Wellcome Trust Case Control Consortium, 2007; Yeager et al., 2007), and a common approach is to remove SNPs with < 95% call rates. The use of GenCall resulted in the loss of at least 636 SNPs, while Illuminus achieved a better performance by removing only 215 SNPs.

We also investigated the performance of Illuminus on 252 samples which have been genotyped on both the Illumina 650 K and the Affymetrix 500K GeneChip genotyping arrays with whole-genome amplified DNA (Table 2). We analysed 95 578 SNPs found on both platforms, and the genotypes for the Affymetrix data has similarly been called using Chiamo. Perturbation analysis with a discordance threshold of 5% identified 7530 SNPs to be removed. These SNPs have mean call and concordance rates from Illuminus of 92.624 and 88.819%, and removing these SNPs from the analysis resulted in higher call and concordance rates for the remaining SNPs. We observed a drop in the performance of both GenCall and Illuminus for whole-genome amplified samples. GenCall has a lower call rate of 82%, but calls that were made generally are highly concordant with Chiamo calls. Illuminus performed significantly better with a much higher call rate of 95% while achieving similar concordance rates as GenCall. Overall, this means Illuminus made at least 2.5 million more concordant calls across 252 samples, corresponding to 12% more concordant calls on top of GenCall. While it appears that the numbers are considerably low, we caution against over-interpreting these figures since the performance of both GenCall and Illuminus is necessarily bounded by the error rates of Chiamo and we believe this may be substantial for whole-genome amplified DNA (see Discussion Section). GenCall removed 41 332 SNPs with SNP-wise call rates of < 90%. By comparison, Illuminus experienced a more tolerable loss of 1607 SNPs.


View this table:
[in this window]
[in a new window]

 
Table 2. Comparison of Illuminus against GenCall on 95 578 SNPs for 252 samples genotyped using whole-genome amplified DNA

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have introduced a model-based approach to call genotypes for the Illumina BeadArray platforms and this has been implemented within an Expectation-Maximization framework in the program Illuminus. By comparing the performance of Illuminus against GenCall on data where the genotypes for the same SNPs from a different genotyping platform is available, we see that Illuminus made more concordant calls and resulted in a smaller number of SNPs which are excluded on the basis of per-SNP call rates (see below). This improvement over GenCall is significantly more substantial for DNA samples which have undergone whole-genome amplification (see below). We have also investigated the use of perturbation analysis to provide a metric for assessing the stability of the assigned genotypes to minor changes in the allelic signals. Empirical results suggest that perturbation analysis correctly identified SNPs with lower call and concordance rates.

In quantifying the performance of Illuminus, we have compared the assigned genotypes against calls made on the same set of SNPs on the Affymetrix 500 K genotyping array. This is clearly not ideal as it assumes that the genotype calls made by Chiamo on the Affymetrix 500 K array are correct. The concordance rates calculated through such comparisons are naturally bounded above by the error rates of Chiamo. We visually inspected the clusterplots for SNPs where the discordance between Chiamo genotypes and GenCall or Illuminus genotypes is > 90% and found that all these SNPs have been correctly called using GenCall and Illuminus but the genotypes are inconsistent with the Chiamo genotypes from the Affymetrix data. As we expect such inconsistencies to be more prevalent for whole-genome amplified DNA, we visually verified the Chiamo calls on the Affymetrix data for a random collection of 4000 SNPs and discarded 1481 (37%) SNPs with suspect genotype calls. On the remaining set of 2519 SNPs with ascertained accurate Chiamo calls, the concordance rates of Illuminus and GenCall genotypes increased to 82.769 and 90.509%, respectively. This suggests that while the use of genotypes from an independent platform can provide a measure of comparative performance between GenCall and Illuminus, the concordance rates calculated here tend to underestimate the actual performance.

While whole-genome amplification allows the recovery of DNA samples with inadequate concentration by performing in vitro reproduction of quality template DNA, empirical evidence suggests that amplified DNA suffers a reduction in hybridization signals. In the context of genotyping large number of samples simultaneously which is common in a genome-wide association study, this can result in a proportion of SNPs experiencing shifts in the positions of the genotype clusters (see Fig. 4), resulting in indistinct clusters and increasing the uncertainty of a call. While the performance of both GenCall and Illuminus were less optimal for hybridization data from whole-genome amplified DNA, Illuminus comes with a modification for calling data from amplified DNA and this provides a substantial improvement over GenCall.


Figure 4
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Clusterplots of a SNP which has been typed on both genomic and whole-genome amplified DNA. Points in black correspond to samples with genomic DNA while points in grey correspond to samples with amplified DNA. The plots are made on the (a) normalized allelic signal coordinates; (b) strength-contrast transformed coordinates.

 
Genotyping technology is moving towards assaying millions of polymorphisms simultaneously. This requires automated genotyping algorithms to be efficient and accurate. While the use of computationally intensive algorithms may aim to maximize the potential of the hybridization data, they require the use of large computing clusters which the typical laboratory may not have access to. There is a need for an integrated, simple-to-use and yet accurate genotype calling software. The software for the algorithm comes equipped with a metric for identifying SNPs with unstable genotypes. We have presented a fast and accurate calling algorithm which is designed to work with data from the Illumina BeadArray genotyping platforms.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The work of all the authors is supported by the Wellcome Trust. In addition, Y.Y.T., K.S.S., D.P.K. and T.G.C. also acknowledge support from the Grand Challenges in Global Health initiative (Gates Foundation, Wellcome Trust and FNIH) and the UK Medical Research Council.

Conflict of Interest: none declared.


    FOOTNOTES
 
{dagger}The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Back

Associate Editor: Martin Bishop

Received on July 31, 2007; revised on August 20, 2007; accepted on August 20, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Affymetrix Inc. BRLMM: an improved genotype calling method for the GenChip Human Mapping 500K Array Set. (2006) http://www.affymetrix.com/support/technical/whitepapers/brlmm_whitepaper.pdf.

    Bolstad BM, et al. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics (2003) 19:185–193.[Abstract/Free Full Text]

    Carvalho B, et al. Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics (2007) 8:485–499.[Abstract/Free Full Text]

    Di X, et al. Dynamic model based algorithms for screening and genotyping over 100K SNPs on oligonucleotide microarrays. Bioinformatics (2005) 21:1958–1963.[Abstract/Free Full Text]

    Gudmundsson J, et al. Genome-wide association study identifies a second prostate cancer susceptibility variantat 8q24. Nat. Genet (2007) 39:631–637.[CrossRef][Medline]

    Gunderson KL, et al. Whole-genome genotyping of haplotype tag single nucleotide polymorphisms. Pharmacogenomics (2006) 7:641–648.[CrossRef][Web of Science][Medline]

    Kermani BG. Artificial intelligence and global normalization methods for genotyping. (2005) US Patent 20060224529.

    Moorhead M, et al. Optimal genotype determination in highly multiplexed SNP data. Eur. J. Hum. Genet (2006) 14:207–215.[CrossRef][Web of Science][Medline]

    Plagnol V, et al. A method to address differential bias in genotyping in large-scale association studies. PLoS Genet (2007) 3:e74.[CrossRef][Medline]

    Rabbee N, Speed TP. A genotype calling algorithm for Affymetrix SNP arrays. Bioinformatics (2006) 1:7–12.[CrossRef]

    Rioux JD, et al. Genome-wide association study identifies new susceptibility loci for Crohn disease and implicates autophagy n disease pathogenesis. Nat. Genet (2007) 39:596–604.[CrossRef][Web of Science][Medline]

    Saxena R, et al. Genome-wide association analysis identifies loci for Type 2 diabetes and triglyceride levels. Science (2007) 316:1331–1336.[Abstract/Free Full Text]

    Scott LJ, et al. A genome-wide association study of Type 2 diabetes in Finns detects multiple susceptibility variants. Science (2007) 316:1341–1345.[Abstract/Free Full Text]

    Steemers FJ, et al. Whole genome genotyping technologies on the BeadArray platform. Biotechnol. J (2007) 2:41–49.[CrossRef][Medline]

    The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3000 shared controls. Nature (2007) 447:661–678.[CrossRef][Medline]

    Xiao Y, et al. A multi-array multi-SNP genotyping algorithm for Affymetrix SNP microarrays. Bioinformatics (2007) 27:1459–1467.

    Yeager M, et al. Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nat. Genet (2007) 39:645–649.[CrossRef][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BiostatisticsHome page
Y. Y. Teo
Exploratory data analysis in large-scale genetic studies
Biostat., October 14, 2009; (2009) kxp038v1.
[Abstract] [Full Text] [PDF]


Home page
Circ Cardiovasc GenetHome page
A. D. Johnson
Single-Nucleotide Polymorphism Bioinformatics: A Comprehensive Review of Resources
Circ Cardiovasc Genet, October 1, 2009; 2(5): 530 - 536.
[Full Text] [PDF]


Home page
Stat Methods Med ResHome page
A. Lynch, M. Dunning, M. Iddawela, N. Barbosa-Morais, and M. Ritchie
Considerations for the processing and analysis of GoldenGate-based two-colour Illumina platforms
Statistical Methods in Medical Research, October 1, 2009; 18(5): 437 - 452.
[Abstract] [PDF]


Home page
Genome ResHome page
Y. Y. Teo, A. E. Fry, K. Bhattacharya, K. S. Small, D. P. Kwiatkowski, and T. G. Clark
Genome-wide comparisons of variation in linkage disequilibrium
Genome Res., October 1, 2009; 19(10): 1849 - 1860.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. E. Ritchie, B. S. Carvalho, K. N. Hetrick, S. Tavare, and R. A. Irizarry
R/Bioconductor software for Illumina's Infinium whole-genome genotyping BeadChips
Bioinformatics, October 1, 2009; 25(19): 2621 - 2623.
[Abstract] [Full Text] [PDF]


Home page
Brief Funct Genomic ProteomicHome page
K. Panoutsopoulou and E. Zeggini
Finding common susceptibility variants for complex disease: past, present and future
Brief Funct Genomic Proteomic, September 1, 2009; 8(5): 345 - 352.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
W. Sun, F. A. Wright, Z. Tang, S. H. Nordgard, P. V. Loo, T. Yu, V. N. Kristensen, and C. M. Perou
Integrated study of copy number states and genotype calls using high-density SNP arrays
Nucleic Acids Res., September 1, 2009; 37(16): 5365 - 5377.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. R. Bayjanov, M. Wels, M. Starrenburg, J. E. T. van Hylckama Vlieg, R. J. Siezen, and D. Molenaar
PanCGH: a genotype-calling algorithm for pangenome CGH data
Bioinformatics, February 1, 2009; 25(3): 309 - 314.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Lin, G. C. Tseng, S. Y. Cheong, L. J. H. Bean, S. L. Sherman, and E. Feingold
Smarter clustering methods for SNP genotype calling
Bioinformatics, December 1, 2008; 24(23): 2665 - 2671.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
E. Giannoulatou, C. Yau, S. Colella, J. Ragoussis, and C. C. Holmes
GenoSNP: a variational Bayes within-sample SNP genotyping algorithm that does not require a reference population
Bioinformatics, October 1, 2008; 24(19): 2209 - 2214.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/20/2741    most recent
btm443v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Teo, Y. Y.
Right arrow Articles by Clark, T. G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Teo, Y. Y.
Right arrow Articles by Clark, T. G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?