Bioinformatics Advance Access originally published online on April 14, 2008
Bioinformatics 2008 24(11):1406-1407; doi:10.1093/bioinformatics/btn136
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Analysing georeferenced population genetics data with Geneland: a new algorithm to deal with null alleles and a friendly graphical user interface
1Centre for Ecological and Evolutionary Synthesis, Department of Biology, University of Oslo, P.O Box 1066 Blindern, 0316 Oslo, Norway, 2Applied Mathematics Department, INRA, Paris and 3Centre de Biologie et de Gestion des Populations, INRA / IRD / CIRAD / Montpellier SupAgro, Campus international de Baillarguet, CS 30016, F-34988 Montferrier-sur-Lez cedex, France
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: We introduce a new algorithm to account for the presence of null alleles in inferences of populations clusters from individual multilocus genetic data. We show by simulations that the presence of null alleles can affect the accuracy of inferences if not properly accounted for and that our algorithm improve signficantly their accuracy.
Availability: This new algorithm is implemented in the program Geneland. It is freely available under GNU public license as an R package on the Comprehensive R Archive Network. It now includes a fully clickable graphical interface. Informations on how to get the software are available on folk.uio.no/gillesg/Geneland.html
Contact: gilles.guillot{at}bio.uio.no
Supplementary information: Details on the simulation study are available from folk.uio.no/gillesg/BioInformatics_Geneland
| 1 INTRODUCTION |
|---|
|
|
|---|
Bayesian clustering algorithms have become extremely useful tools to investigate the structure of population genetics data (Excoffier and Henkel, 2006) but the conclusion drawn from the use of such algorithms can be markedly influenced by the presence of genotyping errors (Pompanon et al., 2005). A well known source of such potential problems is the presence of null alleles arising from variation in the nucleotide sequences of flanking regions that prevent the primer annealing to template DNA during PCR amplification of the microsatellite locus (Dakin and Avise, 2004). The presence of null alleles results in an excess of homozygous genotypes within a population as compared to the expected proportion under Hardy Weinberg Equilibrium (HWE) and Linkage Equilibrium (LE) (Callen et al., 1993; Paetkau et al., 1995). While all population genetics clustering softwares are based on HWE and LE within the sought clusters, there is no study to date on the effect of null alleles on the accuracy of inferences with such softwares. In this note, we introduce a new statistical model and an MCMC step to explictly take into account the putative presence of null allele(s) in the analysed dataset. We briefly illustrate how the presence of null alleles affects the accuracy of inferences with and without using our null allele filtering scheme.
| 2 METHODS |
|---|
|
|
|---|
In case the presence of null alleles is suspected, we introduce a difference between the observed genotypes denoted by z = (zi,l) (where the subscript i and l refer to the individual and the locus, respectively) and the true non-observed genotypes denoted by y = (yi,l). For each locus, we introduce an extra fictional allele denoted by
l coding for the putative presence of one or several null alleles for which cumulated frequency has to be estimated. The presence of null alleles is taken into account by estimating jointly y and the other parameters of the model in an MCMC simulation. A generic step updating y visits sequentially the genotype of all the individuals at all loci. If zi,l is an heterozygous genotype, there is no ambiguity and yi,l = zi,l. If zi,l consists of a double missing data, there is no ambiguity as the true unobserved genotypes consists necessarily of two null alleles and yi,l = (
l,
l). If zi,l = (
,
) there is an ambiguity. The true genotype could be either genuinely homozygous, yi,l = (
,
), or could be yi,l = (
,
l). We denote by
the vector of all unknown quatities to be inferred (including y). The conditional probability of a genuine homozygous is |
|
– y denotes the vector of all parameters except y and fklj denotes the allele frequency of allele j at locus l in population k. The full conditional probability of a presence of a null alleles is
(yi,l= (
,
l)|zi,l = (
,
),
– y) = 1–
(yi,l = (
,
)|zi,l = (
,
),
– y). yi,l is hence sampled randomly according to these two probabilities. The other steps of the Markov chain simulation are similar to those described in (Guillot et al., 2005) except that the likelihood is built on y instead of z. To assess the benefit of using this extra step, we produced data according to the model implemented for simulations in Geneland. Loosely speaking, it produces spatially organised panmictic populations. For each simulated dataset, we tampered with the genotypes of the initial datasets (i.e. the datasets without null alleles) in a way that mimics the presence of null alleles with various frequencies. For each simulated dataset, we carried out inference of the number of populations K and individuals population memberships. Details on the simulation study are given as Supplementary Material.
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
Results are shown in Table 1. We found that inferences with Geneland are robust to the presence of a relatively small proportion of null alleles (i.e. <10%; Table 1). However, the presence of null alleles at higher proportions (e.g. 20%) substantially alters the accuracy of inferences on the number of populations with a systematic overestimation of K (Table 1, line 7,
|
With regards to the use of the new statistical model and MCMC algorithm accounting for null alleles, we found that they efficiently restored the accuracy of inferences (Table 1, line 8). Incidentally, we observed that in case one or several null alleles were simulated, their cumulative frequency at each locus were very accurately estimated (results not shown). Interestingly enough, we observed that the use of our new statistical model and MCMC algorithm did not alter the accuracy of inferences if the dataset does not contain null alleles (Table 1, lines 1 and 2). Finally, we found that the use of the extra algorithmic step accounting for null alleles had a negligible effects on computing times (an increase of only a few percents depending on the thinning of the chain).
The resilience of Geneland to the presence of null alleles with frequencies up to 10% is fortunate regarding previous studies, as the presence of null alleles at microsatellite loci has been reported frequently in PCR primer characterization and in population genetics studies (Dakin and Avise, 2004). This resilience can be explained by the fact that in our simulations (and in real datasets as well), null alleles occur spatially at random so that the spatial locations of individuals carrying null alleles do not display any spatial pattern. Therefore, although the presence of null allele creates an excess of homozygous genotypes, this excess cannot be repaired by creating spurious populations while maintaining the geometric constraints in the spatial model on which Geneland is based. In agreement with this, we found that, when using Geneland under the non-spatial model option (making the prior model on population membership similar to that of Structure or BAPS in the non-spatial mode), we found that the inferences became largely unreliable with a systematic overestimation of the number of populations, even for low null allele frequencies; for instance, we obtained
= 24.3%,
= %22.4 and ERCAi = 8.03% when analysing the datasets with only 2% of null alleles. For mean frequencies of null alleles larger or equal to 20%, the presence of null alleles becomes an issue even when the spatial model option is used. In this case, the accuracy of inferences is efficiently restored when using the new statistical model and MCMC algorithm we specifically proposed for dealing with null alleles. In practice (i.e. when working with a real dataset), the presence of null alleles in the analysed dataset may often be suspected but their proportions are unknown. Since we found that the use of an extra algorithmic step accounting for their putative presence restores accuracy of inferences when null alleles are present and does not alter the accuracy of inferences if the dataset does not contain null alleles, we recommend to carry out inferences with Geneland with this option which does not increase the computing time significantly.
The present algorithm as well as previously existing functionnalities of Geneland are now available through a graphical user interface (GUI). This GUI is written in Tcl/Tk through the R library tcltk. For the growing community of R users in population genetics (see, e.g. the related CRAN task view cran.r-project.org/web/views/Genetics.html), this new GUI should prove to be very useful as its allows to use Geneland without any knowledge of the R language.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank J.M. Cornuet, J.F. Cosson, M. Fontaine, R. Leblois, J.M. Marin, F. Mortier, C.P. Robert and G. Roderick for comment at various stages of this work.
Funding: This work was financially supported by the French Agence Nationale de la Recherche grant No NT05-4-42230.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on March 27, 2008; revised on April 9, 2008; accepted on April 9, 2008
| REFERENCES |
|---|
|
|
|---|
Callen DF, et al. Incidence and origin of null alleles in the (ac)n microsatellite markers. Am. J. Hum. Genet (1993) 52:922–927.[Web of Science][Medline]
Dakin EE, Avise JC. Microsatellite null alleles in parentage analysis. Heredity (2004) 93:504–509.[CrossRef][Web of Science][Medline]
Excoffier L, Henkel G. Computer programs for population genetics data analysis: a survival guide. Nat. Rev. Genet (2006) 7:745–758.[CrossRef][Web of Science][Medline]
Guillot G, et al. A spatial statistical model for landscape genetics. Genetics (2005) 170:1261–1280.
Paetkau D, et al. Microsatellite analysis of population structure in canadian polar bears. Mol. Ecol (1995) 4:347–354.[Medline]
Pompanon F, et al. Genotyping errors: causes, consequences and solutions. Nat. Rev. Genet (2005) 6:847–859.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
G. Guillot Inference of structure in subdivided populations at low levels of genetic differentiation--the correlated allele frequencies model revisited Bioinformatics, October 1, 2008; 24(19): 2222 - 2228. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
K and where 