Bioinformatics Advance Access originally published online on October 9, 2008
Bioinformatics 2008 24(23):2790-2791; doi:10.1093/bioinformatics/btn531
FABSIM: a software for generating FST distributions with various ascertainment biases
1IBE, Institute of Evolutionary Biology (UPF-CSIC), CEXS-UPF-PRBB. Doctor Aiguader, 88. 08003 Barcelona, Catalonia, Spain and 2CIBER en Epidemiologia y Salud Pública (CIBEREsp), Barcelona, Spain
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: We have developed a software that applies ascertainment bias on simulated DNA sequences and calculates FST on them, so they can be used to generate neutral distributions that are appropriate to test whether the genetic differentiation of a particular gene between populations is compatible with neutral evolution, or, on the contrary, suggests local adaptation by natural selection.
Availability: FABSIM is available from http://www.snpator.com/public/downloads/aRamirez/FABSIM/.
Contact: francesc.calafell{at}upf.edu.
Supplementary information: Supplementary data are available at Bioinformatics online. The data from which figures are built can be dowloaded from http://www.snpator.com/public/downloads/aRamirez/.
| 1 INTRODUCTION |
|---|
|
|
|---|
FST, the proportion of the genetic variance explained by differences among populations, can be used to find genes under local selection by comparing the FST value of a single locus against the genome-wide values. Allele frequency differences between populations are mainly caused by genetic drift, that is, by the random process driven by demographic history. Drift affects all of the genome and, thus, a genome-wide FST distribution reflects primarily drift. Against this backdrop, a gene with extremely large FST values becomes suspect of having suffered local adaptation in a subset of the human populations. A large number of works have been published based on this principle, building genome-wide empirical distributions of FST based on increasing numbers of autosomical single nucleotide polymorphisms (SNPs, Akey et al., 2002, among many others).
Although empirical distributions are presumably neutral (since they report the differentiation due to demographic events), they are not built over the total variation found on the genome but on a particular subset of SNPs. The way the SNPs are ascertained may thus produce an underlying bias that affects the shape of the distribution (HapMap International Consortium, 2007). When comparing the FST of the SNPs in a gene against a published empirical distribution, the biases applied to both sets of samples can be different (Ferrer-Admetlla et al., 2008). An alternative is working with simulated distributions, as suggested by Beaumont and Nichols (1996). However, this procedure also produces unreliable distributions, as (a) FST is highly dependent on the demographic history of the samples, which may not be known with sufficient precision, and thus may not be accurately simulated, and (b) simulations do not take into account ascertainment biases. The former issue can now be addressed in humans using the calibrated demographic model proposed by Schaffner et al. (2005).
We address how to produce an FST distribution with the same bias than the genotyped samples. We have developed FABSIM, a Java software package that builds simulated FST distributions under different ascertainment biases. As a complement, it can also calculate minor (MAF) and derived (DAF) allele frequencies and a number of neutrality statistics.
| 2 IMPLEMENTATION |
|---|
|
|
|---|
FABSIM has been programmed in Java using NetBeans IDE 6.0. It has been released as an executable file FABSIM.jar that can be run in any platform provided that a Java Runtime Environment (JRE) 6 has been installed (see the Java web page http://java.sun.com/javase/downloads/index.jsp). Both the executable file and the font code can be freely downloaded from http://www.snpator.com/public/downloads/aRamirez/FABSIM/, together with the Help.pdf file and a Examples.zip file which contains examples of the input file formats.
FABSIM works on coalescent-based simulation results, which may be generated using any of the available packages developed to simulate neutral genealogies. FABSIM supports as input the output file formats of ms, cosi and SelSim packages, or any other simulation translated to reproduce one of these formats. Different populations are introduced in the program in separate files.
FABSIM works on the data introduced by the user, to which four different bias categories (with a total of seven different bias types) can be applied, alone or combined with other biases; the input data can also be left unbiased. The four bias categories are related to: (a) the discovery sample, (b) the presence of polymorphism in a population, (c) the MAF and (d) distance. If more than one population is introduced for analysis in the program, SNPs are selected over only one population (determined by the user), but the bias is applied over all populations. (a) Discovery sample biases imply that only some chromosomes (a subsample of size d) of a general sample of size n have been resequenced, and the segregating sites found on them have been genotyped on the whole sample n. If this bias is applied, the program randomly selects d sequences (where 0 < d < n is specified by the user), and keeps only the SNPs that are polymorphic in these d sequences. (b) In the bias related to the presence of polymorphism, only the SNPs that are polymorphic either in a given population or in all populations are kept. (c) In the MAF bias, all the SNPs that have a MAF below a threshold provided by the user are discarded. (d) In physical distance biases, SNPs are selected with a physical spacing specified by the user. To do so, FABSIM selects randomly one segregating site among the x first base pairs, where x is the spacing, in base pairs, selected by the user. From this first selected SNP, the position x base pairs downstream is determined. If in this new position a segregating site is found, it is selected; otherwise, the nearest one is selected. FABSIM proceeds as explained until the new position is found outside the simulated fragment. This bias can be applied using the same distance along the gene or using different SNP densities. In the last case, a file must be provided stating which fragments are to have particular densities, and which densities these are.
Three different statistics can be calculated with FABSIM: FST, MAF and DAF, as well as 17 neutrality statistics. FST can be calculated correcting or not correcting by the different sample size of the populations involved; and by gene, by SNP or both. The neutrality statistics included are those used in Ramirez-Soriano et al. (2008).
| 3 BIOLOGICAL APPLICATION |
|---|
|
|
|---|
As an example of a possible application of the program, we have used it to produce several FST distributions with and without ascertainment bias and have compared them with the empirical FST distribution of all the segregating sites found in the human genes resequenced by the SeattleSNPs project (http://pga.gs.washington.edu/, Crawford et al., 2005). We have run simulations using the parameters provided by Schaffner et al. (2005), which have been shown to fit empirical human data for several statistics. Only two populations, African-Americans and Europeans, have been simulated, as they are the populations resequenced in SeattleSNPs. To match SeattleSNPs data, we have simulated 48 African-American and 46 European chromosomes. The FST values for all SNPs and simulations, together with the numerical results behind the histograms, are reported as Supplementary Material.
We have compared the distribution of FST obtained by simulation against the empirical distribution of the SNPs in the genes resequenced by SeattleSNPs (Fig. 1). The number of SNPs with low FST is higher in simulations than in SeattleSNPs data. Furthermore, the distribution of FST in SeattleSNPs data has a larger tail of SNPs at high frequencies. This could be explained by (a) the effect of imputing missing genotypes or by (b) the presence of genes affected by natural selection. In order to ascertain the weight of these two possible explanations, the analysis was repeated by dropping the sites with missing data instead of imputing their alleles. When those sites are not included in the analysis, the number of low FST values (<0.05) increases, but it also increases the fraction of values with high FST, mainly those SNPs with FST > 0.95. However, removing SNPs with missing genotypes does not make the empirical FST distribution significantly closer to the simulated distribution. This result points to positive selection as a significant force in shaping the FST distribution for SeattleSNPs genes; a plausible explanation given that genes in this database have been chosen for their relationship with human inflammatory response.
|
Furthermore, we have compared simulation data with and without several degreee of ascertainment bias, producing biases by (a) MAF, (b) presence of polymorphism and (c) discovery sample (see Supplementary Results). When testing the statistical significance of the differences between the distributions of FST obtained from non-ascertained and from ascertained simulations by means of the
2-test, all ascertained distributions are significanly different from the non-ascertained with P < 0.0001. | ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Ignacio Guerra for his support in developing FABSIM, Arcadi Navarro and Urko M. Marigorta for his helpful comments on this article, Deborah Nickerson for making Seattle SNPs sequences widely available and the Spanish National Institute for Bioinformatics (http://www.inab.org), a platform of Genoma España.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on September 5, 2008; revised on October 7, 2008; accepted on October 7, 2008
| REFERENCES |
|---|
|
|
|---|
Akey JM, et al. Interrogating a high-density SNP map for signatures of natural selection. Genome Res. (2002) 12:1805–1814.
Beaumont MA, Nichols RA. Evaluating Loci for use in the genetic analysis of population structure. Proc. R Soc. B Biol. Sci. (1996) 263:1619–1626.[CrossRef]
Crawford DC, et al. The patterns of natural variation in human genes. Annu. Rev. Genomics Hum. Genet. (2005) 6:287–312.[CrossRef][Web of Science][Medline]
Ferrer-Admetlla A, et al. Balancing selection is the main force shaping the evolution of innate immunity genes. J. Immunol. (2008) 181:1315–1322.
HapMap International Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature (2007) 449:851–861.[CrossRef][Web of Science][Medline]
Ramirez-Soriano A, et al. Statistical power analysis of neutrality tests under demographic expansions, contractions and bottlenecks with recombination. Genetics (2008) 179:555–567.
Schaffner SF, et al. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. (2005) 15:1576–1583.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
