Skip Navigation


Bioinformatics Advance Access originally published online on August 27, 2008
Bioinformatics 2008 24(20):2356-2362; doi:10.1093/bioinformatics/btn455
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/20/2356    most recent
btn455v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Li, M.
Right arrow Articles by Hanson, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, M.
Right arrow Articles by Hanson, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

A semiparametric test to detect associations between quantitative traits and candidate genes in structured populations

Meijuan Li *, Cavan Reilly and Timothy Hanson

Division of Biostatistics, School of Public Health, University of Minnesota, A460 Mayo Building MMC 303, Minneapolis, MN 55455-0378, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 CONCLUSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 

Motivation: Although population-based association mapping may be subject to the bias caused by population stratification, alternative methods that are robust to population stratification such as family-based linkage analysis have lower mapping resolution. Recently, various statistical methods robust to population stratification were proposed for association studies, using unrelated individuals to identify associations between candidate genes and traits of interest. The association between a candidate gene and a quantitative trait is often evaluated via a regression model with inferred population structure variables as covariates, where the residual distribution is customarily assumed to be from a symmetric and unimodal parametric family, such as a Gaussian, although this may be inappropriate for the analysis of many real-life datasets.

Results: In this article, we proposed a new structured association (SA) test. Our method corrects for continuous population stratification by first deriving population structure and kinship matrices through a set of random genetic markers and then modeling the relationship between trait values, genotypic scores at a candidate marker and genetic background variables through a semiparametric model, where the error distribution is modeled as a mixture of Polya trees centered around a normal family of distributions. We compared our model to the existing SA tests in terms of model fit, type I error rate, power, precision and accuracy by application to a real dataset as well as simulated datasets.

Contact: meijuanl{at}biostat.umn.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 CONCLUSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 
Population-based association mapping has become a powerful, general tool for identifying loci associated with the inheritance of complex traits. Because the allelic association due to linkage disequilibrium usually operates over shorter genetic distances, association mapping permits higher resolution mapping than does linkage analysis (Hästbacka et al., 1992). However, population-based association mapping may lead to both false positives and failure to detect true associations due to population stratification (Lander and Schork, 1994). Several statistical methods were recently proposed to utilize genomic markers to correct for potential population stratification in the analysis of qualitative traits (Devlin and Roeder, 1999; Devlin and Roeder, 2001; Price et al., 2006; Pritchard et al., 2000) and in the analysis of quantitative traits (Bacanu et al., 2002; Hoggart et al., 2003; Redden et al., 2006; Zhang et al., 2006). Among them, genomic control (GC) (Devlin and Roeder, 1999; Reich and Goldstein, 2001), structured association (SA) (Pritchard et al., 2000) and principal component analysis (PCA) (Price et al., 2006) are the three prevailing methods for dealing with stratification.

The GC method corrects for stratification by adjusting association statistics at each marker by a uniform overall inflation factor (Devlin and Roeder, 1999; Reich and Goldstein, 2001). However, if the actual distribution has thicker tails than predicted, this could lead to a high type I error rate (Pritchard and Donnelly, 2001). The SA method uses a Bayesian clustering method to assign individuals membership of subpopulations using random genetic marker data. The method is computationally expensive. Assignments of individuals to subpopulations are highly sensitive to the number of subpopulations, which is not well defined. In contrast, the PCA approach is computationally efficient. These methods have been proven useful in a variety of contexts; however, they have limitations. For many real-life data having population structure along with diverse levels of familial relatedness within subpopulations, i.e. continuous population stratification, GC, SA and PCA approaches may lead to either inadequate control for false positives or a loss in power owing to genetic correlation within subpopulations (Yu et al., 2006; Zhao et al., 2007). Yu et al. (2006) recently introduced a unified mixed-model approach in which the fixed population structure effect and random subject-specific effects attributable to kinship K with the covariance matrix being 2K{sigma}Formula, where {sigma}Formula is the additive genetic variance, are included to adjust for continuous population stratification.

The association between a candidate gene and a quantitative trait is often evaluated via a linear regression model, where inferred population structure variables are included as covariates. In such cases, the residual distribution is customarily assumed to be from a symmetric and unimodal parametric family, such as a Gaussian (Aranzana et al., 2005; Kang et al., 2008; Lettre et al., 2007; Tommasini et al., 2007; Yu et al., 2006; Zhao et al., 2007), although this may be inappropriate for the analysis of many real-life datasets, where increased skewness, kurtosis and multimodality may hold. For example, many plant traits are highly skewed due to selection (Kelker and Kelker, 1986). Moreover, for a quantitative trait determined by the sequence variants of a few genes of large effects and normally distributed random error(s), the associated residuals from a single marker and the trait association test (the most common practice in association mapping) are indeed distributed as a mixture of normal distributions with non-identical means, and hence are not normally distributed. Furthermore, exploratory analysis such as residual plots for assessing the error distribution is often impractical in association mapping, in particular for genome-wide scans, because there are far too many markers (one cannot examine 100 000 or more plots). Suitable transformation of the response to achieve comfort with the assumptions under a standard parametric family for studies with large numbers of candidate loci may be difficult because the appropriate transformation may vary from one marker to another, and interpretation problems may arise. Enrichment of standard families usually fails to capture all the desirable features. Hence, association tests that avoid parametric specification for the residual distribution and are robust against potential population stratification are desirable; our goal in this article is to describe a statistical method that is intended to have such desired properties. First, we proposed a new computationally efficient method to infer population structure. DNA-based genetic dissimilarity has been used to measure the degree of genetic difference between individuals in association studies (Jakobsson et al., 2008; Wessel and Schork, 2006). In our approach, the genetic distance matrix is estimated from a set of random markers, multidimensional scaling (MDS) is performed on this genetic distance matrix. MDS and PCA are both used for dimensionality reduction. When compared to PCA, MDS gives more readily interpretable solutions of lower dimension and does not depend on the assumption of a linear relationship between variables (Lacher, 1987). In our method, the estimated population structure is given by the MDS representations of the distance matrix. Like Yu's mixed model, in addition to population structure, our model also includes kinship K to adjust for continuous population stratification, and the two approaches in our model for uncovering population stratification are complementary; however, unlike Yu's mixed model, the two approaches in our model are orthogonal, so the effects of kinship on a phenotypic variation are estimated after taking the effects of population structure into account. Next, the relationship between trait values, genotypic scores at a candidate gene and genetic background variables is modeled using a semiparametric regression approach, where the error distribution is modeled as a mixture of Polya trees (MPTs) centered around a family of normal distributions and thus may be viewed as a generalization of standard models in which important, data-driven features, such as skewness and multimodality, are allowed. The MPTs prior provides an intermediate choice between a strictly parametric model and a completely arbitrary model (Ferguson, 1974; Hanson, 2006; Lavine, 1992). To illustrate its utility and interpretation, we applied the proposed model to a previously published dataset of association mapping in 95 Arabidopsis thaliana lines (Zhao et al., 2007), where the residuals were assumed to be normally distributed. We compared our proposed model to the SA method as well as the PCA method in terms of model fit, power and type I error rate. We also conducted two different simulation studies to demonstrate the power, precision and accuracy of the proposed method.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 CONCLUSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 
Let y'=(y1,..., yn) be the n response vector. Let xFormula=(xi1,...,xip) be a p-component column vector of the design matrix for the i-th subject, where xi1=1 is an intercept term, xi2 is the genotypic score (haplotype, genotype or gene expression level) for individual i at a candidate gene for a given trait and xi3,...,xip are the values of the genetic background variables for individual i for i=1,...,n. For a candidate gene with L categorical values, there are (L–1) orthogonal dummy variables corresponding to the candidate gene, and in such cases, xFormula=(xi21,...,xi2(L–1)); βFormula=(β21,...,β2(L–1)). Let β'=(β1Formula,...,βp) be the regression coefficients corresponding to p predictors (including the intercept). A kinship coefficient is often defined as the probability of identity by descent of the markers compared (Ritland, 1996), but estimators based on genetic markers actually estimate a ‘relative kinship’, that can be defined as ratios of differences of probabilities of identity by state (Hardy and Vekemans, 2002). Note that with this definition, negative relative kinship coefficients naturally occur between some individuals. This is interpreted as meaning subjects that these are less related than random individuals and the negative kinship coefficients are set to 0 (Hardy and Vekemans, 2002; Ritland, 1996; Yu et al., 2006; Zhao et al., 2007). Like Yu et al. (2006), we also assume K is symmetric and positive definite and let Ks denote the square root of 2K.

2.1 Methods for inferring population structure
We first consider the SA method for inferring population structure. Let Q be the population structure for n subjects estimated using the software STRUCTURE (Pritchard et al., 2000). Each individual's genetic background is represented by a vector QFormula=(Qi1,...,QiJa), where Qij is the portion of the individual's genome which originated in subpopulation j for j=1,...,Ja (thus {sum}Formula Qij=1 for i=1,...,n). For the SA method, xij=Qi(j–2) for j=3,...,(Ja+1).

Second, we consider the PCA method for population structure inference. Let C=(C1,..., Cj,..., CJb) be the Jb largest principal components of a set of genetic marker data for n individuals, where CFormula=(C1j,..., Cnj). For this model, xij=Ci(j–2) for j=3,...,(Jb+2). We used the Bayesian information criteria (BIC) to choose the number (Jb) of principal components by fitting linear regression models with l components for l=1,...,≤n (the sample size) as the l continuous explanatory variables and a trait of interest as the dependent variable, and Jb is the value of l corresponding to the smallest BIC.

Next, we propose a new method for inferring population structure using data at a set of genetic markers. Suppose that there are N SNP markers which have been genotyped on subjects i and j (for i=1,...,n and j=1,...,n), let N11 be the total number of alleles shared between individuals i and j, then the modified Rogers’ genetic distance (GDMR) (Rogers, 1972) between individuals i and j is calculated as:

Formula
Note here N may vary among different pairs of subjects if the data is incomplete. MDS (Wickelmaier, 2003) was then performed on this n by n genetic distance matrix D with its (i,j)th element being d(i,j) described above. Let B=(B1,...,Bj,...,BJc) with BFormula=(B1j,...,Bnj) be the Jc dimensional MDS representations of genetic distance matrix D. Note here, Jc is determined using BIC by fitting linear regression models with (B1,...,Bl) for l = 1,...,≤r (the rank of D) as the l continuous explanatory variables and a trait of interest as the dependent variable, and Jc is set to the value of l corresponding to the smallest BIC. We denoted this method as MDS, and here xij=Bi(j–2) for j=3,...,(Jc+2).

2.2 Parametric association tests
We first consider Yu's parametric mixed model for association tests in structured population, which may be formulated as a regression model:

Formula 1(1)
for i = 1,...,n, where {varepsilon}i are assumed to be i.i.d normal random variables with mean 0 and variance {sigma}2 and independent of the value of xi2,...,xip and {eta}i = {sum}Formula Ks(i,j){gamma}j. Throughout this article, we assume {gamma}'=({gamma}1,...,{gamma}n), β, {sigma} are independent in their prior distributions with Formula . For priors, we assume {gamma}i~N(0, {tau}–1) for i=1,...,n and {tau} ~ Gamma (a,b) with a = b = 0.01, i.e. Formula . This is equivalent to setting {eta}'=({eta}1,...,{eta}n) and {eta}~N(0, 2K{tau}–1). The joint posterior is given by:

Formula

2.3 Semiparametric association tests
Now, we consider semiparametric structured association tests, which are different fundamentally from the above parametric models in two aspects. First, the error distribution is modeled as a MPTs constrained to have median 0 and centered around a normal family of distributions. Second, we restrict kinship effects to the space orthogonal to population structure, thus the effects of kinship only explains the variations in response that are not captured by the population structure. Let P be the population structure inferred by the SA, PCA or MDS methods. Let Pc = I–P(P'P)–1P' (I here is an identity matrix), denote PFormula=PcKs. The semiparametric association test may be formulated as a regression model:

Formula 2(2)
for i=1,...,n, where {varepsilon}i follows a MPT and {eta}i = {sum}Formula PFormula(i,j){gamma}j. Specifically, let G{sigma}2=N(0, {sigma}2), and Formula . Consider

Formula
We briefly describe the MPT prior but leave technical details to Hanson (2006) and Lavine (1992). A Polya tree prior is constructed from a set of partitions {Pi}Formula={BFormula:{varepsilon}isin{cup}Formula{0, 1}l} and a family A of positive real numbers. Here, the partition points are quantiles of the centering family: if j is the base-10 representation of the binary number {varepsilon}={varepsilon}1,...,{varepsilon}k at level k, then BFormula is defined to be the interval (GFormula(j/2k), GFormula((j+1)/2k)], except the rightmost set is BFormula=GFormula((2k–1)/2k), {infty}). For example, with k=3, and {varepsilon}=000, then j=0 and BFormula=(0, GFormula (1/8)], and with {varepsilon}=010, then j=2 and BFormula=(GFormula(2/8), GFormula(3/8)], etc. Note then that at each level k for k=1,...,(M–1), the class {BFormula : {varepsilon}isin{0, 1}k} forms a partition of the positive reals and furthermore BFormula=BFormula{cup}BFormula for any binary {varepsilon}1,...,{varepsilon}k. We take the family A={{alpha}{varepsilon} :{varepsilon}isin{cup}Formula {0, 1}j} to be defined by {alpha}{varepsilon}1,...,{varepsilon}k=c {rho}(k) for some c>0 (Hanson, 2006). The parameter c is the amount of weight attached to the underlying normal distribution. As c tends to zero, the posterior G|data is almost entirely data driven. As c tends to infinity, we obtain a fully parametric analysis. Here {rho}(.) is any increasing positive function of k, and {rho}(k)=k2 is used in the vast majority of applications. Given prodFormula and A, the Polya tree prior is defined up to level M by the random vectors ZM={(Z{varepsilon}0, Z{varepsilon}1): {varepsilon}isin{cup}Formula {0, 1}j} through the product of conditional probabilities

Formula
For k = 1,...,M, the vectors (Z{varepsilon}0,Z{varepsilon}1) are independent Dirichlet:

Formula
Define the vector of probabilities

Formula
through

Formula
where {varepsilon}1,...,{varepsilon}M is the base-10 binary representation of the integer k, k=0,...,(2M–1).

Let µi1+xFormulaβ23xi3+···+βpxip+{eta}i, and Ni denote the integer part of Formula [here {Phi} is the cdf of N(0,1)]. After simplification, the pdf of yi for i=1,...,n is given by:

Formula
the likelihood involving n independent subjects is then given by:

Formula 3(3)
The joint posterior is then given by:

Formula
We tried various values for c and M for the MPT prior and we also put a Gamma prior on c. The model appears to be robust to the values of c and M, hence our discussion is based on the case with c=1, M=4 and {rho}(k)=ck2, i.e. {alpha}{varepsilon}1,..., {varepsilon}k=c{rho}(k)=k2. All algorithms use a Metropolis–Hastings (M–H) step for updating the components (Z{varepsilon}0, Z{varepsilon}1) one at a time by sampling candidates (ZFormula, ZFormula) from a Dirichlet (mZ{varepsilon}0, mZ{varepsilon}1) distribution, where m=20. Let L*=L(y; ZFormula, {sigma}2,β,{gamma}) and L=L(y; ZM,{sigma}2,β,{gamma}), this candidate is accepted as (ZFormula, ZFormula) with probability

Formula
For the M–H step for updating the parameter vector (β,{sigma}2), we used a multivariate normal random-walk proposal constrained to ({sigma}2)*>0. The proposal covariance matrix is a scaled version of the estimated covariance matrix of the maximum likelihood estimate from fitting the ordinary linear regression:

Formula
The variance scaling factor was set to be a value such that a Markov chain accepted 30–40% of the proposed moves. The candidate (β*,({sigma}2)*) is accepted with probability

Formula
For the M–H step for updating parameter vector {gamma}, sample {gamma}Formula ~ N({gamma}i,s{tau}–1) for i=1,...,n (here s is a scaling factor so that a Markov chain accepted 30–40% of the proposed moves), accept {gamma}* with probability

Formula
We used the full conditional to update of {tau} by sampling {tau}* ~ {Gamma}(a+0.5n, b+0.5{sum}Formula{gamma}Formula). For each of the three population structure inferring methods, we consider both parametric tests denoted as SA, PCA and MDS methods, and semiparametric tests denoted as SPSA, SPPCA and SPMDS methods. We fit both parametric and semiparametric tests using a Bayesian approach. For each of the six tests, two Markov chains were run in parallel to monitor the convergence of the Metropolis algorithm with 60 000 iterations per chain for burn-in and 5000 Monte Carlo samples from 50 000 iterations per chain for inference. C++software was used for the analysis and is available at www.biostat.umn.edu/~cavanr.

2.4 Model comparison and diagnostics
Bayes factors (Han and Carlin, 2001; Kass and Raftery, 1995) are difficult to obtain in practice. Instead, for each model, we compute the log pseudo marginal likelihood (LPML) (Geisser and Eddy, 1979) The LPML is a cross-validated ‘leave-one-out’ measure of a model's ability to predict the data. It is valid for small and large samples and does not suffer from a heuristic justification based on large sample normality. Let p1(·) and p2(·) denote probability densities corresponding to models 1 and 2, respectively. The conditional predictive ordinate (CPO) for observation j under model i is given by CPOij=pi(Dj|Dj), where Dj={(xl1,..., xlp, Ksl, yl)}l!=j. The ratio CPO1j/CPO2j measures how well model 1 supports the observation Dj relative to model 2, based on the remaining data Dj. The product of the CPO ratios gives an overall aggregate summary of how well supported the data are by model 1 relative to model 2 and is called the pseudo Bayes factor:

Formula
The log of the product of the n CPO statistics under a given model is the LPML statistic for that model, LPMLi=log{Pi}FormulaCPOij, and therefore

Formula

2.5 Application to the A. thaliana dataset
The 95 A. thaliana lines used in this study have been described previously (Zhao et al., 2007), where the errors were assumed to be normally distributed. For this specific dataset, subjects are the 95 Arabidopsis lines. The trait we are interested in for association mapping is the plant flowering time measured in days from germination to the first opening of flowers with 5-week vernalization and under short-day conditions (8 hour (h) light/16 h dark) at the University of Southern California (SDV). Using the software STRUCTURE (Pritchard et al., 2000), Nordborg et al. (2005) estimated Q with Ja=8 for the 95 Arabidopsis lines. For inference for B for the MDS method and C for the PCA method, we used 5000 high-quality SNP markers, and both Jb and Jc were determined to be 8 as well. The residuals for SNPs and SDV association tests by the three parametric models demonstrated deviations from normality. As an example, random SNP S106.215 and SDV association tests by the parametric models have excessive skewness (1.27 for the PCA model, 1.27 for the MDS model and 1.95 for the SA model) as well as excessive kurtosis (3.85 for the PCA model, 3.86 for the MDS model and 5.12 for the SA model). Therefore, the semiparametric MPT model which relaxes the normality assumption on the error distribution may be more appropriate than the three parametric models for this dataset.

To compare the performance of the six models, we tested the association between SDV and FRIGIDA (FRI), a known central regulator of flowering time for unvernalized Arabidopsis plants. The effects of FRI is eliminated by vernalization (Lee and Amasino, 1995), a procedure that accelerates flowering by a long period of cold temperatures, generally <10{circ}C but >0{circ}C. There are three genotypes at FRI for 95 lines, where 1 represents the wild-type lines, 2 represents the mutated lines with the deletion of the Ler allele and 3 represents the mutated lines with the deletion of the Col allele (Nordborg et al., 2005). In addition, we did the association tests between SDV and random SNP markers, S1.140 and S1.415, respectively.

The LPML statistics from the FRI and SDV association test for the six models are summarized in Table 1. The LPML values vary a great deal among the six models. For each of the three population inference methods, the semiparametric test has larger LPML than the corresponding parametric test. For example, the LPML for proposed SPMDS model is –369.5 compared to –402.1 for the MDS model. The pseudo Bayes factor for comparing the SPMDS model to the MDS model is approximately exp((–369.5)–(–402.1))>1000. The SPSA model versus the parametric SA or SPPCA model versus the PCA model also yields a pseudo Bayes factor >1000. This observation suggests that the proposed semiparametric MPT models that relax the normality assumption on the error distribution, are better fit than the parametric models for the current dataset. The same conclusion is held from the SNPs and SDV association tests (Table 1).


View this table:
[in this window]
[in a new window]

 
Table 1. Comparisons of the SA, SPSA, PCA, SPPCA, MDS and SPMDS models in LPML and 95% credible sets (95% CI) genetic effects of the Ler and Col deletion alleles of FRI from the association test between FRI and SDV

 
Also listed in Table 1 are the effect ({delta}1) of the Col deletion haplotype and the effect ({delta}2) of the Ler deletion haplotype from the FRI and SDV association test by the six models. Here {delta}1 and {delta}2 are estimated by the mean difference in flowering time between the wild-type group and the Col deletion group, the wild-type group and the Ler deletion group, respectively. Noted here, SDV is a type of flowering time for vernalized plants, hence SDV and FRI, a gene for non-vernalized plants, are expected to be unrelated. The 95% credible sets for {delta}1 covered 0 for all models except the SA model. The 95% credible sets for {delta}2 covered 0 for all six models, as expected. Figure 1 shows the residual predictive density estimates from the SNP S1.415 and SDV association test by the six models. The three semiparametric models provide a better predictive density of the residuals than the three corresponding parametric models.


Figure 1
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. The histograms and predictive densities of the residuals from the SNP S1.415 and SDV association test by the SA, SPSA, PCA, SPPCA, MDS and SPMDS models. The y-axis is the residual predictive density.

 
To compare the false positive rates among the six models, we randomly selected 500 random SNP haplotype makers with two haplotypes, 0 and 1, for each SNP. These SNPs are distributed over the whole A. thaliana genome and have minor allele frequencies >5%. We applied the six models to each of the 500 random SNPs for testing the association with SDV. We estimated the false positive rate which is defined as 500p with p being the proportion of 500 random SNPs whose (1–{alpha}) credible sets of the SNP regression coefficients excluded 0 with {alpha} = 0.05 and 0.01, respectively. As shown in Table 2, the type I error rate at {alpha} = 0.05 is the largest for the SA model (10.5%), followed by the PCA model (8.0%) and the MDS model (7.8%), and the SPSA model has the smallest type I error rate (5.2%), followed by SPMDS model of 5.7%, and SPPCA model of 5.8%. Compared to the parametric models, the semiparametic models show an average of a 36% reduction in the type I error rate. At {alpha} = 0.01, the SPMDS and SPPCA have a type I error rate of 1.2%, SPSA model has a type I error rate of 1.5%, again close to the expected value of 1%. In contrast, the three parametric models which assume normally distributed error have a type I error rate substantially higher than the expected value of 1%.


View this table:
[in this window]
[in a new window]

 
Table 2. Comparisons of the SA, SPSA, PCA, SPPCA, MDS and SPMDS models in power to detect the association between simulated QTN and SDV and the mean SD of the regression coefficients for QTNs (Table 2s) from simulation study I

 
2.6 Simulation study I
Our goal was to simulate a dataset with the same covariate structure as the data from the previous section given a true complex genetic correlation structure among individuals. Our simulations are similar to those carried out by Yu et al. and Zhao et al. in their association studies. Underlying the phenotype simulation, we simulated 500 Arabidopsis flowering genetic loci as follows. For each of the randomly selected 500 SNPs in Section 2.5, a fixed additive genetic effect ranging from {kappa} = 0.1s to 1.05s was added, here s = Formula , where Formula is the usual unbiased estimate of the variance, assuming lines are independent. If we let p be the sample minor allele frequency of a SNP, and n = 95 be the sample size, then the percentage ({zeta}) of the total phenotypic variation explained by this fixed genetic effect can be approximated by {zeta} = p(1–p){kappa}2/(p(1–p){kappa}2+1–1/n) (Long and Langley, 1999; Yu et al., 2006). To each of the 500 random SNPs, we have six simulated datasets; each with the added genetic effect expressed as one of the values 1%, 2.5%, 5%, 7.5%, 10% and 15%. We denoted those simulated flowering time genetic loci as quantitative trait nucleotides (QTN). For each of 3000 simulated datasets i.e. six genetic effects by 500 SNPs, we applied the six models with the same Bayesian implementation as in the analysis from Sections 2.12.22.3. For each model and each genetic effect, we calculated the power as 500p with p being the proportion of the 500 QTNs whose (1–{alpha}) credible set for regression coefficient excluded 0 with {alpha}=0.05 and {alpha}=0.01, respectively. As shown in Table 2, while the power steadily increased as the simulated genetic effect increased from 1% to 15% for all models, the three semiparametric models showed consistently higher power than their corresponding parametric models for all the cases considered. At {alpha}=0.05, for example, compared to the SA model, the SPSA model showed a 0.1–87.5% increase in power for {zeta}=1–15%, while compared to the MDS, the SPMDS showed a 3.5–85.4% increase in power. We also estimated the SD of the regression coefficients corresponding to QTNs. The SA model has the largest mean SD (10.11) averaged across the 500 QTNs and 6 gene effects, followed by the PCA of 9.62, MDS model of 9.57, the three semiparmetric models have smaller SD than any of the parametric models (Table 2).

2.7 Simulation study II
Our second simulation studies are similar to those carried out by Zhang et al. (2006) in their association mapping studies. One limitation of the simulations based on coalescent models is that these models may not represent population evolutionary histories accurately (Zhang et al., 2006). Therefore, our simulations are based on empirical population genetic data. We assume that sampled individuals were genotyped at a series of unlinked SNP loci with two alleles A and a. There are three genotypes at a SNP marker: aa (or –1), Aa (or 0) and AA (or 1). We also assume that the population under study is a continuous admixture of two ancestral human populations including Biaka and Danes. The allele frequencies of 500 unlinked SNPs with minor allele frequency >5% across the two ancestral populations were extracted from a population genetics database ALFRED (Rajeevan, 2003) and we repeated these 500 SNPs 2 times as if we had 1000 SNPs. First, we generated 1000 independent null SNPs with n = 200 individuals, those 1000 null SNPs will be used to estimate population structure. Let pi1, pi2 represent the probabilities that allele A of the ith individual originated from Biaka and Danes, respectively, and pi1 + pi2 =1 for i =1,..., n. We simulate 180 out of 200 individuals as follows. We assume pil ~ Uniform (0,1) independently for each individual i and each SNP l. Therefore, the allele frequency of allele A at marker l for individual i can be written as

Formula 4(4)
where ql1, ql2 are the allele frequencies of allele A at marker l in Biaka and Danes, respectively. Individual i was assigned genotype –1, 0 or 1 at marker l with probabilities (1 pil)2, 2pil(1 – pil), pFormula, respectively for i = 1,..., 180 and l = 1,..., 1000. For the remaining 20 individuals, we assume they are from two genetically related groups with genetic correlation between individuals being {rho}=0.8 for the first group and 0.6 for the second group, and genetically independent between the two groups. For each of the two groups, we assume p ~ Uniform (0,1), then pi = p + {varepsilon}i for i = 1,..., 10, where {varepsilon}i is a small quantity generated from N(0,0.001) constrained pi > 0, so that the individuals within a group have similar ancestry background. Genotypes –1, 0 or 1 at marker l for 10 individuals are generated according to a multivariate binomial distribution with marginal probabilities being(1 – pil)2, 2pil(1 – pil), pFormula (estimated by Equation 4) and the correlation between any pair of individuals being {rho} (Leisch et al., 1998). Next, we simulated 500 additive genetic loci using the simulated null SNP genotype data. We randomly selected 500 out of 1000 SNPs, let xil be the genotype score of individual i at marker l, let µ01 = 10, and µi0 = µ01pi1, the quantitative trait values are generated according to the following model:

Formula 5(5)
for i = 1,..., n and l = 1,..., 500, where {varepsilon}il is a random variable distributed as a skew-normal with mean 0.073 and variance 0.873 i.e. Formula where {theta} = –1.1, {nu}=1.5, {lambda}=5, {varphi}(·) and {Phi}(·) denote the standard normal pdf and cdf, respectively. Note that the error distribution is a unimodal skewed distribution, its median is equal to 0, up to two decimal points. We set µ1=0 and 0.4 for comparing type I error rates and powers, respectively.

We applied the six models to each of the 500 datasets at µ1 = 0 and 500 datasets at µ1 = 0.4 with the same Bayesian implementation as in the analysis from Sections 2.12.22.3. As shown in Table 3, the false positive rates are similar between the three parametric models and the three semiparametric models, which are not significantly different from the expected value of 0.05 at {alpha} = 0.05 [95%CIp=0.05 is (0.0309, 0.0691)] or 0.01 at {alpha}=0.01 [95%CIp = 0.01 is (0.0013, 0.0187)]. The powers for the semiparimetric models are higher than those for the parametric models. On the average, the SD for the candidate gene regression coefficients are smaller for the semiparametric models than for the parametric model. Furthermore, the semiparametric models provide a smaller bias for the candidate gene regression coefficients than the parametric models on the average.


View this table:
[in this window]
[in a new window]

 
Table 3. Comparisons of the SA, SPSA, PCA, SPPCA, MDS and SPMDS models in type I error rate, power, average of biases of candidate gene regression coefficient Table 3 and the mean SD of Table 3s from simulation study II

 

    3 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 CONCLUSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 
The current available quantitative trait association mapping methods often assume that the errors are normally distributed. The normality assumption of the underlying error distributions greatly simplifies the form of the likelihood. However, it may be unrealistic. The normality assumption, if violated, may lead to false association detection and a reduction in power (Morton, 1984). The most commonly adopted method to achieve normality involves log-transforming the data. Although this method may give reasonable empirical results, it should be avoided if a more suitable theoretical model can be found (Azzalini and Capitanio, 1999). Furthermore, we applied the MDS model to the log-transformed dataset, and the type I error rate from 500 random SNPs and SDV association tests was much higher (28% or higher) than that for the SA, MDS, PCA, SPSA, SPPCA and SPMDS models using the untransformed data as we presented in this article. Moreover, the semiparimetric model using a MPTs to model the residual error is more flexible and it can be applied to a more broad range of datasets than the log-transformation method or other types of data transformation methods.

In this article, we have developed a semiparametric test to detect associations between candidate markers and quantitative traits using population-based data. Our model is advantageous in terms of the following aspects: (1) the error distribution is flexibly modeled as a MPTs centered around a family of normal distributions, hence important, data-driven features, such as skewness and multimodality, are allowed; and (2) both population structure and kinship are included in the model to adjust for continuous population stratification. In comparison to Yu's mixed model, the two aspects in our model for uncovering population stratification are orthogonal, so the effect of kinship on the phenotypic variation is estimated after taking the effects of population structure into account. Moreover, orthogonalizing the population structure and kinship reduces multicollinearity problems. Multicollinearity may lead to a large standard error for the regression coefficients, hence it can reduce the power to detect a true association between a candidate gene and a trait of interest. Compared to the parametric SA, PCA and MDS methods, the semiparametric methods demonstrate several advantages. An application to the real dataset of 95 Arabidopsis lines for the flowering time demonstrated that the semiparametric models fit better the data than the three parametric models by comparing pseudo Bayes factor values. Moreover, the semiparametric models achieve a lower rate of false positive associations than the three parametric models; the semiparametric models also have a better precision than the parametric models. Our simulation results show that the semiparametric models demonstrate higher power and better accuracy compared to three parametric models. Therefore, for the current dataset, the proposed semiparametric models are preferred, offering sufficient improvement in goodness of fit, higher power and better precision in detecting gene effects to offset the increased computational complexity.

Our approach uses both inferred population structure and kinship to adjust for continuous population structure. However, for samples under Hardy–Weinberg equilibrium (HWE) within each subpopulation, the population structure alone is sufficient for adjusting for population structure. If HWE does not hold within each subpopulation, there will be multilevel-genetic correlation among the sampled individuals. In this case, the use of population structure alone will have less power, i.e. both K and population structure will be needed. The limitation of our algorithm is the computational expense incurred due to the use of Markov chain Monte Carlo (MCMC) in the inference for association stage. For example, for a sample of 100 individuals, our algorithm takes several minutes on a 2.13 GHz Pentium 4 with 1.99 GB of RAM PC for one candidate gene compared to several seconds for the other parametric methods (since MCMC is unnecessary for those methods). Our algorithm is more suited for candidate gene mapping; however, it is still possible for genome-wide scans where millions of genetic markers are tested for association with a phenotype if parallel computing is used. Finally, incorporating the uncertainty in inferring population structure in estimating association at a candidate locus might be theoretically superior (Zhang et al., 2006), however, the approach can be computationally enormous. It is almost an impossible task for association tests involving hundreds of candidate genes if a 1000 random genetic markers are included in the model for inferring population structure.


    ACKNOWLEDGEMENT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 CONCLUSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 
The authors would like to thank Dr Magnus Nordborg and his group at University of Southern California for kindly providing all the flowering data and the sequence data from 95 Arabidopsis lines.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on April 28, 2008; revised on August 20, 2008; accepted on August 21, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 CONCLUSIONS
 ACKNOWLEDGEMENT
 REFERENCES
 

    Aranzana MJ, et al. Genome-wide association mapping in Arabidopsis identifies previously known flowering time and pathogen resistance genes. PLoS Genet. (2005) 1:e60.[CrossRef][Medline]

    Azzalini A, Capitanio A. Statistical applications of the multivariate skew-normal distribution. J. R. Stat. Soc. Ser. B (1999) 61:579–602.[CrossRef]

    Bacanu S, et al. Association studies for quantitative traits in structured populations. Genet. Epidemiol. (2002) 22:78–93.[CrossRef][Web of Science][Medline]

    Devlin B, Roeder K. Genomic control for association studies. Biometrics (1999) 55:997–1004.[CrossRef][Web of Science][Medline]

    Devlin B, et al. Genomic control, a new approach to genetic-based association studies. Theor. Popul. Biol. (2001) 60:155–166.[CrossRef][Web of Science][Medline]

    Ferguson TS. Prior distributions on spaces of probability measures. Ann. Stat. (1974) 2:615–629.[CrossRef]

    Geisser S, Eddy WF. A predictive approach to model selection. J. Am. Stat. Assoc. (1979) 74:153–160.[CrossRef][Web of Science]

    Han C, Carlin BP. Markov chain Monte Carlo methods for computing Bayes factors: a comparative review. J. Am. Stat. Assoc. (2001) 96:1122–1132.[CrossRef][Web of Science]

    Hanson T. Inference for mixtures of finite Polya tree models. J. Am. Stat. Assoc. (2006) 101:1548–1565.[CrossRef][Web of Science]

    Hardy OJ, Vekemans X. SPAGeDi: a versatile computer program to analyse spatial genetic structure at the individual or population levels. Mol. Ecol. Notes (2002) 2:618–620.[CrossRef][Web of Science]

    Hastbacka J, et al. Linkage disequilibrium mapping in isolated founder populations: diastrophic dysplasia in Finland. Nat. Genet. (1992) 2:204–211.[CrossRef][Web of Science][Medline]

    Hoggart CJ, et al. Control of confounding of genetic associations in stratified populations. Am. J. Hum. Genet. (2003) 72:1492–1504.[CrossRef][Web of Science][Medline]

    Jakobsson M, et al. Genotype, haplotype and copy-number variation in worldwide human populations. Nature (2008) 451:998–1003.[CrossRef][Web of Science][Medline]

    Kang HK, et al. Efficient control of population structure in model organism association mapping. Genetics (2008) 178:1709–1723.[Abstract/Free Full Text]

    Kass RE, Raftery AE. Bayes factors. J. Am. Stat. Assoc. (1995) 90:773–795.[CrossRef][Web of Science]

    Kelker D, Kelker H. The effect of skewness on selection in a plant breeding program. Euphytica (1986) 99:33–54.

    Lacher D. Interpretation of laboratory results using multidimensional scaling and principal component analysis. Ann. Clin. Lab. Sci. (1987) 17:412–417.[Abstract]

    Lander ES, Schork NJ. Genetic dissection of complex traits. Science (1994) 99:33–54.

    Lavine M. Some aspects of Polya tree distributions for statistical modeling. Ann. Stat. (1992) 20:1222–1235.[CrossRef]

    Lee I, Amasino RM. Effect of vernalization, photoperiod, and light quality on the flowering phenotype of arabidopsis plants containing the FRIGIDA gene. Plant Physiol. (1995) 108:157–162.[Abstract]

    Leisch F, et al. On the generation of correlated artificial binary data. In: Working Article Series, SFB ‘Adaptive Information Systems and Modelling in Economics and Management Science’ (1998) Vienna University of Economics.

    Lettre G, et al. Genetic model testing and statistical power in population-based association studies of quantitative traits. Genet. Epidemiol. (2007) 31:358–362.[CrossRef][Web of Science][Medline]

    Long AD, Langley CH. The power of association studies to detect the contribution of candidate genetic loci to variation in complex traits. Genome Res. (1999) 9:720–731.[Abstract/Free Full Text]

    Morton NE. Trials of segregation analysis by deterministic and macro simulation. In: Human Population Genetics: The Pittsburgh Symposium—Chakravarti A, ed. (1984) New York: Van Nostrand Reinhold. 83–107.

    Nordborg M, et al. The pattern of polymorphism in Arabidopsis thaliana. PLoS Biol. (2005) 3:e196.[CrossRef][Medline]

    Price AL, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. (2006) 38:904–909.[CrossRef][Web of Science][Medline]

    Pritchard JK, Donnelly P. Case-control studies of association in structured or admixtured populations. Theor. Popul. Biol. (2001) 60:227–237.[CrossRef][Web of Science][Medline]

    Pritchard JK, et al. Inference of population structure using multilocus genotype data. Genetics (2000) 155:945–959.[Abstract/Free Full Text]

    Rajeevan H. ALFRED the ALlele FREquency Database update. Nucleic Acids Res. (2003) 31:270–271.[Abstract/Free Full Text]

    Redden DT, et al. Regional admixture mapping and structured association testing: conceptual unificcation and an extensible general linear model. PLoS Genet. (2006) 2:e137.[CrossRef][Medline]

    Reich DE, Goldstein DB. Detecting association in a case-control study while correcting for population stratification. Genet. Epidemiol. (2001) 20:4–16.[CrossRef][Web of Science][Medline]

    Ritland K. Estimators for pairwise relatedness and individual inbreeding coefficients. Genet. Res. (1996) 67:175–185.[Web of Science]

    Rogers JS. Measures of genetic similarity and genetic distance. Studies in genetics, VII. Univ. Tex. Publ. (1972) 2713:145–153.

    Tommasini L, et al. Association mapping of Stagonospora nodorum blotch resistance in modern European winter wheat varieties. TAG Theor. Appl. Genet. (2007) 115:697–708.[CrossRef]

    Wickelmaier, F. (2003) An Introduction to MDS. Available at http://perception.inrialpes.fr/~Arnaud/indexation/mds03.pdf.

    Wessel J, Schork NJ. Generalized genomic distance based regression methodology for multilocus association analysis. Am. J. Hum. Gen. (2006) 79:792–807.[CrossRef][Web of Science][Medline]

    Yu J, et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. (2006) 38:203–208.[CrossRef][Web of Science][Medline]

    Zhang S, et al. On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genet. Epidemiol. (2003) 24:44–56.[CrossRef][Web of Science][Medline]

    Zhang L, et al. Bayesian modeling for genetic association in case-control studies: accounting for unknown population substructure. Stat. Modelling (2006) 6:352–372.[CrossRef]

    Zhao K, et al. An arabidopsis example of association mapping in structured samples. PLoS Genet (2007) 3:e4.[CrossRef][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/20/2356    most recent
btn455v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Li, M.
Right arrow Articles by Hanson, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, M.
Right arrow Articles by Hanson, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?