Bioinformatics Advance Access originally published online on January 12, 2006
Bioinformatics 2006 22(5):573-580; doi:10.1093/bioinformatics/btk050
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Multiple association analysis via simulated annealing (MASSA)
1Institut Catalá de Reçerca i Estudis Avançats Pg Lluis Companys 23, 08010 Barcelona, Spain
2Departament de Ciència Animal i del Aliments, Facultat de Veterinària, Universitat Autònoma de Barcelona 08193 Bellaterra, Spain
| ABSTRACT |
|---|
|
|
|---|
Summary: Genome-wide association studies are now technically feasible and likely to become a fundamental tool in unraveling the ultimate genetic basis of complex traits. However, new statistical and computational methods need to be developed to extract the maximum information in a realistic computing time. Here we propose a new method for multiple association analysis via simulated annealing that allows for epistasis and any number of markers. It consists of finding the model with lowest Bayesian information criterion using simulated annealing. The data are described by means of a mixed model and new alternative models are proposed using a set of rules, e.g. new sites can be added (or deleted), or new epistatic interactions can be included between existing genetic factors. The method is illustrated with simulated and real data.
Availability: An executable version of the program (MASSA) running under the Linux OS is freely available, together with documentation, at http://www.icrea.es/pag.asp?id=Miguel.Perez
Contact: miguel.perez{at}uab.es
| 1 INTRODUCTION |
|---|
|
|
|---|
Linkage disequilibrium (LD) genetic mapping is currently a very active research area, since more classical pedigree-based techniques are not suitable for finding the causal mutations. However, traditional LD techniques have been oriented towards fine mapping and thus show optimum properties only when applied to small genomic regions. By focusing only on small regions, LD methods can fail to identify interaction between loci or may result in inaccurate estimates because variation at other loci outside the region studied are not accounted for. Thus, although LD-based techniques for fine-scale gene mapping have a long history, their development and application to genome-wide studies are less well developed (Lin et al., 2004a; Marchini et al., 2005). It is important to develop new methods that specifically account for this challenging increase in data complexity. By searching globally across the genome, the probabilities of detecting loci with small effect and/or epistatic effects are increased (Marchini et al., 2005). Thus, we expect this approach to provide a more complete view of the trait genetic architecture, compared with classical strategies.
Recently Marchini et al. (2005) have investigated several approaches to uncover epistatic relations between loci, but limiting their study to, at most, pairs of loci analyzed simultaneously. In contrast, the multifactor-dimensionality reduction approach (Ritchie et al., 2001) was developed for seeking high order interactions; nonetheless, this approach is unpractical unless the number of markers is very small. Lin et al. (2004a) proposed an exhaustive transmission disequilibrium test that enumerates all contiguous haplotypes of all possible sizes for a set of polymorphisms, but the method is limited to contiguous haplotypes, i.e. neither epistasis nor more than one causal mutation are considered.
Epistasis has been long recognized as a potentially important phenomenon in genetics (Crow and Kimura, 1970; Wolf et al., 2000); it is found to influence disease and quantitative traits across all organisms, ranging from RNA viruses (Sanjuán et al., 2004a) or yeast (Segre et al., 2005) to human (Ukkola et al., 2001) and chicken (Carlborg et al., 2003). Furthermore, whole genome scanning for epistasis is becoming nowadays an important area of research as a consequence of new experimental strategies (Segre et al., 2005) and statistical techniques (Carlborg and Haley, 2004).
The objective of this work is to introduce a method tailored to identify the causal mutations for a trait, or traits, assuming that a (very) large number of genotyped single nucleotide polymorphisms (SNPs) are available. In a general manner, the problem can be posed as follows: What is the genetic model that best explains the data? Ideally, the number of assumptions made in searching for the true model has to be kept the lowest possible. To gain maximum flexibility, we propose the use of mixed model techniques and allow any number of loci and interaction levels between them. The optimum model is found using simulated annealing and defining a series of movements that permits shifting from the current to a new configuration, as detailed below. This multiple association analysis via simulated annealing (MASSA) will be illustrated with simulated and real data.
| 2 METHODS |
|---|
|
|
|---|
Suppose a number of SNPs (1, ... , s) is genotyped for n individuals, s can be >>n, and a trait (or series of traits) has been recorded on the individuals. For the sake of simplicity, we will assume that phases have been identified. A convenient and general explicative model for the phenotypes is
![]() | (1) |
But of course the problem is that neither the number of causal factors nor the matrices Zj are known and one needs to choose among many competing models. The two sources of uncertainty are as follows: Which are the causal loci and what are the interactions between them? For notational convenience, here we unambiguously characterize a model with an integer vector of length s, where a zero value at the i-th position means that the i-th SNP is not included in the model and a value j from one through q means that the i-th SNP is included in the j-th genetic factor. Take three SNPs for illustration, the list of possible models, excluding the null model [000], is [100], [010], [001], [110], [011], [101], [111], [120], [102], [012], [112], [121], [122] and [123]. A typical genome scan is limited to the first three instances, in our example, [100], [010] and [001], thus leaving unexplored the vast majority of possible models. Note that two different SNPs harboring the same genetic factor code j means epistasis between them. Similarly, contiguous SNPs with the same factor number means that the whole haplotype is considered rather than each individual effect. Thus, there is no formal distinction between contiguous and non-contiguous SNP epistasis in our modelisation. For instance model [121] means that there is epistasis between SNPs 1 and 3 (locus 1) together with a second locus (SNP 2) affecting the trait. Under epistasis, the effect of having, say, allele 1 at SNP 1 will depend on the allele at the interacting SNP(s), thus one needs to estimate the effect for each allelic combination and the number of levels for the genetic factor is increased compared with that of additivity. The way MASSA handles epistasis is illustrated in Table 1.
|
It is evident that enumerating all possible models becomes impossible even for relatively small number of SNPs: there are over 1.6 million different models with just 10 SNPs. The goal of our work is to determine which model best explains the data in a reasonable finite computing time while keeping in mind the following two questions: (1) How to measure the validity of the model? (2) How to ensure that the best model has been chosen without explicitly enumerating all possible models?
As not all possible models are nested, likelihood ratio tests cannot be employed. Thus we resorted to the Bayesian information criterion (BIC, also called Schwartz criterion), defined as BIC = 2logl + k logN, where l is the likelihood for the current model, k, the number of parameters and N, the number of records. The goal is to find the model with the lowest BIC. The reader is referred to the ample literature on the properties of other criteria like Akaike's (Burnham and Anderson 2004; Hastie et al., 2001). The difference between two BIC values can be used to approximate the logarithm of the Bayes factor between the two models (Hastie et al., 2001). Thus, assuming uniform priors for each competing model, the posterior probability of i-th model (i = 1, K) is, approximately,
![]() | (2) |
BICi is the difference between the BIC of the i-th model and the minimum BIC. | 3 IMPLEMENTATION |
|---|
|
|
|---|
The Qxpak package (Pérez-Enciso and Misztal, 2004) was used to obtain the log-likelihood and parameter estimates, adapted to the particularity of the genetic Zj matrices, which contain two non-zero elements for each individual's row (an incidence matrix normally contains only one non-zero element per row per effect). In the absence of additional random effects, this is equivalent to a multiple regression procedure.
We relied on simulated annealing (Kirkpatrick et al., 1983) to find the model with minimum BIC. Suppose that current model has BICc, and a new model with associated BICnew is proposed. The probability of accepting the new model is Paccept = Min{1, exp[(BICnew BICc)/T ]}, where T is the annealing temperature. In our implementation, the initial temperature was 10 and decreased by a factor of 5% every m iterations, with m typically between 400 and 3000. Preliminary trials suggested the method was robust to departures from these parameters, and the initial temperature was high enough to have an initial acceptance rate >50%, as suggested in the literature. The number of cycles, i.e. different temperatures, ranged between 40 and 60. The program terminated when the temperature was arbitrarily low or when no new solution was accepted within the same temperature cycle.
The following series of movements were defined to propose new models (Figure 1):
- In: a new site with position randomly chosen is inserted.
- Out: one of the existing genetic factors is deleted.
- Join: two distinct factors are fused (epistasis is created).
- Split: a factor is split into two (if it contains more than one site).
- Glue: two consecutive factors are fused, i.e. combined into one.
- Grow: additional sites are added to the current factor.
- Shave: the most external sites from a factor are deleted.
- Shift: the factor position is shifted with direction (upstream or downstream) randomly chosen.
|
The probability with which a new movement is proposed was not constant. For instance, the frequency of proposing In (Out) was inversely (directly) proportional to the number of factors in the current model. The factor deleted was randomly chosen, as was the position of insertion of the new factor. Shave was more likely proposed for factors comprising many SNPs, conversely, Grow was more frequently applied to single SNP factors.
During the simulated annealing process, the best 20 solutions found are saved. This allows judging whether the global maximum has been reached but, importantly, also to quantitatively asses the strength in favor of the optimum model and carry out approximate model-averaged estimates:
![]() | (3) |
is the weighted parameter estimate,
is the estimate obtained with the i-th model and P(Mi|y) is the probability of model i given the data, which is proportional to exp(BICi/2) when equal priors are assumed for each model. Because the whole methodology that follows from Equation (1) is based on mixed model theory, modeling is very flexible, e.g. any number of effects, fixed or random, can be included in the model, or multitrait models can be run (so far, the same model must be fitted to all traits). Similarly, several thousand markers or individuals can potentially be studied because efficient sparse matrix techniques are employed to maximize the likelihood (Misztal and Perez-Enciso 1993).
| 4 EMPIRICAL PROPERTIES: SIMULATION STUDY |
|---|
|
|
|---|
4.1 Simulated data
Data generated by programs from Lin et al. (2004a) were used for the simulation study. Each dataset consisted of a population of
1000 independent individuals genotyped for 500 SNPs. The data were simulated using coalescence techniques with recombination and with parameters 4Ner = 2000, where Ne is the effective size and r is the recombination rate, nucleotide diversity was 4000. Each SNP was spaced at 10 kb, on average. The original data were unphased and the phase was reconstructed using the hap2 program (Lin et al., 2004b). Two site and three site epistasis models were considered, although only two site epistasis was analyzed in detail. For the two site epistasis, SNPs in positions 8 and 490 were chosen arbitrarily as causal SNPs, with phenotypes simulated conditional on observed genotypes at those SNPs. For our purposes, we retained N = 1000 (all), 500 or 250 of the original individuals in order to mimic different population sizes. Five genetic scenarios were considered. In the simplest scenario (Scenario 1), a continuous trait governed by two additive loci was simulated. A genotypic effect
was defined such that
![]() |
is the genetic variance; h2, the heritability and pj, the allele frequency at the j-th SNP. The phenotype of a given i-th individual was obtained by sampling from a normal distribution N(gi, 1), with gi being the individual's genetic value, e.g. g = 2
for an individual homozygous for the alleles increasing the mean, and so on. Scenarios 2 and 3 involved two kind of extreme epistasis, as detailed in Table 2: coadaptative epistasis and extreme diminishing epistasis. In the coadaptative epistasis, the phenotype is increased when the alleles at the two loci are from the same origin. As noted by Carlborg and Haley (2004), the identification of the loci involved in this kind of epistasis require simultaneous scanning for pairs of loci because the individual locus effect can be negligible. In the extreme diminishing epistasis, the dominant alleles are complementary, this case was discussed by Crow and Kimura (1970). In the fourth scenario (Scenario 4) a binary trait governed by two additive independent loci was simulated. Here the phenotype was truncated to integer values 1 and 2, quantitative phenotypes below the mean was assigned a value of 1, and those above, 2; this procedure ensures that the expected frequency of both the values is equal, to mimic a casecontrol study. Three values of h2 0.30, 0.15 and 0.05 were considered, although not all possible combinations of parameters were explored because of computing constraints. Finally, Scenario 5 consisted of a three site epistasis model: a binary trait took phenotype 1 if at least three alleles in any of the three sites were of type 1, e.g. genotypes 111/222 and 121/221 resulted in the same phenotype. The causal SNPs were in positions 8, 200 and 490. Note that no locus is neither necessary nor sufficient to cause the disease; this kind of inheritance has been described in complex diseases like BardetBiedl syndrome (Katsanis et al., 2001). A total of 30 replicates per case were run in all scenarios, each took between 2 and 5 h in a Pentium 4 PC under Linux OS.
|
In order to characterize the behavior of the method under the null hypothesis and assess power, we permuted the data 100 times and run MASSA with the three data sizes, N = 250, 500 and 1000 in Scenario 2 (a preliminary study indicated that results were highly comparable across scenarios). Power was estimated as the number of replicates that exceeded the 5% cutoff and we only retained these replicates for further study.
4.2 Results
The permutation test applied to MASSA allows the user to have a guideline as to how likely is getting a given result just by chance. The 5% cutoff for the difference in BIC between the null model (a model without any locus) and the model retained by MASSA was 10, 10.6 and 10.5 for N = 1000, 500 and 250, respectively. The different values obtained can be ascribed merely to chance, and thus we retained the 10.5 as threshold for all sizes. Figure 2 shows the estimated distributions, which were rather similar across N. As seen in Table 3, all replicates exceeded this threshold except those simulated under the coadaptative epistasis (Scenario 2). Table 3 summarizes the main results obtained with MASSA. For all parameters studied, the estimate obtained using the best model, together with the result obtained averaging across models [Equation (2)] are reported. As an example, Figure 3 shows the relative probabilities assigned to the best 10 models in the particular case of Scenario 2 (results were very similar across scenarios). In all cases studied, the relative probability of the best model ranged between 35 and 45%, and between 15 and 20% for the second best model. This does not mean that the estimates are equally valid across all scenarios and cases studied but, instead, that the credibility of the best model relative to the rest of competing models was comparable in all situations. All in all, the evidence in favor of the best model was reasonably high and, in fact, point estimates were very similar to model-averaged estimates (Table 3).
|
|
|
Regarding the simplest scenario, i.e. a quantitative trait controlled by two independent loci (Scenario 1), the chances of identifying both causal loci were very high (>80%) with MASSA, except for the lowest heritability, when it was 0.63. The chances of exactly identifying the true model (i.e. two independent loci at positions 8 and 490) are also remarkably high for the most favorable situation (N = 1000, h2 = 0.30), with
50% of replicates correctly identifying the true model. In
60% of replicates, all SNPs in the selected model were at <50 kb (five SNPs) from the causal SNPs for N = 1000. This frequency slightly decreased as the size of the experiment decreased. The presence of epistasis decreased the chances of including both causal loci in the model compared with the strictly additive Scenario 1, illustrating that epistasis always makes it more difficult the genetic dissection of complex traits. In general, Scenario 2 (adaptative epistasis) was a much less favorable setting than Scenario 3 (diminishing epistasis), but the former is also the most difficult epistasis to detect (Carlborg and Haley 2004). This was evidenced also by the largest number of replicates below the 5% threshold (see also Figure 2). Nevertheless, the likelihood of selecting at least one causal locus was often close to 100% (the sum of 2 and 1% columns in Table 3), except for very low h2, 0.05, confirming MASSA as a powerful tool for detecting gene interactions.
Regarding binary traits, the performance of MASSA was as good as for quantitative traits. The probability of identifying both causal loci was well >80% in all instances, except for the smallest dataset. Similarly, the chances of identifying the true or the approximate model were not affected and were quite similar to those in Scenario 1. Scenario 5 was a particular case because there were three interacting causal sites, and it is encouraging that epistasis between the three causal SNPs (or very nearby) were detected in almost all replicates.
Among the SNPs selected, there were some that were far from the causal SNPs (>50 kb) and thus were classified as false positives. Its percentage (last column in Table 3) varied from 10 to 30 for most cases, except in the lowest h2 in Scenario 2, which was close to 50%. In general, the associated P-values to these false positives were much higher (less significant) than for the true causal SNPs but, certainly, this result prompts caution on the problem of identifying the correct causal positions when the allele effects are small, even for relatively large samples. Scenario 5 is extreme: extra sites were retained in the final model and the exact true model was almost never retained, but again the significance associated to these redundant loci was far smaller (average P = 2 x 103) than the significance of the retained true causal loci (P << 1015).
| 5 REAL APPLICATION 1: GENETICAL GENOMICS |
|---|
|
|
|---|
5.1 Data
The number of genetical genomics studies is exploding owing to the huge amount of information provided. However, to realize its full potential, adequate statistical tools need to be employed. It is interesting to note that several experiments have already identified epistasis in these studies (Brem et al., 2005; Chesler et al., 2005). Here we applied our method to publicly available mouse data (Chesler et al., 2005). In brief, the data consist of expression records of
12 000 transcripts, obtained with Affymetrix chip U74Av2, from 35 recombinant B x D mouse lines that were genotyped for 779 non-redundant markers across the whole genome. As our principal interest was epistasis, we analyzed message levels from Grin2b gene. As noted by Chesler et al. (2005), it is known that this locus is modulated by loci on chromosomes 6 and 8; these authors applied an epistatic model to their data and found evidence of epistasis between a distal locus on chromosome 2 (c. 2) and a locus on distal chromosome 6 (c. 6).
5.2 Results
An initial run with our strategy uncovered several markers associated with message level, with sites D2Mt200 (c. 2), D5Mit180 (c. 5), D6Mit301 (c. 6) and D8Mit128 (c. 8) appearing in the majority of models, i.e. the three regions identified in previous studies plus a fourth locus in (c. 5; D5Mit180). In order to explore as exhaustively as possible all models, we allowed the program to fit only these markers. The five models with largest probability are shown in Table 4. The marginal probabilities of these markers appearing in the model were 0.57, 0.40, 1 and 0.84 for markers D2Mt200, D5Mit180, D6Mit301 and D8Mit128, respectively. Note that most models contain an epistatic interaction between chromosomes 6 and 8 (weighted P across best 20 models = 0.80), whereas the evidence in favor of epistasis between chromosomes 2 and 6 is weaker (weighted P = 0.26), confirming previous analyses.
|
| 6 REAL APPLICATION 2: VIRUS DATA |
|---|
|
|
|---|
6.1 Data
We have reanalyzed the data by Sanjuán et al. (2004a), who studied the contribution of epistasis to fitness of the vesicular stomatitis virus (VSV). To that end, these authors created large collection of single and double single-nucleotide substitution mutants by site-directed mutagenesis on an infectious cDNA. The fitness of each double mutant, as well as of each corresponding single mutant, were measured relative to their unmutated ancestor by means of competition experiments (Sanjuán et al., 2004a,b). In brief, both the mutant and unmutated ancestor were mixed at equal numbers and the mixture was used to innoculate a susceptible culture of baby hamster kidney fibroblast cells and fitness was estimated by comparing the growth rates of each genotype during the exponential growth phase. The VSV genome encodes for five proteins: N (nucleocapside that protects the RNA), P (phosphoprotein involved in replicase activity), M (virion inner matrix), G (glycoprotein forming the virion spikes that interact with cellular receptors) and L (the main component of the RNA polymerase, RpRd). In total, there were 33 mutated sites and 96 different genotypes for which fitness was assessed at least with 4-fold replication. Three genotypes were lethal and were excluded from the present analyses, leaving the total number of data as 93.
6.2 Results
The main interest of applying MASSA to this dataset was to investigate whether the method was able of identifying high order epistasis, rather than as an example with a large number of markers. For that purpose, we computed the frequency, weighted by the model probability [Equation (2)], that two sites interact (Table 5). Factors with probabilities of appearing in a model <60% were not included in the table. In total, five mutations selected from genes G and L, four for gene N, two for gene M and one for gene P have been selected. These 15 mutations were grouped into three main epistatic factors, which are marked with boxes of different shadow patterns in Table 5. All but two mutations selected were non-synonymous changes. The first epistatic factor, B1, comprises mutations G406U, U1323A, U1585A and U3615A, the first two mutations are located at the nucleocapside N gene, the third is in the minor component of the RdRp (gene P) and the fourth, at the glycoprotein G gene. In all cases, the probability associated with incorporating the mutation to B1 were larger than P
0.76. The second epistatic factor, B2, was formed by mutations C2607U, U2617G and C7287U (all with probabilities P
0.63). Finally, the third epistatic factor, B3, was constituted by mutations U877C, U3192C, A3802G, U4398C, G4459U, U5124A, G7123C and G7128U (in all cases, P = 1). By far, the largest and strongest epistatic interactions occurred within B3, which contains mutations in genes N (one), G (four), and L (three, including a synonymous change).
|
| 7 DISCUSSION |
|---|
|
|
|---|
One of the immediate challenges in the genetic analysis of complex traits will be to make sense out of the vast amount of data being generated by the HapMap and related projects. As the number of markers increases, so does the probability of genotyping polymorphisms that are either causal or located nearby and thus in strong disequilibrium with the relevant mutation(s). In this context, it becomes feasible to carry out genome scans using linkage disequilibrium even in outbreed populations. Thus far, the vast majority of methods have employed a linear scan consisting in assessing the marker or set of markers that shows maximum association with the trait moving from one extreme through the other of the genome. This strategy is simple and easy to implement but suffers from several drawbacks. First, the classical strategy of adding a new quantitative trait loci to the model to test for additional loci is likely to result in a decreased power when a very stringent significance threshold is set; in addition, every new locus requires recomputing the new positions for all loci. As the number of loci in the model increases, computational cost becomes insurmountable. Second, epistatic effects are likely to be missed. Even if a two-dimensional search is done, it is impossible to exhaustively search all orders of interactions between loci. The use of simulated annealing in MASSA overcomes these two problems as it is able of searching over many interaction orders between loci. MASSA is neither a genome scan nor a greedy search, because the algorithm slowly imposes more and more restrictions on the models to be explored. Nevertheless, other optimization strategies are available, notably genetic algorithms (Mitchell, 1998), but they tend to be much slower than simulated annealing. These alternatives will be considered in the future. The statistical framework used by MASSA, mixed model theory, is extremely flexible. Note that, in contrast to methods based in genome linear scans, we are interested in identifying the best model fitting the data (balanced by the number of parameters). The validity of the final model can be checked via permutation (as done here) or with alternative techniques like cross-validation.
MASSA employs a frequentist maximum likelihood strategy. Although the Bayesian approach is conceptually more rigorous than the BIC or AIC criteria (Yi et al., 2005), it is heavily demanding in terms of computing time and suffers from other drawbacks. For instance, each BIC computation took
0.01 s in a regular PC, whereas obtaining the Bayes factor needs at least 1 min of the CPU's time even with optimized algorithms (Varona et al., 2001). Besides, obtaining Bayes factors can be numerically unstable and posing uninformative priors (the fixed effects in the frequentist paradigm) also results in practical problems to the Bayesian framework. Thus, while implementing efficient and stable Bayesian approaches, using the BIC method can be a good compromise.
Other authors (Bogdan et al., 2004) have proposed modified BIC criteria to improve its properties but they require expliciting priors about the expected interacting terms, which can be somewhat abitrary and are also approximate. In the light of our results, in particular those of simulated Scenario 5 and of genetical genomics data, it can be suggested that MASSA is run in a two-step manner, the first run can be used to pick up the most relevant loci, which will be used in the second run. Manual inspection of selected models together with biological external data should be employed to draw final conclusions.
We have assumed that phases are known. Nevertheless, this is rarely the case in practice. Three options can then be followed: (1) considering unphased genotypes, this is equivalent to having three alleles (the three genotypes) per site, which is the procedure currently implemented in MASSA; (2) estimating the phases with a statistical procedure (Li and Stephens 2003; Lin et al., 2004b) and taking the most likely phase or (3) estimate via Monte Carlo Markov chain methods the phases in each new simulated annealing iteration and using the current phase configuration, note that only the phases of SNPs included in the model need to be dealt with so the CPU time should not be increased much. The latter would be the most desirable strategy but it is also the most difficult to implement.
In their original report, Sanjuán et al. (2004b) estimated the strength and sign of epistasis between nucleotide substitutions as the difference between observed and expected (multiplicative) fitnesses for each double mutant in their collection. A good test of the performance of MASSA would be to compare the results it produced with the experimental determinations of epistasis. Three out of four mutations belonging to B1 were experimentally qualified as involved in significant interactions with other alleles outside B1 (U1585A being the odd case). Interactions between two out of three members of B2 (C7287U was not) were also experimentally detected to be involved in significant interactions outside B2. Finally, five out of eight members of B3 were qualified as significantly interacting with other sites outside the epistatic factor (U877C, A3802G and U4398C were not).
With regard to the genetical genomics study, the distribution of model probabilities is flatter than, for example, that in Figure 3, given the small dataset, and thus results should be interpreted with caution. Nevertheless, our reanalysis confirmed that sites on chromosomes 6 and 8 are very important to the expression of Grin2b gene (the significance of the interaction was P < 1015), whereas the evidence related to the epistasis between chromosomes 2 and 6 is less clearcut. Finally, we note that locus D5Mit180 may also play a role to be confirmed with larger datasets.
An important result of the simulation study is that, although this method has optimum statistical properties for normally distributed traits, its performance with binary traits was equally satisfactory, at least when the underlying model is the threshold model (Table 2). This concords with long existing simulations and experimental results in animal breeding showing that the advantage of specific threshold affected/non-affected) is very rare and the size of the experiment is small (Meijering and Gianola, 1985). This result makes it possible to apply MASSA directly to binary traits.
| Acknowledgments |
|---|
The author is grateful to Santiago Elena, Rob Williams, Rachel Brem and Dave Cutler and their co-workers for sharing their data and software, with special thanks to Santiago Elena for comments and discussions. The author is also indebted to Jesús Fernández for guidelines about the simulated annealing procedure, and to the referees for their valuable suggestions. Work funded by grants AGL2004-00103GAN and Acción especial en genómica y proteómica GEN2003-20658 (Ministerio de Educación y Ciencia, Spain). Part of these simulations were carried out at the supercomputing facilities of CESCA (Barcelona, Spain).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Frank Dudbridge
Received on October 20, 2005; revised on January 5, 2006; accepted on January 9, 2006
| REFERENCES |
|---|
|
|
|---|
Bogdan, M., et al. (2004) Modifying the Schwarz Bayesian information criterion to locate multiple interacting quantitative trait loci. Genetics, 167, 989999
Brem, R.B., et al. (2005) Genetic interactions between polymorphisms that affect gene expression in yeast. Nature, 436, 701703[CrossRef][Medline].
Burnham, K.P. and Anderson, D.R. (2004) Understanding AIC and BIC in model selection. Soc. Meth. Res, . 33, 261304.
Carlborg, O. and Haley, C.S. (2004) Epistasis: too often neglected in complex trait studies? Nat. Rev. Genet, . 5, 618625[CrossRef][ISI][Medline].
Carlborg, O., et al. (2003) A global search reveals epistatic interaction between QTL for early growth in the chicken. Genome Res, . 13, 413421
Chesler, E.J., et al. (2005) Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat. Genet, . 37, 233242[CrossRef][ISI][Medline].
Crow, J.F. and Kimura, M. An Introduction to Population Genetics Theory, (1970) , New York Harper & Row.
Hastie, T., Tibshirani, R., Friedman, J.H. The Elements of Statistical Learning, (2001) , New York Springer Verlag.
Katsanis, N., et al. (2001) Triallelic inheritance in BardetBiedl syndrome, a Mendelian recessive disorder. Science, 293, 22562259
Kirkpatrick, S., et al. (1983) Optimization by simulated annealing. Science, 220, 671680
Li, N. and Stephens, M. (2003) Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics, 165, 22132233
Lin, S., et al. (2004a) Exhaustive allelic transmission disequilibrium tests as a new approach to genome-wide association studies. Nat. Genet, . 36, 11811188[CrossRef][ISI][Medline].
Lin, S., et al. (2004b) Haplotype and missing data inference in nuclear families. Genome Res, . 14, 16241632
Marchini, J., et al. (2005) Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. Genet, . 37, 413417[CrossRef][ISI][Medline].
Meijering, A. and Gianola, D. (1985) Observations on sire evaluation with categorical data using heteroscedastic mixed linear models. J. Dairy Sci, . 68, 12261232
Misztal, I. and Perez-Enciso, M. (1993) Sparse matrix inversion in restricted maximum likelihood estimation of variance components by expectation-maximization. J. Dairy Sci, . 76, 1479[Abstract].
Mitchell, M. An Introduction to Genetic Algorithms, (1998) , Cambridge, Massachussets MIT Press.
Pérez-Enciso, M. and Misztal, I. (2004) Qxpak: a versatile mixed model application for genetical genomics and QTL analyses. Bioinformatics, 20, 27922798
Ritchie, M.D., et al. (2001) Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet, . 69, 138147[CrossRef][ISI][Medline].
Sanjuán, R., et al. (2004a) The contribution of epistasis to the architecture of fitness in an RNA virus. Proc. Natl Acad. Sci. USA, 101, 1537615379
Sanjuán, R., et al. (2004b) The distribution of fitness effects caused by single-nucleotide substitutions in an RNA virus. Proc. Natl Acad. Sci. USA, 101, 83968401
Segre, D., et al. (2005) Modular epistasis in yeast metabolism. Nat. Genet, . 37, 7783[CrossRef][ISI][Medline].
Ukkola, O., et al. (2001) Interactions among the glucocorticoid receptor, lipoprotein lipase and adrenergic receptor genes and abdominal fat in the Quebec Family Study. Int. J. Obes. Relat. Metab. Disord, . 25, 13321339[CrossRef][ISI][Medline].
Varona, L., et al. (2001) Bayes factors for detection of quantitative trait loci. Genet. Sel. Evol, . 33, 133152[CrossRef][ISI][Medline].
Wolf, J.B., Brodie, E.D., Wade, M.J. Epistasis and the Evolutionary Process, (2000) , Oxford Oxfor University Press.
Yi, N., et al. (2005) Bayesian model selection for genome-wide epistatic quantitative trait loci analysis. Genetics, 170, 13331344
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






