Skip Navigation


Bioinformatics Advance Access originally published online on January 12, 2006
Bioinformatics 2006 22(5):573-580; doi:10.1093/bioinformatics/btk050
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/5/573    most recent
btk050v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pérez-Enciso, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pérez-Enciso, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Multiple association analysis via simulated annealing (MASSA)

M. Pérez-Enciso 1,2

1Institut Catalá de Reçerca i Estudis Avançats Pg Lluis Companys 23, 08010 Barcelona, Spain
2Departament de Ciència Animal i del Aliments, Facultat de Veterinària, Universitat Autònoma de Barcelona 08193 Bellaterra, Spain


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 EMPIRICAL PROPERTIES:...
 5 REAL APPLICATION 1:...
 6 REAL APPLICATION 2:...
 7 DISCUSSION
 REFERENCES
 

Summary: Genome-wide association studies are now technically feasible and likely to become a fundamental tool in unraveling the ultimate genetic basis of complex traits. However, new statistical and computational methods need to be developed to extract the maximum information in a realistic computing time. Here we propose a new method for multiple association analysis via simulated annealing that allows for epistasis and any number of markers. It consists of finding the model with lowest Bayesian information criterion using simulated annealing. The data are described by means of a mixed model and new alternative models are proposed using a set of rules, e.g. new sites can be added (or deleted), or new epistatic interactions can be included between existing genetic factors. The method is illustrated with simulated and real data.

Availability: An executable version of the program (MASSA) running under the Linux OS is freely available, together with documentation, at http://www.icrea.es/pag.asp?id=Miguel.Perez

Contact: miguel.perez{at}uab.es


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 EMPIRICAL PROPERTIES:...
 5 REAL APPLICATION 1:...
 6 REAL APPLICATION 2:...
 7 DISCUSSION
 REFERENCES
 
Linkage disequilibrium (LD) genetic mapping is currently a very active research area, since more classical pedigree-based techniques are not suitable for finding the causal mutations. However, traditional LD techniques have been oriented towards fine mapping and thus show optimum properties only when applied to small genomic regions. By focusing only on small regions, LD methods can fail to identify interaction between loci or may result in inaccurate estimates because variation at other loci outside the region studied are not accounted for. Thus, although LD-based techniques for fine-scale gene mapping have a long history, their development and application to genome-wide studies are less well developed (Lin et al., 2004a; Marchini et al., 2005). It is important to develop new methods that specifically account for this challenging increase in data complexity. By searching globally across the genome, the probabilities of detecting loci with small effect and/or epistatic effects are increased (Marchini et al., 2005). Thus, we expect this approach to provide a more complete view of the trait genetic architecture, compared with classical strategies.

Recently Marchini et al. (2005) have investigated several approaches to uncover epistatic relations between loci, but limiting their study to, at most, pairs of loci analyzed simultaneously. In contrast, the multifactor-dimensionality reduction approach (Ritchie et al., 2001) was developed for seeking high order interactions; nonetheless, this approach is unpractical unless the number of markers is very small. Lin et al. (2004a) proposed an exhaustive transmission disequilibrium test that enumerates all contiguous haplotypes of all possible sizes for a set of polymorphisms, but the method is limited to contiguous haplotypes, i.e. neither epistasis nor more than one causal mutation are considered.

Epistasis has been long recognized as a potentially important phenomenon in genetics (Crow and Kimura, 1970; Wolf et al., 2000); it is found to influence disease and quantitative traits across all organisms, ranging from RNA viruses (Sanjuán et al., 2004a) or yeast (Segre et al., 2005) to human (Ukkola et al., 2001) and chicken (Carlborg et al., 2003). Furthermore, whole genome scanning for epistasis is becoming nowadays an important area of research as a consequence of new experimental strategies (Segre et al., 2005) and statistical techniques (Carlborg and Haley, 2004).

The objective of this work is to introduce a method tailored to identify the causal mutations for a trait, or traits, assuming that a (very) large number of genotyped single nucleotide polymorphisms (SNPs) are available. In a general manner, the problem can be posed as follows: What is the genetic model that best explains the data? Ideally, the number of assumptions made in searching for the true model has to be kept the lowest possible. To gain maximum flexibility, we propose the use of mixed model techniques and allow any number of loci and interaction levels between them. The optimum model is found using simulated annealing and defining a series of movements that permits shifting from the current to a new configuration, as detailed below. This multiple association analysis via simulated annealing (MASSA) will be illustrated with simulated and real data.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 EMPIRICAL PROPERTIES:...
 5 REAL APPLICATION 1:...
 6 REAL APPLICATION 2:...
 7 DISCUSSION
 REFERENCES
 
Suppose a number of SNPs (1, ... , s) is genotyped for n individuals, s can be >>n, and a trait (or series of traits) has been recorded on the individuals. For the sake of simplicity, we will assume that phases have been identified. A convenient and general explicative model for the phenotypes is

Formula 1(1)
where y is the vector containing the phenotypic records, X, W and Z are incidence matrices, b is a vector with fixed effects (e.g. sex, age, race, etc.), u is a vector with random effects (e.g. an infinitesimal genetic effect, litter, etc.), and g contains the genetic values, that can be decomposed in a series of q genetic factors. A genetic factor can be a single causal SNP, a series of consecutive SNPs (haplotype) or, more generally, a series of interacting SNPs in different parts of the genome. Thus, we use the term factor because it is more general than the usual term ‘locus’. The matrix Zj links the individual to its two alleles for genetic factor j. The elements in g can be treated either as random or fixed. Here we report only the results when g was fixed because it resulted in overall better performance, but the software can handle both options. Conditional on matrices Zj and on the number of factors, q, it is trivial to estimate the parameters in Equation (1) using mixed model techniques.

But of course the problem is that neither the number of causal factors nor the matrices Zj are known and one needs to choose among many competing models. The two sources of uncertainty are as follows: Which are the causal loci and what are the interactions between them? For notational convenience, here we unambiguously characterize a model with an integer vector of length s, where a zero value at the i-th position means that the i-th SNP is not included in the model and a value j from one through q means that the i-th SNP is included in the j-th genetic factor. Take three SNPs for illustration, the list of possible models, excluding the null model [000], is [100], [010], [001], [110], [011], [101], [111], [120], [102], [012], [112], [121], [122] and [123]. A typical genome scan is limited to the first three instances, in our example, [100], [010] and [001], thus leaving unexplored the vast majority of possible models. Note that two different SNPs harboring the same genetic factor code j means epistasis between them. Similarly, contiguous SNPs with the same factor number means that the whole haplotype is considered rather than each individual effect. Thus, there is no formal distinction between contiguous and non-contiguous SNP epistasis in our modelisation. For instance model [121] means that there is epistasis between SNPs 1 and 3 (locus 1) together with a second locus (SNP 2) affecting the trait. Under epistasis, the effect of having, say, allele 1 at SNP 1 will depend on the allele at the interacting SNP(s), thus one needs to estimate the effect for each allelic combination and the number of levels for the genetic factor is increased compared with that of additivity. The way MASSA handles epistasis is illustrated in Table 1.


View this table:
[in this window]
[in a new window]
 
Table 1 Allele recoding in MASSA

 
It is evident that enumerating all possible models becomes impossible even for relatively small number of SNPs: there are over 1.6 million different models with just 10 SNPs. The goal of our work is to determine which model best explains the data in a reasonable finite computing time while keeping in mind the following two questions: (1) How to measure the validity of the model? (2) How to ensure that the best model has been chosen without explicitly enumerating all possible models?

As not all possible models are nested, likelihood ratio tests cannot be employed. Thus we resorted to the Bayesian information criterion (BIC, also called Schwartz criterion), defined as BIC = –2logl + k logN, where l is the likelihood for the current model, k, the number of parameters and N, the number of records. The goal is to find the model with the lowest BIC. The reader is referred to the ample literature on the properties of other criteria like Akaike's (Burnham and Anderson 2004; Hastie et al., 2001). The difference between two BIC values can be used to approximate the logarithm of the Bayes factor between the two models (Hastie et al., 2001). Thus, assuming uniform priors for each competing model, the posterior probability of i-th model (i = 1, K) is, approximately,

Formula 2(2)
where {Delta}BICi is the difference between the BIC of the i-th model and the minimum BIC.


    3 IMPLEMENTATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 EMPIRICAL PROPERTIES:...
 5 REAL APPLICATION 1:...
 6 REAL APPLICATION 2:...
 7 DISCUSSION
 REFERENCES
 
The Qxpak package (Pérez-Enciso and Misztal, 2004) was used to obtain the log-likelihood and parameter estimates, adapted to the particularity of the genetic Zj matrices, which contain two non-zero elements for each individual's row (an incidence matrix normally contains only one non-zero element per row per effect). In the absence of additional random effects, this is equivalent to a multiple regression procedure.

We relied on simulated annealing (Kirkpatrick et al., 1983) to find the model with minimum BIC. Suppose that current model has BICc, and a new model with associated BICnew is proposed. The probability of accepting the new model is Paccept = Min{1, exp[–(BICnew – BICc)/T ]}, where T is the ‘annealing’ temperature. In our implementation, the initial temperature was 10 and decreased by a factor of 5% every m iterations, with m typically between 400 and 3000. Preliminary trials suggested the method was robust to departures from these parameters, and the initial temperature was high enough to have an initial acceptance rate >50%, as suggested in the literature. The number of cycles, i.e. different temperatures, ranged between 40 and 60. The program terminated when the temperature was arbitrarily low or when no new solution was accepted within the same temperature cycle.

The following series of movements were defined to propose new models (Figure 1):

  • In: a new site with position randomly chosen is inserted.
  • Out: one of the existing genetic factors is deleted.
  • Join: two distinct factors are fused (epistasis is created).
  • Split: a factor is split into two (if it contains more than one site).
  • Glue: two consecutive factors are fused, i.e. combined into one.
  • Grow: additional sites are added to the current factor.
  • Shave: the most external sites from a factor are deleted.
  • Shift: the factor position is shifted with direction (upstream or downstream) randomly chosen.


Figure 1
View larger version (13K):
[in this window]
[in a new window]
 
Fig. 1 Model representation and movements defined to change from model to model. In the cartoon, each square respresents a marker position, blank squares are not included in the model, while squares with the same pattern are contained in the same locus. During the simulated annealing process, a new model is proposed using one of the seven movements in the cartoon. The actual movement chosen depends on a number of factors like the current number of loci (see main text for details).

 
The probability with which a new movement is proposed was not constant. For instance, the frequency of proposing In (Out) was inversely (directly) proportional to the number of factors in the current model. The factor deleted was randomly chosen, as was the position of insertion of the new factor. Shave was more likely proposed for factors comprising many SNPs, conversely, Grow was more frequently applied to single SNP factors.

During the simulated annealing process, the best 20 solutions found are saved. This allows judging whether the global maximum has been reached but, importantly, also to quantitatively asses the strength in favor of the optimum model and carry out approximate model-averaged estimates:

Formula 3(3)
where Formula 3 is the weighted parameter estimate, Formula 3 is the estimate obtained with the i-th model and P(Mi|y) is the probability of model i given the data, which is proportional to exp(–BICi/2) when equal priors are assumed for each model.

Because the whole methodology that follows from Equation (1) is based on mixed model theory, modeling is very flexible, e.g. any number of effects, fixed or random, can be included in the model, or multitrait models can be run (so far, the same model must be fitted to all traits). Similarly, several thousand markers or individuals can potentially be studied because efficient sparse matrix techniques are employed to maximize the likelihood (Misztal and Perez-Enciso 1993).


    4 EMPIRICAL PROPERTIES: SIMULATION STUDY
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 EMPIRICAL PROPERTIES:...
 5 REAL APPLICATION 1:...
 6 REAL APPLICATION 2:...
 7 DISCUSSION
 REFERENCES
 
4.1 Simulated data
Data generated by programs from Lin et al. (2004a) were used for the simulation study. Each dataset consisted of a population of ~1000 independent individuals genotyped for 500 SNPs. The data were simulated using coalescence techniques with recombination and with parameters 4Ner = 2000, where Ne is the effective size and r is the recombination rate, nucleotide diversity was 4000. Each SNP was spaced at 10 kb, on average. The original data were unphased and the phase was reconstructed using the hap2 program (Lin et al., 2004b). Two site and three site epistasis models were considered, although only two site epistasis was analyzed in detail. For the two site epistasis, SNPs in positions 8 and 490 were chosen arbitrarily as causal SNPs, with phenotypes simulated conditional on observed genotypes at those SNPs. For our purposes, we retained N = 1000 (all), 500 or 250 of the original individuals in order to mimic different population sizes. Five genetic scenarios were considered. In the simplest scenario (Scenario 1), a continuous trait governed by two additive loci was simulated. A genotypic effect {alpha} was defined such that

Formula 3
where Formula 3 is the genetic variance; h2, the heritability and pj, the allele frequency at the j-th SNP. The phenotype of a given i-th individual was obtained by sampling from a normal distribution N(gi, 1), with gi being the individual's genetic value, e.g. g = 2{alpha} for an individual homozygous for the alleles increasing the mean, and so on. Scenarios 2 and 3 involved two kind of extreme epistasis, as detailed in Table 2: coadaptative epistasis and extreme diminishing epistasis. In the coadaptative epistasis, the phenotype is increased when the alleles at the two loci are from the same origin. As noted by Carlborg and Haley (2004), the identification of the loci involved in this kind of epistasis require simultaneous scanning for pairs of loci because the individual locus effect can be negligible. In the extreme diminishing epistasis, the dominant alleles are complementary, this case was discussed by Crow and Kimura (1970). In the fourth scenario (Scenario 4) a binary trait governed by two additive independent loci was simulated. Here the phenotype was truncated to integer values 1 and 2, quantitative phenotypes below the mean was assigned a value of 1, and those above, 2; this procedure ensures that the expected frequency of both the values is equal, to mimic a case–control study. Three values of h2 0.30, 0.15 and 0.05 were considered, although not all possible combinations of parameters were explored because of computing constraints. Finally, Scenario 5 consisted of a three site epistasis model: a binary trait took phenotype 1 if at least three alleles in any of the three sites were of type ‘1’, e.g. genotypes 111/222 and 121/221 resulted in the same phenotype. The causal SNPs were in positions 8, 200 and 490. Note that no locus is neither necessary nor sufficient to cause the disease; this kind of inheritance has been described in complex diseases like Bardet–Biedl syndrome (Katsanis et al., 2001). A total of 30 replicates per case were run in all scenarios, each took between 2 and 5 h in a Pentium 4 PC under Linux OS.


View this table:
[in this window]
[in a new window]
 
Table 2 Genotypic coefficients for two-loci epistasis scenariosa

 
In order to characterize the behavior of the method under the null hypothesis and assess power, we permuted the data 100 times and run MASSA with the three data sizes, N = 250, 500 and 1000 in Scenario 2 (a preliminary study indicated that results were highly comparable across scenarios). Power was estimated as the number of replicates that exceeded the 5% cutoff and we only retained these replicates for further study.

4.2 Results
The permutation test applied to MASSA allows the user to have a guideline as to how likely is getting a given result just by chance. The 5% cutoff for the difference in BIC between the null model (a model without any locus) and the model retained by MASSA was 10, 10.6 and 10.5 for N = 1000, 500 and 250, respectively. The different values obtained can be ascribed merely to chance, and thus we retained the 10.5 as threshold for all sizes. Figure 2 shows the estimated distributions, which were rather similar across N. As seen in Table 3, all replicates exceeded this threshold except those simulated under the coadaptative epistasis (Scenario 2). Table 3 summarizes the main results obtained with MASSA. For all parameters studied, the estimate obtained using the best model, together with the result obtained averaging across models [Equation (2)] are reported. As an example, Figure 3 shows the relative probabilities assigned to the best 10 models in the particular case of Scenario 2 (results were very similar across scenarios). In all cases studied, the relative probability of the best model ranged between 35 and 45%, and between 15 and 20% for the second best model. This does not mean that the estimates are equally valid across all scenarios and cases studied but, instead, that the credibility of the best model relative to the rest of competing models was comparable in all situations. All in all, the evidence in favor of the best model was reasonably high and, in fact, point estimates were very similar to model-averaged estimates (Table 3).


Figure 2
View larger version (20K):
[in this window]
[in a new window]
 
Fig. 2 The densities of BIC differences (BIC under the null model minus BIC under the retained model with MASSA) in different scenarios. Densities were estimated with function ‘density’ in the R language (http://www.r-project.org/) using the default parameters. Left column, BIC differences under the null hypothesis obtained via permutation with N = 250 (top), 500 (middle) and 1000 (bottom). Right column, BIC differences for N = 250 in Scenarios 1 (top), 2 (middle) and 3 (bottom). Note that it may be possible that BIC differences are negative under the null hypothesis, meaning that the null hypothesis has a lower (better) BIC than the alternative. This occurs because BIC penalizes the number of parameters. Regarding the different scenarios for N = 250, Scenario 1 (strict additivity) is optimum in terms of BIC differences exceeding those under the null distribution, whereas Scenario 2 (coadaptive epistasis) is the worst of the three scenarios (see also Table 3).

 

View this table:
[in this window]
[in a new window]
 
Table 3 Simulation study

 

Figure 3
View larger version (9K):
[in this window]
[in a new window]
 
Fig. 3 Approximate relative probabilities of the 10 best models given the data in genetic Scenario 2. Results were very similar across scenarios.

 
Regarding the simplest scenario, i.e. a quantitative trait controlled by two independent loci (Scenario 1), the chances of identifying both causal loci were very high (>80%) with MASSA, except for the lowest heritability, when it was 0.63. The chances of exactly identifying the true model (i.e. two independent loci at positions 8 and 490) are also remarkably high for the most favorable situation (N = 1000, h2 = 0.30), with ~50% of replicates correctly identifying the true model. In ~60% of replicates, all SNPs in the selected model were at <50 kb (five SNPs) from the causal SNPs for N = 1000. This frequency slightly decreased as the size of the experiment decreased.

The presence of epistasis decreased the chances of including both causal loci in the model compared with the strictly additive Scenario 1, illustrating that epistasis always makes it more difficult the genetic dissection of complex traits. In general, Scenario 2 (adaptative epistasis) was a much less favorable setting than Scenario 3 (diminishing epistasis), but the former is also the most difficult epistasis to detect (Carlborg and Haley 2004). This was evidenced also by the largest number of replicates below the 5% threshold (see also Figure 2). Nevertheless, the likelihood of selecting at least one causal locus was often close to 100% (the sum of 2 and 1% columns in Table 3), except for very low h2, 0.05, confirming MASSA as a powerful tool for detecting gene interactions.

Regarding binary traits, the performance of MASSA was as good as for quantitative traits. The probability of identifying both causal loci was well >80% in all instances, except for the smallest dataset. Similarly, the chances of identifying the true or the approximate model were not affected and were quite similar to those in Scenario 1. Scenario 5 was a particular case because there were three interacting causal sites, and it is encouraging that epistasis between the three causal SNPs (or very nearby) were detected in almost all replicates.

Among the SNPs selected, there were some that were far from the causal SNPs (>50 kb) and thus were classified as false positives. Its percentage (last column in Table 3) varied from 10 to 30 for most cases, except in the lowest h2 in Scenario 2, which was close to 50%. In general, the associated P-values to these false positives were much higher (less significant) than for the true causal SNPs but, certainly, this result prompts caution on the problem of identifying the correct causal positions when the allele effects are small, even for relatively large samples. Scenario 5 is extreme: extra sites were retained in the final model and the exact true model was almost never retained, but again the significance associated to these redundant loci was far smaller (average P = 2 x 10–3) than the significance of the retained true causal loci (P << 10–15).


    5 REAL APPLICATION 1: GENETICAL GENOMICS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 EMPIRICAL PROPERTIES:...
 5 REAL APPLICATION 1:...
 6 REAL APPLICATION 2:...
 7 DISCUSSION
 REFERENCES
 
5.1 Data
The number of genetical genomics studies is exploding owing to the huge amount of information provided. However, to realize its full potential, adequate statistical tools need to be employed. It is interesting to note that several experiments have already identified epistasis in these studies (Brem et al., 2005; Chesler et al., 2005). Here we applied our method to publicly available mouse data (Chesler et al., 2005). In brief, the data consist of expression records of ~12 000 transcripts, obtained with Affymetrix chip U74Av2, from 35 recombinant B x D mouse lines that were genotyped for 779 non-redundant markers across the whole genome. As our principal interest was epistasis, we analyzed message levels from Grin2b gene. As noted by Chesler et al. (2005), it is known that this locus is modulated by loci on chromosomes 6 and 8; these authors applied an epistatic model to their data and found evidence of epistasis between a distal locus on chromosome 2 (c. 2) and a locus on distal chromosome 6 (c. 6).

5.2 Results
An initial run with our strategy uncovered several markers associated with message level, with sites D2Mt200 (c. 2), D5Mit180 (c. 5), D6Mit301 (c. 6) and D8Mit128 (c. 8) appearing in the majority of models, i.e. the three regions identified in previous studies plus a fourth locus in (c. 5; D5Mit180). In order to explore as exhaustively as possible all models, we allowed the program to fit only these markers. The five models with largest probability are shown in Table 4. The marginal probabilities of these markers appearing in the model were 0.57, 0.40, 1 and 0.84 for markers D2Mt200, D5Mit180, D6Mit301 and D8Mit128, respectively. Note that most models contain an epistatic interaction between chromosomes 6 and 8 (weighted P across best 20 models = 0.80), whereas the evidence in favor of epistasis between chromosomes 2 and 6 is weaker (weighted P = 0.26), confirming previous analyses.


View this table:
[in this window]
[in a new window]
 
Table 4 Most probable models for Grin2b expression level

 

    6 REAL APPLICATION 2: VIRUS DATA
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 EMPIRICAL PROPERTIES:...
 5 REAL APPLICATION 1:...
 6 REAL APPLICATION 2:...
 7 DISCUSSION
 REFERENCES
 
6.1 Data
We have reanalyzed the data by Sanjuán et al. (2004a), who studied the contribution of epistasis to fitness of the vesicular stomatitis virus (VSV). To that end, these authors created large collection of single and double single-nucleotide substitution mutants by site-directed mutagenesis on an infectious cDNA. The fitness of each double mutant, as well as of each corresponding single mutant, were measured relative to their unmutated ancestor by means of competition experiments (Sanjuán et al., 2004a,b). In brief, both the mutant and unmutated ancestor were mixed at equal numbers and the mixture was used to innoculate a susceptible culture of baby hamster kidney fibroblast cells and fitness was estimated by comparing the growth rates of each genotype during the exponential growth phase. The VSV genome encodes for five proteins: N (nucleocapside that protects the RNA), P (phosphoprotein involved in replicase activity), M (virion inner matrix), G (glycoprotein forming the virion spikes that interact with cellular receptors) and L (the main component of the RNA polymerase, RpRd). In total, there were 33 mutated sites and 96 different genotypes for which fitness was assessed at least with 4-fold replication. Three genotypes were lethal and were excluded from the present analyses, leaving the total number of data as 93.

6.2 Results
The main interest of applying MASSA to this dataset was to investigate whether the method was able of identifying high order epistasis, rather than as an example with a large number of markers. For that purpose, we computed the frequency, weighted by the model probability [Equation (2)], that two sites interact (Table 5). Factors with probabilities of appearing in a model <60% were not included in the table. In total, five mutations selected from genes G and L, four for gene N, two for gene M and one for gene P have been selected. These 15 mutations were grouped into three main epistatic factors, which are marked with boxes of different shadow patterns in Table 5. All but two mutations selected were non-synonymous changes. The first epistatic factor, B1, comprises mutations G406U, U1323A, U1585A and U3615A, the first two mutations are located at the nucleocapside N gene, the third is in the minor component of the RdRp (gene P) and the fourth, at the glycoprotein G gene. In all cases, the probability associated with incorporating the mutation to B1 were larger than P ≥ 0.76. The second epistatic factor, B2, was formed by mutations C2607U, U2617G and C7287U (all with probabilities P ≥ 0.63). Finally, the third epistatic factor, B3, was constituted by mutations U877C, U3192C, A3802G, U4398C, G4459U, U5124A, G7123C and G7128U (in all cases, P = 1). By far, the largest and strongest epistatic interactions occurred within B3, which contains mutations in genes N (one), G (four), and L (three, including a synonymous change).


View this table:
[in this window]
[in a new window]
 
Table 5 Interaction between mutated sites in virusa

 

    7 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 EMPIRICAL PROPERTIES:...
 5 REAL APPLICATION 1:...
 6 REAL APPLICATION 2:...
 7 DISCUSSION
 REFERENCES
 
One of the immediate challenges in the genetic analysis of complex traits will be to make sense out of the vast amount of data being generated by the HapMap and related projects. As the number of markers increases, so does the probability of genotyping polymorphisms that are either causal or located nearby and thus in strong disequilibrium with the relevant mutation(s). In this context, it becomes feasible to carry out genome scans using linkage disequilibrium even in outbreed populations. Thus far, the vast majority of methods have employed a linear scan consisting in assessing the marker or set of markers that shows maximum association with the trait moving from one extreme through the other of the genome. This strategy is simple and easy to implement but suffers from several drawbacks. First, the classical strategy of adding a new quantitative trait loci to the model to test for additional loci is likely to result in a decreased power when a very stringent significance threshold is set; in addition, every new locus requires recomputing the new positions for all loci. As the number of loci in the model increases, computational cost becomes insurmountable. Second, epistatic effects are likely to be missed. Even if a two-dimensional search is done, it is impossible to exhaustively search all orders of interactions between loci. The use of simulated annealing in MASSA overcomes these two problems as it is able of searching over many interaction orders between loci. MASSA is neither a genome scan nor a greedy search, because the algorithm slowly imposes more and more restrictions on the models to be explored. Nevertheless, other optimization strategies are available, notably genetic algorithms (Mitchell, 1998), but they tend to be much slower than simulated annealing. These alternatives will be considered in the future. The statistical framework used by MASSA, mixed model theory, is extremely flexible. Note that, in contrast to methods based in genome linear scans, we are interested in identifying the best model fitting the data (balanced by the number of parameters). The validity of the final model can be checked via permutation (as done here) or with alternative techniques like cross-validation.

MASSA employs a frequentist maximum likelihood strategy. Although the Bayesian approach is conceptually more rigorous than the BIC or AIC criteria (Yi et al., 2005), it is heavily demanding in terms of computing time and suffers from other drawbacks. For instance, each BIC computation took ~0.01 s in a regular PC, whereas obtaining the Bayes factor needs at least 1 min of the CPU's time even with optimized algorithms (Varona et al., 2001). Besides, obtaining Bayes factors can be numerically unstable and posing uninformative priors (the ‘fixed effects’ in the frequentist paradigm) also results in practical problems to the Bayesian framework. Thus, while implementing efficient and stable Bayesian approaches, using the BIC method can be a good compromise.

Other authors (Bogdan et al., 2004) have proposed modified BIC criteria to improve its properties but they require expliciting priors about the expected interacting terms, which can be somewhat abitrary and are also approximate. In the light of our results, in particular those of simulated Scenario 5 and of genetical genomics data, it can be suggested that MASSA is run in a two-step manner, the first run can be used to pick up the most relevant loci, which will be used in the second run. Manual inspection of selected models together with biological external data should be employed to draw final conclusions.

We have assumed that phases are known. Nevertheless, this is rarely the case in practice. Three options can then be followed: (1) considering unphased genotypes, this is equivalent to having three alleles (the three genotypes) per site, which is the procedure currently implemented in MASSA; (2) estimating the phases with a statistical procedure (Li and Stephens 2003; Lin et al., 2004b) and taking the most likely phase or (3) estimate via Monte Carlo Markov chain methods the phases in each new simulated annealing iteration and using the current phase configuration, note that only the phases of SNPs included in the model need to be dealt with so the CPU time should not be increased much. The latter would be the most desirable strategy but it is also the most difficult to implement.

In their original report, Sanjuán et al. (2004b) estimated the strength and sign of epistasis between nucleotide substitutions as the difference between observed and expected (multiplicative) fitnesses for each double mutant in their collection. A good test of the performance of MASSA would be to compare the results it produced with the experimental determinations of epistasis. Three out of four mutations belonging to B1 were experimentally qualified as involved in significant interactions with other alleles outside B1 (U1585A being the odd case). Interactions between two out of three members of B2 (C7287U was not) were also experimentally detected to be involved in significant interactions outside B2. Finally, five out of eight members of B3 were qualified as significantly interacting with other sites outside the epistatic factor (U877C, A3802G and U4398C were not).

With regard to the genetical genomics study, the distribution of model probabilities is flatter than, for example, that in Figure 3, given the small dataset, and thus results should be interpreted with caution. Nevertheless, our reanalysis confirmed that sites on chromosomes 6 and 8 are very important to the expression of Grin2b gene (the significance of the interaction was P < 10–15), whereas the evidence related to the epistasis between chromosomes 2 and 6 is less clearcut. Finally, we note that locus D5Mit180 may also play a role to be confirmed with larger datasets.

An important result of the simulation study is that, although this method has optimum statistical properties for normally distributed traits, its performance with binary traits was equally satisfactory, at least when the underlying model is the threshold model (Table 2). This concords with long existing simulations and experimental results in animal breeding showing that the advantage of specific threshold affected/non-affected) is very rare and the size of the experiment is small (Meijering and Gianola, 1985). This result makes it possible to apply MASSA directly to binary traits.


    Acknowledgments
 
The author is grateful to Santiago Elena, Rob Williams, Rachel Brem and Dave Cutler and their co-workers for sharing their data and software, with special thanks to Santiago Elena for comments and discussions. The author is also indebted to Jesús Fernández for guidelines about the simulated annealing procedure, and to the referees for their valuable suggestions. Work funded by grants AGL2004-00103GAN and Acción especial en genómica y proteómica GEN2003-20658 (Ministerio de Educación y Ciencia, Spain). Part of these simulations were carried out at the supercomputing facilities of CESCA (Barcelona, Spain).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Frank Dudbridge

Received on October 20, 2005; revised on January 5, 2006; accepted on January 9, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION
 4 EMPIRICAL PROPERTIES:...
 5 REAL APPLICATION 1:...
 6 REAL APPLICATION 2:...
 7 DISCUSSION
 REFERENCES
 

    Bogdan, M., et al. (2004) Modifying the Schwarz Bayesian information criterion to locate multiple interacting quantitative trait loci. Genetics, 167, 989–999[Abstract/Free Full Text].

    Brem, R.B., et al. (2005) Genetic interactions between polymorphisms that affect gene expression in yeast. Nature, 436, 701–703[CrossRef][Medline].

    Burnham, K.P. and Anderson, D.R. (2004) Understanding AIC and BIC in model selection. Soc. Meth. Res, . 33, 261–304.

    Carlborg, O. and Haley, C.S. (2004) Epistasis: too often neglected in complex trait studies? Nat. Rev. Genet, . 5, 618–625[CrossRef][ISI][Medline].

    Carlborg, O., et al. (2003) A global search reveals epistatic interaction between QTL for early growth in the chicken. Genome Res, . 13, 413–421[Abstract/Free Full Text].

    Chesler, E.J., et al. (2005) Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat. Genet, . 37, 233–242[CrossRef][ISI][Medline].

    Crow, J.F. and Kimura, M. An Introduction to Population Genetics Theory, (1970) , New York Harper & Row.

    Hastie, T., Tibshirani, R., Friedman, J.H. The Elements of Statistical Learning, (2001) , New York Springer Verlag.

    Katsanis, N., et al. (2001) Triallelic inheritance in Bardet–Biedl syndrome, a Mendelian recessive disorder. Science, 293, 2256–2259[Abstract/Free Full Text].

    Kirkpatrick, S., et al. (1983) Optimization by simulated annealing. Science, 220, 671–680[Abstract/Free Full Text].

    Li, N. and Stephens, M. (2003) Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics, 165, 2213–2233[Abstract/Free Full Text].

    Lin, S., et al. (2004a) Exhaustive allelic transmission disequilibrium tests as a new approach to genome-wide association studies. Nat. Genet, . 36, 1181–1188[CrossRef][ISI][Medline].

    Lin, S., et al. (2004b) Haplotype and missing data inference in nuclear families. Genome Res, . 14, 1624–1632[Abstract/Free Full Text].

    Marchini, J., et al. (2005) Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. Genet, . 37, 413–417[CrossRef][ISI][Medline].

    Meijering, A. and Gianola, D. (1985) Observations on sire evaluation with categorical data using heteroscedastic mixed linear models. J. Dairy Sci, . 68, 1226–1232[Abstract/Free Full Text].

    Misztal, I. and Perez-Enciso, M. (1993) Sparse matrix inversion in restricted maximum likelihood estimation of variance components by expectation-maximization. J. Dairy Sci, . 76, 1479[Abstract].

    Mitchell, M. An Introduction to Genetic Algorithms, (1998) , Cambridge, Massachussets MIT Press.

    Pérez-Enciso, M. and Misztal, I. (2004) Qxpak: a versatile mixed model application for genetical genomics and QTL analyses. Bioinformatics, 20, 2792–2798[Abstract/Free Full Text].

    Ritchie, M.D., et al. (2001) Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet, . 69, 138–147[CrossRef][ISI][Medline].

    Sanjuán, R., et al. (2004a) The contribution of epistasis to the architecture of fitness in an RNA virus. Proc. Natl Acad. Sci. USA, 101, 15376–15379[Abstract/Free Full Text].

    Sanjuán, R., et al. (2004b) The distribution of fitness effects caused by single-nucleotide substitutions in an RNA virus. Proc. Natl Acad. Sci. USA, 101, 8396–8401[Abstract/Free Full Text].

    Segre, D., et al. (2005) Modular epistasis in yeast metabolism. Nat. Genet, . 37, 77–83[CrossRef][ISI][Medline].

    Ukkola, O., et al. (2001) Interactions among the glucocorticoid receptor, lipoprotein lipase and adrenergic receptor genes and abdominal fat in the Quebec Family Study. Int. J. Obes. Relat. Metab. Disord, . 25, 1332–1339[CrossRef][ISI][Medline].

    Varona, L., et al. (2001) Bayes factors for detection of quantitative trait loci. Genet. Sel. Evol, . 33, 133–152[CrossRef][ISI][Medline].

    Wolf, J.B., Brodie, E.D., Wade, M.J. Epistasis and the Evolutionary Process, (2000) , Oxford Oxfor University Press.

    Yi, N., et al. (2005) Bayesian model selection for genome-wide epistatic quantitative trait loci analysis. Genetics, 170, 1333–1344[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/5/573    most recent
btk050v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pérez-Enciso, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pérez-Enciso, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?