Skip Navigation

Bioinformatics 2007 23(13):i212-i221; doi:10.1093/bioinformatics/btm217
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Huang, J. C.
Right arrow Articles by Winn, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Huang, J. C.
Right arrow Articles by Winn, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Bayesian association of haplotypes and non-genetic factors to regulatory and phenotypic variation in human populations

Jim C. Huang 1, Anitha Kannan 2 and John Winn 2,*

1Probabilistic and Statistical Inference Group, University of Toronto, Toronto, ON, M5S 3G4, Canada and 2Microsoft Research Cambridge, Cambridge, CB3 0FB, UK

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 A MODEL FOR...
 3 COMPARATIVE RESULTS ON...
 4 A JOINT BAYESIAN...
 5 RESULTS ON LEARNING...
 6 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: With the recent availability of large-scale data sets profiling single nucleotide polymorphisms (SNPs) and quantitative traits data across different human subpopulations, there has been much attention directed towards discovering patterns of genetic variation and their connection to gene regulation and the onset/progression of disease. While previous work has focused primarily on correlating individual SNP markers with gene expression and disease, it has been suggested that using haplotype blocks instead of individual markers can significantly increase statistical power.

Results: We present BlockMapper, a probabilistic generative model for genotype data and quantitative traits data, such as gene expression or phenotype measurements. BlockMapper discovers the block structure of genotype data and associates these inferred blocks to patterns of variation in quantitative traits data, whilst accounting for non-genetic factors. Our model achieves high accuracy for predicting Crohn's disease phenotype in Chromosome 5q31 and reveals novel cis-associations between two haplotype blocks in the ENm006 genomic region and GDI1, a gene implicated in X-linked mental retardation. Our results underscore the importance of accounting for the influence of large sets of SNPs on patterns of regulatory/phenotypic variation and represent a step towards an understanding of human genetic variation.

Contact: jwinn{at}microsoft.com


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 A MODEL FOR...
 3 COMPARATIVE RESULTS ON...
 4 A JOINT BAYESIAN...
 5 RESULTS ON LEARNING...
 6 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
An important question in molecular biology and medicine is how patterns of genetic variation influence gene regulation and phenotype in humans. Previous studies for understanding the influence of genetic variation on gene regulation and phenotypic traits involved searching for genetic variants in unrelated individuals and mapping these to patterns of gene expression (Stranger et al., 2007) or phenotypic traits (Cheung et al., 2005; Morley et al., 2004; Wang et al., 2005). These studies are based on one of the two approaches. One approach makes use of linkage analysis to find candidate genomic regions which vary between pedigrees, and then perform linkage disequilibrium (LD) mapping to search for genetic markers in these putative disease-correlated regions found by linkage analysis (Cheung et al., 2005; Morley et al., 2004).

The problem with this approach is that it may miss weaker associations due to lack of sufficiently large phenotypic differences between individuals in the study (Ardlie et al., 2002). As a way to overcome this, the second approach genotypes a dense set of SNPs across multiple individuals, and then make use of the allele frequencies in these individuals to correlate for disease (Wang et al., 2005). In this genome-wide association approach, high-throughput technologies for profiling quantitative traits (such as microarrays) allow for the detection of more subtle interactions, but at the cost of having to simultaneously test an extremely large number of statistical hypotheses and maintain both high sensitivity and specificity (Stranger et al., 2007). Instead of testing against individual SNPs, neighboring SNP markers can be grouped together into haplotype blocks (Botstein and Risch, 2003), which can then be correlated with disease. Using haplotype blocks in this way can drastically reduce the number of hypotheses to be tested, while providing resistance to genotyping errors. This approach is facilitated by the fact that neighboring SNPs are often in linkage disequilibrium with one another (Gabriel et al., 2002), such that they tend to be inherited jointly in haplotype blocks with relatively little haplotype diversity within a block (Daly et al., 2001). Such blocks can arise due to recombination hotspots in the genome, population bottlenecks or genetic drift, all of which act to preserve low haplotype diversity in distinct regions of the genome.

In this article, we present a joint probabilistic model for genotype and quantitative traits data which enables associating patterns of genetic variation with patterns of gene expression and other phenotype measurements. We call the model BlockMapper since it learns the relationships between haplotype blocks and the patterns of gene expression and other phenotype measurements. The BlockMapper model has two parts: the first part infers reliable haplotype blocks from the genotype data which are used in the second part to infer relationships to gene expression and other phenotype measurements. While both parts of BlockMapper model can be learned in tandem, for clarity, we show in this article how to learn them sequentially. A flowchart summarizing the joint BlockMapper model is shown in Figure 1: the first component accounts for the haplotype block structure of genotype data and allows us to simultaneously infer recombination hotspots and mutations and the phase at each locus in the genotype. The second component then associates the discovered haplotype blocks to a set of quantitative trait measurements across the same set of genotyped individuals. We will show that the BlockMapper model both provides an accurate model for variability in genotype data and infers biologically significant associations between haplotype blocks and patterns of gene expression and disease in both the Chr5q31 and ENm006 genomic regions.


Figure 1
View larger version (41K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. The joint BlockMapper model for haplotype and quantitative traits data. We first partition genotype data into a set of haplotype blocks and then assign labels to each haplotype block to summarize genetic variation within the block. These labels, along with non-genetic factors such as gender and subpopulation, are then related to a set of quantitative traits measurements, such as patterns of gene expression or phenotypic measurements. The result is a set of associations between haplotype and quantitative traits which allows the prediction of new measurements given genotype data.

 

    2 A MODEL FOR HAPLOTYPE BLOCK DISCOVERY IN GENOTYPE DATA
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 A MODEL FOR...
 3 COMPARATIVE RESULTS ON...
 4 A JOINT BAYESIAN...
 5 RESULTS ON LEARNING...
 6 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
The first component of the BlockMapper model allows us to discover sets of haplotype blocks consisting of regions of low recombination. The goal here is not to develop a completely novel model for genotype data, but rather to extend the existing model of (Jojic et al., 2004) to give improved reliability for learning the underlying haplotype block structure of the genotype data. The generative process described by the Jojic et al. model for a given individual is illustrated in Figure 2 and is as follows:
  1. Sample a class c (e.g. a subpopulation) according to P(c).
  2. Sample a pair of ancestors (s1, t1) independently according to initial distributions Formula over the ancestors.
  3. Given the pair of ancestors Formula , sample a pair of alleles (xk, yk) from the ancestral haplotypes R with probabilities Formula .
  4. To account for two possible orderings of a pair of alleles, select the phase Formula with equal probability. If mk = 1, swap the values of (xk, yk), otherwise leave them unchanged.
  5. Conditioned on the current pair of ancestral states (sk, tk), transition to the next pair of ancestor states Formula with transition probabilities Formula , increment the locus k and return to 3.


Figure 2
View larger version (37K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. The generative process used to model the genotype data of each individual (shown for a subset of markers). For each SNP marker, a pair of ancestral indices s and t are generated: alleles are then sampled from the shared ancestral haplotypes Figure 2 for ancestors s and t to generate maternal and paternal haplotypes. Genotypes are obtained by swapping the s and t indices with equal probability at each locus, to capture the fact that the phase is unknown.

 
We repeat the above process for all individuals Formula . The graphical representation of the above process is shown in Figure 3 as a Bayesian network, consisting of a set of nodes encoding random variables in our model and directed edges indicating dependencies between variables.


Figure 3
View larger version (24K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Bayesian network for the model of genotype data. Nodes correspond to observed and unobserved variables as well as model parameters, with directed edges between nodes representing conditional dependencies encoded by our probability model. For each individual j and at each genomic locus k, we first sample a class, then a pair of ancestors Figure 3 conditioned on the ancestors Figure 3 at the previous locus, then generate a genotype Figure 3 from the ancestral haplotypes. The phase Figure 3 at locus k determines which genotype measurement is associated with the paternal haplotype and which with the maternal.

 
When we apply the model, we assume the genotype data Formula was generated by this stochastic process and aim to infer the values of all the unobserved variables and parameters. Hence, the model allows us both to perform haplotype phasing by inferring the phase variables {mk}, and to partition the resulting haplotypes into blocks of SNP markers by inferring the sequences of ancestral indices Formula and Formula for each individual. These blocks are extracted from a learned library of A ‘ancestral’ haplotypes Formula . The model also accounts for the presence of different classes of individuals c, such as those arising from different sub-populations (Gabriel et al., 2002; HapMap Consortium, 2005; Pritchard et al., 2000). Different classes are assigned different initial and transition probabilities Formula . We frequently assume that all individuals come from a single class, in which case the c superscript is dropped.

We make a number of key extensions to the model of Jojic et al. (2004):

  • ‘Soft’ ancestral haplotypes: Previously, Jojic et al. used a library of ancestral haplotypes consisting of single alleles, where any difference between these ancestral haplotypes and the observed individual haplotypes was accounted for by a single noise parameter shared across all loci. We modify the model to define ‘soft’ ancestral haplotypes where each locus contains not a single allele, but a probability distribution over alleles corresponding to emission probabilities for each possible allele. This allows for highly localized variability as well as genotyping errors in the data to be accounted for in the learned ancestral haplotype.
  • Parent-child phase constraints: The above model assumes that the genotypes for individuals are generated i.i.d. from one another: in practice however, we are often given genotypes for parent–parent–child trios which provides us with constraints on some of the phase variables mk for the children. In particular, we can use this parent–child information to infer the phase variables at heterozygous loci, where the phase can often be unambiguously resolved for a child's genotype given the genotypes of the parents. We can incorporate this parent-child information at these loci into our model by fixing the phase variables at loci where the phase can be unambiguously resolved.
  • Dirichlet prior on emission probability: To allow for the possibility of alleles being observed at particular loci which are not present in the training set, we define a Dirichlet prior over the ancestral allele distribution. This allows a small prior probability for each possible allele, where the probability is given by counts of pseudo-observations (pseudo-counts) ß. The quantity ß is set to be equal for all alleles and loci.
  • Recombination prior: In a similar fashion, we define a Dirichlet prior over the transition probabilities Formula . However, in this case, we define a different prior probabilities for staying in the same ancestral haplotype and switching between ancestral haplotypes (recombination). The pseudo-counts for switching are set to {alpha}, and those for remaining in the same pattern are set to {alpha}{gamma}. Hence, {gamma} acts as a recombination prior so that higher values of {gamma} favor staying in the same ancestral state and lower values favor transitions to other ancestral states.

2.1 Variational inference and learning
To perform inference and parameter estimation in the generative model, we must compute the posterior distributions over unobserved variables (namely, the class variables Formula , the ancestral indices Formula and the phase variables {Formula }), as well as estimate the transition probabilities Formula and the ancestral haplotypes R. Computing the exact posterior over these unobserved variables is intractable: to address this, we follow the learning procedure adopted by Jojic et al., (2004) and resort to a tractable variational approximation to the required posterior distributions (Jordan et al.,, 1999). This variational approximation allows us to infer the chains of ancestral indices {sk} and {tk} separately from one another via the forward-backward algorithm (Rabiner et al., 1989). The transition probabilities Formula and emission probabilities R are modified to account for the phase variable at each locus, as well as the Dirichlet prior distributions on these parameters, so that


Formula

where Q refers to the approximate posterior distributions computed via the variational method and z refers to a specific observed allele. Thus, Formula is updated according to the frequency with which one emits allele z given state i at locus k, and Formula is updated according to the frequency with which one transitions from state i to state n. Both variational inference and parameter estimation can be accomplished here in a fashion analogous to the standard parameter updates in the Baum–Welch algorithm for learning HMMs (Rabiner et al., 1989), with the basic parameter updates for our model given by Jojic et al., (2004). By iteratively updating approximate posterior distributions over phase, ancestral indices and the model parameters via an EM algorithm (Dempster et al., 1977; Neal and Hinton, 1998), we are guaranteed to increase a lower bound on the log-likelihood of the data with each update step and thus improve the fit of our model to the data. Note that our inference procedure can be naturally modified to account for missing genotype data by summing over all possible alleles for a given missing marker.

For a given locus, we could re-arrange the ancestral indices, and the corresponding distributions indexed by these, without affecting the probability distribution defined by the model. Thus, as a post-processing step, we permute ancestral indices at each locus so that the most probable transition is to the same ancestral index. This permutation leads to longer contiguous stretches in which no transitions between ancestors occur. Figure 4a shows an example of this learning process.


Figure 4
View larger version (81K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. (a) Individual haplotypes spanning SNP markers 15 to 64 in the Chr5q31 data, along with the mappings to the ancestral haplotypes: each color indicates the ancestral haplotype from which the observed allele is drawn; (b) Haplotype block partitioning of the Daly 5q31 data. Each block shows the mapping to the ancestral haplotype library for all 387 genotyped individuals. Colorings within a block for a given individual indicate the most likely ancestral index labelling associated with the block for that individual. The blue graphs above each block show how the probability of recombination varies across the 103 markers. Vertical blue lines indicate recombination hotspots: loci where the probability of recombination exceeds 0.05. A haplotype block is defined as the region between two consecutive recombination hotspots. As the recombination prior parameter {gamma} increases, recombination is made less likely, leading to fewer and longer haplotype blocks. Experiments in the rest of this paper use {gamma}= 2 which allows for many, short haplotype blocks. For comparison, the haplotype blocks discovered in Daly et al. (2001) are shown in red.

 
2.2 Discovering haplotype blocks
Our model discovers haplotype blocks implicitly as regions where the probability of transitioning from one ancestral haplotype to another is small. This recombination probability can be computed at between any locus k and k + 1 as


Formula 1

(1)

where Formula is the marginal probability under our model of being in state sk at locus k. Given this recombination probability, we define a haplotype block as a region where the recombination probability is below Formula . Any point where the probability of recombination exceeds this value is therefore considered to be a boundary between two blocks and is likely to represent a recombination hotspot. Note that the recombination probability is affected by the recombination prior {gamma} discussed earlier. If {gamma} is large, recombination is discouraged, favoring a smaller number of longer haplotype blocks: similarly, if {gamma} is small, the haplotype is divided into a larger number of shorter blocks, as shown in Figure 4b.

For any individual, we can compactly approximate their haplotype by requiring that only one pair of ancestral haplotypes is used within each haplotype block: so s and t are forced to be constant within a block for that individual. We assign the label Formula to that block as a summary of the genetic variation for that block. As we know that few transitions between ancestral haplotypes occur within a block, this will be a reasonably accurate approximation to the underlying mappings to ancestral haplotypes. For any block b, we can infer the distribution over the ancestral haplotypes used and use these as labels for the given block. Distributions over the block labels Sb and Tb for the maternal and paternal chromosomes can then be computed from the posterior distributions Formula and Formula so that Formula and Formula , where k ranges over loci within haplotype block b. Thus, our model for genotype data allows us to partition the data into haplotype blocks and determine posterior distributions over the block labels for each individual. These block labels will be used in Section 4 to predict the value of quantitative traits measurements for that individual, such as levels of gene expression or phenotypic measurements.


    3 COMPARATIVE RESULTS ON LEARNING HAPLOTYPE BLOCK STRUCTURE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 A MODEL FOR...
 3 COMPARATIVE RESULTS ON...
 4 A JOINT BAYESIAN...
 5 RESULTS ON LEARNING...
 6 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Learning the block structure of Chromosome 5q31
We applied our model to data from Chromosome 5q31 (Daly et al., 2001), consisting of genotypes of 129 parent-parent-children trios profiled across 103 genomic loci. All 129 children in the data set are patients with Crohn's disease. The model was learned using A = 7 ancestral haplotypes and model parameters of Formula , ß = 0.01 for all loci k, ancestral indices sk and alleles. The probability of recombination at each locus was then computed using Equation (1). We repeated this for recombination priors Formula and threshold the probability of transition at a constant value of 0.05 in each case. The genotypes for a subset of the training individuals and the set of learned ancestral haplotypes are shown in Figure 4a and haplotype block boundaries are shown in Figure 4b, along with the haplotype block boundaries reported by Daly et al. shown in red. Here, increasing {gamma} effectively smoothes over the probability of recombination, leading to a smaller number of longer haplotype blocks. We found that {gamma} = 2 gave us the best predictive error on independent test data: this setting of {gamma} also led to a partitioning which gave good agreement with the partitioning reported by Daly et al. (Figure 4b).

3.2 Comparison with original Jojic et al. model
We now compare the following three models:

  • our model excluding parent–child information,
  • our model including parent–child information, so that the child phase is known at many loci,
  • the model from Jojic et al., (2004) where ‘hard’ ancestral haplotypes are learned and parent-child information is not used.

We wish to compare how well these models represent the true probability density over haplotypes. To achieve this, we tested each model's ability to predict missing values in the haplotype data. The idea is that, if a model is correctly capturing the correlations between adjacent SNP measurements, then it will be better at predicting SNPs which are not observed. Whenever our data is fully observed, we can artificially remove observations at random locations and then test the ability of each model to fill-in the gaps correctly.

We applied this comparison method to the three models using the Chr5q31 data described earlier. Training/test sets were constructed from all individuals who were Crohns’-positive so that the training set contained 115 individuals (100 children and 15 parents) and the test set consisted of 29 children. We trained each model three times with different random initializations and in each case selected the one that achieved best bound on the training data likelihood.

For the test data, we artificially removed a fraction {rho} of the SNP measurements and used each learned model to fill-in these missing values. Fill-in was achieved by sampling from the posterior distribution over the missing alleles. Figure 5 shows the average prediction error rates across 10 training/test splits for each model, as {rho} varies from 0 to 0.5. Compared to Jojic et al., (2004), our model provides a significant improvement in accuracy on this fill-in task, indicating that it more accurately captures the true distribution of haplotypes. Making use of parent-child information leads to a further improvement in performance, so that our model is making a quarter of the number of fill-in errors compared to the Jojic model.


Figure 5
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Mean and standard deviations of prediction error on independent test genotype data as a function of the fraction of data {rho} which is missing. The prediction errors shown are obtained from the method of Jojic et al. (red), from our method which accounts for variability in the ancestral haplotypes (green) and combined with making use of parent–child information (blue).

 
3.3 Comparison with the HaploBlock and FastPHASE haplotype inference algorithms
We also used this fill-in task to compare the predictive accuracy of our model with other haplotype inference algorithms: HaploBlock (Greenspan and Geiger, 2004) and fastPHASE (Scheet and Stephens, 2006). Each algorithm was used to infer missing SNP values, exactly as described in the previous section. To make the comparison fair, we used the version of our model which does not make use of the known phase information for the children's genotypes. The corresponding predictive accuracy plots are shown in Figure 6: as can be seen, our model (blue) outperforms both HaploBlock (green dashed) and fastPHASE (red dotted) by a significant margin, where we make about half the number of errors. This result, in tandem with the improved predictive accuracy when using parent-child relationships, indicates that we can more accurately model genotype data by accounting for the underlying haplotype block structure and through the improvements presented earlier.


Figure 6
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Mean and standard deviations of test prediction error as a function of the fraction of test genotype locations made missing for our model (blue), fastPHASE (red dotted) and HaploBlock (green dashed).

 
3.4 Learning the block structure of the ENm006 region
We applied our model to genotype data consisting of 573 SNPs from the ENm006 region (located on X chromosome) common to the 270 individuals of the HapMap Phase II data (HapMap Consortium, 2005). Briefly, the 270 individuals consist of 30 parent–parent–child trios of Yoruba individuals from Ibadan, Nigeria (YRI), 30 trios of Utah individuals with European ancestry (CEU), 45 Han Chinese individuals from Beijing, China (CHB) and 45 Japanese individuals from Tokyo, Japan (JPT).

Using the same parameters for Formula and {gamma} as earlier, we learned our model from this data and partitioned the data into 19 haplotype blocks (Fig. 7). As can be seen, individuals in the three populations (CEU, YRI and CHB+JPT) have noticeably different usage of the ancestral haplotypes. We computed the average block lengths for each subpopulation separately using the block partitioning derived from each individual's recombination probability. We find the average block lengths are 41.6, 21.3 and 39.0 markers for CEU, YRI and CHB+JPT subpopulations, respectively. Consistent with previous results (Gabriel et al., 2002; HapMap Consortium, 2005), we find that the average block length for the YRI subpopulation to be the lowest, followed by the CHB+JPT and CEU subpopulations. Having presented our method and results for learning a set of haplotype blocks, we now present our model for relating blocks to trait measurements.


Figure 7
View larger version (57K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7. Haplotype block partitioning of the ENm006 data. Each block shows the mapping to the ancestral haplotype library for all 270 HapMap genotyped individuals. Vertical blue lines indicate recombination hotspots discovered by our model. A haplotype block is defined implicitly as the region between two recombination hotspots. Colorings within a block for a given individual indicate the most likely ancestral index labeling associated with the block for that individual. Population labels are shown on the right of the diagram and known protein-coding genes in the ENm006 genomic region are shown at the top, each linked to SNPs in the genomic regions spanned by that gene.

 

    4 A JOINT BAYESIAN MODEL OF HAPLOTYPE AND REGULATORY/PHENOTYPIC VARIATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 A MODEL FOR...
 3 COMPARATIVE RESULTS ON...
 4 A JOINT BAYESIAN...
 5 RESULTS ON LEARNING...
 6 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We now turn to the problem of linking the states of these haplotype blocks to patterns of quantitative trait variation. To achieve this, we introduce the model which is shown as a Bayesian network in Figure 8. This model assumes that a subset of haplotype blocks affect each quantitative measurement of regulatory and phenotypic variation through a linear relation. These quantitative measurements may include phenotypes, high throughput gene expression data or any other continuous measurement that may be putatively linked to genetic variation.


Figure 8
View larger version (27K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 8. Bayesian network for the model linking haplotype block structure to regulatory and phenotypic traits. The boxes, or ‘plates’, indicate that the structures contained within them are replicated a number of times indicated at the top of each plate. This replication allows for multiple haplotype blocks to contribute to particular a gene expression/phenotypic measurement for a given individual. Each gene/phenotype in our network is assigned a set of indicator variables which select haplotype blocks that are likely to be relevant given the model parameters and the data.

 
Our model considers the blocks jointly and is able to suppress spurious relations due to correlations between block states across different blocks. For example, if variation in a single haplotype block b completely explains variation in the expression of a particular gene g, then a model that would consider blocks separately (e.g. using multiple linear regressions) would also find relationships between the gene g and all blocks whose states are correlated with the state of b. Such spurious relations are ignored by our approach and the remaining relations which are most informative are discovered by the model.

Let us assume the haplotypes for J individuals have been partitioned into B non-overlapping haplotype blocks using the method described earlier. For each individual j, we define the label for a particular block b as an ordered pair Formula , representing the pair of ancestral haplotype indices inferred for this block, with the ancestral indices ranging from 1 to A. We represent Sb and Tb as indicator vectors of length A so that, if a block is associated with the a-th ancestor, the a-th element of the indicator vector is one and the remaining elements are zero.

Let Z be a Formula matrix, where Formula is the measurement of the g-th quantitative trait of the j-th individual. We can represent the measurements of a particular trait g for all individuals by Formula . We would now like to infer associations between haplotype block labels and quantitative trait measurements whilst rejecting noisy or spurious associations. To optimize predictive accuracy, we will encourage sparsity in our model to favor only a small number of blocks influencing any given trait. We model this explicitly using a binary relevance variable wbg associated with each combination of block and trait, such that Formula indicates an association between the block b and measurement g, and Formula indicates no association. We place a Bernoulli prior distribution over each Formula in which Formula so that smaller values of p favor more sparse solutions.

We model a particular trait g for an individual j as a weighted sum of contributions from the subset of B blocks which are considered relevant according to W, so that Formula , where Formula is a A-dimensional vector of weights for a particular block-trait pair. As the influence of a particular haplotype block on a trait is independent of the individual, the relevance variables W and the block contributions Formula are shared across all individuals. Thus, the A weights for any given block-trait pair allow us to model the joint patterns of variation in block label and trait measurement across individuals.

We define a matrix Formula , where (j,a)-th element is given by Formula . Assuming a zero-mean isotropic Gaussian noise with precision Formula , we have


Formula 2

(2)
We place a Normal-Gamma prior distribution on the weights Formula and the inverse variances Formula such that:


Formula 3

(3)
where the parameters of the prior distributions are shared across all blocks and traits. Given the quantitative measurements and the haplotype block structure across J individuals, we then aim to learn a model that maximizes Formula :


Formula

4.1 Variational Bayes learning of associations
The marginalization over the random variables required to compute P(Z) in the above equation cannot be computed analytically, and hence we cannot maximize this quantity exactly. As in Section 2, we resort to a variational approximation (Jordan et al., 1999) where instead of maximizing Formula , we instead maximize a lower bound on Formula ,


Formula

Here, Q approximates the required posterior distributions over the latent variables: in particular, we use the approximation


Formula

where the distribution over block labels Formula is obtained from Section 2.2 and is held fixed.

To carry out the optimization, we use variational Bayes (Attias, 1999) that performs approximate Bayesian inference by iteratively updating Formula and Formula so that each update increases the bound on Formula . To prevent the premature switching off of haplotype blocks for certain genes, we use an annealing strategy in which we progressively decrease the sparsity prior probability p over the relevance variables for a fixed number of variational Bayes iterations, until it reaches the target sparsity value.


    5 RESULTS ON LEARNING ASSOCIATIONS BETWEEN HAPLOTYPE BLOCKS AND TRAITS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 A MODEL FOR...
 3 COMPARATIVE RESULTS ON...
 4 A JOINT BAYESIAN...
 5 RESULTS ON LEARNING...
 6 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
5.1 Linking haplotype block structure in Chr5q31 to Crohn's disease phenotype
Here, we applied our BlockMapper model to predict Crohn's disease phenotype in the Chr5q31 data (Daly et al., 2001). The data consists of the genotypes and the phenotypes (presence (+1) /absence (–1) of Crohn's disease) for 387 individuals. We first inferred the haplotype blocks from the genotype data for all 387 individuals (Section 3.1). Then, we used the Bayesian learning technique described above to learn the association between the haplotype block labels and the phenotypes of a randomly chosen subset of 287 individuals. We evaluated the accuracy of our model by measuring the error when predicting presence/absence of the disease on the remaining 100 individuals. We repeated this experiment for 10 independent train-test splits. Since our model predicts a real-valued trait corresponding to the input haplotype block labels, we used the sign of the predicted mean trait value to indicate presence (positive) or absence (negative) of the disease. To investigate the sensitivity of our model to the sparsity prior p, we repeated the above experiment for a range of values and found that the error rate was consistently similar for Formula . For example, p = 0.3 yielded an error rate of Formula with a standard deviation of Formula computed over the 10 splits. We found that the genotype information relevant to predicting Crohn's was highly localized, with haplotype blocks 2 and 10 (Fig. 4b) being the most informative towards predicting Crohn's disease (Formula , Wilcoxon–Mann–Whitney test, alpha value of Formula Bonferroni-corrected). These results indicate that patterns of genetic variation in Chr5q31 are informative towards predicting the Crohn's disease phenotype and that our haplotype blocks manage to capture this variation.

5.2 Linking genetic variation in ENm006 to GDI1 expression
We then evaluated our model's predictive accuracy on continuous gene expression measurements from the HapMap project. These measurements consist of the expression profiles for 47 294 genes profiled in EBV-transformed lymphoblastoid cell lines (Stranger et al., 2007). In particular, we focused on a set of 28 genes located in the ENm006 region, allowing us to search for cis-associations with the haplotype blocks established in Section 3.

We randomly split the set of haplotype block labels and gene expression data across to the 270 individuals into three sets—170 for training, 50 for validation and 50 for testing. We learned the model using the training set, while optimizing the sparsity prior parameter p on the validation set using prediction error as the criterion. The prediction error used was the sum of squared residuals between the model's estimate of the gene expression and the true level of expression. We found p = 0.2 minimized the error on the validation set and subsequently used this value for evaluating prediction error on the test set. We repeated the above analysis for five different train-validation-test splits with three random initializations in each split. Figure 9a shows the frequency (over five data splits and three random initializations) with which each gene is associated to each haplotype block according to the average magnitude of the relevance variables wbg.


Figure 9
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 9. (a) Relevance variables linking haplotype blocks and gene expression: each entry represents how frequently a given haplotype block is associated with a given gene under our model; (b) relevance variables learned when non-genetic factors are included in the model accounting for non-genetic factors has removed many spurious associations which can be better explained by non-genetic factors, such as population; (c,d) GDI1 expression versus haplotype block labels for haplotype blocks 2 and 5 in the ENm006 region, with measurements broken down according to the 3 sub-populations (CEU, YRI, CHB + JPT). Lines indicate the median of gene expression for a given block label. Each gene expression measurement is displayed twice, showing the expression value against the ancestral index for each of the maternal and paternal haplotype blocks. The plots indicate that the ancestral index in haplotype blocks 2 and 5 is informative of variability in GDI1 expression, particularly in the YRI and CHB + JPT subpopulations.

 
To assess the impact of accounting for non-genetic factors in explaining gene expression variation, we made use of all the available non-genetic traits (gender, population and parental information). We use the parent–child relationships in the HapMap data to recast age as a categorical variable (child/parent/unknown). We incorporated these additional factors into our probability model from Equation (2) by encoding the states for each non-genetic factor in a fashion analogous to the encoding for terms Formula for the haplotype blocks, effectively treating the three additional variables as three ‘virtual’ blocks. Using the training/testing methodology described above, we learned a model which also accounts for both the influence of non-genetic factors on gene expression as well as the influence of haplotype blocks. Figure 9b shows the frequency (over five data splits and three random initializations) with which each gene is associated with haplotype blocks and the non-genetic attributes: in contrast to Figure 9a, we can see that the model has eliminated many spurious associations between genes and haplotype blocks which are in fact due to population-specific differences in gene expression. Indeed, we find that gene expression is well predicted by the population variable, consistent with recent reports that support this hypothesis (Spielman et al., 2007).

In particular, Figure 9b also highlights significant cis-associations between blocks 2 and 5 in the ENm006 region and the GDI1 gene, which has been implicated in X-linked mental retardation (Shisheva et al., 1994). Figures 9c,d show the relationship between GDI1 expression levels and the haplotype block labels for blocks 2 and 5 across all individuals: as can be seen, the particular label assigned to a haplotype block is informative of variation in GDI1 expression, as given by variations in the median GDI1 expression with respect to haplotype block label within the YRI subpopulation in block 2 (Formula , Wilcoxon–Mann–Whitney test, alpha value of 0.01 Bonferroni-corrected for all pairwise tests of six assigned ancestral indices over three subpopulations) and the CHB+JPT subpopulation in block 5 (Formula ). We only tested associations between genes and which were discovered by our relation model and thus we did not use a Bonferroni correction for the number of genes or haplotype blocks.

To see the extent to which gene expression is explained by non-genetic information, we also learned a baseline model that explains the gene expression using only non-genetic factors consisting of age, gender and population. In the case of GDI1, we found that incorporating the haplotype block information in addition to the non-genetic attributes used by the baseline model reduced the population-specific test prediction errors (i.e. squared prediction error computed over subsets of individuals from a particular subpopulation) by Formula for CHB+JPT, Formula for YRI and Formula for CEU, suggesting a contribution from both subpopulation and genetic variation within populations to the variation of GDI1 across individuals. We computed test error for other genes in the ENm006 region, such as ARHGAP4, RENBP, HCFC1 and TKTL1 and did not observe corresponding reductions in prediction error for these genes. Although the number of individuals here is relatively small, the fact that we are able to infer significant associations between haplotype blocks and gene expression suggests that considering larger data sets would allow the prediction of more reliable associations.


    6 DISCUSSION AND CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 A MODEL FOR...
 3 COMPARATIVE RESULTS ON...
 4 A JOINT BAYESIAN...
 5 RESULTS ON LEARNING...
 6 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this article, we presented BlockMapper, a Bayesian model for learning haplotype block structure from genotype data and associating the discovered blocks with quantitative traits. We have presented two components of the model, where the first component infers a set of reliable haplotype blocks from genotype data and the second learns associations between these blocks and quantitative traits, such as gene expression. We have shown that our model for genotype data significantly outperforms standard haplotype inference algorithms on the task of predicting missing alleles in test data. We also demonstrated that BlockMapper has good predictive accuracy when predicting Crohn's disease phenotype in the Chr5q31 region. Using our model, we discovered novel cis-associations between haplotype blocks 2 and 5 in the ENm006 region of ChrX and GDI1, a gene implicated in X-linked mental retardation.

In this article, we learned the two parts of the model in sequence: in future we would like to extend this so that the block structure and relationships are learned in tandem, allowing haplotype blocks to be discovered which are more relevant to the traits of interest. In addition, we primarily focused on finding cis-associations in this work: we intend to scale-up our analysis to a larger set of genes to find both cis- and trans- associations. Also, we currently treat each gene expression as an independent observation but we can also take into account co-expression/co-regulation between genes by learning these relationships from data (Segal et al., 2003). Finally, it will be important to study alternative non-linear relationships between the haplotype blocks and traits. This work also calls for access to large data sets of genotypes and traits measured across a larger number of individuals than used here. Our model and results, in tandem with the impending improvements in high throughput genotyping and profiling technologies, underscores the potential of discovering thousands of previously uncharacterized associations between genetic variants and quantitative traits.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 A MODEL FOR...
 3 COMPARATIVE RESULTS ON...
 4 A JOINT BAYESIAN...
 5 RESULTS ON LEARNING...
 6 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We would like to thank Paul Scheet for discussions about fastPHASE, and Richard Durbin and Manolis Dermitzakis of the Wellcome Trust Sanger Institute for valuable feedback on the model. JCH was supported by an internship at Microsoft Research Cambridge. We also thank Nebojsa Jojic for helpful discussions.

Conflict of Interest: none declared.


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 A MODEL FOR...
 3 COMPARATIVE RESULTS ON...
 4 A JOINT BAYESIAN...
 5 RESULTS ON LEARNING...
 6 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Ardlie KG, et al. Patterns of linkage disequilibrium in the human genome. Nat Rev Genet, ( (2002) ) 3, : 299–309.[CrossRef][ISI][Medline].

    Attias H. Inferring parameters and structure of latent variable models by variational Bayes. ( (1999) ) Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI). Sweden: Stockholm. 21–30..

    Botstein D, Risch N. Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet, ( (2003) ) 33, ((Suppl.)): 228–237.[CrossRef][ISI][Medline].

    Cheung VG, et al. Mapping determinants of human gene expression by regional and genome-wide association. Nature, ( (2005) ) 437, : 1365–1369.[CrossRef][Medline].

    Daly MJ, et al. High-resolution haplotype structure in the human genome. Nat. Genetics, ( (2001) ) 29, : 229–232.[CrossRef][ISI][Medline].

    Dempster A, et al. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B, ( (1977) ) 39, : 138..

    Gabriel SB, et al. The structure of haplotype blocks in the human genome. Science, ( (2002) ) 296, : 2225–2229.[Abstract/Free Full Text].

    Greenspan G, Geiger D. High density linkage disequilibrium mapping using models of haplotype block variation. Bioinformatics, ( (2004) ) 20, (Suppl. 1): i137–i144.[Abstract].

    The International HapMap Consortium. A haplotype map of the human genome. Nature, ( (2005) ) 437, : 1299–1320.[CrossRef][Medline].

    Jojic N, et al. Joint discovery of haplotype blocks and complex trait associations from SNP sequences without family data. ( (2004) ) Proceedings of the 20th Conference on Uncertainty in Aritificial Intelligence (UAI)..

    Jordan MI, et al. An introduction to variational methods for graphical models. In: Learning in Graphical Models, ( (1999) ) MIT Press: Cambridge..

    Morley M, et al. Genetic analysis of genome-wide variation in gene expression. Nature, ( (2004) ) 430, : 743–747.[CrossRef][Medline].

    Neal RM, Hinton GE. A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in Graphical Model, —Jordan MI, ed. ( (1998) ) Kluwer. http://www.cs.toronto.edu//char126radford/em.abstract.html..

    Pritchard JK, et al. Inference of population structure using multilocus genotype data. Genetics, ( (2000) ) 155, : 945–959.[Abstract/Free Full Text].

    Rabiner L. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, ( (1989) ) 77, (2): 257–286.[CrossRef].

    Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet, ( (2006) ) 78, : 629–644.[CrossRef][ISI][Medline].

    Segal E, et al. A module map showing conditional activity of expression modules in cancer. Nat. Genet, ( (2003) ) 34, : 166–76.[CrossRef][ISI][Medline].

    Shisheva A, et al. Cloning, characterization, and expression of a novel GDP dissociation inhibitor isoform from skeletal muscle. Mol. Cell. Biol, ( (1994) ) 14, : 3459–3468.[Abstract/Free Full Text].

    Spielman RS, et al. Common genetic variatnts account for differences in gene expression among ethnic groups. Nat. Genet, ( (2007) )..

    Stranger BE, et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science, ( (2007) ) 315, : 848–853.[Abstract/Free Full Text].

    Wang WY, et al. Genome-wide association studies: Theoretical and practical concerns. Nat. Rev. Genet, ( (2005) ) 6, : 109–118.[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Huang, J. C.
Right arrow Articles by Winn, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Huang, J. C.
Right arrow Articles by Winn, J.
Social Bookmarking
 Add to CiteULike