Bayesian association of haplotypes and non-genetic factors to regulatory and phenotypic variation in human populations
1Probabilistic and Statistical Inference Group, University of Toronto, Toronto, ON, M5S 3G4, Canada and 2Microsoft Research Cambridge, Cambridge, CB3 0FB, UK
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: With the recent availability of large-scale data sets profiling single nucleotide polymorphisms (SNPs) and quantitative traits data across different human subpopulations, there has been much attention directed towards discovering patterns of genetic variation and their connection to gene regulation and the onset/progression of disease. While previous work has focused primarily on correlating individual SNP markers with gene expression and disease, it has been suggested that using haplotype blocks instead of individual markers can significantly increase statistical power.
Results: We present BlockMapper, a probabilistic generative model for genotype data and quantitative traits data, such as gene expression or phenotype measurements. BlockMapper discovers the block structure of genotype data and associates these inferred blocks to patterns of variation in quantitative traits data, whilst accounting for non-genetic factors. Our model achieves high accuracy for predicting Crohn's disease phenotype in Chromosome 5q31 and reveals novel cis-associations between two haplotype blocks in the ENm006 genomic region and GDI1, a gene implicated in X-linked mental retardation. Our results underscore the importance of accounting for the influence of large sets of SNPs on patterns of regulatory/phenotypic variation and represent a step towards an understanding of human genetic variation.
Contact: jwinn{at}microsoft.com
| 1 INTRODUCTION |
|---|
|
|
|---|
An important question in molecular biology and medicine is how patterns of genetic variation influence gene regulation and phenotype in humans. Previous studies for understanding the influence of genetic variation on gene regulation and phenotypic traits involved searching for genetic variants in unrelated individuals and mapping these to patterns of gene expression (Stranger et al., 2007) or phenotypic traits (Cheung et al., 2005; Morley et al., 2004; Wang et al., 2005). These studies are based on one of the two approaches. One approach makes use of linkage analysis to find candidate genomic regions which vary between pedigrees, and then perform linkage disequilibrium (LD) mapping to search for genetic markers in these putative disease-correlated regions found by linkage analysis (Cheung et al., 2005; Morley et al., 2004).
The problem with this approach is that it may miss weaker associations due to lack of sufficiently large phenotypic differences between individuals in the study (Ardlie et al., 2002). As a way to overcome this, the second approach genotypes a dense set of SNPs across multiple individuals, and then make use of the allele frequencies in these individuals to correlate for disease (Wang et al., 2005). In this genome-wide association approach, high-throughput technologies for profiling quantitative traits (such as microarrays) allow for the detection of more subtle interactions, but at the cost of having to simultaneously test an extremely large number of statistical hypotheses and maintain both high sensitivity and specificity (Stranger et al., 2007). Instead of testing against individual SNPs, neighboring SNP markers can be grouped together into haplotype blocks (Botstein and Risch, 2003), which can then be correlated with disease. Using haplotype blocks in this way can drastically reduce the number of hypotheses to be tested, while providing resistance to genotyping errors. This approach is facilitated by the fact that neighboring SNPs are often in linkage disequilibrium with one another (Gabriel et al., 2002), such that they tend to be inherited jointly in haplotype blocks with relatively little haplotype diversity within a block (Daly et al., 2001). Such blocks can arise due to recombination hotspots in the genome, population bottlenecks or genetic drift, all of which act to preserve low haplotype diversity in distinct regions of the genome.
In this article, we present a joint probabilistic model for genotype and quantitative traits data which enables associating patterns of genetic variation with patterns of gene expression and other phenotype measurements. We call the model BlockMapper since it learns the relationships between haplotype blocks and the patterns of gene expression and other phenotype measurements. The BlockMapper model has two parts: the first part infers reliable haplotype blocks from the genotype data which are used in the second part to infer relationships to gene expression and other phenotype measurements. While both parts of BlockMapper model can be learned in tandem, for clarity, we show in this article how to learn them sequentially. A flowchart summarizing the joint BlockMapper model is shown in Figure 1: the first component accounts for the haplotype block structure of genotype data and allows us to simultaneously infer recombination hotspots and mutations and the phase at each locus in the genotype. The second component then associates the discovered haplotype blocks to a set of quantitative trait measurements across the same set of genotyped individuals. We will show that the BlockMapper model both provides an accurate model for variability in genotype data and infers biologically significant associations between haplotype blocks and patterns of gene expression and disease in both the Chr5q31 and ENm006 genomic regions.
|
| 2 A MODEL FOR HAPLOTYPE BLOCK DISCOVERY IN GENOTYPE DATA |
|---|
|
|
|---|
The first component of the BlockMapper model allows us to discover sets of haplotype blocks consisting of regions of low recombination. The goal here is not to develop a completely novel model for genotype data, but rather to extend the existing model of (Jojic et al., 2004) to give improved reliability for learning the underlying haplotype block structure of the genotype data. The generative process described by the Jojic et al. model for a given individual is illustrated in Figure 2 and is as follows:
- Sample a class c (e.g. a subpopulation) according to P(c).
- Sample a pair of ancestors (s1, t1) independently according to initial distributions
over the ancestors.
- Given the pair of ancestors
, sample a pair of alleles (xk, yk) from the ancestral haplotypes R with probabilities
.
- To account for two possible orderings of a pair of alleles, select the phase
with equal probability. If mk = 1, swap the values of (xk, yk), otherwise leave them unchanged.
- Conditioned on the current pair of ancestral states (sk, tk), transition to the next pair of ancestor states
with transition probabilities
, increment the locus k and return to 3.
|
We repeat the above process for all individuals
|
When we apply the model, we assume the genotype data
We make a number of key extensions to the model of Jojic et al. (2004):
- Soft ancestral haplotypes: Previously, Jojic et al. used a library of ancestral haplotypes consisting of single alleles, where any difference between these ancestral haplotypes and the observed individual haplotypes was accounted for by a single noise parameter shared across all loci. We modify the model to define soft ancestral haplotypes where each locus contains not a single allele, but a probability distribution over alleles corresponding to emission probabilities for each possible allele. This allows for highly localized variability as well as genotyping errors in the data to be accounted for in the learned ancestral haplotype.
- Parent-child phase constraints: The above model assumes that the genotypes for individuals are generated i.i.d. from one another: in practice however, we are often given genotypes for parent–parent–child trios which provides us with constraints on some of the phase variables mk for the children. In particular, we can use this parent–child information to infer the phase variables at heterozygous loci, where the phase can often be unambiguously resolved for a child's genotype given the genotypes of the parents. We can incorporate this parent-child information at these loci into our model by fixing the phase variables at loci where the phase can be unambiguously resolved.
- Dirichlet prior on emission probability: To allow for the possibility of alleles being observed at particular loci which are not present in the training set, we define a Dirichlet prior over the ancestral allele distribution. This allows a small prior probability for each possible allele, where the probability is given by counts of pseudo-observations (pseudo-counts) ß. The quantity ß is set to be equal for all alleles and loci.
- Recombination prior: In a similar fashion, we define a Dirichlet prior over the transition probabilities
. However, in this case, we define a different prior probabilities for staying in the same ancestral haplotype and switching between ancestral haplotypes (recombination). The pseudo-counts for switching are set to
, and those for remaining in the same pattern are set to 
. Hence,
acts as a recombination prior so that higher values of
favor staying in the same ancestral state and lower values favor transitions to other ancestral states.
2.1 Variational inference and learning
To perform inference and parameter estimation in the generative model, we must compute the posterior distributions over unobserved variables (namely, the class variables
, the ancestral indices
and the phase variables {
}), as well as estimate the transition probabilities
and the ancestral haplotypes R. Computing the exact posterior over these unobserved variables is intractable: to address this, we follow the learning procedure adopted by Jojic et al., (2004) and resort to a tractable variational approximation to the required posterior distributions (Jordan et al.,, 1999). This variational approximation allows us to infer the chains of ancestral indices {sk} and {tk} separately from one another via the forward-backward algorithm (Rabiner et al., 1989). The transition probabilities
and emission probabilities R are modified to account for the phase variable at each locus, as well as the Dirichlet prior distributions on these parameters, so that
|
|
where Q refers to the approximate posterior distributions computed via the variational method and z refers to a specific observed allele. Thus,
is updated according to the frequency with which one emits allele z given state i at locus k, and
is updated according to the frequency with which one transitions from state i to state n. Both variational inference and parameter estimation can be accomplished here in a fashion analogous to the standard parameter updates in the Baum–Welch algorithm for learning HMMs (Rabiner et al., 1989), with the basic parameter updates for our model given by Jojic et al., (2004). By iteratively updating approximate posterior distributions over phase, ancestral indices and the model parameters via an EM algorithm (Dempster et al., 1977; Neal and Hinton, 1998), we are guaranteed to increase a lower bound on the log-likelihood of the data with each update step and thus improve the fit of our model to the data. Note that our inference procedure can be naturally modified to account for missing genotype data by summing over all possible alleles for a given missing marker.
For a given locus, we could re-arrange the ancestral indices, and the corresponding distributions indexed by these, without affecting the probability distribution defined by the model. Thus, as a post-processing step, we permute ancestral indices at each locus so that the most probable transition is to the same ancestral index. This permutation leads to longer contiguous stretches in which no transitions between ancestors occur. Figure 4a shows an example of this learning process.
|
2.2 Discovering haplotype blocks
Our model discovers haplotype blocks implicitly as regions where the probability of transitioning from one ancestral haplotype to another is small. This recombination probability can be computed at between any locus k and k + 1 as
|
| (1) |
where
is the marginal probability under our model of being in state sk at locus k. Given this recombination probability, we define a haplotype block as a region where the recombination probability is below
. Any point where the probability of recombination exceeds this value is therefore considered to be a boundary between two blocks and is likely to represent a recombination hotspot. Note that the recombination probability is affected by the recombination prior
discussed earlier. If
is large, recombination is discouraged, favoring a smaller number of longer haplotype blocks: similarly, if
is small, the haplotype is divided into a larger number of shorter blocks, as shown in Figure 4b.
For any individual, we can compactly approximate their haplotype by requiring that only one pair of ancestral haplotypes is used within each haplotype block: so s and t are forced to be constant within a block for that individual. We assign the label
to that block as a summary of the genetic variation for that block. As we know that few transitions between ancestral haplotypes occur within a block, this will be a reasonably accurate approximation to the underlying mappings to ancestral haplotypes. For any block b, we can infer the distribution over the ancestral haplotypes used and use these as labels for the given block. Distributions over the block labels Sb and Tb for the maternal and paternal chromosomes can then be computed from the posterior distributions
and
so that
and
, where k ranges over loci within haplotype block b. Thus, our model for genotype data allows us to partition the data into haplotype blocks and determine posterior distributions over the block labels for each individual. These block labels will be used in Section 4 to predict the value of quantitative traits measurements for that individual, such as levels of gene expression or phenotypic measurements.
| 3 COMPARATIVE RESULTS ON LEARNING HAPLOTYPE BLOCK STRUCTURE |
|---|
|
|
|---|
3.1 Learning the block structure of Chromosome 5q31
We applied our model to data from Chromosome 5q31 (Daly et al., 2001), consisting of genotypes of 129 parent-parent-children trios profiled across 103 genomic loci. All 129 children in the data set are patients with Crohn's disease. The model was learned using A = 7 ancestral haplotypes and model parameters of
effectively smoothes over the probability of recombination, leading to a smaller number of longer haplotype blocks. We found that
= 2 gave us the best predictive error on independent test data: this setting of
also led to a partitioning which gave good agreement with the partitioning reported by Daly et al. (Figure 4b).
3.2 Comparison with original Jojic et al. model
We now compare the following three models:
- our model excluding parent–child information,
- our model including parent–child information, so that the child phase is known at many loci,
- the model from Jojic et al., (2004) where hard ancestral haplotypes are learned and parent-child information is not used.
We wish to compare how well these models represent the true probability density over haplotypes. To achieve this, we tested each model's ability to predict missing values in the haplotype data. The idea is that, if a model is correctly capturing the correlations between adjacent SNP measurements, then it will be better at predicting SNPs which are not observed. Whenever our data is fully observed, we can artificially remove observations at random locations and then test the ability of each model to fill-in the gaps correctly.
We applied this comparison method to the three models using the Chr5q31 data described earlier. Training/test sets were constructed from all individuals who were Crohns-positive so that the training set contained 115 individuals (100 children and 15 parents) and the test set consisted of 29 children. We trained each model three times with different random initializations and in each case selected the one that achieved best bound on the training data likelihood.
For the test data, we artificially removed a fraction
of the SNP measurements and used each learned model to fill-in these missing values. Fill-in was achieved by sampling from the posterior distribution over the missing alleles. Figure 5 shows the average prediction error rates across 10 training/test splits for each model, as
varies from 0 to 0.5. Compared to Jojic et al., (2004), our model provides a significant improvement in accuracy on this fill-in task, indicating that it more accurately captures the true distribution of haplotypes. Making use of parent-child information leads to a further improvement in performance, so that our model is making a quarter of the number of fill-in errors compared to the Jojic model.
|
3.3 Comparison with the HaploBlock and FastPHASE haplotype inference algorithms
We also used this fill-in task to compare the predictive accuracy of our model with other haplotype inference algorithms: HaploBlock (Greenspan and Geiger, 2004) and fastPHASE (Scheet and Stephens, 2006). Each algorithm was used to infer missing SNP values, exactly as described in the previous section. To make the comparison fair, we used the version of our model which does not make use of the known phase information for the children's genotypes. The corresponding predictive accuracy plots are shown in Figure 6: as can be seen, our model (blue) outperforms both HaploBlock (green dashed) and fastPHASE (red dotted) by a significant margin, where we make about half the number of errors. This result, in tandem with the improved predictive accuracy when using parent-child relationships, indicates that we can more accurately model genotype data by accounting for the underlying haplotype block structure and through the improvements presented earlier.
|
3.4 Learning the block structure of the ENm006 region
We applied our model to genotype data consisting of 573 SNPs from the ENm006 region (located on X chromosome) common to the 270 individuals of the HapMap Phase II data (HapMap Consortium, 2005). Briefly, the 270 individuals consist of 30 parent–parent–child trios of Yoruba individuals from Ibadan, Nigeria (YRI), 30 trios of Utah individuals with European ancestry (CEU), 45 Han Chinese individuals from Beijing, China (CHB) and 45 Japanese individuals from Tokyo, Japan (JPT).
Using the same parameters for
and
as earlier, we learned our model from this data and partitioned the data into 19 haplotype blocks (Fig. 7). As can be seen, individuals in the three populations (CEU, YRI and CHB+JPT) have noticeably different usage of the ancestral haplotypes. We computed the average block lengths for each subpopulation separately using the block partitioning derived from each individual's recombination probability. We find the average block lengths are 41.6, 21.3 and 39.0 markers for CEU, YRI and CHB+JPT subpopulations, respectively. Consistent with previous results (Gabriel et al., 2002; HapMap Consortium, 2005), we find that the average block length for the YRI subpopulation to be the lowest, followed by the CHB+JPT and CEU subpopulations. Having presented our method and results for learning a set of haplotype blocks, we now present our model for relating blocks to trait measurements.
|
| 4 A JOINT BAYESIAN MODEL OF HAPLOTYPE AND REGULATORY/PHENOTYPIC VARIATION |
|---|
|
|
|---|
We now turn to the problem of linking the states of these haplotype blocks to patterns of quantitative trait variation. To achieve this, we introduce the model which is shown as a Bayesian network in Figure 8. This model assumes that a subset of haplotype blocks affect each quantitative measurement of regulatory and phenotypic variation through a linear relation. These quantitative measurements may include phenotypes, high throughput gene expression data or any other continuous measurement that may be putatively linked to genetic variation.
|
Our model considers the blocks jointly and is able to suppress spurious relations due to correlations between block states across different blocks. For example, if variation in a single haplotype block b completely explains variation in the expression of a particular gene g, then a model that would consider blocks separately (e.g. using multiple linear regressions) would also find relationships between the gene g and all blocks whose states are correlated with the state of b. Such spurious relations are ignored by our approach and the remaining relations which are most informative are discovered by the model.
Let us assume the haplotypes for J individuals have been partitioned into B non-overlapping haplotype blocks using the method described earlier. For each individual j, we define the label for a particular block b as an ordered pair
, representing the pair of ancestral haplotype indices inferred for this block, with the ancestral indices ranging from 1 to A. We represent Sb and Tb as indicator vectors of length A so that, if a block is associated with the a-th ancestor, the a-th element of the indicator vector is one and the remaining elements are zero.
Let Z be a
matrix, where
is the measurement of the g-th quantitative trait of the j-th individual. We can represent the measurements of a particular trait g for all individuals by
. We would now like to infer associations between haplotype block labels and quantitative trait measurements whilst rejecting noisy or spurious associations. To optimize predictive accuracy, we will encourage sparsity in our model to favor only a small number of blocks influencing any given trait. We model this explicitly using a binary relevance variable wbg associated with each combination of block and trait, such that
indicates an association between the block b and measurement g, and
indicates no association. We place a Bernoulli prior distribution over each
in which
so that smaller values of p favor more sparse solutions.
We model a particular trait g for an individual j as a weighted sum of contributions from the subset of B blocks which are considered relevant according to W, so that
, where
is a A-dimensional vector of weights for a particular block-trait pair. As the influence of a particular haplotype block on a trait is independent of the individual, the relevance variables W and the block contributions
are shared across all individuals. Thus, the A weights for any given block-trait pair allow us to model the joint patterns of variation in block label and trait measurement across individuals.
We define a matrix
, where (j,a)-th element is given by
. Assuming a zero-mean isotropic Gaussian noise with precision
, we have
|
| (2) |
|
| (3) |
|
|
4.1 Variational Bayes learning of associations
The marginalization over the random variables required to compute P(Z) in the above equation cannot be computed analytically, and hence we cannot maximize this quantity exactly. As in Section 2, we resort to a variational approximation (Jordan et al., 1999) where instead of maximizing
, we instead maximize a lower bound on
,
|
|
|
|
To carry out the optimization, we use variational Bayes (Attias, 1999) that performs approximate Bayesian inference by iteratively updating
and
so that each update increases the bound on
. To prevent the premature switching off of haplotype blocks for certain genes, we use an annealing strategy in which we progressively decrease the sparsity prior probability p over the relevance variables for a fixed number of variational Bayes iterations, until it reaches the target sparsity value.
| 5 RESULTS ON LEARNING ASSOCIATIONS BETWEEN HAPLOTYPE BLOCKS AND TRAITS |
|---|
|
|
|---|
5.1 Linking haplotype block structure in Chr5q31 to Crohn's disease phenotype
Here, we applied our BlockMapper model to predict Crohn's disease phenotype in the Chr5q31 data (Daly et al., 2001). The data consists of the genotypes and the phenotypes (presence (+1) /absence (–1) of Crohn's disease) for 387 individuals. We first inferred the haplotype blocks from the genotype data for all 387 individuals (Section 3.1). Then, we used the Bayesian learning technique described above to learn the association between the haplotype block labels and the phenotypes of a randomly chosen subset of 287 individuals. We evaluated the accuracy of our model by measuring the error when predicting presence/absence of the disease on the remaining 100 individuals. We repeated this experiment for 10 independent train-test splits. Since our model predicts a real-valued trait corresponding to the input haplotype block labels, we used the sign of the predicted mean trait value to indicate presence (positive) or absence (negative) of the disease. To investigate the sensitivity of our model to the sparsity prior p, we repeated the above experiment for a range of values and found that the error rate was consistently similar for
5.2 Linking genetic variation in ENm006 to GDI1 expression
We then evaluated our model's predictive accuracy on continuous gene expression measurements from the HapMap project. These measurements consist of the expression profiles for 47 294 genes profiled in EBV-transformed lymphoblastoid cell lines (Stranger et al., 2007). In particular, we focused on a set of 28 genes located in the ENm006 region, allowing us to search for cis-associations with the haplotype blocks established in Section 3.
We randomly split the set of haplotype block labels and gene expression data across to the 270 individuals into three sets—170 for training, 50 for validation and 50 for testing. We learned the model using the training set, while optimizing the sparsity prior parameter p on the validation set using prediction error as the criterion. The prediction error used was the sum of squared residuals between the model's estimate of the gene expression and the true level of expression. We found p = 0.2 minimized the error on the validation set and subsequently used this value for evaluating prediction error on the test set. We repeated the above analysis for five different train-validation-test splits with three random initializations in each split. Figure 9a shows the frequency (over five data splits and three random initializations) with which each gene is associated to each haplotype block according to the average magnitude of the relevance variables wbg.
|
To assess the impact of accounting for non-genetic factors in explaining gene expression variation, we made use of all the available non-genetic traits (gender, population and parental information). We use the parent–child relationships in the HapMap data to recast age as a categorical variable (child/parent/unknown). We incorporated these additional factors into our probability model from Equation (2) by encoding the states for each non-genetic factor in a fashion analogous to the encoding for terms
In particular, Figure 9b also highlights significant cis-associations between blocks 2 and 5 in the ENm006 region and the GDI1 gene, which has been implicated in X-linked mental retardation (Shisheva et al., 1994). Figures 9c,d show the relationship between GDI1 expression levels and the haplotype block labels for blocks 2 and 5 across all individuals: as can be seen, the particular label assigned to a haplotype block is informative of variation in GDI1 expression, as given by variations in the median GDI1 expression with respect to haplotype block label within the YRI subpopulation in block 2 (
, Wilcoxon–Mann–Whitney test, alpha value of 0.01 Bonferroni-corrected for all pairwise tests of six assigned ancestral indices over three subpopulations) and the CHB+JPT subpopulation in block 5 (
). We only tested associations between genes and which were discovered by our relation model and thus we did not use a Bonferroni correction for the number of genes or haplotype blocks.
To see the extent to which gene expression is explained by non-genetic information, we also learned a baseline model that explains the gene expression using only non-genetic factors consisting of age, gender and population. In the case of GDI1, we found that incorporating the haplotype block information in addition to the non-genetic attributes used by the baseline model reduced the population-specific test prediction errors (i.e. squared prediction error computed over subsets of individuals from a particular subpopulation) by
for CHB+JPT,
for YRI and
for CEU, suggesting a contribution from both subpopulation and genetic variation within populations to the variation of GDI1 across individuals. We computed test error for other genes in the ENm006 region, such as ARHGAP4, RENBP, HCFC1 and TKTL1 and did not observe corresponding reductions in prediction error for these genes. Although the number of individuals here is relatively small, the fact that we are able to infer significant associations between haplotype blocks and gene expression suggests that considering larger data sets would allow the prediction of more reliable associations.
| 6 DISCUSSION AND CONCLUSIONS |
|---|
|
|
|---|
In this article, we presented BlockMapper, a Bayesian model for learning haplotype block structure from genotype data and associating the discovered blocks with quantitative traits. We have presented two components of the model, where the first component infers a set of reliable haplotype blocks from genotype data and the second learns associations between these blocks and quantitative traits, such as gene expression. We have shown that our model for genotype data significantly outperforms standard haplotype inference algorithms on the task of predicting missing alleles in test data. We also demonstrated that BlockMapper has good predictive accuracy when predicting Crohn's disease phenotype in the Chr5q31 region. Using our model, we discovered novel cis-associations between haplotype blocks 2 and 5 in the ENm006 region of ChrX and GDI1, a gene implicated in X-linked mental retardation.
In this article, we learned the two parts of the model in sequence: in future we would like to extend this so that the block structure and relationships are learned in tandem, allowing haplotype blocks to be discovered which are more relevant to the traits of interest. In addition, we primarily focused on finding cis-associations in this work: we intend to scale-up our analysis to a larger set of genes to find both cis- and trans- associations. Also, we currently treat each gene expression as an independent observation but we can also take into account co-expression/co-regulation between genes by learning these relationships from data (Segal et al., 2003). Finally, it will be important to study alternative non-linear relationships between the haplotype blocks and traits. This work also calls for access to large data sets of genotypes and traits measured across a larger number of individuals than used here. Our model and results, in tandem with the impending improvements in high throughput genotyping and profiling technologies, underscores the potential of discovering thousands of previously uncharacterized associations between genetic variants and quantitative traits.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We would like to thank Paul Scheet for discussions about fastPHASE, and Richard Durbin and Manolis Dermitzakis of the Wellcome Trust Sanger Institute for valuable feedback on the model. JCH was supported by an internship at Microsoft Research Cambridge. We also thank Nebojsa Jojic for helpful discussions.
Conflict of Interest: none declared.
| REFERENCES |
|---|
|
|
|---|
Ardlie KG, et al. Patterns of linkage disequilibrium in the human genome. Nat Rev Genet, ( (2002) ) 3, : 299–309.[CrossRef][ISI][Medline].
Attias H. Inferring parameters and structure of latent variable models by variational Bayes. ( (1999) ) Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI). Sweden: Stockholm. 21–30..
Botstein D, Risch N. Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet, ( (2003) ) 33, ((Suppl.)): 228–237.[CrossRef][ISI][Medline].
Cheung VG, et al. Mapping determinants of human gene expression by regional and genome-wide association. Nature, ( (2005) ) 437, : 1365–1369.[CrossRef][Medline].
Daly MJ, et al. High-resolution haplotype structure in the human genome. Nat. Genetics, ( (2001) ) 29, : 229–232.[CrossRef][ISI][Medline].
Dempster A, et al. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B, ( (1977) ) 39, : 138..
Gabriel SB, et al. The structure of haplotype blocks in the human genome. Science, ( (2002) ) 296, : 2225–2229.
Greenspan G, Geiger D. High density linkage disequilibrium mapping using models of haplotype block variation. Bioinformatics, ( (2004) ) 20, (Suppl. 1): i137–i144.[Abstract].
The International HapMap Consortium. A haplotype map of the human genome. Nature, ( (2005) ) 437, : 1299–1320.[CrossRef][Medline].
Jojic N, et al. Joint discovery of haplotype blocks and complex trait associations from SNP sequences without family data. ( (2004) ) Proceedings of the 20th Conference on Uncertainty in Aritificial Intelligence (UAI)..
Jordan MI, et al. An introduction to variational methods for graphical models. In: Learning in Graphical Models, ( (1999) ) MIT Press: Cambridge..
Morley M, et al. Genetic analysis of genome-wide variation in gene expression. Nature, ( (2004) ) 430, : 743–747.[CrossRef][Medline].
Neal RM, Hinton GE. A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in Graphical Model, —Jordan MI, ed. ( (1998) ) Kluwer. http://www.cs.toronto.edu//char126radford/em.abstract.html..
Pritchard JK, et al. Inference of population structure using multilocus genotype data. Genetics, ( (2000) ) 155, : 945–959.
Rabiner L. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, ( (1989) ) 77, (2): 257–286.[CrossRef].
Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet, ( (2006) ) 78, : 629–644.[CrossRef][ISI][Medline].
Segal E, et al. A module map showing conditional activity of expression modules in cancer. Nat. Genet, ( (2003) ) 34, : 166–76.[CrossRef][ISI][Medline].
Shisheva A, et al. Cloning, characterization, and expression of a novel GDP dissociation inhibitor isoform from skeletal muscle. Mol. Cell. Biol, ( (1994) ) 14, : 3459–3468.
Spielman RS, et al. Common genetic variatnts account for differences in gene expression among ethnic groups. Nat. Genet, ( (2007) )..
Stranger BE, et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science, ( (2007) ) 315, : 848–853.
Wang WY, et al. Genome-wide association studies: Theoretical and practical concerns. Nat. Rev. Genet, ( (2005) ) 6, : 109–118.[CrossRef][ISI][Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||










