Bioinformatics Advance Access originally published online on January 31, 2007
Bioinformatics 2007 23(7):785-788; doi:10.1093/bioinformatics/btm003
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Large scale genotype–phenotype correlation analysis based on phylogenetic trees
1Department of Physics, 2Department of Pharmacology and 3Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
We provide two methods for identifying changes in genotype that are correlated with changes in a phenotype implied by phylogenetic trees. The first method, VENN, works when the number of branches over which the change occurred are modest. VENN looks for genetic changes that are completely penetrant with phenotype changes on a tree. The second method, CCTSWEEP, allows for a partial matching between changes in phenotypes and genotypes and provides a score for each change using Maddison's concentrated changes test. The mutations that are highly correlated with phenotypic change can be ranked by score. We use these methods to find SNPs correlated with resistance to Bacillus anthracis in inbred mouse strains. Our findings are consistent with the current biological literature, and also suggest potential novel candidate genes.
Contact: farhat{at}pacific.mps.ohio-state.edu for software requests.
| 1 INTRODUCTION |
|---|
|
|
|---|
Single nucleotide polymorphisms (SNPs) are an abundant form of genome variation, distinguished from rare variations by a requirement for the least abundant allele to have a frequency of at least 1%. By studying stretches of DNA that harbor a SNP associated with a disease trait, researchers may reveal relevant genes associated with a disease or susceptibility to a pathogen or other phenotypic traits.
Researchers have used population-based approaches to study the power of SNP–phenotype association using different statistical, data analysis and experimental approaches. Comeron et al. (2003) did a simulation study on the power to detect SNP/phenotype association in candidate quantitative trait loci (QTL) genomic regions and concluded that the study of only a small percentage of all SNPs present in a genomic region gives almost the maximum power to detect association, and greater power is achieved by increasing sample size than by increasing the marker density. Grupe et al. (2001) conducted in silico mapping of complex disease-related traits in mice using a computational method for predicting chromosomal regions regulating phenotypic traits. Their methods treat the organisms as essentially independent and ignore any interrelationships between them.
Felsenstein (1985) argues that due to shared ancestry the organisms cannot be treated as statistically independent and the interrelationships between them, which can be obtained via phylogenetic analysis, should not be ignored. A character-based phylogenetic analysis can produce not only a tree of relationships of the taxa but an apomorphy list containing mutations and phenotypic changes optimized to the branches. Indeed, it has been hypothesized that the character optimizations in a tree contain patterns that reflect historical constraints, selective forces and functional dependencies. Thus, correlated genomic and phenotypic changes along the tree could indicate a functional relationship between phenotype and genotype. Although this information is correlative rather than causative, it is useful for identification of candidate genes. Nevertheless, in most studies involving phylogenetic trees, nucleotide or amino acid sequences and phenotypic observations serve only as sources of variation and the tree is intended for taxonomic purposes only, with little consideration of functional correlations between genotype and phenotype. The implementations of the few phylogeny-based statistical methods for genotype/phenotype correlations that have been described (Maddison, 1990; Pagel, 1994) are primarily GUI based (e.g., Maddison and Maddison 2003a,b; Nixon 2001) and thus unsuitable for the analysis of genome-scale data.
Character evolution is becoming increasingly important in phylogenetic studies of infectious disease (Obenauer et al., 2006). For the human genome, theoretical studies indicate that a genome wide association scan employing every polymorphic marker may have greater power to detect complex disease causing polymorphisms than genome wide linkage studies, even after compensating for the increased number of false positives expected from testing such a large number of markers (Altmüeller et al., 2001; Risch and Merikangas, 1996). This kind of analysis could be especially useful in multifactorial phenotypes where there are potentially hundreds of candidate genes for functional studies and informatic prioritization of candidates is important to progress in biomedical research (Kurc et al., 2006). While the present level of data on the human genome may limit this approach, as more data is acquired our programs would be more applicable. Our script-based programs are particularly suited for sifting through large amounts of SNP data.
In this article, we use correlations between phenotypes and genotypes fully taking into account the underlying phylogenetic tree to identify candidate SNPs related to a given phenotype. To this end, we provide two methods. The first method, VENN, works well when the number of SNPs is large and the number of branches over which change is occurring is modest. VENN operates on apomorphy lists output by popular phylogenetic analysis packages PAUP (Swofford, 2002), POY (Wheeler et al., 2005) or TNT (Goloboff et al., 2003). In contrast, our second method, CCTSWEEP, works even when there are no SNPs that are completely penetrant with the phenotype of interest. VENN and CCTSWEEP both need a phylogeny which can be obtained from genetic data using software such as PAUP, POY, TNT and many others. As a case study, we apply our methods to correlate genotypes and phenotypes associated with susceptibility to infection by Bacillus anthracis among inbred mouse strains. Inbred mice afford several simplifying assumptions that make them attractive for candidate gene discovery. Large segments of the mouse genome have been compared phylogenetically to discover regulatory regions and genes that underlie complex traits (Frazer et al., 2004; Pletcher et al., 2004; Wiltshire et al., 2003). Furthermore, there are abundant and growing genetic polymorphism and comparative phenotype data in the public domain in digital formats (Grubb et al., 2004).
| 2 METHODS AND RESULTS |
|---|
|
|
|---|
2.1 Identifying completely penetrant SNPs
2.1.1 Approach
For a modest number of branches containing change, set theory and Venn diagrams provide the logical bases we employ to find genotype–phenotype relationships. To illustrate, consider a phylogenetic tree on which there are three branches with significant change in the phenotypic character. The three circles in Figure 1 represent all genotypic characters with change optimized to that branch. The intersection of these three sets contains SNPs that have potential to be functionally related to the phenotype. A P-value can be assigned to these candidates by calculating their CCT as described in Section 2.2. This analysis can be extended to any number of branches in the tree.
|
To further filter the candidate loci it could be useful to exclude SNPs that also change in branches where no change in the phenotype is observed. These candidate SNPs would lie in the relative complement of the intersection. In Figure 1, if only branches A and B exhibit a change in the phenotype while branch C does not contain that change, then the candidate SNPs would be contained in the white region at the intersection of the purple and green regions.
In applying these filters, nucleotide data missing due to incomplete datasets is inferred using the tree. To avoid artifacts, we require that SNP nucleotides be known for at least 50% of the strains for a potential candidate SNP. Increases in quality and quantity of SNP data will eventually make this cutoff irrelevant for most regions of the mouse genome (Frazer et al., 2004).
2.1.2 Case study
We use our tool VENN, which implements the approach described above, on the variable phenotype of susceptibility to B.anthracis among 15 strains of inbred mice. The data on Bacillus susceptibility (Dietrich et al., 1998) were obtained from the Mouse Phenome Database (Grubb et al., 2004). Then, using VENN, we identified SNPs that change only on branches with a transition in phenotype from susceptible to resistant. All SNPs located by VENN along with the chromosome and their annotation are shown in Table 1. As will be discussed in Section 3, the last of these SNPs is known to be associated with anthrax susceptibility, leaving the other identified SNPs as potential candidate SNPs.
|
2.2 Identifying SNPs with incomplete penetrance
With an increase in the number of intersecting branches, the number of genetic changes within the intersecting region decrease rapidly. This necessitates a more sophisticated approach when few or no SNPs are completely penetrant with the phenotype. Under these circumstances, we propose a modified version of Maddison's concentrated changes test (CCT) that identifies SNPs that may not be completely penetrant with the phenotype of interest. The CCT also provides a score for the correlation between a particular observed genetic change and a phenotype.
2.2.1 Approach
The CCT as described by Maddison (1990), calculates the probability that a given number of gains and losses, in this case a change in the reconstructed state of a SNP on a branch, fall on the distinguished branches of the tree, i.e., ones on which a change is observed in the phenotype. We account for changes that are not on branches with the observed change in phenotype, by summing over all cases where the distribution of gains and losses is as good as or better than the observed distribution to obtain the CCT,
, for a case where p gains and q losses are observed on branches of the tree having a significant change in the phenotype and n gains and m losses over the entire tree. B(u,r|n,m) is the number of ways in which u gains and r losses can be distributed over the distinguished branches given n gains and m losses over the entire tree. W(n, m) is the number of ways in which n gains and m losses can be distributed over the entire tree. The major limitation with CCT is that it calculates the correlation between binary variables only, though this is not a concern in our case study as the SNPs are biallelic and B.anthracis susceptibility is a binary variable.
2.2.2 Case study
In this case study, we focus only on SNPs on chromosome 11 as literature surveys indicated a higher possibility of finding correlated SNPs within this region. While there may be other regions in the genome (e.g., chromosomes 2, 3, 7) with a functional relation to anthrax susceptibility, we would not be able to corroborate our findings as the literature is not sufficiently advanced regarding other contributing loci. We recalculate the tree as different parts of the genome could have different lineages (Wade et al., 2002), and a globally optimized tree may not reflect the changes happening in a smaller region completely. We use 21 taxa and a smaller number of SNPs (
600 as compared to the previous
13,000 after the 50% cutoff). The high ranking SNPs obtained using CCTSWEEP, are shown in Table 2. A mirror tree visualizing the mouse strain patterns for one of the identified SNPs and anthrax susceptibility is shown in Figure 2.
|
|
A number of markers on chromosome 11 were identified as significantly associated with mouse susceptibility to B.anthracis. No completely penetrant markers were observed for this dataset. These markers were analyzed for their strain distribution patterns and for candidate genes nearby their genomic location. Highly ranked markers included four SNPs in Nalp1b, one 2.3 Mb away from Dock2, one about 0.9 Mb from a locus with 5 CC chemokine genes (CCL3-6,9), one in proximity to IL12b (rs3654344), one 0.3 Mb from Nalp1b and one in Kif1c. One of these was also identified by VENN in the genomewide case (see Table 1). Since linkage disequilibrium blocks are typically large among inbred mouse strains, the Kif1c marker (rs3694522) located only 0.42 Mb away from Nalp1b and another (rs3726991) quite possibly reflect the same QTL. In contrast, rs3654344, which was nearly as penetrant as the best Nalp1b marker is located on chromosome 11 rather distantly (26.9 Mb proximal from Nalp1b), suggesting the possibility of additional QTLs for anthrax susceptibility.
2.2.3 Comparison to non-tree based methods
To compare CCTSWEEP with a non-phylogeny based method, we rank the SNPs using phi-coefficient (Cheetham and Hazel, 1969) and display results in Table 2. Although many SNPs that are strongly correlated with CCTSWEEP are also strongly correlated with phi-coefficient, there are several SNPs that rank much higher with CCTSWEEP than with phi-coefficient and vice versa. For instance, rs3726991 which is functionally related appears at rank 22 using the phi-coefficient but at rank 6 using CCTSWEEP. Thus, our methods provide distinctly different and thus complementary results to non-tree based methods. In addition, CCTSWEEP and VENN consider missing data using character optimization whereas other methods simply ignore missing data.
| 3 ANTHRAX SUSCEPTIBILITY CANDIDATES |
|---|
|
|
|---|
In order to evaluate our candidates, we rely on the literature on B.anthracis susceptibility. Inbred mouse strains have various responses to B.anthracis, the causative agent of anthrax (Welkos et al., 1986). Specifically, cultured macrophages of various mice strains exhibit differences in their susceptibility to cytolysis due to exposure to anthrax lethal toxin (LeTx). Watters et al. (2001) present in vitro and phylogenetic studies suggesting that the variability of resistance and susceptibility among mice is due to a single locus, Kif1C, a kinesin-like motor protein. This locus was identified both by VENN and by CCTSWEEP.
Other groups have found support for a multigenic nature of susceptibility to the anthrax lethal toxin. McAllister et al. (2003) reported at least three QTLs on chromosome 11 control susceptibility to anthrax lethal toxin. Macrophage cytolysis has been found to play a minor role in anthrax pathology, with experiments indicating that LeTx primarily alters signaling cascades in immune cells and blunts immune upregulation, thus reducing bacteriocidal potential against the pathogen (Agrawal et al., 2003; Moayeri et al., 2003; Popov et al., 2002). At the same time, Boyden and Dietrich (2006) recently demonstrated that a major determinant in mediating anthrax lethality is Nalp1b, a member of the inflammasome located near Kif1c on chromosome 11 and again one of the genes identified in our case study. Expression of a Nalp1b allele from susceptible mice in resistant mouse strain macrophages conferred the susceptibility trait. They conclude that previous results regarding Kif1c are either artifactual due to linkage or indicate only a minor role for that candidate (Watters et al., 2001). Other highly ranked SNPs are close to regions with genes important in lymphocyte chemotaxis (Dock2) (Janardhan et al., 2004) and T cell activation upon infection (Bergman et al., 2005; Cote et al., 2006; Popov et al., 2004).
Thus, despite a demonstrated role for Nalp1b it remains possible that multiple loci mediate anthrax susceptibility in mice, and that there is considerable variation among the strains in the profile of their response to the toxin (Moayeri et al., 2004). A promising marker identified in our case study without previously known implication for Bacillus susceptibility is rs3654344. The nearest gene to rs3654344, IL12b, is located 0.2 Mb distal. IL12b expression is fairly restricted to monocytes, macrophages and dendritic cells where it plays a role in the TH1 immune response. Humans with defects in IL12b expression show decreased production of IFN
and increased susceptibility to mycobacterial infections (Remus et al., 2001). Further studies have associated human polymorphisms in IL12b with phenotypic stratification of individuals infected with hepatitis C virus (Mueller et al., 2004). Notably, differences in basal and inducible expression of IL12b have also been observed among mouse strains and associated with polymorphisms in that gene (Ymer et al., 2002). However, these data are incomplete for all strains and do not strictly follow the pattern of LeTx susceptibility. Nonetheless, the strain variation, tissue expression pattern, biological function and prior genetic associations do suggest this as a potential novel candidate locus for anthrax susceptibility, along with others in regions highly correlated here (Dock2 and CC chemokine locus).
| 4 CONCLUSION |
|---|
|
|
|---|
In conclusion, we find that our tools, VENN and CCTSWEEP, can identify changes in genotypes that happen in synchrony with changes in phenotype thus allowing for prioritization of candidate genes for further research. We performed a study where these tools were used to identify candidate genes for the susceptibility to B.anthracis in mice. A review of B.anthracis literature reveals that Nalp1b, the strongest candidate identified by our tools, as well as another gene identified, Kif1c, have been specifically studied and implicated in resistance. This is a good indication for the validity of our methods, and supports the conclusion that other SNPs identified by our tools are potential candidates for further investigation. Thus, the correlation of genetic and phenotypic changes in phylogenetic reconstructions on a large scale may significantly aid in identifying candidate genes for disease-related traits.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
D.J. acknowledges the support of the Department of Biomedical Informatics and OSU Medical Center and that this material is based upon work supported in part by the US Army Research Laboratory and the US Army Research Office under contract/grant number W911NF-05-1-0271.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Joaquin Dopazo
Received on August 23, 2006; revised on December 22, 2006; accepted on January 9, 2007
| REFERENCES |
|---|
|
|
|---|
Agrawal A, et al. Impairment of dendritic cells and adaptive immunity by anthrax lethal toxin. Nature (2003) 424:329–334.[CrossRef][Medline]
Altmüeller J, et al. Genomewide scans of complex human diseases: true linkage is hard to find. Am. J. Hum. Genet (2001) 69:936–950.[CrossRef][Web of Science][Medline]
Bergman NH, et al. Murine macrophage transcriptional responses to Bacillus anthracis infection and intoxication. Infect. Immun (2005) 73:1069–1080.
Boyden ED, Dietrich WF. Nalp1b controls mouse macrophage susceptibility to anthrax lethal toxin. Nat. Genet (2006) 38:240–244.[CrossRef][Web of Science][Medline]
Cheetham AH, Hazel JE. Binary (presence-absence) similarity coefficients. J. Paleontol (1969) 43:1130–1136.
Comeron JM, et al. On the power to detect SNP/phenotype association in candidate quantitative trait loci genomic regions: a simulation study. Pacific Symp. Biocomput (2003) 8:478–489.
Cote CK, et al. Roles of macrophages and neutrophils in the early host response to Bacillus anthracis spores in a mouse model of infection. Infect. Immun (2006) 74:469–480.
Dietrich WF. Bacillus anthracis lethal factor susceptibility and identification of a gene, kif1C, mediating resistance. MPD: 702. In: Mouse Phenome Database Web Site. (2006) Bar Harbor, Maine, USA: The Jackson Laboratory. http://www.jax.org/phenome.
Felsenstein J. Phylogenies and the comparative method. Am. Nat (1985) 125:1–15.[CrossRef][Web of Science]
Frazer KA, et al. Segmental phylogenetic relationships of inbred mouse strains revealed by fine-scale analysis of sequence variation across 4.6 Mb of mouse genome. Genome Res (2004) 14:1493–1500.
Goloboff P, et al. Tree Analysis using New Technology. (2003) http://www.zmuc.dk/public/phylogeny/tnt.
Grubb SC, et al. A collaborative database of inbred mouse strain characteristics. Bioinformatics (2004) 20:2857–2859.
Grupe A, et al. In silico mapping of complex disease-related traits in mice. Science (2001) 292:1915–1918.
Janardhan A, et al. HIV-1 nef binds the DOCK2ELMO1 complex to activate rac and inhibit lymphocyte chemotaxis. PLoS Biol (2004) 2:65–76.[CrossRef][Web of Science]
Kurc TM, et al. An XML-based system for synthesis of data from disparate databases. J. Am. Med. Inform. Assoc (2006) 13:289–301.
Maddison D, Maddison W. Mesquite. (2003a) http://www.mesquite.org.
Maddison D, Maddison W. Macclade. (2003b) http://www.macclade.org.
Maddison WP. A method for testing the correlated evolution of two binary characters: are gains or losses concentrated on certain branches of a phylogenetic tree? Evolution (1990) 44:539–557.[CrossRef][Web of Science]
McAllister RD, et al. Susceptibility to anthrax lethal toxin is controlled by three linked quantitative trait loci. Am. J. Pathol (2003) 163:1735–1741.
Moayeri M, et al. Bacillus anthracis lethal toxin induces TNF-alpha-independent hypoxia-mediated toxicity in mice. J. Clin. Invest (2003) 112:670–682.[CrossRef][Web of Science][Medline]
Moayeri M, et al. Mouse susceptibility to anthrax lethal toxin is influenced by genetic factors in addition to those controlling macrophage sensitivity. Infect. Immun (2004) 72:4439–4447.
Mueller T, et al. Influence of interleukin 12B (IL12B) polymorphisms on spontaneous and treatment-induced recovery from hepatitis C virus infection. J. Hepatol (2004) 41:652–658.[CrossRef][Web of Science][Medline]
Nixon K. Winclada. (2001) http://www.cladistics.com.
Obenauer JC, et al. Large-scale sequence analysis of avian influenza isolates. Science (2006) 311:1576–1580.
Pagel M. Detecting correlated evolution on phylogenies: a general method for the comparative analysis of discrete characters. R. Soc. Lond. Proc. Series B (1994) 255:37–45.
Pletcher MT, et al. Use of a dense single nucleotide polymorphism map for in silico mapping in the mouse. PLoS Biol (2004) 2:2159–2169.[Web of Science]
Popov SG, et al. Effect of Bacillus anthracis lethal toxin on human peripheral blood mononuclear cells. FEBS Lett (2002) 527.
Popov SG, et al. Systemic cytokine response in murine anthrax. Cell Microbiol (2004) 6:225–233.[CrossRef][Web of Science][Medline]
Remus N, et al. Impaired interferon gamma-mediated immunity and susceptibility to mycobacterial infection in childhood. Pediatr. Res (2001) 50:8–13.[Web of Science][Medline]
Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science (1996) 273:1516–1517.
Swofford DL. PAUP*: Phylogenetic Analysis Using Parsimony (*and Other Methods). (1989–2002) Sunderland, Massachusetts: Sinauer Associates. Version 4.
Wade CM, et al. The mosaic structure of variation in the laboratory mouse genome. Nature (2002) 420:574–578.[CrossRef][Medline]
Watters JM, et al. Kif1C and a kinesin-like motor protein and mediates mouse macrophage resistance to anthrax lethal factor. Curr. Biol (2001) 11:1503–1511.[CrossRef][Web of Science][Medline]
Welkos SL, et al. Differences in susceptibility of inbred mice to Bacillus anthracis. Infect. Immun (1986) 51:795–800.
Wheeler WC, et al. POY version 3.0, Documentation by Daniel Janies and Ward Wheeler. Commandline documentation by J. De Laet and W. C. Wheeler. In: Technical Report. (2005).
Wiltshire T, et al. Genome-wide single-nucleotide polymorphism analysis defines haplotype patterns in mouse. PNAS (2003) 100:3380–3385.
Ymer SI, et al. Polymorphisms in the Il12b gene affect structure and expression of IL-12 in NOD and other autoimmune-prone mouse strains. Genes and Immun (2002) 3:151–157.[CrossRef]
This article has been cited by other articles:
![]() |
D. A. Leiman, N. M. Lorenzi, J. C. Wyatt, A. S.F. Doney, and S. T. Rosenbloom US and Scottish Health Professionals' Attitudes toward DNA Biobanking J. Am. Med. Inform. Assoc., May 1, 2008; 15(3): 357 - 362. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


