Bioinformatics Advance Access originally published online on January 28, 2005
Bioinformatics 2005 21(9):2130-2132; doi:10.1093/bioinformatics/bti293
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TreeScan: a bioinformatic application to search for genotype/phenotype associations using haplotype trees
1Variagenics, Inc. 60 Hampshire Street, Cambridge, MA 02139, USA
2Department of Biology, Washington University St Louis, MO 631304899, USA
*To whom correspondence should be addressed at Departamento de Bioquímica, Genética e Inmunología, Facultad de Biología, Campus Universitario, Universidad de Vigo, Vigo 36310, Spain.
| Abstract |
|---|
|
|
|---|
Summary: We present the software implementation of the tree scanning method to detect associations between genetic haplotypes and quantitative traits, utilizing the evolutionary history of the haplotypes, in samples of unrelated individuals.
Availability: The program is available free of charge, under the GNU General Public License. A package including C source code, a Makefile, and Windows (DOS) and Macintosh binaries, can be downloaded from http://darwin.uvigo.es
Contact: dposada{at}uvigo.es
The use of haplotypes found at candidate genes increases the power of association studies, eliminates the difficulties of statistical dependence among single nucleotide polymorphisms (SNPs) showing linkage disequilibrium and adds important information about the genetic context on which SNPs are actually placed (Balciuniene et al., 2002; Drysdale et al., 2000; Knoblauch et al., 2002; Van Eerdewegh et al., 2002; Zaykin et al., 2002). Indeed, haplotypes in a candidate gene are themselves correlated due to a common evolutionary history, and the consideration of this history should augment the amount of biological information available in the sample (Lam et al., 2000; Seltman et al., 2001; Seltman et al., 2003; Templeton, 1995; Templeton et al., 1987; Templeton et al., 2000). Templeton et al. (1987) pioneered the use of evolutionary information in phenotype/genotype association studies, and very recently proposed a new and efficient method, tree scanning, to search for genotype/phenotype associations using haplotype trees (Templeton et al., in press). Basically, tree scanning cuts each branch in the haplotype tree, dividing in turn the haplotypes into two or more mutually exclusive groups or classes. These groups of haplotypes can be then treated as alleles and incorporated in a straightforward manner in a genotypic analysis of phenotypic associations using ANOVA. Since the tests are correlated, and to estimate the association P-values, tree scanning includes a step-down resampling method with enforced monotonicity that incorporates a structure correlation and corrects for multiple testing (Westfall and Young, 1993).
Although evolutionary methods for association studies seem very promising, one of their main drawbacks so far could have been their actual implementation (but see Seltman et al., 2003), especially among the researchers not familiar with the evolutionary jargon. To facilitate the application of the tree scanning method, we have implemented it in a program called TreeScan. TreeScan is written in ANSI C and it can be compiled in any operating system with a C compiler using the provided Makefile. The program is available free of charge under the GNU General Public License, and it can be downloaded from http://darwin.uvigo.es. A detailed documentation is included in the package, with instructions for installation, compilation and execution, along with an example input file.
The input file for the TreeScan program consists of haplotype designations and quantitative traits for each individual in the sample, and a haplotype tree. The haplotype tree has to be in parenthetical notation and must include branch lengths (Fig. 1). This tree can be rooted or unrooted, and include multifurcations, but it cannot be reticulated. Reticulations or loops arise in the tree because of phylogenetic ambiguity due to recombination or to parallel, convergent or reverse changes. In some cases, these reticulations can be solved using predictions from coalescent theory (Crandall and Templeton, 1993; Pfenninger and Posada, 2002). In other cases the reticulations are very difficult to solve without arbitrariness, and the user might want to analyze alternative trees separately. Another strategy to deal with phylogenetic ambiguity would be to simultaneously consider all tree splits derived from the different loop solutions (Templeton et al., in press), but this approach is not currently implemented in TreeScan. To facilitate the use of the same tree for different samples, or to use trees from previous studies, the input tree may contain haplotypes that are not present in the sample. It is also possible to have haplotypes in the sample that are not in the tree, but that situation should not occur often in practice. In such a case, individuals with haplotypes not present in the haplotype tree are effectively ignored. Any disagreement in haplotype composition of the tree and the sample will be indicated in the logfile. Figure 1 depicts in its upper part a haplotype tree obtained with the TCS software (Clement et al., 2000), accompanied by the tree parenthetical notation required by the program. Note that internal nodes are represented as terminal nodes with zero branch lengths.
|
The program TreeScan produces three different output files: outfile, logfile and tabfile (optional). The outfile includes the outcome of the analysis. Results are displayed for each trait in turn, and within each trait, first for the first round results, and then for the conditional tests (second round) of the test. This file includes, for each branch in the haplotype tree, the proportion of trait variation explained by the branch, the F-statistic from the ANOVA, the uncorrected P-value from the F-distribution, the uncorrected permutational P-value and the corrected step-down permutational P-value before and after enforcing monotonicity, the latter being the main association P-value (Fig. 1). This file also include results from the second round of the tree scanning, where new haplotype subgroups are defined and contrasted within those groups that resulted in significant genotype/phenotype associations in the first round of the analysis (Templeton et al., in press). The logfile includes information that might be helpful to make sure that the input was properly interpreted and that the analysis proceeded as expected. It includes the data (individual haplotypes and traits), the haplotype tree and information on all nodes in the tree, a comparison of haplotypes in the sample and haplotypes in the tree and a complete description of the tree splits defined during the analysis. The tabfile is optional and includes a complete description of the ANOVA tables built during the analysis, and a summary of the ANOVA results.
Program arguments are entered in the command line and change the default value of some settings. The user can specify the number of permutations for the estimation of P-values, the significance level, the minimum number of individuals required in each observed genotypic class to proceed with the analysis, whether to print the ANOVA tables, whether to sort results by the F-statistic and whether to use the genetic variance as the main test statistic (instead of the F). TreeScan results have been validated with the R package MULTTEST (Ge et al., 2003) to check that the F-statistics and the step-down correction were correctly implemented, and with an independent script written by one of us (T. J. M.) for the program MACANOVA (Oehlert and Bingham, 2003) available at http://www.stat.umn.edu/macanova/ that implements the ANOVA calculations given the tree partitions.
| Acknowledgments |
|---|
Part of this work was done at Variagenics, Inc (Cambridge, MA, USA). We thank Vincent Stanton, Jr, Daniel Chasman, Lakshman Subrahmanyan and Carsten Wiuf for the helpful discussions. Financial support from the Burroughs Wellcome Fund Innovation Award in Functional Genomics 100133, an NSF predoctoral fellowship award to T. J. M., and the National Institutes of Health grant GM65509 are gratefully acknowledged.
Received on November 5, 2004; revised on January 10, 2005; accepted on January 25, 2005
| REFERENCES |
|---|
|
|
|---|
Balciuniene, J., et al. (2002) Investigation of the functional effect of monoamine oxidase polymorphisms in human brain. Hum. Genet., 110, 17[CrossRef][Web of Science][Medline].
Boerwinkle, E. and Sing, C.F. (1986) Bias of the contribution of single locus effects to the variance of a quantitative trait. Am. J. Hum. Genet., 39, 137144[Web of Science][Medline].
Clement, M., et al. (2000) TCS: a computer program to estimate gene genealogies. Mol. Ecol., 9, 16571659[CrossRef][Medline].
Crandall, K.A. and Templeton, A.R. (1993) Empirical tests of some predictions from coalescent theory with applications to intraspecific phylogeny reconstruction. Genetics, 134, 959969[Abstract].
Drysdale, C.M., et al. (2000) Complex promoter and coding region beta 2-adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness. Proc. Natl Acad. Sci. USA, 97, 1048310488
Ge, Y., et al. (2003) Resampling-based multiple testing for microarray data hypothesis. Test, 12, 144.
Knoblauch, H., et al. (2002) Common haplotypes in five genes influence genetic variance of LDL and HDL cholesterol in the general population. Hum. Mol. Genet., 11, 14771485
Lam, J.C., et al. (2000) Haplotype fine mapping by evolutionary trees. Am. J. Hum. Genet., 66, 659673[CrossRef][Web of Science][Medline].
Oehlert, G.W. and Bingham, C. (2003) MacAnova: a program for statistical analysis and matrix algebra. , St Paul, MN School of Statistics, University of Minnesota.
Pfenninger, M. and Posada, D. (2002) Phylogeographic history of the land snail Candidula unifasciata (Helicellinae, Stylommatophora): fragmentation, corridor migration, and secondary contact. Evolution Int. J. Org. Evolution, 56, 17761788[CrossRef][Web of Science][Medline].
Seltman, H., et al. (2001) Transmission/Disequilibrium test meets measured haplotype analysis: family-based association analysis guided by evolution of haplotypes. Am. J. Hum. Genet., 68, 12501263[CrossRef][Web of Science][Medline].
Seltman, H., et al. (2003) Evolutionary-based association analysis using haplotype data. Genetic Epidemiology, 25, 4858[CrossRef][Web of Science][Medline].
Templeton, A.R. (1995) A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping or DNA sequencing. V. Analysis of case/control sampling designs: Alzheimers disease and the apoprotein E locus. Genetics, 140, 403409[Abstract].
Templeton, A.R., et al. (1987) A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping and DNA sequence data. I. Basic theory and an analysis of alcohol dehydrogenase activity in Drosophila. Genetics, 117, 343351
Templeton, A.R., et al. (2000) Cladistic structure within the human Lipoprotein Lipase gene and its implications for phenotyopic association studies. Genetics, 156, 12591275
Templeton, A.R., Maxwell, T., Posada, D., Stengard, J.H., Boerwinkle, E., Sing, C.F. Tree scanning: a method for using haplotype trees in phenotype/genotype association studies. Genetics, 169, 441453 in press.
Van Eerdewegh, P., et al. (2002) Association of the ADAM33 gene with asthma and bronchial hyperresponsiveness. Nature, 418, 426430[CrossRef][Medline].
Westfall, P.H. and Young, S.S. Resampling-Based Multiple Testing: Examples and Methods for p-value Adjustment, (1993) , NY Wiley.
Zaykin, D.V., et al. (2002) Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Hum. Hered., 53, 7991[Web of Science][Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
