Bioinformatics Advance Access originally published online on June 9, 2005
Bioinformatics 2005 21(16):3445-3447; doi:10.1093/bioinformatics/bti529
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PEDSTATS: descriptive statistics, graphics and quality assessment for gene mapping data
Center for Statistical Genetics, Department of Biostatistics, School of Public Health, University of Michigan Ann Arbor, MI 48103, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: We describe a tool that produces summary statistics and basic quality assessments for gene-mapping data, accommodating either pedigree or case-control datasets. Our tool can also produce graphic output in the PDF format.
Availability: http://www.sph.umich.edu/csg/abecasis/Pedstats/download/
Contact: wiggie{at}umich.edu
Supplementary information: http://www.sph.umich.edu/csg/abecasis/Pedstats/
A crucial first step in the analysis of gene mapping data is the careful description of the available data, including, for example, genotyping completeness and heterozygosities for genetic markers, and distributions and familial correlations for quantitative traits. Although a number of programs now provide some facilities for data checking or summary (Mukhopadhyay et al., 2005; Lange et al., 1988; Elston et al., 2004; O'Connell et al., 1998) complete screening and summary of genetic data frequently involves the use of multiple programs and/or in-house tools. As the scale of the datasets available for analysis increases, this process can become particularly challenging. For example, with the advent of high-throughput single nucleotide polymorphism genotyping technologies, datasets will soon be available that includes genotypes for hundreds of thousands or millions of markers for each individual. In addition, with the focus on uncovering the genetic basis of complex disease, it is likely that collaborative projects will collect samples with hundreds or thousands of phenotypes each measured on thousands of individuals. We have developed PEDSTATS, a freely available utility, for summarizing salient features and performing basic quality checks on gene-mapping data. Our utility can conveniently handle these very large datasets and here we summarize its main features.
PEDSTATS runs on any platform where a modern C++ compiler is available, including those based on the Linux, UNIX, Windows and Mac OS X operating systems. It is a command-line utility that can produce both text output to the console and graphical output to a PDF file. Its major capabilities can be grouped into four areas: (1) checks of input formats and pedigree consistency, (2) checks and descriptions of genetic marker data, (3) checks and descriptions of quantitative traits and covariates and (4) descriptions of discrete traits. We describe each of these in turn below.
The first step in any analysis is the validation of input files. At this stage, common data-format errors such as missing or extraneous columns are reported. Next, the reported family structures are validated to ensure that all connecting individuals are present and that sex-codes are consistent for the various individuals. If desired, large pedigrees can be trimmed to remove uninformative individuals with no phenotype or genotype data, or separated into disconnected family units. A brief summary of the number of pedigrees, individuals and a distribution of individuals per family is produced. This information can be graphically summarized (Fig. 1A is an example summarizing the distribution of family sizes in one large dataset) and, optionally, includes counts for various types of relative pairs which can be further broken down by sex. Individuals with no phenotype or genotype information can be automatically removed and a new set of input files generated. PEDSTATS readily accepts files prepared for other packages we have developed, including those prepared for linkage analyses with Merlin (Abecasis et al., 2002), association analyses with QTDT (Abecasis et al., 2000) and relationship inference with GRR (Abecasis et al., 2001). Other popular formats, such as those used by the LINKAGE package (Lathrop et al., 1985) and by MENDEL and related tools (Lange et al., 1988) are also accommodated.
|
When verifying genetic marker data, PEDSTATS reports basic statistics like heterozygosity and genotyping completeness and can produce graphical summaries of allele and genotype frequencies. After automatic grouping of rare alleles, conformance of observed genotypes with HardyWeinberg equilibrium can be checked with a
2 test for multi-allelic markers or an exact test for bi-allelic markers (Wigginton et al., 2005). Results of HardyWeinberg tests, including an exact distribution for the number of heterozygotes in the sample, can be presented graphically (e.g. Fig. 1B). Mendelian inheritance checks for both autosomal and X-linked marker data are also carried out using a genotype elimination algorithm that finds all inconsistencies in pedigrees without loops (Lange and Goradia, 1987; O'Connell and Weeks, 1999). Verifying Mendelian consistency prior to analysis of genetic marker data can be a crucial step (Lange and Goradia, 1987; O'Connell and Weeks, 1998), since most genetic analysis programs do not model genotyping error explicitly (for an exception, see Sobel et al., 2002). For quantitative traits and covariates, PEDSTATS reports the range, mean and variance of the trait distribution along with the correlation between siblings. Several graphics, including histograms of the overall trait distribution and comparisons of distributions between males and females can be generated (as illustrated in Fig. 1, Panel C which summarizes the distribution of Height in one large dataset). These can be helpful in detecting outliers as well as detecting deviations from approximate normality, which is important for many quantitative trait analyses (Allison et al., 1999). Optionally, correlations for other relative pair types can be calculated and plotted (as illustrated in Fig. 1, Panel D, which summarizes the correlations between Weight for different relative pairs) and stratified by sex, if desired. Correlations between relatives can provide information about the overall impact of genes on a particular trait. In the example, it is clear that correlation of the variable Weight for first degree relatives (in this case, parentoffspring and sibling pairs) is higher than for more distant relatives (half-sibling, avuncular, grand-parent grand-child and cousin pairs). When an age variable is present, we have implemented checks to ensure that values recorded for each individual are compatible with those of their ancestors, subject to user-specified minimum and maximum generation times.
Finally, for discrete traits, PEDSTATS reports the proportion of phenotyped individuals and provides a breakdown of affected individuals. A summary of affected, unaffected and discordant pairs can also be produced, and may help guide decisions on whether a dataset contains sufficient information for an affected relative pair analysis to be carried out (Risch, 1990; Whittemore and Halpern, 1994). As with the other analysis options, discrete trait reports can be segregatedby sex.
In addition to the ability to report statistics separately for different relative pairs and segregate results by sex, PEDSTATS can produce reports for individual families and allows various filters to be applied to input data prior to analysis. For example, all analyses can be restricted to affected individuals (for a specific trait) or to individuals with a minimal amount of genotype data.
We hope our tool will prove valuable to scientists hoping to discern important features of their data, and ease the burdensome task of verifying the consistency and integrity of input formats. Executables, source code and a web-based tutorial that explains input file format, implementation details and output for various tests are available from our website.
| Acknowledgments |
|---|
This work was supported by research grants from the National Human Genome Research Institute and the National Eye Institute.
Conflict of Interest: none declared.
Received on April 28, 2005; revised on June 5, 2005; accepted on June 6, 2005
| REFERENCES |
|---|
|
|
|---|
Abecasis, G.R., et al. (2000) A general test of association for quantitative traits in nuclear families. Am. J. Hum. Genet.,, 66, 279292[CrossRef][Web of Science][Medline].
Abecasis, G.R., et al. (2001) GRR: graphical representation of relationship errors. Bioinformatics, 17, 742743
Abecasis, G.R., et al. (2002) Merlinrapid analysis of dense genetic maps using sparse gene flow trees. Nat. Genet., 30, 97101[CrossRef][Web of Science][Medline].
Allison, D.B., et al. (1999) Testing the robustness of the likelihood-ratio test in a variance-component quantitative-trait loci-mapping procedure. Am. J. Hum. Genet., 65, 531544[CrossRef][Web of Science][Medline].
Elston, R., Bailey-Wilson, J., Bonney, G., Tran, L., Keats, B., Wilson, A. (2004) SAGE Statistical Analysis for Genetic Epidemiology, Version 5.0.
Lange, K. and Goradia, T.M. (1987) An algorithm for automatic genotype elimination. Am. J. Hum. Genet., 40, 250256[Web of Science][Medline].
Lange, K., et al. (1988) Programs for pedigree analysis: MENDEL, FISHER, and dGENE. Genet. Epidemiol., 5, 471472[CrossRef][Web of Science][Medline].
Lathrop, G.M., et al. (1985) Multilocus linkage analysis in humans: detection of linkage and estimation of recombination. Am. J. Hum. Genet., 37, 482498[Web of Science][Medline].
Mukhopadhyay, N., et al. (2005) Mega2: data-handling for facilitating genetic linkage and association analyses. Bioinformatics, 21, 25562557
O'Connell, J.R. and Weeks, D.E. (1998) PedCheck: a program for identification of genotype incompatibilities in linkage analysis. Am. J. Hum. Genet., 63, 259266[CrossRef][Web of Science][Medline].
O'Connell, J.R. and Weeks, D.E. (1999) An optimal algorithm for automatic genotype elimination. Am. J. Hum. Genet., 65, 17331740[CrossRef][Web of Science][Medline].
Risch, N. (1990) Linkage strategies for genetically complex traits. II. The power of affected relative pairs. Am. J. Hum. Genet., 46, 229241[Web of Science][Medline].
Sobel, E., et al. (2002) Detection and integration of genotyping errors in statistical genetics. Am. J. Hum. Genet., 70, 496508[CrossRef][Web of Science][Medline].
Whittemore, A.S. and Halpern, J. (1994) A class of tests for linkage using affected pedigree members. Biometrics, 50, 118127[CrossRef][Web of Science][Medline].
Wigginton, J.E., et al. (2005) A note on exact tests of HardyWeinberg equilibrium. Am. J. Hum. Genet., 76, 887893[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
X. Li, Y.-H. Shu, A. H. Xiang, E. Trigo, J. Kuusisto, J. Hartiala, A. J. Swift, M. Kawakubo, H. M. Stringham, L. L. Bonnycastle, et al. Additive Effects of Genetic Variation in GCK and G6PC2 on Insulin Secretion and Fasting Glucose Diabetes, December 1, 2009; 58(12): 2946 - 2953. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Lampe, C. Dierks, K. Komm, and O. Distl Identification of a new quantitative trait locus on equine chromosome 18 responsible for osteochondrosis in Hanoverian warmblood horses J Anim Sci, November 1, 2009; 87(11): 3477 - 3481. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y.-H. Shu, J. Hartiala, A. H. Xiang, E. Trigo, J. M. Lawrence, H. Allayee, T. A. Buchanan, N. Bottini, and R. M. Watanabe Evidence for Sex-Specific Associations between Variation in Acid Phosphatase Locus 1 (ACP1) and Insulin Sensitivity in Mexican-Americans J. Clin. Endocrinol. Metab., October 1, 2009; 94(10): 4094 - 4102. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Skare, N. Sheehan, and T. Egeland Identification of distant family relationships Bioinformatics, September 15, 2009; 25(18): 2376 - 2382. [Abstract] [Full Text] [PDF] |
||||
![]() |
P.-H. Liu, Y.-C. Chang, Y.-D. Jiang, W. J. Chen, T.-J. Chang, S.-S. Kuo, K.-C. Lee, P.-C. Hsiao, K. C. Chiu, and L.-M. Chuang Genetic Variants of TCF7L2 Are Associated with Insulin Resistance and Related Metabolic Phenotypes in Taiwanese Adolescents and Caucasian Young Adults J. Clin. Endocrinol. Metab., September 1, 2009; 94(9): 3575 - 3582. [Abstract] [Full Text] [PDF] |
||||
![]() |
B Asling, J Jirholt, P Hammond, M Knutsson, A Walentinsson, G Davidson, L Agreus, A Lehmann, and M Lagerstrom-Fermer Collagen type III alpha I is a gastro-oesophageal reflux disease susceptibility gene and a male risk factor for hiatus hernia Gut, August 1, 2009; 58(8): 1063 - 1069. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Valkenburg, A.G. Uitterlinden, D. Piersma, A. Hofman, A.P.N. Themmen, F.H. de Jong, B.C.J.M. Fauser, and J.S.E. Laven Genetic polymorphisms of GnRH and gonadotrophic hormone receptors affect the phenotype of polycystic ovary syndrome Hum. Reprod., August 1, 2009; 24(8): 2014 - 2022. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Wittwer, H. Hamann, and O. Distl The Candidate Gene XIRP2 at a Quantitative Gene Locus on Equine Chromosome 18 Associated with Osteochondrosis in Fetlock and Hock Joints of South German Coldblood Horses J. Hered., July 1, 2009; 100(4): 481 - 486. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J. Birley, M. R. James, P. A. Dickson, G. W. Montgomery, A. C. Heath, N. G. Martin, and J. B. Whitfield ADH single nucleotide polymorphism associations with alcohol metabolism in vivo Hum. Mol. Genet., April 15, 2009; 18(8): 1533 - 1542. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Macgregor, P. A. Lind, K. K. Bucholz, N. K. Hansell, P. A.F. Madden, M. M. Richter, G. W. Montgomery, N. G. Martin, A. C. Heath, and J. B. Whitfield Associations of ADH and ALDH2 gene variation with self report alcohol reactions, consumption and dependence: an integrated analysis Hum. Mol. Genet., February 1, 2009; 18(3): 580 - 593. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. K. Selmer, K. Brandal, O. K. Olstad, B. Birkenes, D. E. Undlien, and T. Egeland Genome-wide Linkage Analysis with Clustered SNP Markers J Biomol Screen, January 1, 2009; 14(1): 92 - 96. [Abstract] [PDF] |
||||
![]() |
J. Palomino-Doza, T. J. Rahman, P. J. Avery, B. M. Mayosi, M. Farrall, H. Watkins, C. R.W. Edwards, and B. Keavney Ambulatory Blood Pressure Is Associated With Polymorphic Variation in P2X Receptor Genes Hypertension, November 1, 2008; 52(5): 980 - 985. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Aberg, F. Dai, G. Sun, E. Keighley, S. R. Indugula, L. Bausserman, S. Viali, J. Tuitele, R. Deka, D. E. Weeks, et al. A genome-wide linkage scan identifies multiple chromosomal regions influencing serum lipid levels in the population on the Samoan islands J. Lipid Res., October 1, 2008; 49(10): 2169 - 2178. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. De Mars, A. Windelinckx, W. Huygens, M. W. Peeters, G. P. Beunen, J. Aerssens, R. Vlietinck, and M. A. I. Thomis Genome-wide linkage scan for contraction velocity characteristics of knee musculature in the Leuven Genes for Muscular Strength Study Physiol Genomics, September 17, 2008; 35(1): 36 - 44. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. A. Bremer, S. M. Blackman, L. L. Vanscoy, K. E. McDougal, A. Bowers, K. M. Naughton, D. J. Cutler, and G. R. Cutting Interaction between a novel TGFB1 haplotype and CFTR genotype is associated with improved lung function in cystic fibrosis Hum. Mol. Genet., July 15, 2008; 17(14): 2228 - 2237. [Abstract] [Full Text] [PDF] |
||||
![]() |
J.-k. Wang, Y. Li, and B. Su A common SNP of MCPH1 is associated with cranial volume variation in Chinese population Hum. Mol. Genet., May 1, 2008; 17(9): 1329 - 1335. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Do, S. D. Bailey, K. Desbiens, A. Belisle, A. Montpetit, C. Bouchard, L. Perusse, M.-C. Vohl, and J. C. Engert Genetic Variants of FTO Influence Adiposity, Insulin Sensitivity, Leptin Levels, and Resting Metabolic Rate in the Quebec Family Study Diabetes, April 1, 2008; 57(4): 1147 - 1150. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Wittwer, C. Dierks, H. Hamann, and O. Distl Associations between Candidate Gene Markers at a Quantitative Trait Locus on Equine Chromosome 4 Responsible for Osteochondrosis Dissecans in Fetlock Joints of South German Coldblood Horses J. Hered., March 1, 2008; 99(2): 125 - 129. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Gaukrodger, P. J. Avery, and B. Keavney Plasma potassium level is associated with common genetic variation in the {beta}-subunit of the epithelial sodium channel Am J Physiol Regulatory Integrative Comp Physiol, March 1, 2008; 294(3): R1068 - R1072. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J. Birley, M. R. James, P. A. Dickson, G. W. Montgomery, A. C. Heath, J. B. Whitfield, and N. G. Martin Association of the gastric alcohol dehydrogenase gene ADH7 with variation in alcohol metabolism Hum. Mol. Genet., January 15, 2008; 17(2): 179 - 189. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Dean, T. W. Yeo, A. Goris, C. J. Taylor, R. S. Goodman, M. Elian, A. Galea-Debono, A. Aquilina, A. Felice, M. Vella, et al. HLA-DRB1 and multiple sclerosis in Malta Neurology, January 8, 2008; 70(2): 101 - 105. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Laje, S. Paddock, H. Manji, A. J. Rush, A. F. Wilson, D. Charney, and F. J. McMahon Genetic Markers of Suicidal Ideation Emerging During Citalopram Treatment of Major Depression Focus, January 1, 2008; 6(1): 69 - 79. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Baker, T. Rahman, D. Hall, P. J Avery, B. M Mayosi, J. M C Connell, M. Farrall, H. Watkins, and B. Keavney The C-532T polymorphism of the angiotensinogen gene is associated with pulse pressure: A possible explanation for heterogeneity in genetic association studies of AGT and hypertension Int. J. Epidemiol., December 1, 2007; 36(6): 1356 - 1362. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Laje, S. Paddock, H. Manji, A. J. Rush, A. F. Wilson, D. Charney, and F. J. McMahon Genetic Markers of Suicidal Ideation Emerging During Citalopram Treatment of Major Depression Am J Psychiatry, October 1, 2007; 164(10): 1530 - 1538. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. A. Treloar, Z. Z. Zhao, L. Le, K. T. Zondervan, N. G. Martin, S. Kennedy, D. R. Nyholt, and G. W. Montgomery Variants in EMX2 and PTEN do not contribute to risk of endometriosis Mol. Hum. Reprod., August 1, 2007; 13(8): 587 - 594. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. M. Watanabe, H. Allayee, A. H. Xiang, E. Trigo, J. Hartiala, J. M. Lawrence, and T. A. Buchanan Transcription Factor 7-Like 2 (TCF7L2) Is Associated With Gestational Diabetes Mellitus and Interacts With Adiposity to Alter Insulin Secretion in Mexican Americans Diabetes, May 1, 2007; 56(5): 1481 - 1485. [Abstract] [Full Text] [PDF] |
||||
![]() |
J M Andresen, J Gayan, S S Cherny, D Brocklebank, G Alkorta-Aranburu, E A Addis, The US-Venezuela Collaborative Research Group, L R Cardon, D E Housman, and N S Wexler Replication of twelve association studies for Huntington's disease residual age of onset in large Venezuelan kindreds J. Med. Genet., January 1, 2007; 44(1): 44 - 50. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



















