Bioinformatics Advance Access originally published online on October 25, 2005
Bioinformatics 2005 21(24):4430-4431; doi:10.1093/bioinformatics/bti725
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GSMA: software implementation of the genome search meta-analysis method
1Department of Medical and Molecular Genetics, King's College London 8th Floor Guy's Tower, Guy's Hospital, London SE1 9RT, United Kingdom
2Division of Clinical Neurobiology and Behavior, University of Pennsylvania School of Medicine 3535 Market Street, Rm 4006, Philadelphia, PA 19104-3309, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Meta-analysis can be used to pool results of genome-wide linkage scans. This is of great value in complex diseases, where replication of linked regions occurs infrequently. The genome search meta-analysis (GSMA) method is widely used for this analysis, and a computer program is now available to implement the GSMA.
Availability: http://www.kcl.ac.uk/depsta/memoge/gsma/
Contact: Cathryn.lewis{at}genetics.kcl.ac.uk
| INTRODUCTION |
|---|
|
|
|---|
Genome-wide linkage searches are widely used to identify regions of the genome, which may harbour susceptibility genes for complex diseases. The value of these studies has been confirmed by a few genes localized by studying regions highlighted by linkage studies, e.g. CARD15 for Crohn's disease and CAPN10 for type 2 diabetes. However, linkage studies for many complex diseases have been disappointing, with few regions showing significant evidence for linkage, and little replication between studies in the same disease (Altmuller et al., 2001). Meta-analysis of genome-wide results provides a rapid method to identify linked regions that individual studies may lack the power to detect.
GSMA method
The most widely used meta-analysis method for linkage studies is the genome search meta-analysis (GSMA) (Wise et al., 1999; Levinson et al., 2003). The GSMA is a non-parametric method, which is applicable to results for any genome-wide linkage study, regardless of family structure (e.g. extended pedigrees, affected sib pairs), markers or statistical analysis method. The GSMA method has been applied to 13 diseases to date, with many further studies in progress. Studies in schizophrenia, rheumatoid arthritis and type 2 diabetes have shown that the GSMA can identify novel regions that were not highlighted by results from original studies (Demenais et al., 2003; Fisher et al., 2003; Lewis et al., 2003).
For each scan, the GSMA requires output statistics across the genome. This may be non-parametric LOD scores calculated at 1 cM intervals (e.g. from Genehunter), parametric LOD scores calculated for a series of models and recombination fractions or a single-point linkage statistic for each marker. Any linkage statistic (NPL score, LOD score, P-value) may be used. Data on linkage test statistics and markers are usually obtained from output files of linkage analysis programs or read from published graphs and tables. Data extraction is a key component of any meta-analysis and detailed information on this stage of the GSMA is given on the GSMA website.
The genome is divided into n bins of approximately equal cM width (e.g. 120 bins of 30 cM). For each study, the maximum evidence for linkage in each bin is identified, and bins are ranked (n, n 1,..., 1) on the basis of their relative evidence for linkage. These ranks are summed across studies. The summed rank (SR) forms a test statistic for each bin and can be tested for significance using its distribution function (Wise et al., 1999) or by simulation. This bin-wise statistic presents a multiple-testing problem, since with no linkage we expect 5% of the bins to achieve nominal significance (PSR < 0.05). A genome-wide interpretation of results is obtained through the ordered rank (OR) statistic. Each of these order statistics, e.g. the observed k-th highest summed rank, is compared with the distribution of k-th highest summed ranks obtained through simulation using re-assignment of ranks in each study. Simulation studies have shown that any bin with significant summed rank and ordered rank statistic (PSR < 0.05, POR < 0.05) has a high probability of containing a true susceptibility gene (Levinson et al., 2003). A weighted analysis, where study ranks are multiplied by a weight reflecting the informativeness of the study, can also be performed.
We have recently developed a software package to perform the GSMA. Executables are available for SUN, Windows, MAC or Linux from the GSMA website; source code (C++) is available from the authors. The GSMA website also has program documentation, a Procedure Guide (detailing methods for data extraction), a list of boundary markers for defining bins, and a bibliography of GSMA studies and methodology development.
Statistical software packages can also be used to obtain the SR statistics and P-values, using simulation or the Koziol and Feng method for P-value calculation (Koziol and Feng, 2004). However, the OR statistic is more difficult to calculate, and this useful statistic has therefore been used in only a few GSMA studies. A method for testing heterogeneity across studies in the GSMA was recently developed and is implemented by the HEGESMA software, which also performs a meta-analysis of the study ranks (Zintzaras and Ioannidis, 2005a,b). The GSMA program, together with the HEGESMA software, provides analysis tools which will enable a full range of testing procedures in the GSMA to be performed by any investigator.
| IMPLEMENTATION |
|---|
|
|
|---|
The GSMA program allows for an arbitrary number of bins (n), and studies (m), with no maximum values specified in the program. Significance tests for the summed rank and the ordered rank are performed, for weighted and unweighted analyses. The P-values are assessed by simulation of observed ranks within each study.
Two input files are required: a matrix of the maximum linkage statistic (e.g. NPL score, LOD score) for each bin for each study (with bin labels in column 1 and study names in row 1), and a file listing the weighting factor for each study. For studies reporting P-values, the data should be entered as 1 P-value to ensure the correct ranking of results, with significant results assigned high ranks. Tied observations within studies are permitted. Most genome-wide linkage studies have results available for all bins, but the program deals with any missing values by replacing them with the median linkage statistic for that study (giving a rank of (n + 1)/2). The weight of each study should reflect its informativeness, although the relative weighting of extended pedigrees and affected sib pairs will depend on the genetic effects contributing to disease risk, and is therefore difficult to quantify. One commonly used weighting function is the square root of the number of affected individuals.
The program is run from the command line, with options for the file name, the number of simulations performed (default 10 000), and the P-value threshold for interesting results (default PSR = 0.1). Three output files are produced. A summary file lists the bins showing the highest evidence for linkage in the weighted and unweighted analyses, and the standardized weights (Fig. 1). Other output files contain a full listing of results by chromosome bin and an output table of ranks, for data checking purposes. The program outputs summed ranks, but these can be converted to average ranks, if required (Levinson et al., 2003).
|
Conflict of Interest: none declared.
Received on August 11, 2005; revised on October 12, 2005; accepted on October 17, 2005
| REFERENCES |
|---|
|
|
|---|
Altmuller, J., et al. (2001) Genomewide scans of complex human diseases: true linkage is hard to find. Am. J. Hum. Genet, . 69, 936950[CrossRef][Web of Science][Medline].
Demenais, F., et al. (2003) A meta-analysis of four European genome screens (GIFT consortium) shows evidence for a novel region on chromosome 17p11.2q22 linked to type 2 diabetes. Hum. Mol. Genet, . 12, 18651873
Fisher, S.A., et al. (2003) Meta-analysis of four rheumatoid arthritis genome-wide linkage studiesconfirmation of a susceptibility locus on chromosome 16. Arthritis Rheum, . 48, 12001206[CrossRef][Web of Science][Medline].
Koziol, J.A. and Feng, A.C. (2004) A note on the genome scan meta-analysis statistic. Ann. Hum. Genet, . 68, 376380[CrossRef][Web of Science][Medline].
Levinson, D.F., et al. (2003) Genome scan meta-analysis of schizophrenia and bipolar disorder, part I: methods and power analysis. Am. J. Hum. Genet, . 73, 1733[CrossRef][Web of Science][Medline].
Lewis, C.M., et al. (2003) Genome scan meta-analysis of schizophrenia and bipolar disorder, part II: schizophrenia. Am. J. Hum. Genet, . 73, 3448[CrossRef][Web of Science][Medline].
Wise, L.H., et al. (1999) Meta-analysis of genome searches. Ann. Hum. Genet, . 63, 263272[CrossRef][Web of Science][Medline].
Zintzaras, E. and Ioannidis, J.P.A. (2005a) HEGESMA: genome search meta-analysis and heterogeneity testing. Bioinformatics, 21, 26722673.
Zintzaras, E. and Ioannidis, J.P.A. (2005b) Heterogeneity testing in meta-analysis of genome searches. Genet. Epidemiol, . 28, 123137[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
A. Malhotra, S. C. Elbein, M. C.Y. Ng, R. Duggirala, R. Arya, G. Imperatore, A. Adeyemo, T. I. Pollin, W.-C. Hsueh, J. C.N. Chan, et al. Meta-Analysis of Genome-Wide Linkage Studies of Quantitative Lipid Traits in Families Ascertained for Type 2 Diabetes Diabetes, March 1, 2007; 56(3): 890 - 896. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

