Bioinformatics Advance Access originally published online on September 14, 2007
Bioinformatics 2007 23(22):3105-3107; doi:10.1093/bioinformatics/btm458
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
IGG: A tool to integrate GeneChips for genetic studies
1Department of Biochemistry, 2Department of Psychiatry, 3Department of Medicine, 4The Centre for Reproduction, Development and Growth, 5Genome Research Center, The University of Hong Kong, Pokfulam, Hong Kong and 6Hunan Business College, Changsha, Hunan 410205, China
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: To facilitate genetic studies using high-throughput genotyping technologies, we have developed an open source tool to integrate genotype data across the Affymetrix and Illumina platforms. It can efficiently integrate a large amount of data from various GeneChips, add genotypes of the HapMap Project into a specific project, flexibly trim and export the integrated data with different formats of popular genetic analysis tools, and highly control the quality of genotype data. Furthermore, this tool has sufficiently simplified its usage through its user-friendly graphic interface and is independent of third-party databases. IGG has successfully been applied to a genome-wide linkage scan in a Charcot-Marie-Tooth disease pedigree by integrating three types of GeneChips and HapMap project genotypes.
Availability: http://bioinfo.hku.hk/iggweb (version 0.9).
Contact: limx54{at}yahoo.com and songy{at}hku.hk
| 1 INTRODUCTION |
|---|
|
|
|---|
High-throughput technologies of genotyping Single Nucleotide Polymorphism (SNP) advance very rapidly. For instance, the Human Mapping GeneChip Array platform technology of Affymetrix Company has been quickly updated from 10 K (Kilo-SNP) to 100 K to 500 K then to 1 M (Mega-SNP) GeneChip Arrays over few years. Its key commercial competitor, Illumina, also promptly offered a series of genotyping arrays for genome-wide genetic studies. These encouraging developments have recently stimulated a great number of genome-wide genetic studies recently (Benjafield et al., 2005; Rioux et al., 2007; Sladek et al., 2007). As a result of these progresses in the genotyping technologies, integration of a huge amount of GeneChip data for genetic studies has become a very important issue.
The integration is significant and of interest in at least two aspects. One is the integration of various GeneChip genotyping data for a long-term project where it takes time to sequentially collect enough samples through several stages. In later stages, updated but compatible GeneChips with denser markers may well be used to finely localize available interesting regions. It turns out the project may have genotypes of various types chips. An integration of these genotypes is necessary for this project to make the most of genotyping information. The other integration is to add available public resources (such as genotypes from the HapMap Project) into a special project. This can enlarge the original sample size of genetic study, and thus increase its statistical power (Gibbs et al., 2005). In addition, recently a number of methods are being developed to impute genotypes across various GeneChip data and public resources (e.g. Burdick et al., 2006; Marchini et al., 2007). A powerful tool, which can map various GeneChips onto a common set for further imputation by these methods, will be very helpful in this community. As far as we know, there is currently no such tool or software package available for geneticists to flexibly integrate high throughput genotypes for genetic analyses. It is also true for the commercially available software packages like GTYPE (Affymetrix—GeneChip® Genotyping Analysis Software, http://www.affymetrix.com) and BeadStudio (Illumina, http://www.illumina.com/). None of them can integrate GeneChip data across platform and from public resources for further genetic analyses.
In order to meet these requirements of integration and overcome limitations of available software packages, we have developed an open-source tool named Integration of Genotypes from GeneChips (IGG) to integrate genotype data from the same or different high-throughput platforms. It is equipped with a set of quality control functions for the integration as well as powerful exporting functions facilitating genome-wide linkage and/or association analyses.
| 2 FEATURES AND ALGORITHMS |
|---|
|
|
|---|
2.1 Integration of genotype data from various chips
IGG can consistently and efficiently integrate large-scale genotype data from various Affymetrix and Illumina GeneChips. There are two challenging problems for the integration. One is the different identification of SNPs in different types of chips even within the same platform. The other is the distinct genotype/allele designation systems across platforms. That is, the genotype call AA of a given SNP from different GeneChips may correspond to different polymorphisms. For the first problem, the reference SNP ID number (RS ID) of the dbSNP database (http://www.ncbi.nlm.nih.gov/sites/entrez?db=snp) might be an obvious identification for all SNPs. However, we found that not all SNPs in the annotation files have the RS ID, and some RS IDs used in the files were even out of date. Therefore, we used physical positions of SNPs on the reference genome to identify them because these physical positions are unique and consistent under the same Genome Build Version. The second problem is solved by simply matching the flanking sequences of the SNPs. The flanking sequences and corresponding alleles of A and B calls can also be found in the annotation files. Their combination can determine whether the allele designations between different chips are consistent or not. If the flanking sequences and the alleles are all identical for a SNP, then genotypes (denoted by A and B) in the final output files should be accordant. If the flanking sequences are complementary but the alleles are not, we need reverse the genotypes in either of their corresponding output files to make the genotypes consistent. The remaining two can be deduced similarly.
2.2 Consolidation of HapMap genotypes
HapMap genotypes can be easily consolidated into a project in hand on IGG. This function has a few interesting applications. For instance, for genetic studies of low prevalence diseases, the HapMap pedigrees or subjects can be added as controls or unknowns (Gibbs et al., 2005). This can increase the sample size and thus the power of these studies. Another application is for studies involving haplotyping. The added genotype can increase the accuracy of any haplotyping methods (Fallin and Schork, 2000). To integrate the HapMap genotypes, users have to download their interesting genotype files from the website of HapMap, http://www.hapmap.org/genotypes/?N=D first. The downloaded data can be imported into IGG through a graphic interface. During integration, users only need determine which HapMap sample(s) among the four (the Caucasian, Chinese, Japanese and Yoruba) will be added into their own dataset and then the system will automatically retrieve genotypes of the SNPs on GeneChips and compile them. The allele designation of HapMap genotypes can also be reunified with those from GeneChips by matching the flanking sequences of the SNPs as described above.
2.3 Trimming of dataset
A function of trimming a dataset has been devised to speed up preliminary linkage scan. A whole genome scan with hundreds of thousands of SNPs is always time-consuming and not practical for workstations. A pruned dataset with selected informative SNPs can be preluded before a fine scan. IGG can select SNPs with maximum heterozygosity while controlling intervals between them. The interval can be flexibly customized in a centi-Morgan or base-pair distance by users in a graphic interface.
2.4 Connection to available genetic analysis tools
IGG has a powerful export function to connect with genetic analysis tools. At present, it can export the integrated data with input formats of a number of popular tools such as Merlin, Gene Hunter, Solar, Plink, Phase, Linkage, SuperLink and Mega2. The number of tools for consideration will be subject to extension according to the feedbacks of users in the future. The exporting can be specified in a graphic dialog, where chromosome(s), region(s) or SNP(s), phenotypes, missing genotypes and reference populations for allele frequencies can be easily set by users. Users can easily pick up their interested regions or SNPs for a quick analysis.
2.5 Simplification of usage
We tried to make IGG easy to use for most genetic investigators. It has three features in this aspect. First, the whole package is coded in Java, a programming language known for its cross-platform capabilities which can avoid frequent troubles in code compiling. Second, a user-friendly graphic interface is designed to facilitate the manipulation. Finally, it is able to handle large-scale data without any dependence on any third-party database software like MySQL (http://www.mysql.com/), although it has to deal with large-scale data. According to our previous experience in software development (Zhao et al., 2005), a third-party database can considerably increase difficulty in software installation because most genetic investigators are not familiar with the configuration of professional databases. Exclusion of professional databases, however, entails additional optimization procedures to process a large amount of data in IGG itself, which have already considered by these databases.
2.6 Quality control
IGG has some basic functions for quality control. It can report SNPs, whose allele frequencies in the loaded dataset are quite different from those in the annotation files. An important reason for this frequency discrepancy is genotyping error. IGC can also check Mendelian inheritance consistency within families. SNPs with inconsistent Mendelian inheritance also imply possible genotyping errors. In addition, if a subject has more than one genotype calls for the same SNPs, IGG can detect conflicting genotypes of these SNPs. Users can easily pick up all of these problematic SNPs from IGG's output information to double-check their genotypes in raw datasets.
| 3 IMPLEMENTATION |
|---|
|
|
|---|
IGG has been professionally implemented and plays an important role in genome-wide scan studies of our group. We conducted a careful requirement analysis to meet users expectations. The object-oriented design was employed to enlarge its extensibility in the future. Likewise, a considerable testing plan was carried out to minimize potential bugs. To use IGG, users only need to prepare two kinds of input files: the pedigree/subjects and the GeneChip output files. The format of the former is a classic one for genetic analysis and that of the latter is the default exporting format from Affymetrix GTYPE and Illumina BeadStudio software. Interested readers can get more details regarding the input files in its user manual. Guides and examples for large dataset are also included in the manual. The first successful application of IGG was a genome-wide linkage study in a Charcot-Marie-Tooth disease pedigree by our group. One and a half years ago, a maximum linkage LOD 2.4 was obtained based on 9 Illumina Human Linkage IVb Panel chips in this pedigree. Half year later, 3 more Affymetrix 50K Xba240 chips were appended in this pedigree. Recently we added 8 more Affymetrix 250 K Nsp chips in this pedigree in the hope of increasing LOD scores and narrowing down interested regions by these dense markers. IGG was used to integrate the genotypes of these chips, to generate input files for Merlin (Abecasis et al., 2002), trim the whole data set with an interval for a quick preliminary analysis, split the genome data into prioritized regions for a further study with dense markers and to incorporate HapMap genotypes for modeling marker–marker linkage disequilibrium in multipoint linkage analysis (Abecasis and Wigginton, 2005). The integration enables the linkage tool to jointly rather than separately utilize genotypes from various types of GeneChips. A region on the sex chromosome X with the maximum linkage LOD score of 3.0 was identified in a final integrated dataset.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We gratefully acknowledge the BIOSUPPORT project (http://www.bioinfo.hku.hk), the Computer Centre. This work was supported by a grant from the Research Grant Council of Hong Kong (HKU7496/04M, YQS).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on June 22, 2007; revised on August 10, 2007; accepted on September 3, 2007
| REFERENCES |
|---|
|
|
|---|
Abecasis GR, et al. Merlin – rapid analysis of dense genetic maps using sparse gene flow trees. Nat. Genet. (2002) 30:97–101.[CrossRef][Web of Science][Medline]
Abecasis GR, Wigginton JE. Handling marker-marker linkage disequilibrium: pedigree analysis with clustered markers. Am. J. Hum. Genet. (2005) 77:754–767.[CrossRef][Web of Science][Medline]
Benjafield AV, et al. Genome-wide scan for hypertension in Sydney Sibships: the GENIHUSS study. Am. J. Hypertens (2005) 18:828–832.[CrossRef][Web of Science][Medline]
Burdick JT, et al. In silico method for inferring genotypes in pedigrees. Nat. Genet. (2006) 38:1002–1004.[CrossRef][Web of Science][Medline]
Fallin D, Schork NJ. Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. Am. J. Hum. Genet. (2000) 67:947–959.[CrossRef][Web of Science][Medline]
Gibbs RA, et al. A haplotype map of the human genome. Nature (2005) 437:1299–1320.[CrossRef][Medline]
Marchini J, et al. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. (2007) 39:906–913.[CrossRef][Web of Science][Medline]
Rioux JD, et al. Genome-wide association study identifies new susceptibility loci for Crohn disease and implicates autophagy in disease pathogenesis. Nat. Genet. (2007) 39:596–604.[CrossRef][Web of Science][Medline]
Sladek R, et al. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature (2007) 445:881–885.[CrossRef][Medline]
Zhao LJ, et al. SNPP: automating large-scale SNP genotype data management. Bioinformatics (2005) 21:266–268.
This article has been cited by other articles:
![]() |
A. D. Johnson Single-Nucleotide Polymorphism Bioinformatics: A Comprehensive Review of Resources Circ Cardiovasc Genet, October 1, 2009; 2(5): 530 - 536. [Full Text] [PDF] |
||||
![]() |
M. Bahlo and C. J. Bromhead Generating linkage mapping files from Affymetrix SNP chip data Bioinformatics, August 1, 2009; 25(15): 1961 - 1962. [Abstract] [Full Text] [PDF] |
||||
![]() |
M.-X. Li, L. Jiang, P. Y.-P. Kao, P.-C. Sham, and Y.-Q. Song IGG3: a tool to rapidly integrate large genotype datasets for whole-genome imputation and individual-level meta-analysis Bioinformatics, June 1, 2009; 25(11): 1449 - 1450. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. I.W. de Bakker, M. A.R. Ferreira, X. Jia, B. M. Neale, S. Raychaudhuri, and B. F. Voight Practical aspects of imputation-driven meta-analysis of genome-wide association studies Hum. Mol. Genet., October 15, 2008; 17(R2): R122 - R128. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


