Bioinformatics Advance Access originally published online on November 15, 2005
Bioinformatics 2006 22(3):371-373; doi:10.1093/bioinformatics/bti785
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2SNP: scalable phasing based on 2-SNP haplotypes
Department of Computer Science, Georgia State University Atlanta, GA 30303, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: 2SNP software package implements a new very fast scalable algorithm for haplotype inference based on genotype statistics collected only for pairs of SNPs. This software can be used for comparatively accurate phasing of large number of long genome sequences, e.g. obtained from DNA arrays. As an input 2SNP takes genotype matrix and outputs the corresponding haplotype matrix. On datasets across 79 regions from HapMap 2SNP is several orders of magnitude faster than GERBIL and PHASE while matching them in quality measured by the number of correctly phased genotypes, single-site and switching errors. For example, 2SNP requires 41 s on Pentium 4 2 Ghz processor to phase 30 genotypes with 1381 SNPs (ENm010.7p15:2 data from HapMap) versus GERBIL and PHASE requiring more than a week and admitting no less errors than 2SNP.
Availability: 2SNP software package is publicly available at http://alla.cs.gsu.edu/~software/2SNP
Contact: alexz{at}cs.gsu.edu
| INTRODUCTION |
|---|
|
|
|---|
The difference between individual DNA sequences mostly occurs at a single-base site, in which more than one nucleic acid or gap is observed across the population. Such variations are called single nucleotide polymorphisms (SNPs). The number of sufficiently frequent SNPs in the human population is estimated to be around 10 million (Kruglyak and Nickerson, 2001). For complex diseases caused by more than a single gene it is important to identify a set of alleles inherited together. Identification of haplotypes, the sequences of alleles in contiguous SNP sites along a chromosomal region, is a central challenge of the International HapMap project (HapMap, 2003, http://www.hapmap.org). The number of simultaneously typed SNPs for association and linkage studies is reaching 250 000 for SNP Mapping Arrays (Affymetrix, 2005, http://www.affymetrix.com/products/arrays/). Although it is not clear how many SNPs will be phased simultaneously, processing thousands of SNPs for thousands of individuals will be soon required that brings more attention to phasing scalability and runtime.
Diploid organisms, like human, have two near-identical copies of each chromosome. Most experimental techniques for determining SNPs do not provide the haplotype information separately for each of the two chromosomes. Instead, they generate for each site an unordered pair of allele readings, one from each copy of the chromosome,which is called a genotype.
Computational inferring of haplotypes from the genotypes (or phasing) has been initiated by Clark (1990) who proposed a parsimony-based approach. It has been shown later that the likelihood-based expectation-maximization (EM) approach is more accurate (Niu, 2004). Markov chain Bayesian haplotype reconstruction methods have been proposed by Stephens et al. (2001) (PHASE) and Niu et al. (2002) (HAPLOTYPER). A combinatorial model based on the perfect phylogeny tree assumption was suggested by Gusfield (2003). By using perfect phylogeny model and block structure Halperin and Eskin (2004) (HAP) showed good performance on real genotypes with low error rates. Recently, Kimmel and Shamir (2005) (GERBIL) combined block identification and phasing steps.
Here, we present a 2SNP software package implementing a new phasing algorithm, which is very fast and scalable. Haplotypes for each genotype are inferred based on the maximum spanning tree of a graph with vertices corresponding to heterozygous sites of a genotype and edge weights representing the certainty in cis- or trans-phasing of 2-SNP genotypes.
The runtime of the implemented algorithm is O(nm(n + m)), where n and m are the number of genotypes and SNPs, respectively. It is several orders of magnitude faster than EM algorithms and solves instances of the phasing problem that are much larger than can be solved by existing software tools. For example, an instance consisting of 30 genotypes each with 1381 SNPs taken from HapMap ENCODE project is phased in 41 s on Pentium 4 2 GHz processor while GERBIL or PHASE need weeks while being no more accurate. In tests on real datasets across 79 different genomic regions 2SNP was also matching the quality of GERBIL and PHASE. We also found that HAPLOTYPER is almost as fast as 2SNP and close in quality, but currently does not handle populations with more than 300 SNPs.
| ALGORITHM AND IMPLEMENTATION |
|---|
|
|
|---|
As an input 2SNP takes genotype vectors with coordinates corresponding to SNPs. SNP values belong to {0, 1, 2, ?}, where 0's and 1's denote homozygous sites with major allele and minor allele, respectively; 2's stand for heterozygous sites, and ?'s denote missed SNP values. Phasing replaces each genotype vector by two haplotype vectors with SNP values in {0, 1} such that any genotype 0-SNP (resp. 1-SNP or 2-SNP) is replaced with two haplotype 0-SNPs (resp. two 1-SNPs or 0-SNP and 1-SNP). A 2-SNP genotype 22 can be cis-phased, i.e. represented as 00 and 11 haplotypes, or trans-phased, i.e. represented as 01 and 10 haplotypes. If for any pair of 2's one knows if they are cis- or trans-phased, then the entire phasing is known. Missing data (?'s) are recovered after phasing of 2's in time O(n2m). For each haplotype h we find the closest (w.r.t. Hamming distance) haplotype(s) h' and recover ?'s in h with the corresponding values from h'.
For each genotype g, 2SNP constructs a genotype graph, which is a complete graph with vertices corresponding to 2's (i.e. heterozygous sites) of g. The weight of the edge between heterozygous sites i and j represents the certainty in that i and j are cis- or trans-phased. The maximum spanning tree of the genotype graph uniquely determines the phasing of the corresponding genotype since it gives cis-/trans- phasing for any two 2's.
Note that Kimmel and Shamir (2005) have applied the same construction for preliminary estimation of haplotype frequencies rather than phasing per se. Therefore, for the edge weight, they have chosen LD-based formula over probabilities of full (i-j)-haplotypes given by maximum-likelihood solution.
Instead, edge weights in 2SNP do not account for SNPs between i and j. We propose to use the logarithm of linkage disequilibrium for the cis/trans odds ratio divided by squared distance between corresponding SNPs as a weight of edge between SNPs i and j
![]() |
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
We compare our 2SNP method with PHASE-2.1.1 and GERBIL on offspring genotypes in family triosthe computationally inferred haplotypes have been compared with haplotypes inferred from parental genotypes.
Chromosome 5q31: 129 genotypes with 103 SNPs derived from the 616 KB region of human Chromosome 5q31 (Daly et al., 2001).
Yoruba population (D): 30 genotypes with SNPs from 51 various genomic regions, with number of SNPs per region ranging from 13 to 114 (Gabriel et al., 2002).
Random matching 5q31: 128 genotypes each with 89 SNPs from 5q31 cytokine gene cluster generated by random matching from 64 haplotypes of 32 West African reported by Hull et al. (2004).
HapMap datasets: 30 genotypes of Utah residents and Yoruba residents available on HapMap by May 2005. The number of SNPs varies from 52 to 1381 across 40 regions including ENm010, ENm013, ENr112, ENr113 and ENr123 spanning 500 KB regions of chromosome bands 7p15:2, 7q21:13, 2p16:3, 4q26 and 12q12, respectively, and two regions spanning the gene STEAP and TRPM8 plus 10 KB upstream and downstream.
We measure the following three phasing errors. A single-site error is the percent of erroneous SNPs among all SNPs in phased haplotypes. An individual error is the percent of genotypes phased with at least one error among all genotypes. A switching error is the percent of switches (among all possible switches) between inferred haplotypes necessary to obtain a true haplotype.
Table 1 shows that 2SNP is several orders of magnitude faster than two other phasing methods handling large datasets in a matter of seconds. The reported mean errors with the respective 95% confidence intervals show that all three phasing methods have the same accuracy for real data (Chromosome 5q31, Yoruba(D), HapMap datasets). On the other hand, 2SNP and GERBIL are considerably outperformed by PHASE and HAPLOTYPER (not shown) on simulated data (Random matching 5q31). Poor performance of 2SNP can be caused by the absence of deviation from the HardyWeinberg equilibrium observed on real data.
|
In conclusion, we have presented 2SNP phasing software package which implements a new extremely fast and simultaneously highly accurate phasing algorithm based on 2-SNP haplotypes. We hope that it will be very useful for high-throughput genotype data processing, e.g. SNP Mapping Arrays (Affymetrix, 2005, http://www.affymetrix.com/products/arrays/). We are going to extend our method by applying 3-SNP haplotype analysis.
| Acknowledgments |
|---|
D.B. was partially supported by Georgia State University Molecular Basis of Disease Fellowship. A.Z. was partially supported by NIH Award 1 P20 GM065762-01A1.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Frank Dudbridge
Received on October 3, 2005; revised on November 10, 2005; accepted on November 14, 2005
| REFERENCES |
|---|
|
|
|---|
Clark, A. (1990) Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol, . 7, 111122[Abstract].
Daly, M., et al. (2001) High resolution haplotype structure in the human genome. Nat. Genet, . 29, 229232[CrossRef][ISI][Medline].
Gabriel, G., et al. (2002) The structure of haplotype blocks in the human genome. Science, 296, 22252229
Gusfield, D. (2003) Haplotype inference by pure parsimony. 14th Symposium on Combinatorial Pattern Matching , pp. 144155 Springer lecture Notes in ComputerScience, No. 2676.
Halperin, E. and Eskin, E. (2004) Haplotype reconstruction from genotype data using imperfect phylogeny. Bioinformatics, 20, 18421849
International HapMap Consortium. (2003) The international HapMap project. Nature, 426, 789796 http://www.hapmap.org[CrossRef][Medline].
Hull, J., et al. (2004) Haplotype mapping of the bronchiolitis susceptibility locus near IL8. Am. J. Hum. Genet, . 114, 272279.
Kimmel, G. and Shamir, R. (2005) GERBIL: genotype resolution and block identification using likelihood. Proc. Natl Acad. Sci. USA, 102, 158162
Kruglyak, L. and Nickerson, D.A. (2001) Variation is the spice of life. Nat. Genet, . 27, 234236[CrossRef][ISI][Medline].
Niu, T., et al. (2002) Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am. J. Hum. Genet, . 70, 157169[CrossRef][ISI][Medline].
Niu, T. (2004) Algorithms for inferring haplotypes. Genet. Epidemiol, . 27, 334347[CrossRef][ISI][Medline].
Stephens, M., et al. (2001) A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet, . 68, 978989[CrossRef][ISI][Medline].
This article has been cited by other articles:
![]() |
M. Niens, R. F. Jarrett, B. Hepkema, I. M. Nolte, A. Diepstra, M. Platteel, N. Kouprie, C. P. Delury, A. Gallagher, L. Visser, et al. HLA-A*02 is associated with a reduced risk and HLA-A*01 with an increased risk of developing EBV+ Hodgkin lymphoma Blood, November 1, 2007; 110(9): 3310 - 3315. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

