Skip Navigation


Bioinformatics Advance Access originally published online on August 7, 2006
Bioinformatics 2006 22(20):2558-2561; doi:10.1093/bioinformatics/btl420
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/20/2558    most recent
btl420v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by He, J.
Right arrow Articles by Zelikovsky, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by He, J.
Right arrow Articles by Zelikovsky, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

MLR-tagging: informative SNP selection for unphased genotypes based on multiple linear regression

Jingwu He and Alexander Zelikovsky *

Department of Computer Science, Georgia State University Atlanta, GA 30303, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 ALGORITHM AND IMPLEMENTATION
 RESULTS AND DISCUSSION
 REFERENCES
 

Summary: The search for the association between complex diseases and single nucleotide polymorphisms (SNPs) or haplotypes has recently received great attention. For these studies, it is essential to use a small subset of informative SNPs accurately representing the rest of the SNPs. Informative SNP selection can achieve (1) considerable budget savings by genotyping only a limited number of SNPs and computationally inferring all other SNPs or (2) necessary reduction of the huge SNP sets (obtained, e.g. from Affymetrix) for further fine haplotype analysis.

A novel informative SNP selection method for unphased genotype data based on multiple linear regression (MLR) is implemented in the software package MLR-tagging. This software can be used for informative SNP (tag) selection and genotype prediction. The stepwise tag selection algorithm (STSA) selects positions of the given number of informative SNPs based on a genotype sample population. The MLR SNP prediction algorithm predicts a complete genotype based on the values of its informative SNPs, their positions among all SNPs, and a sample of complete genotypes.

An extensive experimental study on various datasets including 10 regions from HapMap shows that the MLR prediction combined with stepwise tag selection uses fewer tags than the state-of-the-art method of Halperin et al. (2005).

Availability: MLR-Tagging software package is publicly available at http://alla.cs.gsu.edu/~software/tagging/tagging.html

Contact: alexz{at}cs.gsu.edu


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 ALGORITHM AND IMPLEMENTATION
 RESULTS AND DISCUSSION
 REFERENCES
 
The search for the association between complex diseases and single nucleotide polymorphisms (SNPs) has recently received great attention. For these studies, it is essential to use a small subset of informative SNPs accurately representing the rest of the SNPs.

First, informative SNPs can be used for selective SNP typing and computationally inferring all non-typed SNPs thus achieving considerable budget savings. Traditionally, such SNPs are called tags and the selection procedure is referred as haplotype tagging since SNPs are parts of existing chromosomes (Zhang et al., 2004a). The decision which SNPs should be typed (also referred as tag SNPs) and which should be inferred is based on how well non-typed SNPs can be predicted from typed SNPs.

Second, informative SNPs can be used for compaction of unphased genotype data. Indeed, recent successes in high-throughput genotyping technologies (e.g. Affimetrix Map Arrays, http://www.affymetrics.com/products/arrays) drastically increase the length of available SNP sequences and they should be compacted to be feasible for fine genotype analysis. In this application, the selected informative SNPs are referred as index SNPs (Zhang, et al., 2004b). In order to avoid information loss, index SNPs are chosen based on how well the other non-index SNPs can be reconstructed.

Human SNPs belong to two near-identical copies of each chromosome (haplotypes). Most experimental techniques for determining SNPs generate for each site an unordered pair of allele readings, one from each haplotype, which is called a genotype. Phasing, or splitting a genotype into two haplotypes is usually inferred computationally (Niu, 2004). The selection of tag or index SNPs can be done either before or after phasing—MLR-tagging package selects tags before phasing. For both applications (either for tag- or index-SNP selection), the corresponding problem can be formulated as follows:

Informative SNP Selection problem (ISSP)
Given a sample S of a population P of individuals (either haplotypes or genotypes) on m SNPs, select positions of k(k < m) SNPs such that for any individual, one can predict non-selected SNPs from these k selected SNPs. MLR-tagging software solves the optimization version of ISSP which asks for k informative SNPs minimizing the prediction error measured by the number of incorrectly predicted SNPs. If MLR-tagging is used for tag selection, then the number of tags k depends on genotyping budget. If MLR-tagging is used for indexing, then the number of tags k depends on the desirable data size. Further, for brevity, we refer to selected informative SNPs as tags.

The tags are selected based on the sample population with intention to derive conclusions about the entire population. Statistical analysis may ensure that a high prediction quality of non-tag SNPs is not a coincidence. If certain SNPs are highly correlated (i.e. in linkage disequilibrium) in the sample, then we would expect that this correlation will be observed in the entire population. Therefore, it would be highly desirable that the tags contributing to non-tag SNP prediction will correlate with the predicted SNP. Originally, haplotype tags have been selected based on the squared correlation R2 between true and predicted SNPs in Chapman et al. (2003) and true and predicted halotype dosage in (Stram et al., (2003). Since linkage disequilibrium is usually higher for closer SNPs, Zhang et al., (2004a) partition SNPs into blocks based on limited haplotype variability and then select tags in each block separately thus ensuring high correlations between tags and predicted SNPs. Halperin et al. (2005) choose tags for the entire SNP sequence rather than blocks but predict each non-tag SNP based on two closest tags on both sides assuming that these tags will correlate with the predicted SNPs since they may be not far apart. The linear reduction (LR) method (He et al., 2005) is based on Gauss–Jordan elimination that is used to predict non-tag SNP by rounding fractional linear combination over tag SNPs.

Here we present an MLR-tagging software package for solving the ISSP on genotypes. It implements a new SNP prediction method based on multiple linear regression analysis. The proposed method directly predicts genotypes without the explicit requirement of haplotypes. There is a significant difference between LR and MLR-tagging methods: (1) LR predicts SNPs by rounding fractional linear combinations while MLR chooses the value that have best fit to the linear regression model and (2) LR selects tags without taking in account prediction accuracy of the selected tags while MLR-tagging uses the stepwise tag selection iteratively to find the best tag set expansion according to a chosen prediction quality measure.

The current version of MLR-tagging assumes that the data are complete. Before using our application, the missing data should be imputed; see Huang et al. (2004) for a relevant approach. We plan to use MLR SNP prediction for computational inference of missing genotype data. Below we give details of the implemented algorithms and results of the experimental studies on various datasets (including 10 regions from HapMap) comparing MLR-tagging with STAMPA (Halperin et al., 2005) and LR (He et al., 2005).


    ALGORITHM AND IMPLEMENTATION
 TOP
 ABSTRACT
 INTRODUCTION
 ALGORITHM AND IMPLEMENTATION
 RESULTS AND DISCUSSION
 REFERENCES
 
The general purpose of multiple linear regression is to learn the relationship between several independent variables and a response variable. The multiple linear regression model is given by

Formula
where y is the response variable (represented by a column with n coordinates (k ≤ n–1), xi, i = 1, ... , k are independent variables (columns), ßi, i = 1, ... k are regression coefficients, and {varepsilon} (a column) is the model error. The regression coefficient ßi represents the independent contribution of the independent variable xi to the prediction of y. The MLR method computes bi, i =1, ... , k to estimate unknown true coefficients ßi, i = 1, ... , k to minimize the error ||{varepsilon}|| using the least squares method. Geometrically speaking, in the estimation space span (X), which is the linear closure of vectors xi, i = 1, ... , k, we find the vector Formula = b0 + b1x1 + b2x2 + ··· +bkxk = Xb estimating y. The vector Formula minimizing distance (error) Formula is the projection of y on span (X) and equals Formula. Given the values of independent variables Formula, the MLR method can predict (estimate) the corresponding response variable y* with Formula.

SNP prediction
In SNP prediction, y is a non-tag SNP and xi, i = 1, ... , k are tags. Given the known tag values x* in an individual, we should predict the non-tag SNP value y*. There are three possible values for each SNP (–1, 0, 1), corresponding to homozygous major allele, heterozygous allele and homozygous minor allele. Note that rather than encode SNP with more common notations (0, 2, 1), we use (–1, 0, 1)-notation, called sigma-encoding. An obvious way to predict y* is to round expression (3) for Formula. Instead MLR SNP prediction algorithm finds the value of (–1, 0 or 1) that better fits the MLR model (1), i.e. minimizes the error ||{varepsilon}||.

Formally, let T be the (n + 1) x k matrix consisting of n + 1 rows corresponding to a tag-restricted genotypeFormula and n sample genotypes Formulafrom X,Formula, whose k coordinates correspond to k tag SNPs. The SNP s, a non-tag SNP, is represented by a (n + 1)-column with known values Formula, for genotypes from X and the unknown value y* for the genotype g which should be predicted. Let d = ||{varepsilon}|| be the least square distance between s and T, i.e. Formula. Our algorithm finds the value (–1, 0 or 1) for y* and selects one minimizing d.

Tag selection
STSA starts with the best tag t0, i.e. the SNP minimizing the error when predicting all other SNPs. Then STSA finds tag t1 which minimizes the prediction error of the tag set (t0, t1) as follows. Each non-tag SNP is added to the original tag set, then other SNPs are predicted from the extended set using MLR SNP prediction. The SNP that achieves the highest prediction accuracy will be t1. Best tags are added until reaching the specified size k.

Running time
Computing of Tt·T is O(nk2) since T is a nxk matrix and Tt is a k x n matrix. For inverting the kxk matrix Tt·T, we need O(k3) steps. Let k < n, then the running time for computing Formula is O(n2k). The matrix of T' is the same for all these (mk) non-tag SNPs, thus, the total running time for predicting a complete individual is Formula. If k ≥ n, then only Formula closest tags to the right and to the left of the predicted SNP are used. There are only kn + 1 different matrices T' to compute and the total running time is O(n3m). The MLR SNP prediction need knm steps for prediction, thus, the total runtime of MLR-tagging is Formulawhen k < n and O(kn4m2) when k ≥ n.


    RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 ALGORITHM AND IMPLEMENTATION
 RESULTS AND DISCUSSION
 REFERENCES
 
The following genotype datasets are used to measure the quality of our SNP prediction and informative SNP selection algorithms as well as comparison with the results of Halperin et al. (2005). We use GERBIL algorithms (Kimmel and Shamir, 2004) for resolving missing data. The SNPs with only one allele are removed from the original data.

Seven ENCODE regions from HapMap Regions ENr123 and ENm010 from 2 population: 45 Han Chinese from Beijing (HCB) and 44 Japanese from Tokyo (JPT) for three regions (ENm013, ENr112, ENr113) from 30 CEPH family trios obtained from HapMap ENCODE Project HapMap (2003, http://www.hapmap.org).

Two gene regions from HapMap Two gene regions STEAP and TRPM8 from 30 CEPH family trios were obtained from HapMap.

Chromosome 5q31 The data set collected by Daly et al. (2001) was derived from the 616 kb region of human Chromosome 5q31 from 129 family trios.

We report the prediction accuracy and the squared correlation R2 between predicted and original non-tag SNP values. The prediction accuracy is measured as the percentage of correctly predicted SNP values on non-tag SNPs. Table 1 reports the prediction accuracy and the average and the minimum correlation R2 for all non-tag SNPs.


View this table:
[in this window]
[in a new window]

 
Table 1 The quality of SNP prediction from the given number of tags (5–15%) of the total number of SNPs (in parentheses)

 
Alternatively, we apply leave-one-out cross-validation to evaluate the quality of the MLR-tagging solution for the Genotype Tagging Problem as follows: (1) one by one, each genotype vector is removed from the sample, (2) tag SNPs are selected using only the remaining genotypes and (3) the ‘left out’ genotype is reconstructed based on its tag SNPs and the values of tag and non-tag SNPs in the remaining genotypes. In Table 2, we compare MLR with STAMPA and LR. Note that if one predicts each SNP as 0 (i.e. homozygous with major allele), then the prediction accuracy on STEAP, TRPM8 and 5q31 data will be 79.36, 72.53 and 63.57%, respectively. MLR first predicts each SNP as 0 and then gets even higher prediction accuracy when it uses a single tag while STAMPA requires at least two tags for prediction. STAMPA is asymptotically faster but MLR is more accurate compared on four HapMap datasets (Table 3).


View this table:
[in this window]
[in a new window]

 
Table 2 Number of tags used by MLR-tagging, STAMPA and LR to achieve 80 and 90% prediction accuracy in leave-one-out tests

 


View this table:
[in this window]
[in a new window]

 
Table 3 The comparison of MLRs and STAMPAs prediction accuracy and running time using the number of tags (2, 5, 10, 15, 20, 25) on region ENr123 (A) and ENm010 (B) from 2 populations: CHB and JPT

 
According to the regression model (1), the tags which are more correlated with the predicted SNP have larger regression coefficients and, therefore, will contribute more to predicting the SNP. For example, for 7 ENCODE regions Hapmap (2003) and k = 10 tags, the tag with the largest regression coefficient ({approx}0.82 on average) has an average correlation 0.61 with the predicted SNP, the tag with the second largest regression coefficient has average correlation 0.28 and so on. Averaged over all considered real datasets, the correlation between regression coefficients and tag-to-non-tag correlations is 0.96. When MLR-tagging is applied to data containing both high- and low-LD regions, the high-LD region always have small number of tags since tags in the low-LD region do not correlate with SNPs in the high-LD region and, therefore, do not contribute to high-LD SNP prediction. For ENm010JRT dataset containing 11 SNPs in the high-LD region and 94 SNPs in the low-LD region, only 2 tags are chosen in the high-LD region out of total k = 59 tags.

In conclusion, we have presented the MLR-tagging software package that can be useful in selecting informative SNPs and reducing the genotyping cost.


    Acknowledgments
 
J.H. was supported by Georgia State University Molecular Basis of Disease Fellowship. A.Z. was partially supported by NIH Award 1 P20 GM065762-01A1 and NSF Award #0429735.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on December 7, 2005; revised on July 12, 2006; accepted on July 28, 2006

    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 ALGORITHM AND IMPLEMENTATION
 RESULTS AND DISCUSSION
 REFERENCES
 

    Chapman, J.M., et al. (2003) Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power. Human Heredity, 56, 18–31[CrossRef][ISI][Medline].

    Daly, M., et al. (2001) High resolution haplotype structure in the human genome. Nat. Genet, . 29, 229–232[CrossRef][ISI][Medline].

    Kimmel, G. and Shamir, R. (2004) GERBIL: genotype resolution and block identification using likelihood. Proc. Natl Acad. Sci. USA, 102, 158–162.

    Halperin, E., et al. (2005) Tag SNP selection in genotype data for maximizing SNP prediciton accuracy. Bioinformatics, 21, i195–203[Abstract].

    International HapMap Consortium. (2003) The International HapMap Project. Nature, 426, 789–796[CrossRef][Medline].

    He, J., et al. (2005) Linear reduction method for predictive and informative tag SNP selection. Int. J. Bioinform. Res. Appl, . 3, 249–260.

    Huang, Y.H., et al. (2004) Approximation algorithms for the selection of robust tag SNPs’, Proc. Workshop Algorith. Bioinform, . 3240, 278–289.

    Niu, T. (2004) Algorithms for inferring haplotypes. Genetic Epidemiol, . 4, 334–47.

    Stram, D., et al. (2003) Choosing haplotype-tagging SNPs based on unphased genotype data using as preliminary sample of unrelated subjects with an example from the multiethnic cohort study. Human Heredity, 55, 27–36[CrossRef][ISI][Medline].

    Zhang, K., et al. (2004a) Haplotype block partitioning and tag SNP selection using genotype data and their applications to association studies. Genome Res, . 14, 908–916[Abstract/Free Full Text].

    Zhang, P., et al. (2004b) A double classification tree search algorithm for index SNP selection. BMC Bioinformatics, 5, 89–95[CrossRef][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
Y. Saeys, I. Inza, and P. Larranaga
A review of feature selection techniques in bioinformatics
Bioinformatics, October 1, 2007; 23(19): 2507 - 2517.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/20/2558    most recent
btl420v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by He, J.
Right arrow Articles by Zelikovsky, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by He, J.
Right arrow Articles by Zelikovsky, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?