Bioinformatics Advance Access originally published online on December 20, 2005
Bioinformatics 2006 22(5):541-549; doi:10.1093/bioinformatics/btk011
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Accurate prediction of HIV-1 drug response from the reverse transcriptase and protease amino acid sequences using sparse models created by convex optimization
1Gene Security Network Palo Alto, CA, USA
2Department of Engineering, Stanford University Palo Alto, CA, USA
3Northwestern University School of Medicine Chicago, IL, USA
4Department of Microbiology and Immunology, Stanford University Medical Center Palo Alto, CA, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Genotypephenotype modeling problems are often overcomplete, or ill-posed, since the number of potential predictorsgenes, proteins, mutations and their interactionsis large relative to the number of measured outcomes. Such datasets can still be used to train sparse parameter models that generalize accurately, by exerting a principle similar to Occam's Razor: When many possible theories can explain the observations, the most simple is most likely to be correct. We apply this philosophy to modeling the drug response of Type-1 Human Immunodeficiency Virus (HIV-1). Owing to the decreasing expense of genetic sequencing relative to in vitro phenotype testing, a statistical model that reliably predicts viral drug response from genetic data is an important tool in the selection of antiretroviral therapy (ART). The optimization techniques described will have application to many genotypephenotype modeling problems for the purpose of enhancing clinical decisions.
Results: We describe two regression techniques for predicting viral phenotype in response to ART from genetic sequence data. Both techniques employ convex optimization for the continuous subset selection of a sparse set of model parameters. The first technique, the least absolute shrinkage and selection operator, uses the l1 norm loss function to create a sparse linear model; the second, the support vector machine with radial basis kernel functions, uses the
-insensitive loss function to create a sparse non-linear model. The techniques are applied to predict the response of the HIV-1 virus to 10 reverse transcriptase inhibitor and 7 protease inhibitor drugs. The genetic data are derived from the HIV coding sequences for the reverse transcriptase and protease enzymes. When tested by cross-validation with actual laboratory measurements, these models predict drug response phenotype more accurately than models previously discussed in the literature, and other canonical techniques described here. Key features of the methods that enable this performance are the tendency to generate simple models where many of the parameters are zero, and the convexity of the cost function, which assures that we can find model parameters to globally minimize the cost function for a particular training dataset.
Availability: Results, tables and figures are available at ftp://ftp.genesecurity.net
Contact: mrabinowitz{at}genesecurity.net
Supplementary information: An Appendix to accompany this article is available at Bioinformatics online.
| INTRODUCTION |
|---|
|
|
|---|
As of today, approved antiretroviral therapy (ART) drugs consist of a list of 11 RTIs: 7 nucleoside, 1 nucleotide and 3 non-nucleoside; 7 protease inhibitors (PIs) and 1 fusion/entry inhibitor (Carpenter et al., 2000; WHO, 2003, http://www.who.int/hiv/pub/prev_care/en/arvrevision2003en.pdf). Given the current rollout of ART drugs around the world (UNAIDS, 2004), the appearance of resistant strains of the virus is inevitable, both owing to the low genetic barrier to resistance (D'Aquila et al., 2003; Fessel et al., 2003; Kuritzkes et al., 1996; Molla et al., 1996; Rhee et al., 2003; Shafer, 2004; Zolopa et al., 1999) and to poor drug adherence (Sherr, 2000). Consequently, techniques to predict how mutated viruses will respond to anti-retroviral therapy are increasingly important as they will influence the outcome for salvage therapies. The rapidly decreasing cost of viral genetic sequencingwith volume pricing as low as $5 for pre-prepared sequences (Macrogen, 2005, http://www.macrogen.com/eng/sequencing/sequence_main.jsp)makes the selection of drugs based on viral genetic sequence data an attractive option than the more costly and technically involved in vitro phenotype measurement (Meynard et al., 2002; Shafer, 2002). The use of sequence data, however, necessitates accurate predictions of viral drug response based on the appearance of viral genetic mutations. The many different combinations of viral mutations make it difficult to design a model that includes all the genetic cofactors and their interactions, and to train the model with limited data (Kijak et al., 2003; Ravela et al., 2003). The latter problem is exacerbated in the context of modeling in vivo drug response (De Luca et al., 2003), where the many different combinations of drug regimens make it difficult to collect sufficiently large datasets for any particular regimen that contains the necessary variables, namely baseline clinical status, treatment history, clinical outcome and genetic sequence. We focus here on the in vitro scenario, but the techniques discussed are also applicable to the more complex in vivo scenario.1
Resistance to antiviral drugs can be the result of one mutation within the RT or protease sequences, or the combination of multiple mutations (DHHS, 2004, http://aidsinfo.nih.gov/Guidelines/GuidelineDetail.aspx?Menultem=Guidelines&Search=Off&GuidelineID=7&ClassID=1). The RT enzyme is coded by a key set of 560 codons; the protease enzyme by 99 codons. By considering only mutations that alter the amino acids, each amino acid locus has 19 possible mutations, so there are a total of 10 640 possible mutations that differ from wild type on the RT enzyme, and 1981 possible mutations on the protease enzyme. Using a simple linear model, where each mutation encountered in the data (not all mutations will occur) is associated with a particular weighting, or linear regression parameter, we may have several thousand parameters. If only several hundred patient samples are available for each drug, the problem is overcomplete, or ill-posed in the Hadamard sense (Hadamard, 1923), since we have more parameters to estimate than independent equations. Many techniques exist that can be applied to the problem of constructing models for the ill-posed problem. These include combining a priori expert knowledge with observations to create expert-rule based systems (ANRS, 2004, http://www.hivfrenchresistance.org/; HIVDB, 2004, http://hivdb.stanford.edu), as well as statistical methods including (1) ridge regression (Daniels and Kass, 2001, http://citeseer.ist.psu.edu/403256.html; Neter et al., 1990), (2) principal component analysis (Jolliffe, 2002; Mardia, 1980), (3) decision trees (Breiman, 1984; Hastie and Tibshirani, 1996), (4) stepwise selection techniques (Ryan, 1996; Wang et al., 2004), (5) neural networks (Draghici and Potter, 2003; Wang and Larder, 2003), (6) linear support vector machines (SVMs) (Beerenwinkel et al., 2003; Vapnik, 1998), (7) the least absolute shrinkage and selection operator (LASSO) (Efron et al., 2003; Tibshirani, 1996) and (8) non-linear SVMs (Schölkopf et al., 1999; Vapnik, 1998). We compare these techniques, and focus in particular on the latter two, which have been used to generate results superior to those of recently published approaches (Beerenwinkel et al., 2003; DiRienzo et al., 2003; Kijak et al., 2003; Wang et al., 2004). We explain why the LASSO and the SVM techniques, both of which generate convex cost functions and sparse parameter sets, achieve superior performance.
Three main industry-standard expert systems are typically used to predict the susceptibility of HIV viruses to ART drugs: the ANRS-AC11 System (ANRS, 2004), the Rega System (Rega, 2004, http://hivdb.stanford.edu/pages/algs/HIValg.html) and the Stanford HIVdb System (Rhee et al., 2003). It is commonplace in the literature for new algorithms to be benchmarked against these expert systems. None of these expert systems, however, is designed to perform direct prediction of phenotypic response, but rather to provide a numeric score by which different drugs can be compared, or to classify the drugs into discrete groupings such as Sensitive, Intermediate and Resistant (HIVDB, 2004). In addition, it has been clearly established (De Luca et al., 2004; Wang et al., 2004) that statistical algorithms, such as linear regression models trained with stepwise selection, substantially outperform expert systems in prediction of phenotypic outcome. Consequently, we will only compare a set of statistical techniques which include the best performing methods recently disclosed in the literature.
| METHODS |
|---|
|
|
|---|
Source of the dataset
We used the Stanford HIV RT and Protease drug resistance database to train and test our models (HIVDB, 2004). This data consist of several thousand in vitro phenotypic tests of HIV-1 viruses for which the Protease and/or RT encoding segments have been sequenced. Tests have been performed on 10 RTI drugs and 7 PI drugs. The RTIs include lamivudine (3TC), abacavir (ABC), zidovudine (AZT), stavudine (D4T), zalcitabine (DDC), didanosine (DDI), delaviradine (DLV), efavirenz (EFV), nevirapine (NVP) and tenofovir (TDF). The PIs include ampranavir (APV), atazanavir (ATV), nelfinavir (NFV), ritonavir (RTV), saquinavir (SQV), lopinavir (LPV) and indinavir (IDV)). The in vitro drug susceptibility methods used to gather the phenotypic data was ViroLogicTM (Virologic, 2005, http://www.virologichiv.com/). Details of the approach to gathering and rendering the data are provided by Robert Shafer's group at the site http://hivdb.stanford.edu. We will only discuss those issues directly relevant to formulating the statistical problem.
Problem formulation
For each drug, we structure the data into pairs of the form (xi, yi), i = 1, ..., N, where N is the number of samples constituting the training data, yi is the measured drug fold resistance and xi is the vector of mutations plus a constant term, xi = [1 xi1, xi2, ..., xiM]T, where M is the number of possible mutations on the relevant enzyme. We set element xim = 1 if the m-th mutation is present on i-th sample, and set xim = 0 otherwise. Each mutation is characterized both by the codon locus and the substituted amino acid. Mutations that do not affect the amino acid sequence are ignored. Note that only mutations present in >1% of the samples for each drug are included in the set of possible predictors for a model, since it is improbable that mutations associated with resistance would occur so infrequently. The measurement yi represents the fold resistance of the drug for the mutated virus as compared with the wild type. Specifically, yi is the log of the ratio of the IC50 (the concentration of the drug required to slow down replication by 50%) of the mutated virus, as compared with the IC50 of the wild-type virus. Our goal is to develop a model for each drug that accurately predicts yi from xi. In order to perform batch optimization on the data, we stack the independent variables in an N by M + 1 matrix, X = [x1, x2, ..., xN]T, and we stack all observations in a vector y = [y1, y2, ..., yN]T.
The performance of each algorithm is measured using cross-validation. For each drug, we calculate the first-order correlation coefficient R between the predicted phenotypic response of the model and the actual measured in vitro phenotypic response of the test data.
![]() | (1) |
is the prediction of phenotypes y,
denotes the mean of the elements in vector y and
denotes the vector of all ones. For each drug and each method, the data are randomly subdivided in the ratio 9:1 for training and testing, respectively. A total of 10 different subdivisions are performed in order to generate the vector
and R without any overlap of training and testing data. This entire process is then repeated 10 times to generate 10 different values of R. The 10 different values of R are averaged to generate the R reported. We also determine the standard deviation of R for each of the models measured over the 10 different experiments to ensure that we are comparing models in a statistically significant manner. | RESULTS |
|---|
|
|
|---|
Table 1 displays the results of the above mentioned models for the seven PI drugs; Table 2 displays the results for the 10 RTI drugs. Results are displayed in terms of correlation coefficient R, averaged over 10 subdivisions of the training and test data as described above. The estimated SD of the mean value of R, computed from the sample variance, is also displayed. The number of available samples for each drug is shown in the last row. The methods tested, in order of increasing average performance, are (1) RR, Ridge Regression; (2) DT, Decision Trees; (3) NN, Neural Networks; (4) PCA, Principal Component Analysis; (5) SS, Stepwise Selection; (6) SVM_L, Support Vector Machines with Linear Kernels (Beerenwinkel et al., 2003; Vapnik, 1998), (7) LASSO, least absolute shrinkage and selection operator and (8) SVM, Support Vector Machines with Radial Basis Kernels (Schölkopf et al., 1999; Vapnik, 1998). The information in the last columns of Table 1 and Table 2 is depicted in Figure 1. The circles display the correlation coefficient R averaged over 10 different experiments for each PI, and then averaged over the seven different PIs. The diamonds display the correlation coefficient R averaged over 10 different experiments for each RTI, and then averaged over the 10 different RTIs. The one SD error bars are also indicated. Wherever modeling techniques involve tuning parameters, these have been adjusted for optimal performance using a grid search approach: each point in the grid is associated with a quantized set of tuning parameter, and that point of the grid that produces the best correlation coefficient R in cross-validation is selected. Once the best-performing tuning parameters are selected, the model is re-trained and tested to generate a new correlation coefficient. In all cases, the grid quantization was fine enough that the difference in the prediction resulting from adjacent grid points lay below the experimental noise floor.
|
|
|
Although there are strong trends in the data, it should be noted that owing to differences in the number of samples, interactions of the underlying genetic predictors, and other idiosyncrasies in the data that vary between drugs, the R achieved by each algorithm varies from drug to drug. This can be seen by studying the individual drug columns of Tables 1 and 2.
Of all the methods, SVM performs best, slightly outperforming LASSO (P < 0.001 for the RTIs; P = 0.18 for the PIs).2 The performance of SVM trained with the
-insensitive loss function is considerably better than that of previously reported methods based on the SVM (Beerenwinkel et al., 2003; Wang and Larder, 2003). Note that SVM, which uses non-linear kernel functions, outperforms SVM_L which uses linear kernel functions and is also trained using the
-insensitive loss function (P = 0.003 for RTIs; P < 0.001 for PIs). The SVM considerably outperforms the other non-linear technique which uses neural networks (Wang and Larder, 2003) and which does not create a convex cost function (P < 0.001 for both RTIs and PIs). The LASSO technique, which trains a linear regression model using a convex cost function and continuous subset selection, considerably outperforms the SS technique (P < 0.001 for both PIs and RTIs). The top five methods, namely SS, PCA, SVM_L, LASSO and SVM_R, all tend to generate models that are sparse, or have a limited number of non-zero parameters. This will be discussed further in the next section.
In order to illustrate the subset of mutations selected as predictors, we will focus on the second-best performing model, namely the LASSO, which creates a linear regression model that, unlike SVM, does not attempt to emulate non-linear or logical coupling between the predictors. Consequently, it is straightforward to show how many predictors are selected. Table 3 shows the number of mutations selected by the LASSO as predictors for each PI drug, together with the number of mutations and the total number of samples used in training each model. The same table is shown for the RTI drugs in Table 4.
|
|
The selected mutations may also enhance understanding of the causes of drug resistance. Figures 24 show the value of the parameters selected by the LASSO for predicting response to PI, nucleoside RTIs (NRTIs) and non-nucleoside RTIs (NNRTIs), respectively. Each row in the figures represents a drug; each column represents a mutation. Relevant mutations are on the protease enzyme for PI drugs and on the RT enzyme for NRTI and NNRTI drugs. The shading of each square indicates the value of the parameter associated with that mutation for that drug. As indicated by the color-bar on the right, those predictors that are shaded darker are associates with increased resistance; those parameters that are shaded lighter are associated with increased susceptibility. The mutations are ordered from left to right in order of decreasing magnitude of the average of the associated parameter. The associated parameter is averaged over all rows, or drugs, in the class. Only those mutations associated with the 40 largest parameter magnitudes are shown. Note that for a particular mutation, or column, the value of the parameter varies considerably over the rows, or the different drugs in the same class.
|
For the algorithms RR, DT, NN and SS, the model was not trained on all mutations, but rather on a subset of mutations occurring at those loci that have been determined to affect resistance by the Department of Health and Human Services (DHHS, 2004; Wang et al., 2004). The reduction in the number of independent variables was found to improve the performance of these algorithms. In the case of the SVM_L algorithm, best performance for RTIs was achieved using only the DHHS mutation subset, while best performance for PIs was achieved by training the model on all mutations. For all other algorithms, best overall performance was achieved by training the model on all mutations.
The set of mutations shown in Figures 24 that were selected by the LASSO as predictors, but are not associated with loci determined by DHHS to affect resistance (DHHS, 2004; Wang et al., 2004), are for PIs19P, 91S, 67F, 4S, 37C, 11I, 14Z; for NRTIs68G, 203D, 245T, 208Y, 218E, 208H, 35I, 11K, 40F, 281K and for NNRTIs139R, 317A, 35M, 102R, 241L, 322T, 379G, 292I, 294T, 211T, 142V. Note that in some cases, such as for the LASSO,3 the performance for particular drugs, such as LPV, was significantly improved (P < 0.001) when all mutations were included in the model (R = 86.78, SD = 0.17) as compared with the case when only those loci recognized to affect resistance by DHHS (DHHS, 2004) were included (R = 81.72, SD = 0.18). This suggests that other mutations, beyond those recognized by the DHHS, play a role in resistance to such drugs, and the topic warrants further study.4
| DISCUSSION |
|---|
|
|
|---|
Motivation for the LASSO and the l1 selection function
We describe here the explicit use of the l1 norm in the LASSO (Efron et al., 2003; Tibshirani, 1996), and we extend this discussion to the SVM trained with the
-insensitive loss function in the next section. We begin by discussing the significance of sparse models.
When the number of predictors M exceeds the number of training samples N, the modeling problem is overcomplete, or ill-posed (Hadamard, 1923), since any arbitrary subset of N predictors is sufficient to yield a linear model with zero error on the training data, so long as the associated columns in the X matrix are linearly independent. Consequently, one is disinclined to put faith in an N-predictor model returned by a linear regression method. Suppose, however, a model with significantly fewer than N variables has low training error. The more sparse the model, the less probable that low training error could be a chance artifact; hence, more likely that the predictors are causally related to the dependent variable. This underlies the importance of sparse solutions in overcomplete problems, as is the case for the RTI data. A similar argument can be applied to ill-conditioned problems characterized by a large condition number on the matrix XTX. This is the case for several PI drugs for which N > M. The problem is ill-conditioned owing to co-linearity between the predictors resulting from similar genealogy of the sequences. In such a case, the estimated parameters
are highly susceptible to the model error, as well as to measurement noise, and as a result, the estimator has high variance and is unlikely to generalize accurately. Overcomplete and ill-conditioned problems are typical of genetic data, where the number of possible predictorsgenes, proteins or, in our case, mutation sitesis large relative to the number of measured outcomes.
One canonical approach to such cases is the best subset selection (Hastie and Friedman, 2001). This involves selecting a maximum number of predictors to be used, M'
N, and cross-validating the models that are generated from all possible subsets of M' or fewer predictors. When M = M', this involves 2M different subsets. Using advanced computational techniques (Hastie and Friedman, 2001), the problem may be solved for M
50. However, this approach is intractable for the number of potential predictorsand the number of predictors selectedin the results above. A computationally tractable approach is stepwise selection (Neter et al., 1990), which adds a single predictor to the model at each step, based on that variable having the highest F-test statistic indicating the level of significance with which it is correlated with prediction error. After each variable is added, the remaining variables are all checked to ensure that none of them have dropped below a threshold of statistical significance in their association with the prediction error of the model. This technique has been successfully applied to the problem of drug response prediction (Wang et al., 2004). However, owing to the discrete nature of the selection process, small changes in the data can considerably alter the chosen set of predictors. The presence or absence of one variable will affect the statistical significance associated with another variable and whether that variable is included or rejected from the model. This affects accuracy in generalization, particularly for ill-conditioned problems.
Another approach is for the values of the estimated parameters
to be constrained by means of a shrinkage function. Shrinkage functions tend to add bias to the estimator but reduce the estimator's variance (Hastie and Friedman, 2001) and hence improve the ability to generalize from ill-conditioned data. A canonical shrinkage function is the square of the l2 norm, or the sum of the squares of the parameters, and this is applied in ridge regression which finds the parameters according to
![]() | (2) |
is a tuning parameter, typically determined by cross-validation. This method is non-sparse and does not set parameters to 0. This tends to undermine accuracy in generalization and makes solutions difficult to interpret.
These problems are addressed by the LASSO technique. In contrast to stepwise selection, the LASSO does not perform discrete acceptance or rejection of predictor variables; rather it allows one to select en-masse, via a continuous subset optimization, the set of variables that together are the most effective predictors. It uses the l1 norm shrinkage function:
![]() | (3) |
is typically set by cross-validation in order to estimate the optimal trade-off between bias and variance in the estimator. The LASSO will tend to generate sparse models by setting many of the parameters to 0. Figure 5 provides insight into this feature of the LASSO, termed selectivity. Suppose that we are creating a model based on just two mutations with the training data X = [1 0; 0 1]T, y = [2 1]T, and the x-axis and y-axis represent the two parameters b1 and b2, respectively. Compare the use of the l1 and l2 shrinkage functions, where in both cases we find a solution that fits the training data equally well such that ||y Xb||2 = 2. The large circle, small circle and square respectively represent level curves for the cost functions ||y Xb||2, the l2 norm ||b||2 and the l1 norm |b|1 = |b1| + |b2|. A solution for ridge regression (l2 norm) is found where the two circles meet; a solution for the LASSO (l1 norm) is found where the square and the large circle intersect. Owong to the pointiness of the level cure for the l1 norm, we have found a solution that lies on the axis b1 and is therefore sparse. This argument, extended into higher dimensions, explains the tendency of LASSO to produce sparse solutions and suggests why the results achieved are measurably better than those of techniques reported in the literature (Beerenwinkel et al., 2003, 2002; Draghici and Potter et al., 2003; Wang and Larder, 2003; Wang et al., 2004). The l1 norm can be viewed as the most selective shrinkage function that remains convex. Since a convex function has a global minimum and no local minima, convexity guarantees that we can find the one global solution for a given dataset (Boyd, 2004; Murray et al., 1981). A highly efficient recent algorithm, termed Least Angle Regression (Efron et al., 2003; Osborne et al., 2000), is guaranteed to converge to the global solution of the LASSO in M steps. Code for solving the LASSO, for many hundreds or thousands of potential predictors, is available in MATLAB via anonymous ftp at ftp://ftp.genesecurity.net/code/LASSO.
SVMs, convexity and sparsity
In a modeling problem where there is coupling between the independent variables, or other non-linear effects, the performance of a linear regression model as used by the LASSO will be suboptimal. One approach to modeling non-linear effects in ART drug response prediction is to use neural networks (Draghici and Potter, 2003; Wang and Larder, 2003). In the results above, we focused on the techniques of Wang since these were reported to achieve the better performance of the two neural network methods, and are better suited to the regression problem that we address. In general, there are two key challenges with neural network approaches. The first is that the cost function to be minimized is non-convex, so that it is extremely difficult to converge in training to the global minimum that generates optimal parameter values, particularly when there are a large number of predictors as in our case. The second is that the complex neural network models tend to be non-sparse, and consequently do not generalize well. This problem is addressed in two ways (Wang and Larder, 2003) for the NN algorithm described above: First, using cross validation, the network training parameters and the number of hidden layer nodes that generalized most accurately for each drug are determined. Second, for each prediction, 10 independent neural networks are trained and then all 10 predictions are averaged to create a single model prediction. These techniques (Wang and Larder, 2003) are at best palliative and do not address the fundamental challenges of the non-sparse, non-convex, neural network approach.
As the results show, a preferable technique for creating non-linear models is the use of SVMs which are trained with convex cost functions, and which tend to produce sparse models that generalize well. SVMs are learning algorithms that can perform real-valued function approximation and can achieve accurate generalization of sample data even when the estimation problem is ill-posed in the Hadamard sense (Hadamard, 1923; Vapnik, 1998). The ability of SVMs to accurately generalize is influenced by two selectable features in the SVM model and training algorithm. The first is the selection of the cost function, or the function that we seek to minimize in training. The second is the selection of the kernels of the SVM, or those functions that enable the SVM to map complex non-linear functions, involving interactions between the independent variables, using a relatively small set of linear regression parameters (Burges, 1998; Vapnik, 1998). We discuss these features in the following paragraphs.
Consider modeling yi with a linear function approximation:
. Suppose we seek to estimate b by minimizing a cost function consisting of a l2 shrinkage function on the parameters, together with the
-insensitive loss function, which does not penalize data fitting errors below some
> 0. This SV regression may be formulated as (Vapnik, 1998)
![]() | (4) |
![]() | (5) |
![]() | (6) |
![]() | (7) |
. Parameter C allows one to scale the relative importance of the error versus the shrinkage on the weights.
Since the regression model is linear and the
-insensitive loss function is convex, the above constrained optimization is convex, and can consequently be solved by finding the saddle-point of a Lagrangian, in order to satisfy the KarushKuhnTucker (KKT) conditions (Boyd, 2004). In Appendix I, by referring to the Lagrangian dual problem, we describe how the use of the
-insensitive loss function in combination with an l2 shrinkage on the parameters is similar to the explicit use of an l1 shrinkage on the parameters, in terms of the tendency to generate sparse models.
Also in Appendix I, we describe how convexity is extended to the non-linear case with the use of kernel functions in the regression model
![]() | (8) |
![]() | (9) |
|x xi|), where
is determined by cross validation, as described above. The code to implement the SVM with the
-insensitive loss function and various kernel functions is available via anonymous ftp at ftp://ftp.genesecurity.net/code/SVM. | CONCLUSIONS |
|---|
|
|
|---|
We have described the use of convex optimization techniques to achieve continuous subset selection of sparse parameter sets in order to train drug-response prediction models that generalize accurately. We have described the use of the LASSO which applies the l1 norm shrinkage function to generate a sparse set of linear regression parameters. We have also described the use of the SVM with radial basis kernel functions and trained with the
-insensitive loss function to generate sparse non-linear models. Both the LASSO and the non-linear SVM techniques are able to model the response of mutated viruses to anti-retroviral drugs more accurately, as measured by cross-validation, than any of the techniques previously published. The superior performance of these techniques is explained in terms of the convexity of their cost functions used in training, and their tendency to produce sparse models. Convexity assures that one can find the globally optimal parameters for a particular training dataset when there are many potential predictors. Sparse models tend to generalize well, particularly in the context of underdetermined or ill-conditioned data, as is typical of genetic data. The l1 norm may be viewed as the most selective convex function. The SVM, which uses an l2 shrinkage function together with an
-insensitive loss function, tends to produce an effect similar to the explicit use of the l1 norm as a shrinkage function applied to the parameters associated with the support vectors. Two advantages of LASSO are that the resulting model is easy to interpret since the parameters directly combine predictors, or expressions involving predictors, rather than support vectors; and that the LASSO can be efficiently trained for models involving many potential predictors using the LARS algorithm. The performance of the LASSO may be enhanced by adding non-linear or logical combinations of predictors to the model. Logical combinations can be derived from those generated by a decision tree, from logical expressions described by expert rules, from the technique of logic regression (Kooperberg et al., 2001), or even from a set of random permutations of logical terms.
Note that other techniques exist that use shrinkage function more selective than the l1 norm. For example, log-shrinkage regression (Fazel, 2002; Singer, 2004) uses a shrinkage function derived from coding theory which measures the amount of information residing in the model parameter set. While offering a theoretically intriguing approach for seeking a sparse set of parameters, the non-convexity of the resulting cost function means that solving the log-shrinkage regression is computationally challenging, and for large sets of predictors will yield only a local, rather than a global, minimum for the given data.
Conflict of Interest: none declared.
|
|
|
| FOOTNOTES |
|---|
Associate Editor: Keith A Crandall
1In the in vivo case, the combined effect of multiple drugs in a regimen must be modeled, the outcome data in terms of CD4+ and viral load count is longitudinal, the virus tends to mutate over the period of treatment, and results are affected by interaction with human genetic and environmental factors (Beerenwinkel et al., 2003) (O'Brien and Nelson, 2004). ![]()
2Wherever a P-value is quoted to indicate the difference between methods, it is computed as follows. Let
and
be the sample means of the correlation coefficients for methods 1 and 2, each averaged over 10 different experiments and all different drugs. Since the data is randomly subdivided 10 times in the ratio 9:1 for training and testing for each drug and each method,
and
will be independent. Let
and
be the estimated standard deviations of the means as shown in the last columns of Table 1 and Table 2. The test statistic is computed:
. The P-value is computed using the two-tailed test:
. Since D has a Student-T distribution with >100 degrees of freedom, it can be treated as normally distributed as
N(0,1). ![]()
3A similar effect was also observed with the SVM. ![]()
4As a first step, this tendency should be verified with a new dataset. ![]()
Received on November 2, 2005; revised on December 14, 2005; accepted on December 14, 2005
| REFERENCES |
|---|
|
|
|---|
ANRS. (2004) Agence Nationale de Recherches sur le SIDA, group AC11.
Beerenwinkel, N., et al. (2003a) Geno2pheno: Estimating phenotypic drug resistance from HIV-1 genotypes. Nucleic Acids Res, . 31, 38503855
Beerenwinkel, N., et al. (2003b) Methods for optimizing antiviral combination therapies. Bioinformatics, 19, Suppl. 1, i16i25[Abstract].
Beerenwinkel, N., et al. (2002) Diversity and complexity of HIV-1 drug resistance: a bioinformatics approach to predicting phenotype from genotype. Proc. Natl Acad. Sci. USA, 99, 82718276
Boyd, S.V.L. Convex Optimization, (2004) , Cambridge Cambridge University Press.
Breiman, L. Classification and regression trees, . (1984) , Belmont, CA Wadsworth.
Burges, C.J. A Tutorial on Support Vector Machines for Pattern Recognition, (1998) Kluwer Academic Press.
Carpenter, C.C., et al. (2000) Antiretroviral therapy in adults: updated recommendations of the International AIDS Society-USA Panel. JAMA, 283, 381390
Daniels, M.J. and Kass, R.E. (2001) Shrinkage estimators for covariance matrices.
D'Aquila, R.T., et al. (2003) Drug resistance mutations in HIV-1. Top HIV Med, . 11, 9296[Medline].
De Luca, A., et al. (2003) Variable prediction of antiretroviral treatment outcome by different systems for interpreting genotypic human immunodeficiency virus type 1 drug resistance. J. Infect. Dis, . 187, 19341943[CrossRef][Web of Science][Medline].
De Luca, A., et al. (2004) Construction, training and clinical validation of an interpretation system for genotypic HIV-1 drug resistance based on fuzzy rules revised by virologic outcomes. Antivir. Ther, . 9, 583593[Web of Science][Medline].
DHHS. (2004) Guidelines for the Use of Antiretroviral Agents in HIV-1-Infected Adults and Adolescents.
DiRienzo, A.G., et al. (2003) Non-parametric methods to predict HIV drug susceptibility phenotype from genotype. Stat. Med, . 22, 27852798[CrossRef][Web of Science][Medline].
Draghici, S. and Potter, R.B. (2003) Predicting HIV drug resistance with neural networks. Bioinformatics, 19, 98107
Efron, B., et al. (2003) Least Angle Regression. Ann. Stat, . 32, 407499[CrossRef].
Fazel, M. (2002) The Use of Log-Determinant Heuristics to Solve Rank Minimization Problems for Matrices. , Standford, CA PhD Dissertation Stanford University School of Electrical Engineering.
Fessel, W.J., Rhee, S.Y., Hurley, L., Nguyen, D.P., Slome, S., Smith, S.L., Klein, D., et al. (2003) High-level dual (NRTI and PI) and triple class multidrug resistance (MDR) in a large health maintenance organization: prevalence, risk factors, and response to salvage therapy [abstract H-915]. Interscience Conference on Antimicrobial Agents and ChemotherapyChicago, IL.
Hadamard, J. Lectures on the Cauchy Problem in Linear Partial Differential Equations, (1923) Yale University Press.
Hastie, T. and Friedman, J. Elements of Statistical Learning, Data Mining, and Prediction, (2001) Springer Verlag.
Hastie, T.J. and Tibshirani, R.J. General Additive Models, (1996) Chapman & Hall.
HIVDB. (2004) The HIV RT and Protease Sequence Database at Stanford University.
Jolliffe, I.T. Principal Component Analysis, (2002) Springer.
Kijak, G.H., et al. (2003) Discrepant results in the interpretation of HIV-1 drug-resistance genotypic data among widely used algorithms. HIV Med, . 4, 7278[CrossRef][Medline].
Kooperberg, C., et al. (2001) Sequence analysis using logic regression. Genet Epidemiol, . 21, Suppl. 1, S626S631.
Kuritzkes, D.R., et al. (1996) Drug resistance and virologic response in NUCA 3001, a randomized trial of lamivudine (3TC) versus zidovudine (ZDV) versus ZDV plus 3TC in previously untreated patients. AIDS, 10, 975981[Web of Science][Medline].
Mardia, K.V. Multivariate Analysis, (1980) Academic Press.
Meynard, J.L., et al. (2002) Phenotypic or genotypic resistance testing for choosing antiretroviral therapy after treatment failure: a randomized trial. AIDS, 16, 727736[CrossRef][Web of Science][Medline].
Molla, A., et al. (1996) Ordered accumulation of mutations in HIV protease confers resistance to ritonavir. Nat. Med, . 2, 760766[CrossRef][Web of Science][Medline].
Murray, W., Gill, P., Wright, M. Practical Optimization, (1981) , NY Academic Press.
Neter, J., Wasserman, W., Kutner, M. Applied Linear Statistical Models: Regression, Analysis of Variance and Experimental Designs, (1990) 3rd edn Richard D. Irwin, Inc.
O'Brien, S.J. and Nelson, G.W. (2004) Human genes that limit AIDS. Nat. Genet, . 36, 565574[CrossRef][Web of Science][Medline].
Osborne, M.R., et al. (2000) A new approach to variable selection in least squares problems. IMA J. Numer. Anal, . 20, 389403
Ravela, J., et al. (2003) HIV-1 protease and reverse transcriptase mutation patterns responsible for discordances between genotypic drug resistance interpretation algorithms. J. Acquir. Immune Defic. Syndr, . 33, 814[Medline].
Rhee, S.Y., et al. (2003) Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res, . 31, 298303
Ryan, T.P. Modern Regression Methods, (1996) Wiley.
Schölkopf, B., Burges, C., Smola, A. Advances in Kernel MethodsSupport Vector Learning, (1999) , Cambridge, MA MIT Press.
Shafer, R.W. (2002) Genotypic testing for human immunodeficiency virus type 1 drug resistance. Clin. Microbiol. Rev, . 15, 247277
Shafer, R.W. NIAID ACTG Data Analysis Concept Sheet Proposal: Genotype predictors of clinical outcome across completed ACTG clinical trials, (2004) NIAID ACTG.
Sherr, L. (2000) Adherencesticking to the evidence. AIDS Care, 12, 373375[Medline].
Singer, J. (2004) Log-Penalized Linear Regression. , Standford, CA PhD Dissertation Stanford University School of Electrical Engineering.
Tibshirani, R. (1996) Regression shrinkage and selection via the LASSO. J Royal Statist. Soc. B, 58, 267288.
UNAIDS. (2004) UNAIDS/WHO. Report on the Global AIDS epidemic. Geneva, Switzerland.
Vapnik, V.N. The Nature of Statistical Learning Theory, (1998) Springer.
Wang, D. and Larder, B. (2003) Enhanced prediction of lopinavir resistancefrom genotype by use of artificial neural networks. J. Infect Dis, . 188, 653660[CrossRef][Web of Science][Medline].
Wang, K., et al. (2004) Simple linear model provides highly accurate genotypic predictions of HIV-1 drug resistance. Antivir Ther, . 9, 343352[Web of Science][Medline].
Zolopa, A.R., et al. (1999) HIV-1 genotypic resistance patterns predict response to saquinavir-ritonavir therapy in patients in whom previous protease inhibitor therapy had failed. Ann. Intern. Med, . 131, 813821
This article has been cited by other articles:
![]() |
H. Saigo, T. Uno, and K. Tsuda Mining complex genotypic features for predicting HIV-1 drug resistance Bioinformatics, September 15, 2007; 23(18): 2455 - 2462. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











