Genotypic predictors of human immunodeficiency virus type 1 drug resistance
See allHide authors and affiliations

Communicated by Bradley Efron, Stanford University, Stanford, CA, August 28, 2006 (received for review December 5, 2005)
Abstract
Understanding the genetic basis of HIV1 drug resistance is essential to developing new antiretroviral drugs and optimizing the use of existing drugs. This understanding, however, is hampered by the large numbers of mutation patterns associated with crossresistance within each antiretroviral drug class. We used five statistical learning methods (decision trees, neural networks, support vector regression, leastsquares regression, and least angle regression) to relate HIV1 protease and reverse transcriptase mutations to in vitro susceptibility to 16 antiretroviral drugs. Learning methods were trained and tested on a public data set of genotype–phenotype correlations by 5fold crossvalidation. For each learning method, four mutation sets were used as input features: a complete set of all mutations in ≥2 sequences in the data set, the 30 most common data set mutations, an expert panel mutation set, and a set of nonpolymorphic treatmentselected mutations from a public database linking protease and reverse transcriptase sequences to antiretroviral drug exposure. The nonpolymorphic treatmentselected mutations led to the best predictions: 80.1% accuracy at classifying sequences as susceptible, low/intermediate resistant, or highly resistant. Least angle regression predicted susceptibility significantly better than other methods when using the complete set of mutations. The three regression methods provided consistent estimates of the quantitative effect of mutations on drug susceptibility, identifying nearly all previously reported genotype–phenotype associations and providing strong statistical support for many new associations. Mutation regression coefficients showed that, within a drug class, crossresistance patterns differ for different mutation subsets and that crossresistance has been underestimated.
Twenty antiretroviral drugs are approved for treating HIV1 infection: eight protease inhibitors (PIs), seven nucleoside and one nucleotide reverse transcriptase (RT) inhibitors (NRTIs), three nonnucleoside RT inhibitors (NNRTIs), and one fusion inhibitor. Resistance to these drugs is caused by mutations in their molecular targets. Understanding the genetic basis of crossresistance is essential for designing new antiviral drugs and for using genotypic drug resistance testing to select optimal therapy. Despite the large number of PIs and RT inhibitors, therapy is challenging because drug resistance arises from complex patterns of mutations and because of the high degree of crossresistance within each drug class.
Approaches for using HIV1 drug resistance mutations to predict changes in drug susceptibility have included decision trees (1), linear regression (2), linear discriminant analysis (3), neural networks (4), and support vector regression (SVR) (5). Here, we compare five statistical learning methods each using four different sets of input mutations to develop quantitative models associating HIV1 protease and RT mutations with changes in susceptibility to 16 antiretroviral drugs. The analyses are performed on a curated publicly available data set (6) generated with a highly reproducible drug susceptibility assay (7, 8). The results provide insight into the performance of different statistical learning methods at predicting phenotypic characteristics of highly polymorphic proteins and into the genetic mechanisms of HIV1 antiretroviral crossresistance.
Results
Drug Susceptibility Results, Input Mutations, and Learning Methods.
For each of the three drug classes, we created four mutation sets that included (i) a complete set of all mutations present in ≥2 sequences, (ii) an expert panel mutation set (9), and (iii) a set of nonpolymorphic treatmentselected mutations (TSMs) derived from a database linking protease and RT sequences to the treatment histories of persons from whom the sequenced viruses were obtained (10) (Table 1). A control set of the 30 most common mutations in the data set was also created (see Supporting Text, which is published as supporting information on the PNAS web site). Predictions using these 30 mutations were consistently inferior to those using the other three mutation sets (data not shown).
Table 2 shows the number of isolates for which sequences and susceptibility results were available. Thirtyseven to 60% of isolates had reduced drug susceptibility to one or more PIs. Thirtyone to 70% had reduced drug susceptibility to one or more NRTIs. Thirtythree to 41% had reduced drug susceptibility to one or more NNRTIs. The distribution in the number of nonpolymorphic TSMs and expert panel mutations per isolate is shown in Fig. 2, which is published as supporting information on the PNAS web site.
We applied five statistical learning methods [decision trees, neural networks, leastsquares regression (LSR), SVR, and least angle regression (LARS)] to classify isolates as susceptible, low/intermediate, or highly resistant to the drugs used for testing. The regression methods were used to predict the level of reduced drug susceptibility for each isolate.
Classification.
The mean prediction accuracy (5 methods × 3 mutation sets) was highest for the NNRTIs (83.0%) compared with the PIs (78.2%; P < 0.001) and NRTIs (75.9%; P < 0.001) (Table 3). Among the PIs, the highest accuracy was for ritonavir (85.4%) and the lowest was for atazanavir (70.6%), the PI with the fewest test results. Among the NRTIs, the highest accuracies were for lamivudine (88.6%) and the lowest were for tenofovir (TDF) (67.7%), the NRTI with the fewest test results. There was minimal variation in prediction accuracy among the NNRTIs.
When the prediction accuracies of the possible combinations of drug and learning method were averaged over the different mutation sets, the mean accuracies of the learning methods ranged from 76.1% (neural networks) to 79.7% (LARS). The superiority of LARS was due to its accuracy in using the complete mutation set (80.3%), which was significantly higher than decision trees (77.2%; P = 0.02), SVR (76.1%; P < 0.001), neural networks (74.6%; P < 0.001), and LSR (72.3%; P < 0.001). Regression methods (79.9%) had higher accuracy for the PIs than the decision tree and neural network methods (75.5%; P < 0.001).
Averaged over the different learning methods and drugs, the nonpolymorphic TSM set had the highest prediction accuracy (80.1%) followed by the expert panel (77.5%; P < 0.001) and complete mutation set (76.1%; P < 0.001). However, the use of all of the expert panel mutations for a drug class (pooling the mutations associated with all of the drugs of a class) increased the accuracy from 77.5% to 79.3%.
Table 4, which is published as supporting information on the PNAS web site, shows the number of highly discordant results according to the mutation data set and learning method. For 10,624 predictions obtained by applying LARS to the nonpolymorphic TSM data set, only 33 (0.31%) were highly discordant with the measured phenotype (Table 5, which is published as supporting information on the PNAS web site).
Regression.
The TSM set had the highest correlation coefficients (r ^{2}) between actual and predicted susceptibility compared with the complete and expert panel mutation set (Table 6, which is published as supporting information on the PNAS web site). The r ^{2} values for the TSM set determined by SVR, LSR, and LARS ranged from 0.83 to 0.84 averaged over the PIs, 0.76 to 0.77 averaged over the NRTIs, and 0.77 to 0.79 averaged over the NNRTIs.
Averaged over each of the 16 drugs and the three regression methods, the meansquared errors (MSEs) were 0.22 for the TSMs and 0.32 for the expert panel mutations (Table 7, which is published as supporting information on the PNAS web site). However, the use of the complete set of expert panel mutations for a drug class reduced the overall MSE from 0.32 to 0.26. The MSEs of the regression methods using the TSMs were inversely proportional to the number of samples used for testing and training (Fig. 3, which is published as supporting information on the PNAS web site).
Regression coefficients for each of the PI, NRTI, and NNRTI TSMs determined by the different regression methods were highly correlated (Tables 8–10 and Fig. 4, which are published as supporting information on the PNAS web site). For the PIs, the mean r ^{2} between the LSR and SVR coefficients was 0.98 and between LSR and LARS was 0.96. For the NRTIs, the mean r ^{2} between LSR and SVR was 0.94 and between LSR and LARS was 0.91.
PIResistance Mutations.
Fig. 1 A shows the LSR coefficients for 35 PI TSMs occurring in ≥10 sequences and significantly associated with decreased susceptibility to one or more PIs (regression coefficient >3.0 standard deviations above or below 0). The substrate cleft mutations G48V and I84V; the flap mutations I54V and Q58E; and the mutations L24I, G73S, and L90M were associated with decreased susceptibility to all seven PIs. The substrate cleft mutations V32I, I50V, and V82A/T/F; the flap mutations K43T, M46I/L, I47V, F53L, and I54M/L; and the mutations L10F, K20I/T, G73T, T74S, and N88D/S were associated with decreased susceptibility to four or more PIs.
The TSM coefficients provided quantitative confirmation of each nonpolymorphic expert panel mutation. In addition, six nonexpertpanel TSMs (V11I, K43T, Q58E, T74S, L76V, and L89V) were significantly associated with decreased susceptibility to one or more PIs.
NRTI and NNRTIResistance Mutations.
Fig. 1 B shows the LSR regression coefficients for 23 NRTI TSMs occurring in ≥10 sequences and significantly associated with decreased susceptibility to one or more NRTIs. The TSM coefficients provided quantitative confirmation of each nonpolymorphic expert panel mutation. In addition, 8 nonexpertpanel TSMs (K43E/Q, V75M/T, E203K, D218E, K219R, and L228H) decreased susceptibility to one or more NRTIs. As previously reported, M184V increased susceptibility to zidovudine (AZT), stavudine (d4T), and TDF; L74V increased susceptibility to AZT and TDF; and K65R increased susceptibility to AZT (11–13).
The LSR coefficients for 24 NNRTI TSMs occurring in ≥2 sequences and associated with decreased susceptibility to one or more NNRTIs are shown in Table 10. Most mutations decreased susceptibility to each of the NNRTIs. Seven nonexpertpanel TSMs (K101E/P, K103S, Y181V, G190E/Q, and K238T) decreased susceptibility to one or more NNRTIs.
To explore the effect of NNRTIresistance mutations on NRTI susceptibility and NRTIresistance mutations on NNRTI susceptibility, we used LSR and the combined set of NRTI and NNRTI TSMs to predict both NRTI and NNRTI susceptibility. Two NNRTIresistance mutations, L100I and Y181C, had significantly negative regression coefficients for AZT (−0.45 and −0.45) and TDF (−0.43 and −0.33) consistent with previous reports (12, 14). Five NRTIresistance mutations had significantly negative regression coefficients for efavirenz (EFV) including M41L, D67N, M184V, L210W, and K219Q (range of −0.08 to −0.17), consistent with previous reports (15).
Discussion
Association studies with HIV1 genotype are complicated by the high dimensionality of the data caused by HIV1's high mutation rate. We used several types of external feature selection to limit the dimensionality of HIV1 genotypic data and five statistical learning methods to create models of how genotype influences susceptibility. LARS, the only regression method that performed its own feature selection, was superior to the other regression methods at predicting decreased susceptibility in the absence of external feature selection (i.e., using the complete set of mutations present in ≥2 sequences in the data set). However, LARS was not superior to the other regression methods when externally derived features were used.
The most successful approach to external feature selection was to use a set of mutations previously identified as absent in viruses from treatmentnaïve individuals and occurring at significantly increased frequencies in viruses from treated individuals (10). This set of nonpolymorphic TSMs had the highest accuracy (80.1%) averaged over the five learning methods when classifying isolates as susceptible, low/intermediate, and highlevel resistant, and had the lowest MSEs compared with the other mutation data sets when used for regression. The success of TSMs at predicting susceptibility follows from Darwinian principles: the TSMs emerge during therapy presumably because they facilitate virus escape from drug inhibition. Exploiting the results of an analysis on a separate data set containing genotypetreatment correlations thus proved superior to relying completely on the genotype–phenotype data set to predict phenotype.
The fact that prediction accuracy was lowest for the drugs (atazanavir and TDF) with the fewest samples suggests that accuracy depends to a large extent on sample size. The learning curves show that for most drugs, ≈400 genotype–phenotype examples were required for optimal predictive accuracy and that the MSE then saturates at 0.15–0.20, suggesting that factors other than TSMs influence susceptibility. For example, polymorphic sites influence susceptibility either directly or indirectly by affecting virus replication capacity (16). In addition, protease cleavage site mutations, primarily those in the gag gene, also influence viral replication capacity and drug susceptibility (17).
Although it is likely that the effect of individual mutations on drug susceptibility depends on the presence of other mutations, we did not demonstrate an improvement in predictive accuracy using a LARS regression model that included all possible twoway interactions or using a polynomial kernel for SVR (data not shown). This may be because the process of exploring interactions without model selection to choose sets of interacting mutations often leads to decreased performance due to overfitting. Alternatively, the main effects may have already included the effects of interactions because, in clinical virus isolates, only those combinations of mutations that reduce susceptibility and allow the virus to replicate are likely to emerge.
The regression coefficients provide evidence for the high level of crossresistance within the PI class and identified new mutations that were associated with decreased susceptibility to one or more drugs. Because the coefficients were derived from a model in which susceptibility results were standardized (i.e., corrected for the different ranges in susceptibility for different drugs), their magnitude indicates the mutation's contribution to resistance relative to other mutations affecting the same drug and relative to the mutation's effect on other drugs. The high correlation among the coefficients derived from three different regression methods suggests that the mutation regression coefficients represent a reproducible effect that is a real property of the data.
Although drugs of the same class shared high levels of crossresistance, the patterns of crossresistance were often different for different mutations. This may reflect the physical interactions inhibitors make with more than one part of the target molecule. For example, PIs bind to four or more binding pockets in the protease substrate cleft, whereas NRTIs interact with several parts of RT both before and after being added to a growing DNA chain. The genetic mechanisms of crossresistance and of mutational antagonism should be exploited in selecting antiretroviral drug regimens by combining drugs with the least crossresistance and in designing new compounds containing moieties that select for antagonistic mutations.
Methods
HIV1 Isolates and Genotypes.
HIV1 sequences used for this study were from publicly available isolates in the Stanford HIV Drug Resistance Database for which both sequences and in vitro susceptibility results were available (6). Isolates included viruses from the plasma of HIV1infected persons and laboratory viruses with drugresistance mutations resulting from sitedirected mutagenesis or in vitro passage. Up to two isolates from a small number of individuals were included provided their isolates differed at two or more drugresistance positions. Isolates with electrophoretic evidence of more than one amino acid at a nonpolymorphic drugresistance position were excluded from analysis.
Genotypes were derived from the amino acid sequences of positions 1–99 in protease and 1–240 in RT. Mutations were defined as amino acid differences from the subtype B consensus wildtype sequence (http://hivdb.stanford.edu/pages/asi/releaseNotes/). Mutations were classified as nonpolymorphic if they occurred with a frequency of ≤0.5% in untreated persons.
Drug Susceptibility Results.
For consistency, only susceptibility results generated by the PhenoSense method (Monogram Biosciences, South San Francisco, CA) were analyzed (7, 8). Drug susceptibility results were expressed as fold change in susceptibility defined as the ratio of the IC_{50} of an isolate and a standard wildtype control isolate. Results were classified into three categories: susceptible, low/intermediate resistance, and highlevel resistance. The cutoff between susceptible and low/intermediate resistance was based on the distribution of results in HIV1 isolates from untreated persons lacking drugresistance mutations (18). The cutoff between low/intermediate and highlevel resistance was based partly on the drug's dynamic susceptibility range (fold difference of the most highly resistance isolates) and partly based on levels of resistance associated with markedly reduced clinical activity.
For the PIs, <3.0fold resistance was considered susceptible, 3.0 to 20fold was considered low/intermediate resistant, and >20fold was considered highly resistant. For the NNRTIs and the NRTIs AZT and lamivudine, <3.0fold resistance was considered susceptible, 3.0 to 25fold was considered low/intermediate, and >25fold was considered highly resistant. For the NRTIs ddI, d4T, and TDF, a fold resistance of <1.5 was considered susceptible, 1.5–3.0 was considered low/intermediate, and >3.0 was considered highly resistant. For the NRTI ABC, <2.0fold resistance was considered susceptible, 2.0 to 6.0fold was considered low/intermediate, and >6.0fold was considered highly resistant. Susceptibility results were logtransformed and standardized before analysis.
Prediction Algorithms.
Decision trees.
The C4.5 algorithm (19) was used to construct decision trees for the 16 drugs in the study. Mutations were the attributes, and phenotypic classifications were the decision variables.
Neural network.
Feedforward neural network models with a single hidden layer were trained to predict susceptibility by using the R package AMORE (http://cran.rproject.org). The input layer contained a node for each of the mutation attributes. The hidden layer was assigned 12 nodes for all drugs. The output layer consisted of three nodes, one for each of the susceptibility classes. The weights were learned by using the error back propagation algorithm.
Support vector regression.
SVR was used to learn a regression function of the form: ƒ(x) = Σα_{i} K(x,x _{i}) + b, where ƒ(x) is the logarithm of the fold value for the training sample and x is the binary vector of mutations (20). K is the kernel function, the nonzero α_{i} correspond to the support vectors, and b is the bias term. Both linear and polynomial kernels were used. The latter was used to model interactions between mutations. SVR was performed with the PyML package (available at http://pyml.sourceforge.net).
Linear regression.
LSR and LARS (20, 21) were used to predict the logarithm of the fold decrease in susceptibility. For each of the regression methods, the coefficients were learned after scaling the logfold values to zero mean and unit variance to analyze the relative impact of a mutation on different drugs. LARS is a constrained model building procedure similar to the LASSO (23) that constructs a model by first finding the mutation most correlated with susceptibility and then incrementally builds the model by following the “equiangular vector” until another variable is equally correlated with the residual. The process would eventually terminate at a leastsquares solution. However, a validation set (20% of the data) was used to decide when to stop adding variables to the model. With LARS, secondorder polynomials were also used to model interactions among each of the input mutations.
CrossValidation.
Fivefold crossvalidation was used to determine the mean generalization accuracy of each learning method on test data. Fivefold crossvalidation was run 10 times on different subdivisions of the data set to estimate variability of the mean generalization accuracy. For decision trees, LSR, and SVR, 80% of the data were used for training and 20% for testing. For neural networks and LARS, 60% of the data were used for training, 20% for validating the selected model, and 20% for testing.
Performance Criteria.
Accuracy was the proportion of correctly predicted samples. The regression methods were also evaluated by using the MSE and r ^{2} between actual and predicted standardized logfold values. Both measurements indicate how much of the variability in the response variable (the standardized logtransformed reduction susceptibility) was explained by the regression model.
All comparisons between drug classes, drugs, mutation sets, or learning methods were done in a pairwise fashion, using paired differences of the averaged accuracy or MSE for each repeated run of crossvalidation. For instance, the accuracy for mutations sets was averaged over all drugs and learning methods, resulting in a single observation per run of crossvalidation. A onesample permutation test on the 10 paired differences was used to ensure the correct null distribution was used for computing P values.
To assess the effect of the number of training examples (genotype–phenotype correlations) on prediction accuracy, we created sets of randomly selected training examples that were multiples of 50 and ranging in size from 50 to 600 for testing and training using 5fold crossvalidation.
Acknowledgments
S.Y.R., G.W., J.T., and R.W.S. were supported in part by National Institute of Allergy and Infectious Diseases Grant AI4614801 and National Institute of General Medical Sciences Grant 5P01GM06652403. A.B.H. and D.L.B. were supported by a Stanford BioX Interdisciplinary Grant.
Footnotes
 ^{¶}To whom correspondence should be addressed at: Division of Infectious Diseases, Room S169, Stanford University, Stanford, CA 94305. Email: rshafer{at}stanford.edu

Author contributions: S.Y.R. and J.T. contributed equally to this work; R.W.S. designed research; S.Y.R., J.T., G.W., and R.W.S. performed research; A.B.H. and D.L.B. contributed new reagents/analytic tools; S.Y.R., J.T., G.W., A.B.H., D.L.B., and R.W.S. analyzed data; and S.Y.R., J.T., and R.W.S. wrote the paper.

↵ ^{§}Present address: Department of Computer Science, Colorado State University, Fort Collins, CO 80523.

The authors declare no conflict of interest.
 Abbreviations:
 AZT,
 zidovudine;
 LARS,
 least angle regression;
 LSR,
 leastsquares regression;
 MSE,
 meansquared error;
 PI,
 protease inhibitor;
 RT,
 reverse transcriptase;
 NRTI,
 nucleoside RT inhibitor;
 NNRTI,
 nonnucleoside RT inhibitor;
 SVR,
 support vector regression;
 TDF,
 tenofovir;
 TSM,
 treatmentselected mutation.

Freely available online through the PNAS open access option.
 © 2006 by The National Academy of Sciences of the USA
References

↵
 Beerenwinkel N ,
 Schmidt B ,
 Walter H ,
 Kaiser R ,
 Lengauer T ,
 Hoffmann D ,
 Korn K ,
 Selbig J
 ↵

↵
 Sevin AD ,
 DeGruttola V ,
 Nijhuis M ,
 Schapiro JM ,
 Foulkes AS ,
 Para MF ,
 Boucher CA

↵
 Wang D ,
 Larder B

↵
 Beerenwinkel N ,
 Daumer M ,
 Oette M ,
 Korn K ,
 Hoffmann D ,
 Kaiser R ,
 Lengauer T ,
 Selbig J ,
 Walter H

↵
 Rhee SY ,
 Gonzales MJ ,
 Kantor R ,
 Betts BJ ,
 Ravela J ,
 Shafer RW

↵
 Petropoulos CJ ,
 Parkin NT ,
 Limoli KL ,
 Lie YS ,
 Wrin T ,
 Huang W ,
 Tian H ,
 Smith D ,
 Winslow GA ,
 Capon DJ ,
 Whitcomb JM

↵
 Zhang J ,
 Rhee SY ,
 Taylor J ,
 Shafer RW
 ↵

↵
 Rhee SY ,
 Fessel WJ ,
 Zolopa AR ,
 Hurley L ,
 Liu T ,
 Taylor J ,
 Nguyen DP ,
 Slome S ,
 Klein D ,
 Horberg M ,
 et al.

↵
 St Clair M ,
 Martin JL ,
 TudorWilliams G ,
 Bach MC ,
 Vavro CL ,
 King DM ,
 Kellam P ,
 Kemp SD ,
 Larder BA

↵
 Parkin N ,
 Chappey C ,
 Petropoulos C ,
 Hellmann N

↵
 Parikh UM ,
 Koontz DL ,
 Chu CK ,
 Schinazi RF ,
 Mellors JW

↵
 Larder BA
 ↵

↵
 MartinezPicado J ,
 Wrin T ,
 Frost SD ,
 Clotet B ,
 Ruiz L ,
 Brown AJ ,
 Petropoulos CJ ,
 Parkin NT

↵
 Maguire MF ,
 Guinea R ,
 Griffin P ,
 Macmanus S ,
 Elston RC ,
 Wolfram J ,
 Richards N ,
 Hanlon MH ,
 Porter DJ ,
 Wrin T ,
 et al.

↵
 Parkin NT ,
 Hellmann NS ,
 Whitcomb JM ,
 Kiss L ,
 Chappey C ,
 Petropoulos CJ

↵
 Quinlan J

↵
 Schoelkopf B ,
 Smola A
 ↵

 Tibshirani R