Dr.VAE: Drug Response Variational Autoencoder
Ladislav Rampasek, Daniel Hidru, Petr Smirnov, Benjamin Haibe-Kains, Anna Goldenberg
DDr.VAE: Drug Response Variational
Autoencoder
Ladislav Rampasek ∗†‡
Daniel Hidru †‡ Petr Smirnov § Benjamin Haibe-Kains § ¶†(cid:107) Anna Goldenberg ∗‡†
Abstract
We present two deep generative models based on Variational Autoencoders toimprove the accuracy of drug response prediction. Our models, Perturbation VariationalAutoencoder and its semi-supervised extension, Drug Response Variational Autoencoder(Dr.VAE), learn latent representation of the underlying gene states before and afterdrug application that depend on: (i) drug-induced biological change of each gene and(ii) overall treatment response outcome. Our VAE-based models outperform the currentpublished benchmarks in the field by anywhere from to AUROC and to AUPR. In addition, we found that better reconstruction accuracy does not necessarilylead to improvement in classification accuracy and that jointly trained models performbetter than models that minimize reconstruction error independently.
Despite tremendous advances in the pharmaceutical industry, many patients worldwide donot respond to the first medication they are prescribed. Personalized medicine, an approachthat uses patients’ own genomic data, promises to tailor the treatment program to increasethe probability of positive response. That idea is great, but to build powerful predictivemodels, we need training data. The space of all possible treatments and their combinationsfor a given condition is enormous and the heterogeneity of patients with complex diseasesis high. Thus, while much data has been collected, it is sparsely and inefficiently sampledmaking it a very hard learning problem.In the last decade there have been several public releases of high throughput drug screeningin cell lines. Cancer cell lines are cells taken from a patient’s tumor that are “immortalized”, ∗ Corresponding authors: [email protected] and [email protected] † University of Toronto, Department of Computer Science, Toronto, ON, Canada ‡ The Hospital for Sick Children, Toronto, ON, Canada § Princess Margaret Cancer Centre, University Health Network, Toronto, ON Canada ¶ Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada (cid:107)
Ontario Institute of Cancer Research, Toronto, ON, Canada a r X i v : . [ s t a t . M L ] J u l .e. modified to divide indefinitely. The greatest advantage of cell lines is that it is relativelyinexpensive to test them with thousands of drugs providing a rich basis for learning predictivemodels. This screening task was undertaken by several large consortia and pharmaceuticalcompanies resulting in public datasets of various sizes, e.g. Genomics of Drug Sensitivityin Cancer (GDSC) with 138 drugs (Yang et al., 2013) tested on 700 cancer cell lines, andthe Cancer Cell Line Encyclopedia (CCLE) (Barretina et al., 2012) with 24 drugs tested ona panel of >1000 cell lines. The availability of these datasets spurred the development ofpredictive models. Jang et al. (2014) compared seven standard machine learning approachesand identified ridge and elastic net regressions as the best performers with an average AUCof ∼ . across 24 compounds from the CCLE dataset and ∼ . across 138 compounds fromthe GDSC dataset.In addition, there is a perturbation database containing over 16,000 experiments showinghow the expression of 1000 genes changed in response to a drug (gene expression is recordedbefore and after drug application) (Duan et al., 2014). This information allows for theassessment of biological change due to treating the cancer cells with drugs. Combiningresponse and perturbation data is expected to ultimately yield a better and more biologicallyrelevant model of drug response, though likely more experiments will be needed, since thereare only a few drugs tested in each cell line.In this paper we present two deep generative models Perturbation Variational Autoencoderand its semi-supervised extension, Drug Response Variational Autoencoder (Dr.VAE), thatlearn latent representation of the underlying gene states before and after drug applicationthat depend on both the cell line’s overall response to the drug and the biological change ofeach of the landmark genes. We are building on VAEs ability to leverage expressiveness ofdeep neural networks for Bayesian learning and inference (Kingma et al., 2014). In addition,as Bayesian models they are more adept for the task when very little data is present, which isthe case in our drug response prediction problem. To fit our model we use a combination ofStochastic Gradient Variational Bayes (Kingma and Welling, 2013) and Inverse AutoregressiveFlow (Kingma et al., 2016), a recently introduced type of Normalizing Flow (Rezende andMohamed, 2015).We tested our methods on 19 drugs for which both perturbation and drug response datawas available. Both Dr.VAE and Semi-Supervised VAE outperform the current publishedbenchmark models (Jang et al., 2014) in the field by anywhere from to AUROCand to AUPR. Our analysis of this problem and of the model performance showsthe difficulty of fitting sparsely and inefficiently sampled high dimensional data, but at thesame time illustrates the flexibility and potential improvement over the currently availablestate-of-the-art models for drug response prediction problem.
We propose two models. First, we discuss an approach for modeling drug perturbationeffects, i.e. given gene expression of a cell line before the drug is applied (pre-treatment geneexpression), we are aiming to predict gene expression after the drug is applied (post-treatment2 x x x z z z z p ✓ z ! z p ✓ z ! x p ✓ z ! x (a) x x x x z z z z q x ! z q x ! z p ✓ z ! z (b) Figure 1: Perturbation VAE: (a) Factorization of the generative distribution p , (b) Factoriza-tion of the approximate posterior distribution q . Note, we use the generative p θ z → z in case x is not observed.state). To this end we propose a deep generative model, Perturbation VAE (PertVAE).We then develop drug response prediction model, Drug Response Variational Autoencoder(Dr.VAE), a semi-supervised extension of PertVAE, to tackle the problem of drug response(treatment efficacy) prediction while harnessing the unsupervised information about theparticular drug from observed pre- and post-treatment gene expression perturbation pairs. Perturbation Variational Autoencoder (PertVAE) is an unsupervised model for drug-inducedgene expression perturbations, that embeds the data space (gene expression) in a lowerdimensional latent space. In the latent space we model the drug-induced effect as a linearfunction, which is trained jointly with the embedding encoder and decoder.We fit PertVAE on “perturbation pairs” [ x , x ] of pre-treatment and post-treatment geneexpression with shared stochastic embedding encoder q φ x → z and decoder p θ z → x . The originaldimension of each vector x is genes. Additionally we use unpaired pre-treatment data(with no know post-treatment state) to improve learning of the latent representation. Thegraphical representation of PertVAE model is shown in Figure 1. Joint distribution.
Our Perturbation VAE models joint p ( x , x , z , z ) , which we assumeto factorize as: p ( x , x , z , z ) = p ( x | z ) · p ( x | z ) · p ( z | z ) · p ( z ) (1) Generative distribution p . Perturbation VAE’s generative process is as follows: p ( z ) = N ( , I ) (2) p θ z → z ( z | z ) = N (cid:0) z | µ z = f θ ( z ) , σ z = exp f θ ( z ) (cid:1) (3) k ∈ { , } : p θ z → x ( x k | z k ) = N (cid:0) x k | µ x k = f θ ( z k ) , σ x k = exp f θ ( z k ) (cid:1) (4)The parameters of these distributions are computed by functions f θ , which are neuralnetworks with a total set of parameters Θ . For brevity we refer to these parameters as Θ θ z → x or θ z → z when such level of detail unnecessarily cluttersthe notation.We constrain the mean function in p θ z → z to be a linear function f θ z → z ( z ) of thefollowing form: f θ z → z ( z ) ≡ z + Wz + b (5)with W and b initialized close to zero such that f θ z → z ( z ) starts as an identity function. Wefound that together with L2 penalization this formulation improves stability and generalizationof the model. Approximate posterior q . Depending on the type of the data, we assume the approximateposterior q with a set of parameters φ to factorize as:perturbation pairs: q φ ( z , z | x , x ) = q φ x → z ( z | x ) · q φ x → z ( z | x ) (6)pre-treatment singleton: q φ ( z , z , x | x ) = q φ x → z ( z | x ) · p θ z → z ( z | z ) · p θ z → x ( x | z ) (7)Analogously to the shared generative p θ z → x distribution, also q φ x → z ( z k | x k ) is shared for both k ∈ { , } . Here, instead of directly using a diagonal Gaussian as the final approximateposterior k ∈ { , } : q φ x → z ( z k | x k ) = N (cid:0) z k | µ z k = f φ ( x k ) , σ z k = exp f φ ( x k ) (cid:1) (8)we apply two steps of “LSTM-type” Inverse Autoregressive Flow (IAF) (Kingma et al., 2016)updates to facilitate a richer family of approximate distributions. A sample from q φ x → z ( z k | x k ) is then derived by two iterations of the following step: z (0) k ∼ N (cid:0) µ z k = f φ ( x k ) , σ z k = exp f φ ( x k ) (cid:1) (9) t ∈ { , } : z ( t ) k = sigmoid ( s ( t ) ) (cid:12) z ( t − k + (1 − sigmoid ( s ( t ) )) (cid:12) m ( t ) (10)The coefficients [ m ( t ) , s ( t ) ] of the IAF are computed by a Masked Autoencoder for DistributionEstimation (MADE) model (Germain et al., 2015): [ m ( t ) , s ( t ) ] = MADE ( t ) (cid:16) z ( t − k , h ( x k ) (cid:17) (11)MADE is an autoregressive model, that is, j -th elements of the m ( t ) and s ( t ) vectors onlydepend on up to the first j − elements of z ( t − k . Using this property, the determinantof Jacobian of each IAF step can be computed efficiently. As each IAF step is then aninvertible function with known Jacobian determinant, it is thus possible to efficiently computeprobability of the derived sample z ( t ) k in the complex posterior q φ x → z ( z k | x k ) that does nothave a parametric form (Kingma et al., 2016).4 itting θ and φ parameters. We jointly optimize the generative model θ and variational φ parameters with Stochastic Gradient Variational Bayes (SGVB) (Kingma and Welling,2013; Rezende et al., 2014) to maximize Evidence Lower Bound (ELBO) of the data: N P (cid:88) log p ( x , x ) + N S (cid:88) log p ( x ) ≥ ELBO
PertVAE (12)ELBO
PertVAE = N P (cid:88) L P ( x , x ; θ, φ ) + N S (cid:88) L S ( x ; θ, φ ) (13)which is a sum of the evidence lower bound of N P perturbation pairs and the lower boundof N S unpaired “singleton” examples that we leverage to train the latent space VariationalAutoencoder as well. The individual per-example lower bounds L P and L S take the followingform: L P ( x , x ; θ, φ ) = E q φ ( z , z | x , x ) (cid:2) log p θ ( x , x , z , z ) − log q φ ( z , z | x , x ) (cid:3) (14) = E q φ ( z | x ) [log p θ ( x | z )] − D KL [ q φ ( z | x ) || p ( z )] + (15) + E q φ ( z | x ) [log p θ ( x | z )] − D KL [ q φ ( z | x ) || p θ ( z | z )] L S ( x ; θ, φ ) = E q φ ( z | x ) (cid:2) log p θ ( x , z ) − log q φ ( z | x ) (cid:3) (16) = E q φ ( z | x ) [log p θ ( x | z )] − D KL [ q φ ( z | x ) || p ( z )] Using SGVB it is possible to backpropagate through ELBO
PertVAE and we use Adam (Kingmaand Ba, 2014) to compute gradient updates for both θ and φ parameters. As we use IAFto model q φ ( z k | x k ) , the Kullback–Leibler divergence D KL cannot be computed numericallyand therefore we use a Monte Carlo estimate. Additionally we follow (Kingma et al., 2016)and allow “free bits” in D KL to mitigate the problem of overly strong prior causing theoptimization to get stuck in bad local optima. Analogously to Semi-Supervised VAE, we can extend our unsupervised Perturbation VAEto a semi-supervised model by stacking a modified “M2 model” (Kingma et al., 2014). Thismodel can be trained jointly to model both drug-induced perturbation effects as well astreatment response outcomes. As such we call this model Drug Response VAE (Dr.VAE).We use similar type of data to train Dr.VAE as we use for PertVAE, however some ofthe perturbation pairs and pre-treatment singletons now can have a binary outcome label y associated with them, denoting if the drug treatment was successful or not. Schema ofDr.VAE model is shown in Figure 2. Joint distribution.
Drug Response VAE extends Perturbation VAE to model a jointdistribution p ( x , x , z , z , z , y ) factorized as: p ( x , x , z , z , z , y ) = p ( x | z ) · p ( x | z ) · p ( z | z ) · p ( z | z , y ) · p ( z ) · p ( y ) (17)5 x x x yy z z z z z z (a) x x x x yy z z z z z z (b) Figure 2: Dr.VAE is semi-supervised extension of our Perturbation VAE: (a) Factorizationof the generative distribution p , (b) Factorization of the approximate posterior distribution q . In case the post-treatment gene expression x is not observed, we can use the expectedposterior E q φ ( z | x ) [ p θ ( z | z )] for z instead. Generative distributions p . The individual generative distributions Dr.VAE factorizeshave the following form: p ( y ) = Cat ( y ) (18) p ( z ) = N ( , I ) (19) p θ ( z | z , y ) = N (cid:0) z | µ z = f θ ( z , y ) , σ z = exp f θ ( z , y ) (cid:1) (20) p θ ( z | z ) = N (cid:0) z | µ z = f θ ( z ) , σ z = exp f θ ( z ) (cid:1) (21) k ∈ { , } : p θ ( x k | z k ) = N (cid:0) x k | µ x k = f θ ( z k ) , σ x k = exp f θ ( z k ) (cid:1) (22)Same way as in PertVAE, we share the “data decoder” p θ ( x k | z k ) among both k ∈ { , } . Approximate posterior q . Depending on the type of the data, we assume the approximateposterior q to factorize as: labeled pair: q φ ( z , z , z | x , x , y ) = q φ ( z | x ) · q φ ( z | x ) · q φ ( z | z , y ) (23) unlabeled pair: q φ ( z , z , z , y | x , x ) = q φ ( z | x ) · q φ ( z | x ) · q φ ( y | z , z ) · q φ ( z | z , y ) (24) labeled singleton: q φ ( z , z , z , x | x , y ) = q φ ( z | x ) · p θ ( z | z ) · p θ ( x | z ) · q φ ( z | z , y ) (25) unlab. singleton: q φ ( z , z , z , x , y | x ) = q φ ( z | x ) · p θ ( z | z ) · p θ ( x | z ) · (26) · q φ ( y | z , z ) · q φ ( z | z , y ) The “data encoder” k ∈ { , } : q φ ( z k | x k ) is shared and parametrized the same way as inPertVAE. The additional approximate posterior distributions then take the following form: q φ ( y | z , z ) = Cat (cid:0) y | π = softmax ( f φ ( z , z )) (cid:1) (27) q φ ( z | z , y ) = N (cid:0) z | µ z = f φ ( z , y ) , σ z = exp f φ ( z , y ) (cid:1) (28)The afford mentioned factorizations of the joint and of the posteriors also provide a recipefor sampling and inference in the model by Monte Carlo sampling.6 itting θ and φ parameters. We have 4 different sets of partially observed variables,which correspond to different types of data. Therefore there are 4 different evidence lowerbounds to optimize: N LP labeled perturbation pairs: (cid:80) N LP L LP ( x , x , y ; θ, φ ) (29) N UP unlabeled perturbation pairs: (cid:80) N UP L UP ( x , x ; θ, φ ) (30) N LS labeled pre-treatment singletons: (cid:80) N LS L LS ( x , y ; θ, φ ) (31) N US unlabeled pre-treatment singletons: (cid:80) N US L US ( x ; θ, φ ) (32)The sum of these 4 specific evidence lower bounds, ELBO DrVAE , is the evidence lower boundwe need to maximize. We omit the derivation of these specific lower bounds in the mainmanuscript since it follows the same principles as shown above for PertVAE and as shown inthe derivation of Semi-Supervised VAE (Kingma et al., 2014; Louizos et al., 2015).Finally, we need to explicitly introduce loss of the predictive posterior log q φ ( y | z , z ) inorder for it to be trained also on labeled data. This is required as for the labeled data therandom variable y is observed and therefore the lower bounds L LP and L LS are conditionedon y and do not contribute to fitting of q φ ( y | z , z ) . Our final objective J DrVAE to maximizeis J DrVAE = ELBO
DrVAE + α N LP (cid:88) E q φ ( z , z | x , x ) [ − log q φ ( y = t | z , z )] ++ α N LS (cid:88) E q φ ( z , z | x ) [ − log q φ ( y = t | z , z )] (33) In our experiments we used two main data resources: (i) high-throughput screens of cancercell-lines including gene expression pre-treatment for all tested cell lines and drug response interms of cell line viability, and (ii) high-throughput screens of gene expression perturbationeffects induced by drugs in cancer cell lines.We tested our methods on a panel of 19 drugs for which there are both response andperturbation experiments available. These 19 drugs were also used in recent AstraZeneca-Sanger DREAM Challenge and therefore we use it as a representative sample of anti-cancerdrugs.
Large high-throughput screening efforts have been undertaken resulting in publicly availabledatasets of cancer cell lines with post treatment cell viability at various drug concentrations.In our experiments we utilize the Genomics of Drug Sensitivity in Cancer (GDSC) (Yanget al., 2013) and Cancer Cell Line Encyclopedia (CCLE) (Barretina et al., 2012) datasets.We obtained these datasets using PharmacoGx R package (Smirnov et al., 2015). As the drug7esponse outcome we use binarized version of dose-response curves, which were computed byPharmacoGx. For consistency, we use response outcome from GDSC, while we use all theother cell lines in GDSC and CCLE not screened for response to a given drug as unlabeledcell line examples. Summary of our pooled dataset is detailed in Supplementary Materials.
The Library of Network-Based Cellular Signatures (LINCS) consortium screened perturbationeffects that drugs have on gene expression of L1000 landmark genes in cancer cell lines (Duanet al., 2014). These experiments do not measure the drug treatment efficacy, however some ofthese cell lines were independently tested in GDSC for the drug response. We cross-referencethese cell lines and assign the corresponding label to their perturbation measurement.The L1000 perturbation dataset is very sparse: for the 19 drugs, only up to 56 differentcell lines were screened. In fact, only 8 drugs have been measured in over 50 cell lines, theremaining 19 have been measured in fewer than 20 cell lines, albeit at various concentrationsand with many biological replicates. In our results we use measurements at the highest drugconcentration and all the biological replicates of such experiments. In cross-validation of ourmodels we use cell-line-wise splitting so that the biological replicates for a particular cell lineare in the same data fold.
We tested the performance of our models on three tasks: (i) drug response prediction task,(ii) drug perturbation prediction and (iii) gene expression reconstruction from the latentrepresentation.
Architecture.
All evaluated Variational Autoencoder -based models, our proposed models(Dr.VAE, PertVAE) and the published models we used for comparison (VAE and Semi-Supervised VAE (SSVAE)), use 100 stochastic latent units, i.e. all z are stochastic vectorsof length 100, and have the same architecture for the “data encoder” q φ ( z k | x k ) and “datadecoder” p θ ( x k | z k ) . For the encoder, there are 903 input units corresponding to 903 landmarkgenes (we exclude ∼ genes that we could not uniquely map between data sets). Theencoder has hidden layers with 500 and 300 units from which parameters of initial Gaussiandistribution µ z k and σ z k are computed together with 200 hidden units on which the subsequentInverse Autoregressive Flow is conditioned. We use 2 steps of IAF, each with one hiddenlayer of 300 units. Architecture of data decoder mirrors that of data encoder, but withoutIAF. Throughout all our models we use ELU activation function (Clevert et al., 2015) andWeight Normalization (Salimans and Kingma, 2016). We preserve various parts of thearchitecture among different models to help with further analysis of what helps with theobserved performance.For both Dr.VAE and SSVAE, the classification posteriors log q φ ( y | z , z ) and q φ ( y | z ) ,respectively, are linear functions with soft-max activation over two output units. In our8mplementation, we use a slight modification for Dr.VAE, for which we found that using [ z , z − z ] instead of [ z , z ] as the classifier input improves the performance. Further, thedistributions p θ ( z | z , y ) and q φ ( z | z , y ) (and their equivalents in SSVAE) are parametrizedby a neural network with a single hidden layer of 100 units.All our presented experiments are evaluated in 10-times randomized 5-fold cross-validationand we report the average metric across these 50 data splits. The models were fittedindependently for each of the 19 drugs, but with the same hyperparameters. We compare our models to Ridge L2 logistic regression (LR), random forest (RF), and supportvector machine with a linear kernel (SVM), following Jang et al. (2014) that found Ridge LRto be the best overall classifier for drug response in GDSC dataset.To assess informativeness of drug-induced perturbations for drug response prediction taskwe also compare Dr.VAE to a Semi-Supervised VAE (Kingma et al., 2014). SSVAE is trainedon the same data, however without ability to model the drug effect, as the perturbationpairs are simply presented as independent unlabeled singletons. On average, Dr.VAE is thebest performing method from all tested models ranging from to improvement overthe ridge logistic regression, considered state-of-the-art in the field. Over all 19 drugs, theaverage improvement in performance is . for Dr.VAE compared to . of SSVAE.The only drug where both Dr.VAE and SSVAE performed worse than Ridge LR is paclitaxel.This is a chemotherapy drug (no specific gene target) with a much smaller sample size, thusit appears that the simpler model has an advantage over all other models for this one.Dr.VAE and SSVAE learn a latent representation of the data and the classifier jointly. Tounderstand the importance of learning a good latent embedding, we also explored the learningparadigm where we first optimize latent representation in an unsupervised fashion and thentrain the classifier using the already learnt embedding. To this end we trained an unsupervisedPertVAE on all perturbation pairs and afterwards fitted Ridge LR classifiers, one using thePertVAE’s latent representation of pre-treatment gene expression (PertVAE+LR on Z1) andanother on the latent representation of predicted post-treatment state (PertVAE+LR onZ2). Additionally, we compared these results to the baseline models trained on principalcomponent analysis (PCA) projection of the dataset to the first 100 principal components.The third best average performance over all models was achieved by LR trained on thelatent embedding of pre-treatment gene expression learned by the PertVAE model. Theimprovement is only . over Ridge LR and does not beat SSVAE or Dr.VAE on any ofthe 19 drugs. Ridge LR performs better on PertVAE latent representations than on boththe observed gene expression and PCA representation with the same number of hidden units(principal components) as PertVAE.As the evaluation metric we use the area under precision-recall curve (AUPR) and areaunder the ROC curve (in the Supplementary Materials). Performance of all models ispresented in Table 1. 9able 1: Cross-validated test AUPR (area under PR curve) of our Dr.VAE to SSVAE andother classification models. Methods including PCA and PertVAE are 2-step methods: (i)fit the unsupervised model, (ii) use latent representation to fit a standard classifier. Theperformance comparison is presented as the relative change to Ridge LR classifier trained onthe pre-treatment gene expression. drug Ridge LR △△ RF △△ SVM △△ PCA RF △△ PCA Ridge LR △△ PCA SVM △△ PertVAE + LR on Z1 △△ PertVAE + LR on Z2 △△ SSVAE △△ DrVAEafatinib 0.527 -1.52% 2.28% -15.94% -4.74% -8.54% 2.09% -0.76% 13.47% 14.42%bortezomib 0.731 -2.74% -0.68% -5.88% -1.78% -3.28% 1.50% 1.50% 4.79% 5.61%docetaxel 0.772 -4.79% -0.91% -7.12% -4.02% -5.70% -1.94% -3.37% 1.42% 2.33%doxorubicin 0.644 3.73% -1.71% 2.02% 2.80% -0.47% 4.97% 5.12% 7.92% 8.39%etoposide 0.654 -2.29% -2.29% -4.13% -0.31% -1.07% 1.07% 0.92% 2.45% 3.67%GDC-0941 0.641 -3.43% -1.56% -2.81% 1.72% -0.16% 1.72% 1.87% 2.03% 1.40%gefitinib 0.285 1.05% 9.47% -7.72% 4.21% -0.35% 10.53% 11.23% 30.88% 32.63%linsitinib 0.447 -5.15% -0.89% -10.51% -1.12% -2.68% 3.58% 0.45% 6.04% 5.59%navitoclax 0.625 2.08% -5.76% -8.00% -3.20% -5.12% 2.56% 1.12% 10.08% 10.24%NU-7441 0.306 -3.59% 1.63% -0.65% 5.56% 5.56% 4.90% 5.56% 10.46% 11.44%olaparib 0.197 3.55% 6.60% 0.00% 5.08% 5.08% 22.84% 20.30% 30.46% 32.99%paclitaxel 0.712 -9.83% 0.70% -8.85% -2.67% -4.49% -7.72% -9.27% -3.79% -3.37%saracatinib 0.333 -13.51% -12.91% -21.92% -6.31% -10.21% -9.91% -11.11% 4.80% 10.51%selumetinib 0.48 3.75% -1.04% -10.42% 5.83% 4.17% 6.88% 6.67% 16.04% 16.04%SN-38 0.733 -3.14% -0.95% -4.91% -1.91% -2.05% 0.00% -0.41% 2.73% 3.27%temsirolimus 0.721 -2.50% -1.80% -5.83% 1.66% 0.69% 0.28% -0.14% 4.72% 4.72%tipifarnib 0.697 -3.59% -1.29% -5.16% -0.86% 0.14% 2.30% 1.58% 1.15% 1.87%vinorelbine 0.681 -4.41% -1.03% -5.73% -0.29% -1.76% 0.88% 0.44% 2.35% 2.64%vorinostat 0.795 0.63% 0.25% -4.65% 1.26% -0.13% 2.52% 2.52% 5.28% 5.66% average 0.578 -2.40% -0.63% -6.75% 0.05% -1.60% 2.58% 1.80% 8.07% 8.95%
Variational Autoencoder is an expressive non-linear model, while PCA has the best recon-struction loss among linear models. To evaluate how well a VAE with our architecture canmodel gene expression, we fitted a VAE with various number of stochastic latent variablesand compared its reconstruction to reconstructions by a PCA with equivalent number ofprincipal components. As the measure of reconstruction quality we used Spearman’s ρ between reconstruction mean and the original gene expression. We plot the results in Figure 3.The Variational Autoencoder with our encoder/decoder architecture, as used in Dr.VAE andPertVAE, does better for small latent spaces ( < ) after which it seems to overfit comparedto PCA. We chose this architecture and 100 stochastic units as the default for all our models.We expect our models then to have enough expressive power and capacity to not just modelgene expression but also find such a latent space that can be informative for drug responseand/or drug effect can be modeled as a stochastic linear function. We trained a PertVAE for each drug independently to see how well we can predict drugperturbation effects. That is, we optimized the ELBO
PertVAE and stopped training whenperturbation prediction loss started to increase on the validation set.10 .40.50.60.70.80.91 5 10 20 40 60 80 100 120 140 S p ea r m a n c o rr e l a ti on o f r ec on s t r u c t e d d a t a a nd o r i g i n a l d a t a number of latent units PCA train ⍴ PCA test ⍴ VAE train ⍴ VAE test ⍴ Figure 3: PCA and VAE reconstruc-tion quality comparison for varying latentspace size. Table 2: Perturbation VAE prediction resultswith latent space size 100. drug ⍴⍴ rec,pert ⍴⍴ pred,pert p-valueolaparib 56 0.529 0.517 1selumetinib 56 0.457 0.466 vorinostat 56 0.475 0.584 bortezomib 51 0.444 0.508 navitoclax 51 0.505 0.485 1SN-38 51 0.433 0.509 temsirolimus 51 0.488 0.504 tipifarnib 51 0.538 0.536 0.713GDC-0941 19 0.488 0.494 0.361gefitinib 19 0.545 0.541 0.795NU-7441 19 0.502 0.502 0.548saracatinib 19 0.517 0.514 0.682vinorelbine 14 0.51 0.504 0.659docetaxel 13 0.524 0.509 0.981paclitaxel 13 0.465 0.443 1afatinib 12 0.481 0.472 0.562etoposide 12 0.49 0.481 0.745doxorubicin 8 0.254 0.311 linsitinib 6 0.502 0.5 0.628 To evaluate the prediction performance we computed Spearman’s correlation ρ pred,pert between the mean of predicted gene expression distribution E q φ ( z , z | x ) [ p θ ( x | z )] and theobserved post-treatment gene expression in the test set. We compare this correlationto the correlation ρ rec,pert between the mean of pre-treatment reconstruction distribution E q φ ( z | x ) [ p θ ( x | z )] and the true post-treatment gene expression. This is done to assesswhether the drug perturbation function is in fact learning anything beyond reconstructingpre-treatment gene expression. Note, that in the training step the “drug effect” mean functionis initialized close to identity. If PertVAE would either underfit or overfit on the training set,we would expect ρ pred,pert to be no larger than ρ rec,pert . Therefore we calculate Mann-Whitneysingle-sided test with the alternative hypothesis H = ρ rec,pert < ρ pred,pert on the results of our10-times randomized 5-fold CV. The average correlation values and p-values of the statisticaltest are in Table 2, showing that PertVAE can at least partially predict drug perturbations for5 out of 8 drugs (p-value ≤ . ) for which the data set consists of perturbation experimentswith at least 51 unique cell lines. In our explorations of optimizing latent space, we found that doing well on reconstructiontask does not directly lead to improved classification performance. The ability of Dr.VAE tomodel drug-induced perturbation effects on gene expression leads to limited improvementin classification performance. Compared to fitted PertVAE, a Dr.VAE model trained pre-dominantly for classification does not learn to predict perturbation effects along the way.11owever, it is the best performing classification model. This set of observations compells usto conclude that the latent space serves a different role than simply compressing observedgene expression. Given a very small set of samples, very heterogeneous and noisy input andlikely noisy output, the goal of the latent space is to capture the essense of the observed geneexpression that is most useful and likely biased for prediction. The original goal of our workwas to create a rich paradigm where much of the available data can be incorporated to boostthe predictive performance of drug response. We did achieve an improvement in predictingdrug response in the flexible and powerful VAE framework that we believe is the way tomodel such data in the future.
References
Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A. A., Kim, S., Wilson, C. J.,Lehár, J., Kryukov, G. V., Sonkin, D., et al. (2012). The cancer cell line encyclopedia enablespredictive modelling of anticancer drug sensitivity.
Nature , 483(7391):603–607.Clevert, D.-A., Unterthiner, T., and Hochreiter, S. (2015). Fast and accurate deep network learningby exponential linear units (elus). arXiv preprint arXiv:1511.07289 .Duan, Q., Flynn, C., Niepel, M., Hafner, M., Muhlich, J. L., Fernandez, N. F., Rouillard, A. D.,Tan, C. M., Chen, E. Y., Golub, T. R., et al. (2014). Lincs canvas browser: interactive web appto query, browse and interrogate lincs l1000 gene expression signatures.
Nucleic acids research ,page gku476.Germain, M., Gregor, K., Murray, I., and Larochelle, H. (2015). Made: Masked autoencoder fordistribution estimation. In
ICML , pages 881–889.Jang, I. S., Neto, E. C., Guinney, J., Friend, S. H., and Margolin, A. A. (2014). Systematic assessmentof analytical methods for drug sensitivity prediction from cancer cell line data. In
Pac. Symp.Biocomput , volume 19, pages 63–74. World Scientific.Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 .Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. (2014). Semi-supervised learning withdeep generative models. In
Advances in Neural Information Processing Systems , pages 3581–3589.Kingma, D. P., Salimans, T., and Welling, M. (2016). Improving variational inference with inverseautoregressive flow. arXiv preprint arXiv:1606.04934 .Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 .Louizos, C., Swersky, K., Li, Y., Welling, M., and Zemel, R. (2015). The variational fair autoencoder. arXiv preprint arXiv:1511.00830 . ezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing flows. arXiv preprintarXiv:1505.05770 .Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximateinference in deep generative models. arXiv preprint arXiv:1401.4082 .Salimans, T. and Kingma, D. P. (2016). Weight normalization: A simple reparameterization toaccelerate training of deep neural networks. In Advances in Neural Information Processing Systems ,pages 901–901.Smirnov, P., Safikhani, Z., El-Hachem, N., Wang, D., She, A., Olsen, C., Freeman, M., Selby, H.,Gendoo, D. M., Grossman, P., et al. (2015). Pharmacogx: an r package for analysis of largepharmacogenomic datasets.
Bioinformatics , page btv723.Yang, W., Soares, J., Greninger, P., Edelman, E. J., Lightfoot, H., Forbes, S., Bindal, N., Beare, D.,Smith, J. A., Thompson, I. R., et al. (2013). Genomics of drug sensitivity in cancer (gdsc): aresource for therapeutic biomarker discovery in cancer cells.
Nucleic acids research , 41(D1):D955–D961. Data sets summary
Table 3: Dataset summarization. (a) Number of labeled and unlabeled pre-treatmentsingletons and perturbation pairs. In L1000 perturbation columns, the total number ofexperiments including biological replicates is shown, while the number of unique cell lines islisted in parentheses. (b) Shows split of the response-labeled samples to negative and positiveclasses. (a) drug labeled CL unlabeled CLafatinib 647 636 17 (3) 42 (9)bortezomib 501 782 2 (1) 167 (50)docetaxel 671 612 36 (3) 76 (10)doxorubicin 646 637 10 (2) 42 (6)etoposide 653 630 26 (2) 121 (10)GDC-0941 636 647 41 (3) 134 (16)gefitinib 647 636 52 (3) 138 (16)linsitinib 646 637 5 (1) 28 (5)navitoclax 645 638 42 (8) 127 (43)NU-7441 643 640 24 (3) 88 (16)olaparib 647 636 44 (8) 155 (48)paclitaxel 338 945 0 (0) 83 (13)saracatinib 341 942 0 (0) 171 (19)selumetinib 617 666 41 (8) 165 (48)SN-38 647 636 36 (8) 128 (43)temsirolimus 647 636 44 (8) 129 (43)tipifarnib 646 637 37 (7) 137 (44)vinorelbine 653 630 31 (2) 139 (12)vorinostat 647 636 44 (8) 182 (48)
CCLE ∪ GDSC L1000 perturbations labeled (unique CL) unlabeled (unique CL) (b) drug non responsers respondersafatinib 483 164bortezomib 211 290docetaxel 262 409doxorubicin 248 398etoposide 276 377GDC-0941 255 381gefitinib 523 124linsitinib 409 237navitoclax 426 219NU-7441 465 178olaparib 542 105paclitaxel 136 202saracatinib 266 75selumetinib 395 222SN-38 234 413temsirolimus 269 378tipifarnib 246 400vinorelbine 261 392vorinostat 220 427 AUROC Results