[PDF] Combining human cell line transcriptome analysis and Bayesian inference to build trustworthy machine learning models for prediction of animal toxicity in drug development

Abstract

Biomedical data, particularly in the field of genomics, has characteristics which make it challenging for machine learning applications - it can be sparse, high dimensional and noisy. Biomedical applications also present challenges to model selection - whilst powerful, accurate predictions are necessary, they alone are not sufficient for a model to be deemed useful. Due to the nature of the predictions, a model must also be trustworthy and transparent, empowering a practitioner with confidence that its use is appropriate and reliable. In this paper, we propose that this can be achieved through the use of judiciously built feature sets coupled with Bayesian models, specifically Gaussian processes. We apply Gaussian processes to drug discovery, using inexpensive transcriptomic profiles from human cell lines to predict animal kidney and liver toxicity after treatment with specific chemical compounds. This approach has the potential to reduce invasive and expensive animal testing during clinical trials if in vitro human cell line analysis can accurately predict model animal phenotypes. We compare results across a range of feature sets and models, to highlight model importance for medical applications.

Full PDF

CCombining human cell line transcriptome analysisand Bayesian inference to build trustworthy machinelearning models for prediction of animal toxicity indrug development

Laura-Jayne Gardiner

IBM ResearchThe Hartree Centre, [email protected]

Anna Paola Carrieri

IBM ResearchThe Hartree Centre, [email protected]

Jenny Wilshaw

IBM ResearchThe Hartree Centre, [email protected]

Stephen Checkley

STFC Daresbury LabWarrington, [email protected]

Edward O. Pyzer-Knapp

IBM ResearchThe Hartree Centre, [email protected]

Ritesh Krishna

IBM ResearchThe Hartree Centre, [email protected] ∗ Abstract

Biomedical data, particularly in the ﬁeld of genomics, has characteristics whichmake it challenging for machine learning applications – it can be sparse, highdimensional and noisy. Biomedical applications also present challenges to modelselection – whilst powerful, accurate predictions are necessary, they alone are notsufﬁcient for a model to be deemed useful. Due to the nature of the predictions, amodel must also be trustworthy and transparent, empowering a practitioner withconﬁdence that its use is appropriate and reliable. In this paper, we propose thatthis can be achieved through the use of judiciously built feature sets coupled withBayesian models, speciﬁcally Gaussian processes. We apply Gaussian processes todrug discovery, using inexpensive transcriptomic proﬁles from human cell linesto predict animal kidney and liver toxicity after treatment with speciﬁc chemicalcompounds. This approach has the potential to reduce invasive and expensiveanimal testing during clinical trials if in vitro human cell line analysis can accuratelypredict model animal phenotypes. We compare results across a range of featuresets and models, to highlight model importance for medical applications.

To progress a pharmaceutical drug candidate to human trial phase, regulations require preclinicaltrials or animal tests to ascertain compound safety. This is balanced against a desire to reduce animaltesting of compounds due to expense and ethical implications. However, although compound toxicityin animals can translate to human toxicity, this translation shows considerable variability that we needto understand [1, 2, 3]. Machine learning (ML) offers a potential path to derive insight on this complextask. To enable this, there has been a large amount of effort invested in cost effective techniquesto generate data-sets from which the likely effect that a compound will have after administration,particularly with respect to toxicity or adverse events, can be predicted. One such approach is L1000,a high-dimensional gene expression proﬁling method, which is fast emerging as a cheap alternativeto produce large amounts of experimental observations suitable for ML driven drug discovery [4]. ∗ Corresponding AuthorMachine Learning for Health (ML4H) at NeurIPS 2019 - Extended Abstract. a r X i v : . [ q - b i o . GN ] N ov e use L1000 proﬁles generated for a variety of human cell lines under different perturbationconditions for a large number of chemical compounds. We investigate the use of L1000 proﬁlesto generate predictive models and the validity of their application on rats – a model drug testingspecies. We apply ML to L1000 proﬁles generated after chemical compound treatment, to predictrat kidney dysfunction after treatment with the same compound. By including feature informationabout the chemical compounds used to treat human cells and animals during preclinical trials, wetranslate these inexpensive and non-invasive human cell line tests to preclinical phenotypic responsesin model species such as rats. L1000 gene expression proﬁles and chemical structures, were collatedby [5] including 964 “Landmark” genes following small molecule treatment. L1000 gene expressioninformation is available as -1, 0 and 1 values to represent down-regulation, no change and up-regulation of gene expression. Chemical structure of the compounds was in the form of a 166-bitMACCS chemical ﬁngerprint for small molecule compounds computed using Open Babel [6].The target for our ML approach is the prediction, in rats, of the amount of blood urea nitrogen(BUN). Elevated BUN tests are routinely used as an indicator that the kidneys or liver may bedamaged or dysfunctional. In rats elevated levels for BUN of 28-136mg/dl were consistent withkidney dysfunction [7, 8]. The datasets used here were collected from DrugMatrix and detail the ratblood concentration of BUN after compound administration. BUN levels range from 7.75-39.60mg/dlencompassing a wide range from known safe levels to high levels that exceed health indicator limits.We matched the induced L1000 gene expression signatures with rat blood test results where the samecompound had been used. This allowed us to match L1000 proﬁles with 429 rat BUN measurements.These observations all represent different chemical compound treatments for the training set.We predict BUN levels using either human gene expression data or compound chemical structureinformation as features. Then, we combine both feature sets (1,130 features) and show substantialprediction improvement in line with previous work [5]. We use dimensionality reduction to tackleproblems associated with sparse and high dimensionality datasets. We highlight the beneﬁt of usingGaussian processes to understand the transparency and trustworthiness of a prediction, which is ofhigh importance in healthcare. We integrate heterogeneous public experimental datasets and, forthe ﬁrst time, we translate predictive capability related to renal toxicity or BUN level from in vitro human cell line tests to in vivo blood tests from model animal species. Our ultimate aim is to reduceexpensive and invasive animal testing with the ability to predict clinical response using human cellline gene expression proﬁles and mathematical modelling, supporting the pharmaceutical industry’scommitment to the 3R’s (Replacement, Reduction, Reﬁnement) in drug development [9]. A richer feature set, coupled with an appropriate model, should improve the predictive power of theapproach. As such, we augmented L1000 features with chemical structure information. To test this,we applied Gaussian Process (GP) regression [10, 11] to predict BUN level using ﬁrstly, only theL1000 gene expression proﬁles as the training dataset (964 features). Secondly, using only chemicalstructure information as the training dataset (166 features). Finally, we combined both informationresources gene expression and chemical structure to generate a single training dataset (1,130 features).For all three analyses we used a kernel which was the sum of a simple RBF kernel, and a WhiteKernel.Hyperparameters were optimized using descent on the marginal log-likelihood of the model. Due tothe large dimensionality of the data, a single length scale was used over all features.Figure 1 and Table 1 show that the combination of gene expression and chemical structure informationdecreases test set RMSE scores to 2.961 from the average 3.708 seen for chemical structure and geneexpression separately. There is also a marked increase in test set r2 scores after combination of thefeatures from on average 0.021 to 0.418. Figure 1 shows the true versus predicted values for the testdataset where, correlation coefﬁcients increase from 0.11 and 0.17 for gene expression and chemicalstructure, to 0.65 for combined features. In addition, Figure 1 shows that conﬁdence in the predictionsincreases after feature combination with the standard deviation of the predictive distribution foreach of the test datapoints showing an average decreasing from 3.283 and 3.283 for gene expressionand chemical structure respectively to 0.683 for the combined features. The combination of geneexpression and chemical structure information increases our predictive capability for BUN level. Thisanalysis also highlights the beneﬁt of using a Gaussian Process with the ability to directly capture the2ncertainty of the predictions. We observe a dramatic decrease (4.8-fold) in the uncertainty level orthe standard deviation of the predictive distribution of the predictions after feature combination whilethe difference in RSME scores was not so marked (1.3-fold) and could mask a poor model (Table 1). a. b. c.

True Values True Values True Values P r e d i c t e d V a l u e s d. e. f. P r e d i c t e d V a l u e s Figure 1: True (x-axis) vs predicted (y-axis) test values for BUN level prediction: using as trainingdata (a) L1000 gene expression, (b) chemical structure information and (c) both; using gene expressionand chemical structure training data after (d) PCA, (e) tSVD and (f) tSVD h. Plots show marginalhistograms with regression and kernel density ﬁts. The 95% conﬁdence interval for regression isdrawn using translucent bands around the regression line. Point colour is based on the standarddeviation of its underlying predictive distribution: note that point colour scales vary between plots.

Gene Expr. Chem. Gene Expression & Chemical Structure

GP GP GP GP&PCA GP&TSVD GP&TSVD h. LightGBMTrain MAE Test MAE

Test RMSE

Test r2

Av. SD - Table 1: Prediction of BUN levels. Best results from comparison of combining chemical and geneexpression data, using dimensionality reduction and ﬁnally comparing the best alternative classiﬁerfrom 8 tested, deﬁned as producing the highest test prediction accuracy (lowest MAE with highest r2score). The best model is highlighted in bold. *Weighted RMSE.

One of the main challenges for this data was a greater number of features than datapoints (1,130vs 429) as traditional ML methods could overﬁt. To address this, ﬁrstly, we use a non-parametricBayesian model known as a Gaussian process (GP) that uses a covariance matrix, rather than the inputfeatures themselves, so its predictive power scales with datapoint number, rather than dimensionality.We found a kernel comprising of a sum of the common Radial Basis Function (RBF) kernel, withautomatic relevance determination (ARD) to best model the data. Secondly, we used dimensionalityreduction methods that are useful in building ML models from genomics datasets [12, 13, 14]implementing a truncated singular value decomposition (t-SVD) approach [15, 16]. We implementeda hierarchical approach to t-SVD (tSVD h.), where the two parts of the feature space (L1000 proﬁlesand chemical descriptors) were considered separately and then recombined. Preserving 95% of feature3nformation per sub-domain resulted in a 57-dimensional vector, with 17 dimensions representingchemical features and 40 dimensions representing L1000 proﬁles. We tested tSVD h. against astraight dimensionality reduction to 57 dimensions using t-SVD and the commonly used PCA method(Figure 1, Table 1). Our tSVD h. approach gave the strongest results with a strong correlationcoefﬁcient of 0.81 and r2 of 0.661 exceeding any other from this dataset and the lowest weightedRMSE (1.528). All methods outperform a GP with no dimensionality reduction.

In healthcare, one of the barriers to the uptake of ML is a perceived lack of trust. We believethat a lack of trust originates mainly from poor prediction on out of set problems and a lack oftransparency/interpretability. To address poor out of set modelling, we propose building a richdescription of model uncertainty and returning both a prediction and conﬁdence on new data. Thus,an out of set prediction will come with a ‘warning’ about possible performance issues. To demonstratethe importance of predictive uncertainties, we ﬁtted commonly used deterministic methods to thedata – Linear regression, Random Forest, Support Vector Regressor, XGBoost, Gradient Boosting,K nearest neighbors and LightGBM. 80% of the data was used for training and the remaining 20%,with no overlap with the training set, was held out for testing. 10-fold cross validation was performedon the training data (ShufﬂeSplit) and hyperparameters tuned using a grid search. We comparedresults to our best GP model using the best model for each method. GP performed favourably to otheralgorithms with its closest rival model generated with LightGBM (Table 1). However, the inclusionof uncertainty measurements with GP allows calculation of a weighted RSME. The low uncertaintyrate (0.470) from the GP results in the lowest observed RMSE (1.528) across all of the models afterweighting and the highest r2 score of 0.661 (Table 1). Finally, we highlight a speciﬁc sample in thetop right corner of Figure 1 f with a BUN level of 37.89mg/dl consistent with potentially severe kidneydysfunction. The LightGBM model predicts the BUN level of this sample as 34.61mg/dl decreasingit by 3.3mg/dl. GP predicts 37.23mg/dl representing a 5-fold prediction improvement. Furthermore,the GP’s low uncertainty estimate of 0.661 conﬁrms the close proximity of the prediction to the truevalue. Here, the absence of this information for LightGBM, could result in severe kidney dysfunctionthat may need immediate intervention, being mis-diagnosed for a more moderate condition.We can improve trust in a model by generating insight into the contributing factors to a prediction.We exploit the fact that the length scales of our GP are ﬁtted per feature and can give insight into theimpact speciﬁc features have on our model – shorter length scale, higher impact. However, since webuilt our features using t-SVD, we use an inverse projection to relate the reduced feature space tothe original features for the dimension, focusing on the shortest length scale in our best GP modelbased on gene expression. The top three genes that represent this reduced feature space have relevantroles and include: CCL2 that is linked to renal damage [17], FOSL1 that regulates cell survival andGLRX that is expressed in the kidney and contributes to the defense system. This analysis highlightsimportant genes that are closely associated with the biological purpose of the model. Using lengthscale in this way, therefore improves the transparency and trustworthiness of the GP model. Thisanalysis outperformed the ExtraTreesRegressor approach that we trialed for the same purpose. Thismethod does not consider the model itself and therefore it deﬁned genes that were less biologicallyrelevant e.g. TUBB6 (tubulin beta 6 class V) and DAXX (Death Domain Associated Protein).

For ML in healthcare, there are challenges emanating from the data (sparsity, low volume, lack ofstructure) or the user (requirements for transparency and trust). We use GP to conﬁdently predictBUN level in rats after compound treatment using gene expression proﬁling of human cell lines.We demonstrate transferability of information between human and animal subjects in the effort toreduce the need for animal testing and eliminate toxic compounds earlier in the evaluation process toreduce development costs. We use t-SVD to reduce the high dimensional input data, which displayssuperior predictive power when used with a GP regressor. This reduced dimensionality allows usto improve the quality of the predictive uncertainties, through the addition of feature-wise lengthscales to the kernel. We also ﬁt a noise kernel to the data to account for some non-correlated noisein the measurements of toxicity. The predictive uncertainties recovered from the GP improve thetransparency and trustworthiness of the model, allowing the end users to understand when it can andcannot be used. Finally, we further improve the transparency providing insight into the feature-wisecontribution to the prediction – a starting point for explainable AI for medical applications.4 cknowledgments

This work was supported by the STFC Hartree Centre’s Innovation Return on Research programme,funded by the Department for Business, Energy & Industrial Strategy.

References [1] Bailey, J., Thew, M. & Balls, M. (2014) An Analysis of the Use of Animal Models in PredictingHuman Toxicology and Drug Safety.

Alternatives to Laboratory Animals (3):181–199.[2] Bailey, J., Thew, M. & Balls, M. (2013) An Analysis of the Use of Dogs in Predicting HumanToxicology and Drug Safety. Alternatives to Laboratory Animals (5):335–350.[3] Poussin, C., Mathis, C., Alexopoulos, L., Messinis, D., Dulize, R. et al. (2014) The speciestranslation challenge - A systems biology perspective on human and rat bronchial epithelial cells. Scientiﬁc Data , Article number: 140009.[4] Subramanian, A., Narayan, R., Corsello, S.M., Peck, D.D., Natoli, T.E. et al. (2017) A NextGeneration Connectivity Map: L1000 Platform and the First 1,000,000 Proﬁles. Cell (6):1437-1452.[5] Wang, Z., Clark, N.R. & Ma’ayan A. (2016) Drug-induced adverse events prediction with theLINCS L1000 data.

Bioinformatics (15):2338-2345.[6] O’Boyle, N.M. , Banck, M., James, C.A., Morley, C., Vandermeersch, T. & Hutchison, G.R.(2011) Open Babel: An open chemical toolbox. J. Cheminformatics (3):33.[7] Stender, R.N., Engler, W.J., Braun. T.M. & Hankenson, F.C. (2007) Establishment of bloodanalyte intervals for laboratory mice and rats by use of a portable clinical analyzer. J Am AssocLab Anim Sci (3):47-52.[8] Yan, S.L., Lin, P. & Hsiao, M. (1999) Separation of Urea, Uric Acid, Creatine, and Creati-nine by Micellar Electrokinetic Capillary Chromatography with Sodium Cholate. Journal ofChromatographic Science (2):45-50.[9] Parker, R.M. & Browne, W.J. (2014) The Place of Experimental Design and Statistics in the3Rs. ILAR Journal (3):477-485.[10] Pedregosa, F., Variqueux, C., Gramfort, A., Michel, V., Thirion, B. et al. (2011) Scikit-learn:Machine Learning in Python. Journal of Machine Learning Research :2825-2830.[11] Rasmussen, C.E. & Williams, C. (2006) Gaussian Processes for Machine Learning. MIT Press .[12] Trapnell, C. (2015) Deﬁning cell types and states with single-cell genomics.

Genome Research :1491-1498.[13] Luecken, M.D. & Theis, F.J. (2019) Current best practices in single-cell RNA-seq analysis: atutorial. Mol. System Biology (6):e8746.[14] Rowe, W.P.M., Carrieri, A.P., Alcon-Giner, C., Caim, S., Shaw, K. et al. (2019) Streaminghistogram sketching for rapid microbiome analytics. Microbiome (1):40.[15] Hansen, P. C. (1987) The truncated SVD as a method for regularization. BIT (4): 534–553.[16] Eckart, C. & Young, G. (1936) The approximation of one matrix by another of lower rank. Psychometrika (3):211–218.[17] Kashyap, S., Osman, M., Ferguson, C., Nath, M., Roy, B. et al. (2018) Ccl2 deﬁciency protectsagainst chronic renal injury in murine renovascular hypertension. Scientiﬁc reports8