Large-scale ligand-based virtual screening for SARS-CoV-2 inhibitors using deep neural networks
Markus Hofmarcher, Andreas Mayr, Elisabeth Rumetshofer, Peter Ruch, Philipp Renz, Johannes Schimunek, Philipp Seidl, Andreu Vall, Michael Widrich, Sepp Hochreiter, Günter Klambauer
LLarge-scale ligand-based virtual screening for SARS-CoV-2 inhibitors usingdeep neural networks
Markus Hofmarcher
Andreas Mayr
Elisabeth Rumetshofer
Peter Ruch
Philipp Renz
Johannes Schimunek
Philipp Seidl
Andreu Vall
Michael Widrich
Sepp Hochreiter
G ¨unter Klambauer
Abstract
Due to the current severe acute respiratory syn-drome coronavirus 2 (SARS-CoV-2) pandemic,there is an urgent need for novel therapies anddrugs. We conducted a large-scale virtual screen-ing for small molecules that are potential CoV-2inhibitors. To this end, we utilized “ChemAI,” adeep neural network trained on more than 220Mdata points across 3.6M molecules from threepublic drug-discovery databases. With ChemAI,we screened and ranked one billion moleculesfrom the ZINC database for favourable effectsagainst CoV-2. We then reduced the result to the30,000 top-ranked compounds, which are readilyaccessible and purchasable via the ZINC database.Additionally, we screened the DrugBank usingChemAI to allow for drug repurposing, whichwould be a fast way towards a therapy. We providethese top-ranked compounds of ZINC and Drug-Bank as a library for further screening with bioas-says at https://github.com/ml-jku/sars-cov-inhibitors-chemai . Introduction.
Due to the current world-wide crisis ofSARS-CoV-2 virus infections, there is a strong need fornew therapies. While many efforts are focused on repurpos-ing existing drugs (Zhou et al., 2020; Wang et al., 2020; Tonet al., 2020), we suggest to test new molecules with poten-tially higher efficacy. Therefore, we performed a large-scaleligand-based virtual screening run, which resulted in 30,000potential SARS-CoV-2 inhibitors with favorable properties.We actively outreach to the scientific community to testthese molecules and consider them as a custom-designedchemical library. * Equal contribution ELLIS Unit at the LIT AI Lab, JohannesKepler University Linz, Austria Institute for Machine Learning,Johannes Kepler University Linz, Austria. Correspondence to:G¨unter Klambauer < [email protected] > . Technical Report
Most current virtual screens are structure-based and usedocking methods (Chen et al., 2020; Huang et al., 2020;Haider et al., 2020; Wang et al., 2020; Fischer et al., 2020;Chen et al., 2020; Ton et al., 2020; Senathilake et al., 2020;Ruan et al., 2020; Jin et al., 2020; Zhang et al., 2020;Gorgulla et al., 2020) while only one screen is ligand-basedand uses a similarity-based approach (Zhu et al., 2020). Thelargest docking studies screen databases with sizes rangingfrom roughly 700 million (Fischer et al., 2020) to 1.3 billion(Ton et al., 2020) molecules. Also our study operates ondatabases of this size, concretely we perform a ligand-basedvirtual screening of a collection of one billion moleculesfrom the ZINC database.
Deep ligand-based virtual screening. “ChemAI” is adeep neural network trained to simultaneously predict alarge number of biological effects (Mayr et al., 2018; Preueret al., 2019). In more detail, the network is of the typeSmilesLSTM (Mayr et al., 2018; Hochreiter & Schmidhu-ber, 1997) and trained on a data set comprised of ChEMBL(Gaulton et al., 2017), ZINC (Sterling & Irwin, 2015) andPubChem (Kim et al., 2016), and which is similar to thedata set used by Preuer et al. (2018). ChemAI predicts 6,269biological outcomes, such as binding to targets, inhibitoryor toxic effects. The network was trained in a multi-tasksetting, in which data from other bioassays was used toenhance the predictive power for SARS-CoV inhibitory ef-fects. Each modelled biological effect is represented by anoutput neuron of the neural network. We utilized a smallset of output neurons associated with SARS-CoV inhibitionand a set of output neurons associated with toxic effects torank compounds.We screen the ZINC database because it contains a largeset of diverse molecules and additionally provides links tovendors from which to purchase and physically obtain thosemolecules. We downloaded 898,196,375 molecules fromZINC and converted them to canonical SMILES (Weininger,1988) using RDKit (Landrum, 2006). We then performedinference with ChemAI to obtain predictions for each ofthose roughly one billion molecules. a r X i v : . [ q - b i o . B M ] A p r arge-scale ligand-based virtual screening for SARS-CoV-2 inhibitors using deep neural networks Assay ID Source
Table 1.
Overview of the main biological effects considered for ranking the molecules of the virtual screen. “
Selecting bioassays for multiple targets of SARS-CoV.
The SARS-CoV-2 has two main proteases that are criticalfor its replication, namely the 3CLpro (3C-like protease)and PLpro (Papain Like Protease), encoded in an open read-ing frame (Macchiagodena et al., 2020). A compound thatinhibits both proteases could be promising drug candidates(Ledford, 2009; Collison, 2019). The virus proteases arealso strikingly similar to those in SARS-CoV-1 (Macchiago-dena et al., 2020), which is also an implicit assumption bydocking-based approaches. We therefore select two groupsof assays, one of which measures the inhibition of 3CLproand the other the inhibition of PLpro (see Table 1). Foreach of those four assays, ChemAI possesses an output unit,which models the ability of small molecules to exhibit theeffect measured by the assay. Thus, using the predictionsyielded by ChemAI, it is possible to rank compounds bytheir predicted ability to inhibit the two main proteases ofSARS-CoV-1, which can be a proxy for the inhibitory po-tential for SARS-CoV-2.
Consensus ranking.
We developed a library of com-pounds which is enriched for molecules with the abilityto inhibit both proteases of the SARS-CoV-2. In orderto score the multi-target effect, we calculated a consensusscore for each molecule as the average rank of the predic-tions over the four selected assays (see Table 1). We thenranked all compounds by this consensus score. For each ofthe top-ranked compounds, we also calculated their minimaldistance to actives in the training set to be able to identifynovel chemical structures. Furthermore, for each compoundwe also report its number of potential toxic effects (Mayret al., 2016).For the distance metric, we used the Jaccard distance basedon binary ECFP4 fingerprints folded to a length of 1024,which yields values in the interval [0 , . For potential toxiceffects, we used 75 output units of ChemAI with high pre-dictive quality, concretely an area under ROC-curve (AUC)larger than . , and counted how many of those outputunits indicated a toxic effect. This value is reported in Ta-ble 2 (column “tox”). Furthermore, we report the clinicaltoxicity probability predicted by an independent multitaskneural network fitted on the ClinTox dataset (Wu et al.,2018). These probability values are calibrated by Platt scal- ing (Platt et al., 1999) and reported in Table 2 (column “ct”).The additional information contained in these values can beused to obtain a refined ranking for testing the molecules.We implemented the overall process as a two-step approach.In the first step, we reduced the ZINC database of one billionmolecules to a smaller set, where we kept all molecules thatexhibited some predicted activity on any of the four assays(precisely, at least one of the predictions had to reside in thetop-1% quantile). In this way, we obtained an intermediatedataset of 5,672,501 molecules. For those molecules, theconsensus score, the toxicity flags and the distance to knownactives were calculated. In the second step, we reduced thedataset to the top-ranked 30,000 molecules by the consensusscore. Results.
With the abovementioned approach, we assem-bled a library of potential inhibitors of SARS-Cov-2. Wereport three metrics for each compound: a) predicted in-hibitory effect of SARS-CoV proteases b) potential toxici-ties and c) distance to known actives. This led to a rankedlist of compounds of which we provide the top 30,000 asa screening library. The top-ranked molecules are given inTable 2 and Figure 1.We also checked whether molecules suggested by otherpublications can be confirmed by ChemAI. Overall, somesuggested molecules show at least mild predicted activityagainst SARS-CoV (see Table 3).
Drug repurposing of DrugBank molecules.
With thesame procedure employed for ZINC, we also screened theDrugBank (Wishart et al., 2018), a database of ≈ Discussion.
In this work, we presented the constructionof a screening library of small molecules that are potentialinhibitors of SARS-CoV-2. Our ligand-based approach usesa neural network trained to predict the outcomes of bioas- arge-scale ligand-based virtual screening for SARS-CoV-2 inhibitors using deep neural networks
ZINC ID Canonical SMILES dist score tox ctZINC000254565785 CNC(=S)NN=Cc1c2ccccc2c(Cl)c2ccccc12 0.5455 0.8244 8 0.06ZINC000726422572 C=C(Cl)COc1ccc(C(C)=NNC(=S)NCCc2ccccn2)cc1 0.5333 0.8232 7 0.05ZINC000916265995 CNC(=S)NN=Cc1cc2cccc(C)c2nc1Cl 0.6111 0.8230 5 0.08ZINC000916356873 N \ NC(=S)NNC(=S)N(C)c1ccccc1)c1nccc2ccccc12 0.6721 0.8191 11 0.06ZINC000005486767 C/C(=N/NC(=S)NNC(=S)N(C)c1ccccc1)c1nccc2ccccc12 0.6721 0.8191 11 0.06ZINC000005527649 CSC(=S)N/N=C/c1ccc2ccccc2n1 0.6327 0.8187 6 0.05ZINC000755497029 C=CCNC(=S)NN=Cc1nc2cc(Cl)ccc2n1C 0.6719 0.8186 5 0.06ZINC000746495682 FC(F)(F)CNC(=S)NN=Cc1cn(Cc2ccccc2)c2ccccc12 0.5690 0.8186 15 0.07ZINC000005719506 CN(/N=C/c1ccc(Cl)cc1)C(=S)c1ccccc1 0.6818 0.8178 4 0.05ZINC000002149503 S=C(NCc1ccccc1)N/N=C/c1cn(CCOc2ccc(Br)cc2)c2ccccc12 0.5625 0.8175 13 0.21
Table 2.
Top-ranked molecules by ChemAI. All compounds have a high activity predicted on all four assays (column “score”) and arerelatively distant (column “dist”) to current known inhibitors. The distance measure is the Jaccard distance based on binary ECFP4fingerprints and resides in the interval [0 , . Some of the presented molecules might exhibit a number of toxic effects (column “tox”).Here the number of models indicating a toxic effect is reported, where the total number of toxicity models was 75. We also report theestimated probability to exhibit clinical toxicity (column “ct”). arge-scale ligand-based virtual screening for SARS-CoV-2 inhibitors using deep neural networks NH S NH N Cl
ClONNHSNHN
NH S NH N NCl
NNNNHSNHCl
ZINC000254565785 ZINC000726422572 ZINC000916265995 ZINC000916356873
OBrNNN
OBrNNNS
S S NH N
NHSNHNHONOCl N
ZINC000806591744 ZINC000178971373 ZINC000000155607 ZINC000016317677
ONBr O
OBrNNN
N N N NBrSNH
NH S NH N N ClN
ZINC000193073749 ZINC000769846795 ZINC000755523869 ZINC000763345954
S SNH N N ClN
NNHSNHNHSN N NNHSNHNHSN N
S S NH N N
ZINC000001448699 ZINC000016940508 ZINC000005486767 ZINC000005527649
NH S NH N N ClN
F F F NH S NH N N
N N ClS
SNH NH N N OBr
ZINC000755497029 ZINC000746495682 ZINC000005719506 ZINC000002149503
Figure 1.
Graphical representation of the top-ranked molecules by ChemAI. arge-scale ligand-based virtual screening for SARS-CoV-2 inhibitors using deep neural networks says. From this multi-task models, four tasks have beenselected to predict the inhibitory potential against SARS-CoV-1. A consensus between these predictions was usedto rank compounds from the ZINC database, of which the30,000 top-ranked are reported.The approach is limited by the predictive quality of theunderlying machine learning method, evaluated via AUCand leading to values in the range of . to . . Whilethese results are very promising, improved data quality,larger amount of data or machine-learning approaches couldlead to increased predictive performance and quality ofthe library. A promising direction is also to enrich therepresentation of molecules via already available biologicalmodalities (Simm et al., 2018; Hofmarcher et al., 2019).We expect that the data for SARS-CoV-1 already has highpredictive power for inhibitory effects of compounds onSARS-CoV-2. However, the current predictions can befurther adjusted toward SARS-CoV-2 via transfer-learningand the incorporation of new data from SARS-CoV-2. Inparticular, few shot learning may be utilized for the firstmeasurements for SARS-CoV-2 thus adjusting the multi-task model toward SARS-CoV-2. Availability
The library of molecules is avail-able at https://github.com/ml-jku/sars-cov-inhibitors-chemai . Acknowledgements
Funding by the Institute for Machine Learning (JKU). Allauthors contributed equally to this work.
References
Ashburn, T. T. and Thor, K. B. Drug repositioning: identi-fying and developing new uses for existing drugs.
Naturereviews Drug discovery , 3(8):673–683, 2004.Chen, Y. W., Yiu, C.-P. B., and Wong, K.-Y. Predictionof the sars-cov-2 (2019-ncov) 3c-like protease (3cl pro)structure: virtual screening reveals velpatasvir, ledipasvir,and other drug repurposing candidates.
F1000Research ,9, 2020.Collison, J. Two targets are better than one.
Nature ReviewsRheumatology , 15(7):386–386, 2019.Fischer, A., Sellner, M., Neranjan, S., Lill, M. A., andSmieˇsko, M. Inhibitors for novel coronavirus proteaseidentified by virtual screening of 687 million compounds.2020.Gaulton, A., Hersey, A., Nowotka, M., Bento, A. P., Cham-bers, J., Mendez, D., Mutowo, P., Atkinson, F., Bellis, L. J., Cibri´an-Uhalte, E., et al. The chembl database in2017.
Nucleic acids research , 45(D1):D945–D954, 2017.Glantz-Gashai, Y., Meirson, T., Reuveni, E., and Samson,A. O. Virtual screening for potential inhibitors of Mcl-1 conformations sampled by normal modes, moleculardynamics, and nuclear magnetic resonance.
Drug DesDevel Ther , 11:1803–1813, 2017.Gorgulla, C., Boeszoermenyi, A., Wang, Z.-F., Fischer, P. D.,Coote, P., Das, K. M. P., Malets, Y. S., Radchenko, D. S.,Moroz, Y. S., Scott, D. A., et al. An open-source drugdiscovery platform enables ultra-large virtual screens.
Na-ture , pp. 1–8, 2020.Haider, Z., Subhani, M. M., Farooq, M. A., Ishaq, M.,Khalid, M., Khan, R. S. A., and Niazi, A. K. In sil-ico discovery of novel inhibitors against main protease(mpro) of sars-cov-2 using pharmacophore and molecu-lar docking based virtual screening from zinc database.2020.Hochreiter, S. and Schmidhuber, J. Long short-term memory.
Neural computation , 9(8):1735–1780, 1997.Hofmarcher, M., Rumetshofer, E., Clevert, D.-A., Hochre-iter, S., and Klambauer, G. Accurate prediction of biolog-ical assays with high-throughput microscopy images andconvolutional networks.
Journal of chemical informationand modeling , 59(3):1163–1171, 2019.Huang, A., Tang, X., Wu, H., Zhang, J., Wang, W., Wang,Z., Song, L., Zhai, M.-a., Zhao, L., Yang, H., et al. Virtualscreening and molecular dynamics on blockage of keydrug targets as treatment for covid-19 caused by sars-cov-2. 2020.Jin, Z., Du, X., Xu, Y., Deng, Y., Liu, M., Zhao, Y., Zhang,B., Li, X., Zhang, L., Duan, Y., et al. Structure-based drugdesign, virtual screening and high-throughput screen-ing rapidly identify antiviral leads targeting covid-19. bioRxiv , 2020.Kim, S., Thiessen, P. A., Bolton, E. E., Chen, J., Fu, G., Gin-dulyte, A., Han, L., He, J., He, S., Shoemaker, B. A., et al.Pubchem substance and compound databases.
Nucleicacids research , 44(D1):D1202–D1213, 2016.Landrum, G. RDKit: Open-source cheminformatics, 2006.URL .Ledford, H. One drug, two targets, 2009.Lim, L., Roy, A., and Song, J. Identification of azika ns2b-ns3pro pocket susceptible to allosteric in-hibition by small molecules including qucertin richin edible plants. bioRxiv , 2016. doi: 10.1101/078543. URL . arge-scale ligand-based virtual screening for SARS-CoV-2 inhibitors using deep neural networks Macchiagodena, M., Pagliai, M., and Procacci, P. Inhibitionof the main protease 3cl-pro of the coronavirus disease 19via structure-based ligand design and molecular modeling. arXiv preprint arXiv:2002.09937 , 2020.Mayr, A., Klambauer, G., Unterthiner, T., and Hochreiter, S.Deeptox: toxicity prediction using deep learning.
Fron-tiers in Environmental Science , 3:80, 2016.Mayr, A., Klambauer, G., Unterthiner, T., Steijaert, M., Weg-ner, J. K., Ceulemans, H., Clevert, D.-A., and Hochreiter,S. Large-scale comparison of machine learning methodsfor drug target prediction on chembl.
Chemical science ,9(24):5441–5451, 2018.Platt, J. et al. Probabilistic outputs for support vector ma-chines and comparisons to regularized likelihood meth-ods.
Advances in large margin classifiers , 10(3):61–74,1999.Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S., andKlambauer, G. Fr´echet chemnet distance: a metric forgenerative models for molecules in drug discovery.
Jour-nal of chemical information and modeling , 58(9):1736–1741, 2018.Preuer, K., Klambauer, G., Rippmann, F., Hochreiter, S.,and Unterthiner, T. Interpretable deep learning in drugdiscovery. In
Explainable AI: Interpreting, Explainingand Visualizing Deep Learning , pp. 331–345. Springer,2019.Ruan, Z., Liu, C., Guo, Y., He, Z., Huang, X., Jia, X., andYang, T. Potential inhibitors targeting rna-dependent rnapolymerase activity (nsp12) of sars-cov-2. 2020.Senathilake, K., Samarakoon, S., and Tennekoon, K. Virtualscreening of inhibitors against spike glycoprotein of 2019novel corona virus: a drug repurposing approach. 2020.Simm, J., Klambauer, G., Arany, A., Steijaert, M., Wegner,J. K., Gustin, E., Chupakhin, V., Chong, Y. T., Vialard,J., Buijnsters, P., et al. Repurposing high-throughputimage assays enables biological activity prediction fordrug discovery.
Cell chemical biology , 25(5):611–618,2018.Sterling, T. and Irwin, J. J. Zinc 15–ligand discovery foreveryone.
Journal of chemical information and modeling ,55(11):2324–2337, 2015.Ton, A.-T., Gentile, F., Hsing, M., Ban, F., and Cherkasov, A.Rapid identification of potential inhibitors of sars-cov-2main protease by deep docking of 1.3 billion compounds.
Molecular Informatics , 2020.Wang, Q., Zhao, Y., Chen, X., and Hong, A. Virtual screen-ing of approved clinic drugs with main protease (3clpro)reveals potential inhibitory effects on sars-cov-2. 2020. Weininger, D. SMILES, a chemical language and informa-tion system. 1. introduction to methodology and encodingrules.
Journal of chemical information and computer sci-ences , 28(1):31–36, 1988.Wishart, D. S., Feunang, Y. D., Guo, A. C., Lo, E. J., Marcu,A., Grant, J. R., Sajed, T., Johnson, D., Li, C., Sayeeda,Z., et al. Drugbank 5.0: a major update to the drug-bank database for 2018.
Nucleic acids research , 46(D1):D1074–D1082, 2018.Wu, C., Liu, Y., Yang, Y., Zhang, P., Zhong, W., Wang,Y., Wang, Q., Xu, Y., Li, M., Li, X., Zheng, M., Chen,L., and Li, H. Analysis of therapeutic targets for sars-cov-2 and discovery of potential drugs by computationalmethods.
Acta Pharmaceutica Sinica B , 2020. ISSN2211-3835. doi: https://doi.org/10.1016/j.apsb.2020.02.008. URL .Wu, Z., Ramsundar, B., Feinberg, E., Gomes, J., Geniesse,C., Pappu, A. S., Leswing, K., and Pande, V. Molecu-leNet: A benchmark for molecular machine learning.
Chemical Science , 9(2):513–530, 2018. ISSN 2041-6520, 2041-6539. doi: 10.1039/C7SC02664A. URL http://xlink.rsc.org/?DOI=C7SC02664A .Zhang, J.-J., Shen, X., Yan, Y.-M., Yan, W., and Cheng,Y.-X. Discovery of anti-sars-cov-2 agents from commer-cially available flavor via docking screening. 2020.Zhou, Y., Hou, Y., Shen, J., Huang, Y., Martin, W., andCheng, F. Network-based drug repurposing for novelcoronavirus 2019-ncov/sars-cov-2.
Cell Discovery , 6(1):1–18, 2020.Zhu, Z., Wang, X., Yang, Y., Zhang, X., Mu, K., Shi, Y.,Peng, C., Xu, Z., et al. D3similarity: A ligand-based ap-proach for predicting drug targets and for virtual screen-ing of active compounds against covid-19. 2020. arge-scale ligand-based virtual screening for SARS-CoV-2 inhibitors using deep neural networks
ZINC ID Trivial name(s) Canonical SMILES PublicationsZINC00057060 Melatonin COc1ccc2[nH]cc(CCNC(C)=O)c2c1 Zhou et al. (2020)ZINC03869685 Meletin, Quercetin O=c1c(O)c(-c2ccc(O)c(O)c2)oc2cc(O)cc(O)c12 Lim et al. (2016)ZINC85537142 Aclarubicin CC[C@@]1(O)C[C@H](O[C@H]2C[C@H](N(C)C)[C@H](O[C@H]3C[C@H](O)[C@H](O[C@H]4CCC(=O)[C@H](C)O4)[C@H](C)O3)[C@H](C)O2)c2c(cc3c(c2O)C(=O)c2c(O)cccc2C3=O)[C@H]1C(=O)OC Senathilake et al. (2020)ZINC03794794 Mitoxantrone C1=CC(=C2C(=C1NCCNCCO)C(=O)C3=C(C=CC(=C3C2=O)O)O)NCCNCCO Wang et al. (2020)ZINC01668172 - O=C(C[n+]1ccc2ccccc2c1)c1ccc2ccc3ccccc3c2c1 Glantz-Gashai et al. (2017)ZINC03830332 E155 C1=CC=C2C(=C1)C(=CC=C2S(=O)(=O)O)NN=C3C=C(C(=O)C(=NNC4=CC=C(C5=CC=CC=C54)S(=O)(=O)O)C3=O)CO Senathilake et al. (2020)ZINC14879972 Gar-936 CN(C)c1cc(NC(=O)CNC(C)(C)C)c(O)c2c1C[C@H]1C[C@H]3[C@H](N(C)C)C(O)=C(C(N)=O)C(=O)[C@@]3(O)C(O)=C1C2=O Wu et al. (2020)ZINC00001645 Magnolol C=CCc1ccc(O)c(-c2cc (CC=C)ccc2O)c1 Wu et al. (2020)ZINC00014036 Piceatannol Oc1cc(O)cc(/C=C/c2ccc(O)c(O)c2)c1 Wu et al. (2020)ZINC16052277 Doxycycline C[C@H]1c2cccc(O)c2C(=O)C2=C(O)[C@]3(O)C(=O)C(C(=N)O)=C(O)[C@@H](N(C)C)[C@@H]3[C@@H](O)[C@@H]21 Wu et al. (2020)ZINC3920266 Idarubicin CC(=O)[C@]1(O)Cc2c(O)c3c(c(O)c2[C@@H](O[C@H]2C[C@H](N)[C@H](O)[C@H](C)O2)C1)C(=O)c1ccccc1C3=O Wu et al. (2020)