[PDF] Accelerating high-throughput virtual screening through molecular pool-based active learning

Abstract

Structure-based virtual screening is an important tool in early stage drug discovery that scores the interactions between a target protein and candidate ligands. As virtual libraries continue to grow (in excess of 10 8 molecules), so too do the resources necessary to conduct exhaustive virtual screening campaigns on these libraries. However, Bayesian optimization techniques can aid in their exploration: a surrogate structure-property relationship model trained on the predicted affinities of a subset of the library can be applied to the remaining library members, allowing the least promising compounds to be excluded from evaluation. In this study, we assess various surrogate model architectures, acquisition functions, and acquisition batch sizes as applied to several protein-ligand docking datasets and observe significant reductions in computational costs, even when using a greedy acquisition strategy; for example, 87.9% of the top-50000 ligands can be found after testing only 2.4% of a 100M member library. Such model-guided searches mitigate the increasing computational costs of screening increasingly large virtual libraries and can accelerate high-throughput virtual screening campaigns with applications beyond docking.

Full PDF

AAccelerating High-Throughput Virtual ScreeningThrough Molecular Pool-Based Active Learning

David E. Graﬀ, † Eugene I. Shakhnovich, † and Connor W. Coley ∗ , ‡ † Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA ‡ Department of Chemical Engineering, MIT, Cambridge, MA

E-mail: [email protected]

Abstract

Structure-based virtual screening is an important tool in early stage drug discoverythat scores the interactions between a target protein and candidate ligands. As virtuallibraries continue to grow (in excess of 10 molecules), so too do the resources neces-sary to conduct exhaustive virtual screening campaigns on these libraries. However,Bayesian optimization techniques can aid in their exploration: a surrogate structure-property relationship model trained on the predicted aﬃnities of a subset of the librarycan be applied to the remaining library members, allowing the least promising com-pounds to be excluded from evaluation. In this study, we assess various surrogatemodel architectures, acquisition functions, and acquisition batch sizes as applied toseveral protein-ligand docking datasets and observe signiﬁcant reductions in compu-tational costs, even when using a greedy acquisition strategy; for example, 87.9% ofthe top-50000 ligands can be found after testing only 2.4% of a 100M member library.Such model-guided searches mitigate the increasing computational costs of screeningincreasingly large virtual libraries and can accelerate high-throughput virtual screeningcampaigns with applications beyond docking. a r X i v : . [ q - b i o . Q M ] D ec ntroduction Computer-aided drug design techniques are widely used in early stage discovery to iden-tify small molecule ligands with aﬃnity to a protein of interest.

Broadly speaking, thesetechniques fall into one of two domains: ligand-based or structure-based. Ligand-based tech-niques often rely on either a quantitative structure-activity relationship (QSAR) or similaritymodel to screen possible ligands. Both techniques require one or more previously labeledactive/inactive compounds that are typically acquired through physical experimentation.In contrast to ligand-based techniques, structure-based techniques, such as computationaldocking and molecular dynamics, try to simulate the physical process of a ligand binding toa protein active site and assign a quantitative score intended to correlate with the free energyof binding. These techniques require a three-dimensional structure of the target protein butdo not require target-speciﬁc bioactivity data. Scoring functions used in structure-basedtechniques are typically parameterized functions describing the intra- and intermolecularinteractions at play in protein-ligand binding. As a result, structure-based methods arein principle more able to generalize to unseen protein and ligand structures compared toligand-based methods. This advantage has been leveraged to discover novel ligand scaﬀoldsin a number of recent virtual screening campaigns. A typical virtual screening workﬂow will exhaustively predict the performance of ligandsin a virtual chemical library. However, over the past decade, these libraries have grownso large that the computational cost of screening cannot be ignored. For example, ZINC, apopular database of commercially available compounds for virtual screening, grew from 700kto 120M structures between 2005 and 2015 and, at the time of writing, now contains roughly1 billion molecules.

ZINC is not unique in its gargantuan size; other enumerated vir-tual libraries exist that number well over one billion compounds. Non-enumerated librariescontain an implicit deﬁnition of accessible molecules and can be much larger, containinganywhere from 10 to 10 possible compounds. Despite some debate around whether“bigger is better” in virtual screening, such large virtual libraries are now being used for2creening in structure-based drug design workﬂows. These studies required computa-tional resources that are inaccessible to many academic researchers (e.g., 475 CPU-years inthe case of Gorgulla et al. ). Moreover, this high computational cost makes such a strategyimpractical to apply to many distinct protein targets. As virtual libraries grow ever larger,new strategies must be developed to mitigate the computational cost of these exhaustivescreening campaigns.The goal in any virtual screening approach is to ﬁnd a set of high-performing compounds—herein, computational “hits” with the most negative docking scores—within a signiﬁcantlylarger search space. To restate this formally, we are attempting to solve for the set of top- k molecules { x i } ki =1 ∗ from a virtual library X that maximize some black-box function ofmolecular identity f : x ∈ X → R , i.e., { x i } ki =1 ∗ = argmax { x i } ki =1 ⊂X k (cid:88) i =1 f ( x i ) . (1)In this study, the black-box function f ( x ) is the negative docking score of a candidate ligandbut other possibilities include the HOMO-LUMO gap of a candidate organic semiconductor,extinction coeﬃcient of a candidate dye, turnover number of a candidate catalyst, etc. Bruteforce searching—screening every molecule in a library indiscriminately—is a straightforwardand common strategy employed to solve this type of problem, but it necessarily spends asigniﬁcant fraction of its time evaluating relatively low-performing compounds (Figure 1A).However, algorithmic frameworks exist that aim to solve Equation 1 with the fewest numberof required evaluations. Bayesian optimization is one such framework that uses a surrogatemodel trained on previously acquired data to guide the selection of subsequent experiments.We describe Bayesian optimization in more detail in the Methods section below, but werefer a reader to ref. 16 for an in-depth review on the subject. The application of Bayesianoptimization to physical problems is well-precedented, e.g., with applications to materialsdesign, bioactivity screening, and molecular simulations, but few examples exist3f its application in virtual screening for drug discovery. Furthermore, the design space isboth large and discrete but not necessarily deﬁned combinatorially, which is a relativelyunexplored setting for Bayesian optimization.Two recent examples of work that used active learning for computational drug discoverycan be found in Deep Docking and a study from Pyzer-Knapp. Deep Docking uses agreedy, batched optimization approach and treats docking scores as a binary classiﬁcationproblem during the QSAR modelling step with a ﬁngerprint-based feed forward neural net-work surrogate model. Pyzer-Knapp also utilized a batched optimization approach with aGaussian process (GP) surrogate model using a parallel and distributed Thompson samplingacquisition strategy; this approach is well-suited to small design spaces (e.g., thousandsof compounds) but GP training does not scale well to millions of acquired data points dueto its O ( N ) complexity. In contrast to Deep Docking, our work treats docking score asa continuous variable, so our surrogate model follows a regression formulation. Lyu et al.observed correlations between the success rates of experimental validation and computeddocking scores, suggesting that preserving this information during model training will im-prove our ability to prioritize molecules that are more likely to be validated as active.In this work, we leverage Bayesian optimization algorithms for docking simulations in amanner that preserves the ﬁdelity of these structure-based methods while decreasing the com-putational cost needed to explore the structure-activity landscape of virtual libraries by overan order of magnitude (Figure 1B). We demonstrate that surrogate machine learning modelscan prioritize the screening of molecules that are associated with better docking scores, whichare presumed to be more likely to validate experimentally. We analyze a variety of diﬀerentmodel architectures and acquisition functions that one can use to accelerate high-throughputvirtual screening using Bayesian optimization. Speciﬁcally, we test random forest (RF), feedforward neural network (NN), and directed-message passing neural network (MPN) surro-gate models along with greedy, upper conﬁdence bound (UCB), Thompson sampling (TS),expectation of improvement (EI), and probability of improvement (PI) acquisition strategies4n addition to various acquisition batch sizes. We study these optimization parameters onmultiple datasets of protein-ligand docking scores for compound libraries containing roughlyten thousand, ﬁfty thousand, two million, and one hundred million molecules. We performthese studies using

MolPAL , an open source software which we have developed and madefreely available. selectselect traintraindockdockpredictpredict

A. Brute Force (previous) dockdock

B. MolPAL (current)

Figure 1: Overview of a computational docking screen using ( A ) Brute force (exhaustive)virtual screening and ( B ) Molecular pool-based active learning ( MolPAL ). Grey circles depictmolecules that have not been evaluated.

Results

Small virtual libraries

As an initial evaluation of the experimental eﬃciency Bayesian optimization can provide, weturned to a dataset containing scores from 10,540 compounds (Enamine’s smaller DiscoveryDiversity Collection, “Enamine 10k”) docked against thymidylate kinase (PDB ID: 4UNN )using AutoDock Vina. Data acquisition was simulated as the iterative selection of 1% of thelibrary (ca. 100 molecules) repeated ﬁve times after initialization with a random 1% subsetfor a total acquisition of 6% of the library. We ﬁrst looked at a random forest (RF) operating5n molecular ﬁngerprints as our surrogate model along with a greedy acquisition strategy.This combination yields clear improvement over the random baseline, representative of abrute-force search, when looking at the percentage of top-100 (ca. top-1%) scores in thefull dataset found by

MolPAL (Figure 2, left panel). The greedy strategy ﬁnds, on average,51.6% ( ± . k scores foundby the model-guided search to the percentage of top- k scores found by a random searchfor the same number of objective function calculations. The random baseline ﬁnds only5.6% of the top-100 scores in the 10k dataset, thus constituting an EF of 9.2 for the greedyrandom forest combination. A UCB acquisition metric, yields similar, albeit slightly lower,performance with an EF of 7.7. Surprisingly, the other optimistic acquisition metric wetested, Thompson sampling (TS), does show an improvement over the random baseline but isconsiderably lower than all other metrics (EF = 4.9). We attribute this lower performance tothe large uncertainties in the RF model and the relatively compact distribution of predictedmeans, which cause the Thompson sampling strategy to behave somewhat closer to randomsampling than other metrics.We next assessed the eﬀect of architecture for the surrogate model. Using a fully con-nected neural network (NN) operating on molecular ﬁngerprints, we observed an increasein performance for all acquisition metrics (Figure 2, middle panel). With the NN model,the least performant acquisition strategy (56.0% with EI) was more powerful than the bestperforming strategy with the RF model (51.6% with greedy acquisition.) The greedy acquisi-tion metric remains the best performing for the NN model, achieving 66.8% of top-100 scoresfound for an EF of 11.9. The third and ﬁnal architecture examined is a message-passing neu-ral network model (MPN), using the D-MPNN implementation by Yang et al.. The MPNmodel resulted in comparable performance to the NN model (Figure 2, right panel), withslight improvement for some metrics. However, the highest performance observed, 67.0%6

RF NN MPNNumber of Ligands Explored P e r c en t age o f T op - S c o r e s F ound Figure 2: Bayesian optimization performance on Enamine 10k screening data as measuredby the percentage of top-100 scores found as function of the number of ligands evaluated.Each trace represents the performance of the given acquisition metric with the surrogatemodel architecture corresponding the chart label. Each experiment began with random 1%acquisition (ca. 100 samples) and acquired 1% more each iteration for ﬁve iterations. Errorbars reﬂect ± one standard deviation across ﬁve runs. RF, random forest; NN, neural net-work; MPN, message-passing neural network; UCB, upper conﬁdence bound; TS, Thompsonsampling; EI, expected improvement; PI, probability of improvement.(EF = 12.0) with greedy acquisition, is statistically identical to the best NN result.These analyses were repeated for Enamine’s larger Discovery Diversity Collection of50,240 molecules (“Enamine 50k”) also against 4UNN with the same docking parameters(Figure 3). We again took 1% of the library as our initialization with ﬁve subsequent ex-ploration batches of 1% each. All of the trends remained largely the same across acquisitionmetrics and model architectures; we observed comparable quantitative performance for ev-ery surrogate model/acquisition metric combination as compared to the smaller library. Forexample, the RF model with a greedy acquisition strategy now ﬁnds 59.1% ( ± .

9) of thetop-500 scores (ca. top-1%) using the Enamine 50k library vs. the 51.6% of the top-100scores (ca. top-1%) when using the Enamine 10k library. There was little diﬀerence betweenthe results of the NN and MPN models on the Enamine 50k results, which ﬁnd 74.8% and72.9% of the top-500 scores after exploring just 6% of the library, respectively. These values7epresent enrichment factors of 11.3 and 11.0, respectively, over the random baseline.

Enamine HTS Collection

Encouraged by the signiﬁcant enrichment observed with the 10k and 50k libraries, we nexttested Enamine’s 2.1 million member HTS Collection (“Enamine HTS”)–a size more typicalof a high-throughput virtual screen. We again use 4UNN and Autodock Vina to deﬁnethe objective function values. With this larger design space, acquisitions of 1% represent asigniﬁcant number of molecules (ca. 21,000); therefore, we also sought to reduce explorationsize. Given the strong performance of the greedy acquisition metric and its simplicity (i.e.,lack of a requirement of predicted variances), we focus our analysis on the this metric alone.We tested three diﬀerent batch sizes for initialization and exploration, with ﬁve explo-ration batches, as in our above experiments. Using a batch size of 0.4% for a total of 2.4%of the pool, we observed strong improvement over random exploration for all three models

RF NN MPNNumber of Ligands Explored P e r c en t age o f T op - S c o r e s F ound Figure 3: Bayesian optimization performance on Enamine 50k screening data as measuredby the percentage of top-500 scores found as function of the number of ligands evaluated.Each trace represents the performance of the given acquisition metric with the surrogatemodel architecture corresponding the chart label. Each experiment began with random 1%acquisition (ca. 500 samples) and acquired 1% more each iteration for ﬁve iterations. Errorbars reﬂect ± one standard deviation across ﬁve runs.8sing the greedy metric in terms of the fraction of the top-1000 (top-0.05%) scores found(Figure 4, left panel.) With a 0.4% batch size, the random baseline ﬁnds only 2.6% of thetop-1000 scores, whereas the RF, NN, and MPN models ﬁnd 84.3% (EF = 32.4), 95.7% (EF= 36.8), and 97.7% (EF = 37.6) of the top-1000 scores, respectively. Lowering the total ex-ploration size by half so that 0.2% of the library is acquired at each iteration (a total of 1.2%of the library) reduces the overall performance of each model, but the drop in performanceis not commensurate with the decrease in performance of the random baseline (Figure 4,middle panel.) The MPN model is shown to be the most robust to the decrease in batchsize, identifying 93.7% of the top-1000 scores after exploring just 1.2% of the design spacefor an enrichment factor of 72.0. Similarly, reducing the batch size further to 0.1% aﬀectsthe random baseline to a greater extent than any active learning strategy (Figure 4, rightpanel.) Here, the random baseline ﬁnds only 0.6% of the top-1000 scores but the MPN modelwith greedy acquisition ﬁnds 82.2% of the top-1000 scores, representing an enrichment factorof 137. This growth in enrichment factor as exploration fraction decreases holds for other,non-greedy acquisition metrics (Tables S3-S5). Ultra-large library

One goal of the Bayesian optimization framework in our software,

MolPAL , is to scale to evenlarger chemical spaces. A two million member library is indeed a large collection of molecules,but screens of this size are compatible with standard high-performance computing clusters.Our ﬁnal evaluation sought to demonstrate that

MolPAL can make virtual screens of ≥ -member libraries accessible to researchers using modest, shared clusters. We turned to arecent study by Lyu et al. that screened 99 million readily accessible molecules against AmpC β -lactamase (PDB ID: 12LS) using DOCK3.7. The full dataset of 99.5 million moleculesthat were successfully docked against 12LS (“AmpC”) is used as the ground truth. Wemeasure our algorithm’s performance as a function of the top-50000 (top-0.05%) scores foundfor all three models using acquisition sizes of 0.4%, 0.2%, or 0.1%.9 k k k k k k k k k k k k k k k k P e r c en t age o f T op - S c o r e s F ound Figure 4: Bayesian optimization performance on Enamine HTS screening data as measuredby the percentage of top-1000 scores found as function of the number of ligands evaluated.Each trace represents the performance of the given model with a greedy acquisition strategy.Chart labels represent the fraction of the library acquired in the random initial batch andthe ﬁve subsequent exploration batches. Error bars reﬂect ± one standard deviation acrossﬁve runs.We see a similar qualitative trend for the AmpC dataset as for all previous experi-ments: namely, that the use of Bayesian optimization enables us to identify many of thetop-performing compounds after exploring a small fraction of the virtual library, even whenusing a greedy acquisition metric (Figure 5). For the 0.4% batch size experiments, the MPNmodel ﬁnds 87.9% of the top-50000 (ca. top-0.05%) scores after exploring 2.4% of the totalpool (EF = 36.6). Decreasing the batch size to 0.2% led to a recovery of 67.1% (EF = 55.9),and further decreasing the batch size to 0.1% resulted in 52.0% recovery (EF = 86.7.) TheNN and RF surrogate models were not as eﬀective as the MPN, but still led to signiﬁcantenrichment above the baseline. Results for additional acquisition functions can be found inTables S6-S8. The UCB acquisition metric led to notable increases in the performance of theMPN model for both a 0.4% and 0.2% batch size, ﬁnding 94.8% (EF = 39.5) and 75.5% (EF= 62.9) of the top-50000 scores, respectively; however, UCB was not consistently superior10 . M M . M M . M . M . M . M . M M . M k k k k k k P e r c en t age o f T op - S c o r e s F ound Figure 5: Bayesian optimization performance on AmpC screening data as measured by thepercentage of top-50000 scores found as function of the number of ligands evaluated. Eachtrace represents the performance of the given model with a greedy acquisition strategy. Chartlabels represent the fraction of the library taken in both the initialization batch and the ﬁveexploration batches. Error bars reﬂect ± one standard deviation across three runs.to greedy acquisition for other model architectures or datasets.It is worth commenting on the diﬀerences in quantitative enrichment factors reported forthe AmpC data and Enamine HTS data. There are at least two diﬀerences that precludea direct comparison: (1) The top- k docking scores in the AmpC data were generated byDOCK and span a range of -118.83 to -73.99. This is compared to docking scores from theEnamine HTS collection dataset calculated with AutoDock Vina, where the top- k dockingscores range from -12.7 to -11.0. The lower precision of AutoDock Vina scores makes thetop- k score metric more forgiving (discussed later in the Limitations of evaluation metricssubsection.) (2) The chemical spaces canvassed by both libraries are diﬀerent. This willnecessarily impact model performance and, by extension, optimization performance. Single-iteration active learning

A critical question with these experiments is the importance of the active learning strategy.From the Enamine HTS data (Figure 4), we observe a sharp increase in the percentage of11 k k k k k ModelMPNNNRF

Number of Ligands Explored P e r c en t age o f T op - S c o r e s F ound Figure 6: Single-iteration results on Enamine HTS with greedy acquisition. Solid traces:initialization batch size of 0.4% of pool and exploration batch size of 2% of pool. Dashedtraces: initialization batch size of 2% of pool and exploration batch size of 0.4% of pool.Error bars reﬂect ± one standard deviation across three runs. Faded traces: reproducedactive learning data from 0.4% experiments (error bars omitted.)top-1000 scores found after the ﬁrst exploration batch (e.g., from 0.4% to 70.3% for the MPN0.4% acquisition), suggesting that the randomly selected initial batch is quite informative.We look at “single-iteration” experiments, where the ﬁrst batch is selected randomly andthe second (and ﬁnal) batch is selected according to our acquisition function (Figure 6).Selecting all 42,000 molecules at once after training on the initial 8,400 molecules results inﬁnding 94.9% ( ± Dynamic convergence criterion

Our evaluations so far have demonstrated reliable performance of

MolPAL using a ﬁxedexploration strategy (i.e., number of iterations). However, we will typically not know a priori what an appropriate total exploration size is. We therefore deﬁne a convergence criterionfor the Enamine HTS dataset that is satisﬁed when the fractional diﬀerence between thecurrent average of the top-1000 scores and the rolling average of the top-1000 scores fromthe previous three epochs, corresponding to the top 0.05% of compounds, is less than 0.01.Figure 7 illustrates the use of this convergence criterion using a 0.1% batch size (ca. 2,100molecules) with a greedy acquisition metric. We ﬁnd that not only do the higher capacitymodels converge sooner (MPN > NN > RF), but they also converge to a higher percentageof top-1000 scores found (88.7%, 85.4%, and 75.8%, respectively).

Chemical space visualization

To visualize the set of molecules acquired during the Bayesian optimization routine, weprojected the 2048-bit atom-pair ﬁngerprints of the Enamine HTS library into two dimensionsusing UMAP (Figures 8 and S16). The embedding of the library was trained on a random10% subset of the full library and then applied to the full library. Comparing the locationof the top-1000 molecules (Figure 8A) to the density of molecules in the entire 2M library(Figure 8B) reveals several clusters (black ellipses) of high-performing compounds located insparse regions of chemical space. To observe how the three separate surrogate models copewith this mismatch, we plot the embedded ﬁngerprints of the molecules acquired in the zeroth(random), ﬁrst, third, and ﬁfth iterations of a 0.1% batch size greedy search (Figure 8C)While all surrogate models select candidates in these two areas, there are diﬀerences in13 k k k k ModelMPNNNRF

Number of Ligands Explored P e r c en t age o f T op - S c o r e s F ound Figure 7: Convergence results on Enamine HTS collection with greedy acquisition and 0.1%batch size for both initialization and exploration.both the speed and thoroughness with which they search them. The RF model does notbegin sampling densely from this region until the third iteration and refocuses its attentionelsewhere by the ﬁfth iteration. Both the NN and MPN models acquire a larger number ofpoints from these areas in earlier iterations. While this analysis is qualitative by nature, itillustrates how the performance diﬀerences between the three surrogate model architecturesrelates to the regions of chemical space they choose to explore.

Discussion

Eﬀect of acquisition strategy on performance

An interesting result from these experiments was the consistently strong performance of thegreedy acquisition metric. This is surprising given the fact that the greedy metric is, inprinciple, purely exploitative and vulnerable to limiting its search to a single area of thegiven library’s chemical space. Prior work in batched Bayesian optimization has focused on14

Points R F NN M P N Score

A BC

Iteration

Figure 8: Visualization of the chemical space in the Enamine HTS library using UMAPembeddings of 2048-bit Atom-pair ﬁngerprints trained on a random 10% subset of the fulllibrary. A. Embedded ﬁngerprints of the top-1000 scoring molecules. B. C. Embedded ﬁngerprints of the molecules ac-quired in the given iteration using a greedy acquisition metric, 0.1% batch size, and speciﬁedsurrogate model architecture. Circled regions indicate clusters of high-scoring compoundsin sparse regions of chemical space. Color scale corresponds to the negative docking score(higher is better). X- and y-axes are the ﬁrst and second components of the 2D UMAPembedding and range from -7.5 to 17.5 15eveloping strategies to prevent batches from being homogeneous, including use of an inneroptimization loop to construct batches one candidate at a time.

Despite this potentiallimitation of the greedy acquisition metric, it still leads to adequate exploration of theselibraries and superior performance to metrics that combine some notion of exploration alongwith exploitation (i.e., UCB, TS, EI, PI). One confounding factor in this analysis is thatmethods used for uncertainty quantiﬁcation in regression models are often unreliable, whichmay explain the poorer empirical results when acquisition depends on their predictions. Eﬀect of library size

The principal takeaway from our results on diﬀerent library sizes is that Bayesian optimiza-tion is not simply a viable technique but an eﬀective one in all of these cases. Though itis diﬃcult to quantitatively compare algorithm performance on each dataset due to theirdiﬀering chemical spaces, the impact of library size on the optimization is still worth com-menting on. We observe the general trend in our data that, as library size increases, so toodoes top- k performance given a constant fractional value for k , even when decreasing relativeexploration size. We anticipate that this is due in part to the greater amount of trainingdata that the surrogate model is exposed to over the course of the optimization. Despite therelative batch size decreasing, the absolute number of molecules taken in each batch and thusthe number data points to train on increases from roughly 500 to nearly 8,400 when movingfrom the Enamine 50k dataset (1% batch size) to the Enamine HTS dataset (0.4% batchsize). We analyzed the mean-squared error (MSE) of MPN model predictions on the entire10k, 50k, and HTS libraries after initialization with random 1%, 1%, and 0.4% batches,respectively; the MSE decreased with increasing library size: 0.3504, 0.2617, and 0.1520(Spearman’s ρ = 0 . . . Consistency across repeated trials

Bayesian optimization can be prone to large deviations across repeated trials, but our re-sults showed consistent performance across all datasets and conﬁgurations (Tables S1-S8).To investigate whether the consistency in performance is a result of consistency in the exactmolecular species selected, we calculate the total number of unique SMILES strings acquiredacross all repeated experiments as a function of optimization iteration (Figures S6 and S7).Comparing these results to both the theoretical maximum (each trial acquiring a completelyunique subset of molecules at each iteration) and the theoretical minimum (each trial acquir-ing an identical subset of molecules at each iteration after the initialization) approximatesthe degree to which each repeat explores the same or diﬀerent regions of chemical space.Traces closer to the maximum would indicate that each experiment is exploring a uniquesubset of highly performing molecules, and traces closer to the minimum signal the oppo-site: that each experiment is converging towards the same regions of chemical space. Ourdata are most consistent with the latter, suggesting that each trial is converging towards thesame optimal subset of the library regardless of its initialization. We hypothesize that thisis due to relatively smooth structure-property landscapes present in these datasets and lackof statistical uncertainty.

Limitations of evaluation metrics

In this study, three separate evaluation criteria were used to assess performance: the averagetop- k docking score identiﬁed, the fraction of top- k SMILES identiﬁed, and the fraction oftop- k scores identiﬁed throughout the optimization campaign (calculation details are pro-vided in the Methods section below). The average metric is sensitive to the scale anddistribution of scores, making direct comparison between datasets challenging. The top- k SMILES metric asks whether a speciﬁc set of molecules is selected by the algorithm, which17an be overly strict if multiple molecules have identical performance (i.e., the same score)but are not within the top- k due to arbitrary data sorting (Figures S2-S4). The top- k scoremetric overcomes this limitation by assigning equal weight to each molecule with the samescore regardless of its speciﬁc position in the sorted dataset. As a result, this makes thescores metric more forgiving for datasets with smaller ranges and lower precision (e.g., thosecalculated using AutoDock Vina with a precision of 0.1) than those with larger ranges andhigher precision (e.g., those calculated using DOCK with a precision of 0.01). In contrast tothe average top- k score, however, this metric does not reward “near-misses”, for example,identifying the k + 1 ranked molecule with a nearly identical score to the k -th molecule. Optimal batch size for active learning

The number of molecules selected at each iteration represents an additional hyperparam-eter for Bayesian optimization. In one limit, Bayesian optimization can be conducted ina purely sequential fashion, acquiring the performance of a single molecule each iteration.Fully sequential learning would oﬀer the most up-to-date surrogate model for the acquisitionof each new point but it would also be extremely costly to continually retrain the modeland perform inference on the entire candidate pool. In the other limit, molecules wouldbe selected in a single iteration, which can lead to suboptimal performance depending onthe acquisition size (Figure 6). Finding a principled balance between these two withoutresorting to empirical hyperparameter optimization is an ongoing challenge. In each of ourexperiments, the relative batch size was held constant at one sixth of the total explorationsize. Future performance engineering work will seek to examine the eﬀects of dynamic batchsizes in batched optimization. Note that overall batch diversity is another considerationin batched Bayesian optimization. While selected batches in this study did not appear tosuﬀer from homogeneity, previous approaches to improve batch diversity could be exploredas well. ost of surrogate model (re)training The question of optimal batch size cannot be decoupled from the computational cost of modelretraining and inference. Throughout these studies, we have focused only on the numberof objective function calculations necessary to achieve a given level of performance. Whileobjective function calculation is signiﬁcantly more expensive than the cost of model trainingand inference, inference costs scale linearly with the size of the dataset and contribute to theoverall cost of our algorithm.The MPN model was shown to be superior in the largest datasets (Enamine HTS andAmpC), but its costs are markedly higher than those of the ﬁngerprint-based NN model. Thetradeoﬀ between sample eﬃciency (number of objective function calculations) and surrogatemodel costs (training and inference) should be balanced when selecting a model architecture.In our evaluation, the costs of the MPN and NN cannot be directly compared due to dif-ferences in their implementation and extent of both parallelization and precalculation. Formore details, see the software design subsection in the supplementary text. An additionalchoice when seeking to limit surrogate model costs is whether to train the model online withnewly acquired data or fully retrain the model at each iteration with all acquired data. Weexamined this possibility, but online learning lead to consistently lower performance in ourexperiments (Tables S1-S8).

Conclusion

In this work, we have demonstrated the application of Bayesian optimization to the priori-tization of compounds for structure-based virtual screening using chemical libraries rangingin size from 10k to 100M ligands. A thorough evaluation of diﬀerent acquisition metricsand surrogate model architectures illustrates the surprisingly strong performance of a greedyacquisition strategy and the superiority of a message passing neural network over ﬁngerprint-based feed forward neural network or random forest models. In the largest library tested,19he 100M member library screened against 12LS by Lyu et al., we identify 87.9% of thetop-50000 scoring ligands with a > k compounds is not needed. Moreover, this approach is also relevantto experimental high-throughput screening, an expensive and important tool for challengingdrug discovery problems. Future work will seek to extend the open source

MolPAL softwarepackage and leverage it in a prospective manner to greatly accelerate a structure-based virtualscreen of the Enamine REAL database. We also hope to expand

MolPAL beyond the initialsoftware detailed in this report with the addition of new surrogate model architectures, theinclusion of improved uncertainty estimation techniques, and the expansion to other formsof virtual discovery, i.e., other objective functions. Finally, we envision that in addition toaccelerating the practice of virtual screening,

MolPAL will increase its accessibility throughthe pipelining of the entire virtual screening process behind a common interface.

Methods

Batched Bayesian optimization

Bayesian optimization is an active learning strategy that iteratively selects experiments toperform according to a surrogate model’s predictions, often using machine learning (ML)models. In the context of this work, the Bayesian optimization was performed on a discreteset of candidate molecules, herein referred to as a “pool” or virtual library, and points wereacquired in batches rather than one point at a time. Batched Bayesian optimization beginsby ﬁrst calculating the objective function f ( x ) for a set of n random points { x } ni =1 withina pool of points X . The objective function values for these points are calculated, in thisstudy as docking scores from AutoDock Vina, and the corresponding tuples { ( x i , f ( x i )) } ni =1 D . A surrogate model ˆ f ( x ) is then trained on these data and, alongwith the current maximum objective function value f ∗ , is passed to an acquisition function α ( x ; ˆ f , f ∗ ), which calculates the utility of evaluating the objective function at the point x .Utility may be measured in a number of ways: the predicted objective function value, theamount of information this new point will provide the surrogate model, the likelihood thispoint will improve upon the current maximum, etc. . . . See ref. 41 for a detailed discussionof various acquisition functions. The set of m points with the largest sum of utilities S =argmax { x i } mi =1 ⊂X (cid:80) mi =1 α ( x, ˆ f , f ∗ ) is then selected, or “acquired”. The set of objective function valuescorresponding to these points is calculated { ( x, f ( x )) : x ∈ S} and used to update the dataset D . This process is repeated iteratively until a stopping criterion is met (e.g., a ﬁxed numberof iterations or a lack of suﬃcient improvement). Algorithm 1:

Batched Bayesian Optimization

Input: objective function f ( x ), acquisition function α , surrogate model ˆ f ( x ),candidate set X Select random batch

S ⊂ X

Initialize

D ← { ( x, f ( x )) : x ∈ S} for t ← T do Train surrogate model ˆ f ( x ) using D Select batch

S ← argmax

B⊂X , |B| = m (cid:88) x ∈B α ( x ; ˆ f , f ∗ )Update D ← D ∪ { ( x, f ( x )) : x ∈ S} endResult: { x i } ki =1 ∗ = argmax { x i } ki =1 ⊂D (cid:80) ki =1 f ( x i ) Acquisition metrics

The following acquisition functions were tested in this study:Random( x ) ∼ U (0 , x ) = ˆ µ ( x )UCB( x ) = ˆ µ ( x ) + β ˆ σ ( x ) 21S( x ) ∼ N (ˆ µ ( x ) , ˆ σ ( x ))EI( x ) =  γ ( x )Φ( z ) + ˆ σ ( x ) φ ( z ) , ˆ σ ( x ) > γ ( x ) , ˆ σ ( x ) = 0PI( x ) =  Φ( z ) , ˆ σ ( x ) > , ˆ σ ( x ) = 0 and γ ( x ) > , ˆ σ ( x ) = 0 and γ ( x ) ≤ γ ( x ) := ˆ µ ( x ) − f ∗ + ξ ; z ( x ) := γ ( x )ˆ σ ( x ) ; ˆ µ ( x ) and ˆ σ ( x ) are the surrogate model predictedmean and uncertainty at point x , respectively; Φ and φ are the CDF and PDF of the standardnormal distribution, respectively; and f ∗ is the current maximum objective function value.For the experiments reported in the paper, we used β = 2 and ξ = 0 . Surrogate models

In the context of our study, the surrogate modelling step involved training a machine learning(ML) model and using this trained model to predict the objective function value of moleculesfor which the objective function has not yet been calculated. Three surrogate model archi-tectures were investigated in these studies: random forest (RF), feed-forward neural network(NN), and directed message passing neural network (MPN) models.

Random forest models

Random forest (RF) models operate by ensembling decision treemodels each trained on a random subset of the training data. Broadly, a decision tree istrained on input data of the form ( x , y ) where x = [ x , . . . , x N ] is a vector composed ofinput features x , . . . , x N and y is a “target” output value. During training, a decision treeis built by progressively partitioning the input space at decision nodes into two subspacescorresponding to the value of a given feature. This process is repeated recursively on eachof the resulting subspaces until some maximum amount of partitioning is achieved and a22eaf node is constructed that contains all possible values { y i } Mi =1 that correspond to thepartitioning of the parent decision nodes. A more in-depth discussion on RF models forQSAR may be found in ref. 42. The RF surrogate model in our study was implementedusing the RandomForestRegressor class from Scikit-Learn using an n estimators valueof 100 and a max depth value of 8. Feed-forward neural networks

Feed-forward neural networks (FFNNs) comprise an in-put layer, an output layer, and zero or more hidden layers between the two. At each layer,the input or hidden vector is linearly transformed by a learned weight matrix and passedthrough an elementwise nonlinear activation function. NNs are generally trained throughstochastic gradient descent to minimize a loss function, which for regression tasks is generallythe mean squared error.The NN models in our study were implemented in TensorFlow using two fully connectedhidden layers of 100 nodes each and an output size of one. Each hidden layer utilized arectiﬁed linear unit (ReLU) activation function. The network was trained over 50 epochswith early stopping using the Adam optimizer with an initial learning rate of 0.01, a mean-squared error loss function with L2 regularization (0.01), and a batch size of 4096. Predictionuncertainties, as needed for non-greedy acquisition metrics, were estimated using Monte-Carlo Dropout via dropout layers ( p = 0 .

2) after each hidden layer using 10 forward passesduring inference.

Molecular ﬁngerprints

Given that molecules are not naturally represented as featurevectors, inputs to both RF and NN models were generated by calculating molecular ﬁnger-prints. Fingerprint calculation algorithms vary in their implementation but can broadly beunderstood as encoding information about the presence or absence of substructures of thegiven molecule into a vector of ﬁxed length. The input to both the RF and NN modelsused in this study is a 2048-bit Atom-pair ﬁngerprint with a minimum radius of one anda maximum radius of three. A more detailed overview of molecular ﬁngerprints is provided23n ref. 46. Message passing neural networks

The third and ﬁnal model architecture we testedwas a directed message passing neural network (D-MPNN) model, a variant of a messagepassing neural network (MPNN) model. In contrast to FFNNs, MPNNs operate directly onthe molecular graph rather than a ﬁxed feature vector calculated from the graph. MPNNsfunction in two stages, an initial message passing phase followed by a readout phase. Inthe message passing phase, “messages” are passed between atoms and/or bonds and theirdirect neighbors and incoming messages are used to update the “hidden state” of each atomand/or bond. The message passing phase is repeated over multiple (e.g., 3) iterations, atwhich point the hidden states of each atom are aggregated (e.g., summed) to produce amolecule-level feature vector. By training this model at the same time as a FFNN operatingon the feature vector, MPNNs are able to learn a task-speciﬁc representation of an inputmolecular graph. For more details on the D-MPNN model, we refer a reader to ref. 33.Message-passing neural network models were implemented using PyTorch with the MoleculeModel class from the

Chemprop library using standard settings: messages passedon directed bonds, messages subjected to ReLU activation, a learned encoded representationof dimension 300, and the output of the message-passing phase fully connected to an outputlayer of size 1. The model was trained using the Adam optimization algorithm, a Noamlearning rate scheduler (initial, maximum, and ﬁnal learning rates of 10 − , 10 − , and 10 − ,respectively,) and a root mean-squared error loss function over 50 epochs with a batch sizeof 50. For more details on the Noam learning rate scheduler, see ref. 48. The model wastrained without early stopping, but the model state was reloaded from the epoch with thelowest validation score after the ﬁnal training epoch. When uncertainty values were neededfor metric function calculation, an MVE model based oﬀ of the work done by Hirschfeldet al. was used. This model featured an output size of two and was trained using the loss24unction deﬁned by Nix and Weigend: L ( y, ˆ y, ˆ σ ) = log 2 π σ y − ˆ y ) σ (2)All of the surrogate models were used exactly as described above without additionalhyperparameter optimization. The models were fully retrained from scratch with all acquireddata at the beginning of each iteration. Datasets

The datasets generated for these studies were produced from docking the compounds con-tained in both Discovery Diversity sets and the HTS collection from Enamine againstthymidylate kinase (PDB ID: 4UNN). The docking was performed using AutoDock Vinawith the following command line arguments: --receptor=4UNN.pdbqt --ligand= --center_x=9 --center_y=20--center_z=-5 --size_x=20 --size_y=20 --size_z=17

All other default arguments were left as-is. The ligands were prepared from SDFs availablefrom Enamine, parsed into SMILES strings using RDKit, and processed into PDBQTﬁles using OpenBabel with the --gen-3d ﬂag. The receptor was prepared with PDBFixer using the PDB ID 4UNN, selecting only chain A, deleting all heterogens, adding all suggestedmissing residues and heavy atoms, and adding hydrogens for pH 7.0. The AmpC screeningdataset is the publicly available dataset published by Lyu et al and was used as-is. Thescore distribution of each dataset may be found in Figures S2-S5.

Evaluation metrics

MolPAL performance was judged through three evaluation metrics: (1) average top- k dockingscore identiﬁed (“Average”), (2) the fraction of top- k SMILES identiﬁed (“SMILES”), and253) the fraction of top- k scores identiﬁed (“Scores”). For (1), the average of the top- k scoresof molecules explored by MolPAL was taken and divided by the true top- k molecules’ scoresbased on the full dataset. (2) was calculated by taking the intersection of the set of SMILESstrings in the true top- k molecules and the found top- k molecules and dividing its size by k .(3) was calculated as the size of the intersection of the list of true top- k scores and the listof observed top- k scores divided by k . Hyperparameter optimization

The experiments shown in this study represent only a small fraction of the conﬁgurationsin which

MolPAL may be run. A sample of settings that are supported include: variousﬁngerprint types (e.g., RDKit, Morgan, and MACCS), input preclustering for cluster-basedacquisition, diﬀerent conﬁdence estimation methods for deep learning models, etc. Given thewealth of options, an exhaustive hyperparameter optimization was outside the scope of theseinvestigations. We looked at broad trends in both the Enamine 10k and 50k datasets andfound only minor variations in performance, supporting our choice not to pursue a rigorousscreening of all possible conﬁgurations.

Software design

MolPAL is built around the

Explorer class, which performs the Bayesian optimization rou-tine shown in Algorithm 1. The

Explorer is designed with abstraction at its core and thusrelies on four helper classes that each handles an isolated element of Bayesian optimiza-tion:

MoleculePool , Model , Acquirer , and

Objective . In each iteration of the

Explorer ’smain optimization loop, a

Model is ﬁrst retrained on all observed data then applied to allmolecules in the

MoleculePool to generate a predicted mean and, depending on the

Model ,a predicted uncertainty for each molecule in the pool. These predictions are then passedto an

Acquirer which calculates the acquisition utility of each molecule and acquires thetop- m untested molecules from the MoleculePool based on their acquisition utilities. Next,26his set of candidate molecules is handed to an

Objective and the objective function valuefor each of these candidate molecules is calculated. Lastly, the stopping condition for the

Explorer is checked and, if satisﬁed, the

Explorer terminates and outputs the top- k eval-uated molecules. A schematic of both the design and workﬂow of MolPAL may be seen inFigure S1. The experiments performed in this study were all performed retrospectively usingthe

LookupObjective subclass of the

Objective with a fully generated dataset as a lookuptable for objective function calculation.

Acknowledgement

We thank Samuel Goldman and Itai Levin for providing feedback on the manuscript and

MolPAL code. The computations in this paper were run on the FASRC Cannon clustersupported by the FAS Division of Science Research Computing Group at Harvard University.This work was funded by NIGMS RO1 068670.

Supporting Information Available

Additional methods and results can be found in the supporting information. All code anddata needed to reproduce this study and the ﬁgures herein can be found at https://github.com/coleygroup/molpal

References (1) Yu, W.; MacKerell, A. D. In

Antibiotics: Methods and Protocols ; Sass, P., Ed.; Methodsin Molecular Biology; Springer: New York, NY, 2017; pp 85–106.(2) Macalino, S. J. Y.; Gosu, V.; Hong, S.; Choi, S. Role of computer-aided drug design inmodern drug discovery.

Archives of Pharmacal Research , , 1686–1701.273) Li, J.; Fu, A.; Zhang, L. An Overview of Scoring Functions Used for Protein–LigandInteractions in Molecular Docking. Interdisciplinary Sciences: Computational Life Sci-ences , , 320–328.(4) Irwin, J. J.; Shoichet, B. K. Docking Screens for Novel Ligands Conferring New Biology. Journal of Medicinal Chemistry , , 4103–4120, Publisher: American ChemicalSociety.(5) Irwin, J. J.; Shoichet, B. K. ZINC - A Free Database of Commercially Available Com-pounds for Virtual Screening. Journal of Chemical Information and Modeling , , 177–182, Publisher: American Chemical Society.(6) Sterling, T.; Irwin, J. J. ZINC 15 – Ligand Discovery for Everyone. Journal of Chem-ical Information and Modeling , , 2324–2337, Publisher: American ChemicalSociety.(7) REAL Database - Enamine. https://enamine.net/library-synthesis/real-compounds/real-database , Accessed 09/15/2020.(8) Knehans, T.; Klingler, F.-M.; Kraut, H.; Saller, H.; Herrmann, A.; Rippmann, F.;Eiblmaier, J.; Lemmen, C.; Krier, M. Merck AcceSSible InVentory (MASSIV): In sil-ico synthesis guided by chemical transforms obtained through bootstrapping reactiondatabases. Abstracts of Papers of the American Chemical Society. 2017.(9) Nicolaou, C. A.; Watson, I. A.; Hu, H.; Wang, J. The Proximal Lilly Collection: Map-ping, Exploring and Exploiting Feasible Chemical Space. Journal of Chemical Infor-mation and Modeling , , 1253–1266, Publisher: American Chemical Society.(10) Hu, Q.; Peng, Z.; Sutton, S. C.; Na, J.; Kostrowicki, J.; Yang, B.; Thacher, T.; Kong, X.;Mattaparti, S.; Zhou, J. Z.; Gonzalez, J.; Ramirez-Weinhouse, M.; Kuki, A. PﬁzerGlobal Virtual Library (PGVL): A Chemistry Design Tool Powered by Experimentally28alidated Parallel Synthesis Information. ACS Combinatorial Science , , 579–589, Publisher: American Chemical Society.(11) Clark, D. E. Virtual Screening: Is Bigger Always Better? Or Can Small Be Beauti-ful? Journal of Chemical Information and Modeling , , 4120–4123, Publisher:American Chemical Society.(12) Gorgulla, C.; Boeszoermenyi, A.; Wang, Z.-F.; Fischer, P. D.; Coote, P. W.; Pad-manabha Das, K. M.; Malets, Y. S.; Radchenko, D. S.; Moroz, Y. S.; Scott, D. A.;Fackeldey, K.; Hoﬀmann, M.; Iavniuk, I.; Wagner, G.; Arthanari, H. An open-sourcedrug discovery platform enables ultra-large virtual screens. Nature , , 663–668,Number: 7805 Publisher: Nature Publishing Group.(13) Lyu, J.; Wang, S.; Balius, T. E.; Singh, I.; Levit, A.; Moroz, Y. S.; O’Meara, M. J.;Che, T.; Algaa, E.; Tolmachova, K.; Tolmachev, A. A.; Shoichet, B. K.; Roth, B. L.;Irwin, J. J. Ultra-large library docking for discovering new chemotypes. Nature , , 224–229, Number: 7743 Publisher: Nature Publishing Group.(14) Acharya, A. et al. Supercomputer-Based Ensemble Docking Drug Discovery Pipelinewith Application to Covid-19. , Publisher: ChemRxiv.(15) Mark McGann,; OpenEye Scientiﬁc, GigaDocking ™ - Structure Based Virtual Screeningof Over 1 Billion Molecules Webinar. 2019; , Accessed 09/01/2020.(16) Frazier, P. I. A Tutorial on Bayesian Optimization. arXiv:1807.02811 [cs, math, stat] , arXiv: 1807.02811.(17) Balachandran, P. V.; Xue, D.; Theiler, J.; Hogden, J.; Lookman, T. Adaptive Strategiesfor Materials Design using Uncertainties. Scientiﬁc Reports , , 19660, Number:1 Publisher: Nature Publishing Group. 2918) Gubaev, K.; Podryabinkin, E. V.; Hart, G. L. W.; Shapeev, A. V. Accelerating high-throughput searches for new alloys with active learning of interatomic potentials. Com-putational Materials Science , , 148–156.(19) Xue, D.; Balachandran, P. V.; Hogden, J.; Theiler, J.; Xue, D.; Lookman, T. Acceler-ated search for materials with targeted properties by adaptive design. Nature Commu-nications , , 11241, Number: 1 Publisher: Nature Publishing Group.(20) Montoya, J. H.; Winther, K. T.; Flores, R. A.; Bligaard, T.; Hummelshøj, J. S.;Aykol, M. Autonomous intelligent agents for accelerated materials discovery. Chem-ical Science , , 8517–8532, Publisher: The Royal Society of Chemistry.(21) Bilsland, E.; Sparkes, A.; Williams, K.; Moss, H. J.; de Clare, M.; Pir, P.; Rowland, J.;Aubrey, W.; Pateman, R.; Young, M.; Carrington, M.; King, R. D.; Oliver, S. G. Yeast-based automated high-throughput screens to identify anti-parasitic lead compounds. Open Biology 3 , 120158, Publisher: Royal Society.(22) Czechtizky, W.; Dedio, J.; Desai, B.; Dixon, K.; Farrant, E.; Feng, Q.; Morgan, T.;Parry, D. M.; Ramjee, M. K.; Selway, C. N.; Schmidt, T.; Tarver, G. J.; Wright, A. G.Integrated Synthesis and Testing of Substituted Xanthine Based DPP4 Inhibitors: Ap-plication to Drug Discovery.

ACS Medicinal Chemistry Letters , , 768–772, Pub-lisher: American Chemical Society.(23) Williams, K.; Bilsland, E.; Sparkes, A.; Aubrey, W.; Young, M.; Soldatova, L. N.;De Grave, K.; Ramon, J.; de Clare, M.; Sirawaraporn, W.; Oliver, S. G.; King, R. D.Cheaper faster drug development validated by the repositioning of drugs against ne-glected tropical diseases. Journal of The Royal Society Interface , , 20141289,Publisher: Royal Society.(24) Janet, J. P.; Ramesh, S.; Duan, C.; Kulik, H. J. Accurate Multiobjective Design in a30pace of Millions of Transition Metal Complexes with Neural-Network-Driven EﬃcientGlobal Optimization. ACS Central Science , acscentsci.0c00026.(25) Ghanakota, P.; Bos, P.; Konze, K.; Staker, J.; Marques, G.; Marshall, K.; Leswing, K.;Abel, R.; Bhat, S. Combining Cloud-Based Free Energy Calculations, SyntheticallyAware Enumerations and Goal-Directed Generative Machine Learning for Rapid Large-Scale Chemical Exploration and Optimization. , Publisher: ChemRxiv.(26) Konze, K. D.; Bos, P. H.; Dahlgren, M. K.; Leswing, K.; Tubert-Brohman, I.; Bor-tolato, A.; Robbason, B.; Abel, R.; Bhat, S. Reaction-Based Enumeration, ActiveLearning, and Free Energy Calculations To Rapidly Explore Synthetically TractableChemical Space and Optimize Potency of Cyclin-Dependent Kinase 2 Inhibitors.

Jour-nal of Chemical Information and Modeling , , 3782–3793.(27) Gentile, F.; Agrawal, V.; Hsing, M.; Ban, F.; Norinder, U.; Gleave, M. E.; Cherkasov, A.Deep Docking - a Deep Learning Approach for Virtual Screening of Big ChemicalDatasets. bioRxiv , 2019.12.15.877316, Publisher: Cold Spring Harbor LaboratorySection: New Results.(28) Pyzer-Knapp, E. O. Using Bayesian Optimization to Accelerate Virtual Screen-ing for the Discovery of Therapeutics Appropriate for Repurposing for COVID-19. arXiv:2005.07121 [cs, q-bio] , arXiv: 2005.07121.(29) Hern´andez-Lobato, J. M.; Requeima, J.; Pyzer-Knapp, E. O.; Aspuru-Guzik, A. Par-allel and Distributed Thompson Sampling for Large-scale Accelerated Exploration ofChemical Space. arXiv:1706.01825 [stat] , arXiv: 1706.01825.(30) Gibbs, M.; MacKay, D. J. C. Eﬃcient Implementation of Gaussian Processes ; 1997.(31) Naik, M. et al. Structure Guided Lead Generation for M. tuberculosis Thymidylate Ki-nase (Mtb TMK): Discovery of 3-Cyanopyridone and 1,6-Naphthyridin-2-one as Potent31nhibitors.

Journal of Medicinal Chemistry , , 753–766, Publisher: AmericanChemical Society.(32) Trott, O.; Olson, A. J. AutoDock Vina: Improving the speed and accu-racy of docking with a new scoring function, eﬃcient optimization, and mul-tithreading. Journal of Computational Chemistry , , 455–461, eprint:https://onlinelibrary.wiley.com/doi/pdf/10.1002/jcc.21334.(33) Yang, K.; Swanson, K.; Jin, W.; Coley, C.; Eiden, P.; Gao, H.; Guzman-Perez, A.;Hopper, T.; Kelley, B.; Mathea, M.; Palmer, A.; Settels, V.; Jaakkola, T.; Jensen, K.;Barzilay, R. Analyzing Learned Molecular Representations for Property Prediction. Journal of Chemical Information and Modeling , , 3370–3388.(34) Balius, T.; Lyu, J.; Shoichet, B. K.; Irwin, J. J. AmpC screen table.csv.gz.2018; https://figshare.com/articles/AmpC_screen_table_csv_gz/7359626 , Ac-cessed 06/01/2020.(35) McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Pro-jection for Dimension Reduction. arXiv:1802.03426 [cs, stat] , arXiv: 1802.03426.(36) Desautels, T.; Krause, A.; Burdick, J. W. Parallelizing Exploration-Exploitation Trade-oﬀs in Gaussian Process Bandit Optimization. Journal of Machine Learning Research , , 4053–4103.(37) Tsymbalov, E.; Makarychev, S.; Shapeev, A.; Panov, M. Deeper Connections betweenNeural Networks and Gaussian Processes Speed-up Active Learning. Proceedings of theTwenty-Eighth International Joint Conference on Artiﬁcial Intelligence , 3599–3605, arXiv: 1902.10350.(38) Hirschfeld, L.; Swanson, K.; Yang, K.; Barzilay, R.; Coley, C. W. Uncertainty Quantiﬁ-cation Using Neural Networks for Molecular Property Prediction.

Journal of ChemicalInformation and Modeling , , 3770–3780.3239) Wang, Z.; Gehring, C.; Kohli, P.; Jegelka, S. Batched Large-scale Bayesian Opti-mization in High-dimensional Spaces. arXiv:1706.01445 [cs, math, stat] , arXiv:1706.01445.(40) Schuﬀenhauer, A. et al. Evolution of Novartis’ Small Molecule Screening Deck Design. Journal of Medicinal Chemistry , Publisher: American Chemical Society.(41) Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R. P.; de Freitas, N. Taking the HumanOut of the Loop: A Review of Bayesian Optimization.

Proceedings of the IEEE , , 148–175.(42) Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.; Sheridan, R. P.; Feuston, B. P.Random forest: a classiﬁcation and regression tool for compound classiﬁcation andQSAR modeling. Journal of chemical information and computer sciences , ,1947–1958.(43) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blon-del, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V., et al. Scikit-learn: Machine learningin Python. the Journal of machine Learning research , , 2825–2830.(44) Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.;Davis, A.; Dean, J.; Devin, M., et al. Tensorﬂow: Large-scale machine learning onheterogeneous distributed systems. arXiv preprint arXiv:1603.04467 ,(45) Carhart, R. E.; Smith, D. H.; Venkataraghavan, R. Atom pairs as molecular features instructure-activity studies: deﬁnition and applications. Journal of Chemical Informationand Computer Sciences , , 64–73, Publisher: American Chemical Society.(46) Bajusz, D.; R´acz, A.; H´eberger, K. Reference Module in Chemistry, Molecular Sci-ences and Chemical Engineering ; 2017; Journal Abbreviation: Reference Module inChemistry, Molecular Sciences and Chemical Engineering.3347) Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.;Lin, Z.; Gimelshein, N.; Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems.2019; pp 8026–8037.(48) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.;Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv:1706.03762 [cs] ,arXiv: 1706.03762.(49) Nix, D. A.; Weigend, A. S. Estimating the mean and variance of the target probabilitydistribution. Proceedings of 1994 IEEE International Conference on Neural Networks(ICNN’94). 1994; pp 55–60 vol.1.(50) Diversity Libraries - Enamine. https://enamine.net/hit-finding/diversity-libraries , Accessed 04/01/2020.(51) HTS Collection - Enamine. https://enamine.net/hit-finding/compound-collections/screening-collection/hts-collection , Accessed04/01/2020.(52) RDKit. http://rdkit.org/ , Accessed 10/20/2020.(53) O’Boyle, N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchi-son, G. R. Open Babel: An open chemical toolbox.

Journal of Cheminformatics , , 33.(54) Eastman, P.; Swails, J.; Chodera, J. D.; McGibbon, R. T.; Zhao, Y.; Beauchamp, K. A.;Wang, L.-P.; Simmonett, A. C.; Harrigan, M. P.; Stern, C. D.; Wiewiora, R. P.;Brooks, B. R.; Pande, V. S. OpenMM 7: Rapid development of high performancealgorithms for molecular dynamics. PLOS Computational Biology , , 1–17, Pub-lisher: Public Library of Science. 34 upporting Information Accelerating High-Throughput Virtual Screening ThroughMolecular Pool-Based Active Learning

David E. Graﬀ, † Eugene Shakhnovich, † Connor W. Coley ∗ , ‡ † Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA ‡ Department of Chemical Engineering, MIT, Cambridge, MA

E-mail: [email protected]

Additional Methods

Elaboration on software design

The design choices detailed in the Software design paragraph of the Methods section arecritical to both the testing and extension of the

MolPAL software. Namely, the decision torely on the

MoleculePool , Model , Acquirer , and

Objective helper classes enables the rapidand facile testing of diﬀerent combinations of model architectures and acquisition strategiesfor a given objective optimization. This choice also enables the straightforward extensionof

MolPAL with new surrogate model architectures, acquisition metrics, and objective func-tions. The

Model and

Objective are both deﬁned as minimal abstract base classes builtaround an adapter design pattern. This enables the simple interfacing of popular machinelearning learning libraries (e.g., Scikit-Learn, PyTorch, and TensorFlow) via the

Model andvirtual screening software via the

Objective with the

Explorer class. The

MoleculePool is primarily an abstraction of a list of molecules stored. This class stores both a molecule’sSMILES string and, if necessary, its precalculated ﬁngerprint. The ﬁngerprint is used for1lustering, if desired, and as input to models expecting vectors as inputs (e.g., RF and NNmodels) during the model inference step. The data stored by the

MoleculePool is all storedon disk to enable the seamless application of

MolPAL to all-sizes of virtual libraries. Molec-ular graphs, the input to the MPN model, are not capable of being stored either in memoryor on disk due to their large memory footprint in their current implementation. As such,they are recalculated as necessary.Figure S1: Overview of the MolPAL software structure and workﬂow.

Alternative surrogate models

Feed forward neural network models

Two alternative NN models were deﬁned forconﬁdence estimation purposes: an ensemble model and a mean-variance estimation (MVE)model. The ensemble model was the same as the base model, with the only diﬀerence beingthat an ensemble of ﬁve models was trained. Each of the trained models was used forinference, and these ﬁve separate predictions were averaged and a variance taken to produceboth a mean predicted value and an uncertainty estimate, respectively. The mean-varianceestimation model used an output layer size of two, the learning rate was increased to 0.05from 0.01, and the same loss function from the MPN-MVE was used (Equation 2). Neither ofthese alternate models was used for experiments due to their lower performance as comparedto the dropout model. 2 irected-message passing neural network models

An MPN dropout model was alsodeﬁned for conﬁdence estimation purposes. This model was built similar to the NN dropoutmodel, with the key diﬀerence being that the dropout layer was prepended to the hiddenlayer. Again, a dropout probability of 0.2 was used and dropout was performed during modelinference. Mean predicted values were calculated by averaging 10 forward passes throughthe model and the variance of these predictions was used to as the predicted uncertainty.This alternate model was not used in experiments due to its signiﬁcantly higher inferencecosts.

Retraining strategy

In addition to fully retraining the surrogate model from scratch using all acquired data, wetested an online training strategy. For online training, the trained surrogate model from theprevious epoch was trained only on newly acquired data. Note that online training appliesonly to the NN and MPN models, as the RF is reinitialized each time it is ﬁt.3 dditional Results

Dataset score distributions

10 9 8 7 6 50100200300400500 C o un t

10 9 8 7 6 510 C o un t Score

Figure S2: Distribution of docking scores in the Enamine 10k dataset with a bin size of 0.1.Red, dashed line corresponds to the k th best score ( k = 100).

10 9 8 7 6 50500100015002000 C o un t

10 9 8 7 6 510 C o un t Score

Figure S3: Distribution of docking scores in the Enamine 50k dataset with a bin size of 0.1.Red, dashed line corresponds to the k th best score ( k = 500).4 C o un t

12 10 8 6 410 C o un t Score

Figure S4: Distribution of docking scores in the Enamine HTS dataset with a bin size of 0.1.Red, dashed line corresponds to the k th best score ( k = 1000).

100 50 0 500100000200000300000400000500000 C o un t

100 50 0 5010 C o un t Score

Figure S5: Distribution of docking scores in the AmpC dataset with a bin size of 0.1. Red,dashed line corresponds to the k th best score ( k = 50000).5 ibrary exploration across separate experiments T o t a l N u m be r o f M o l e c u l e s E x p l o r ed A m ong R un s Figure S6: The total number of unique SMILES strings acquired across 5 greedy optimiza-tions on the 10k and 50k libraries. The top black line is the theoretical maximum (i.e.,repeated trials select distinct subsets of molecules to evaluate) the bottom black line is thetheoretical minimum (i.e., repeated trials select identical subsets of molecules to evaluate). T o t a l N u m be r o f M o l e c u l e s E x p l o r ed A m ong R un s Figure S7: The total number of unique SMILES strings acquired across 5 greedy optimiza-tions on the Enamine HTS library. The top black line is the theoretical maximum (i.e.,repeated trials select distinct subsets of molecules to evaluate) the bottom black line is thetheoretical minimum (i.e., repeated trials select identical subsets of molecules to evaluate).6 nline training strategy

RF NN MPNNumber of Ligands Explored P e r c en t age o f T op - S c o r e s F ound Figure S8: Bayesian optimization performance on Enamine 10k screening data as measuredby the percentage of top-100 scores found as a function of the number of ligands evaluated.Each trace represents the performance of the given acquisition metric with the surrogatemodel architecture corresponding to the chart label. Full opacity: online model training.Faded: full model retraining. Each experiment began with a random 1% acquisition (ca.100 samples) and acquired 1% more each iteration for ﬁve iterations. Error bars reﬂect ± one standard deviation across ﬁve runs. RF NN MPNNumber of Ligands Explored P e r c en t age o f T op - S c o r e s F ound Figure S9: Bayesian optimization performance on Enamine 50k screening data as measuredby the percentage of top-500 scores found as a function of the number of ligands evaluated.Each trace represents the performance of the given acquisition metric with the surrogatemodel architecture corresponding to the chart label. Full opacity: online model training.Faded: full model retraining. Each experiment began with a random 1% acquisition (ca.500 samples) and acquired 1% more each iteration for ﬁve iterations. Error bars reﬂect ± one standard deviation across ﬁve runs. 7 k k k k k k k k k k k k k k k RF NN MPNNumber of Ligands Explored P e r c en t age o f T op - S c o r e s F ound Figure S10: Bayesian optimization performance on Enamine HTS screening data as measuredby the percentage of top-1000 scores found as a function of the number of ligands evaluated.Each trace represents the performance of the given acquisition metric with the surrogatemodel architecture corresponding to the chart label. Full opacity: online model training.Faded: full model retraining. Each experiment began with a random 0.4% acquisition (ca.8,400 samples) and acquired 0.4% more each iteration for ﬁve iterations. Error bars reﬂect ± one standard deviation across ﬁve runs. k k k k k k k k k k k k k k k RF NN MPNNumber of Ligands Explored P e r c en t age o f T op - S c o r e s F ound Figure S11: Bayesian optimization performance on Enamine HTS screening data as measuredby the percentage of top-1000 scores found as function of the number of ligands evaluated.Each trace represents the performance of the given acquisition metric with the surrogatemodel architecture corresponding to the chart label. Full opacity: online model training.Faded: full model retraining. Each experiment began with a random 0.2% acquisition (ca.4,200 samples) and acquired 0.2% more each iteration for ﬁve iterations. Error bars reﬂect ± one standard deviation across ﬁve runs. 8 k k k k k k k k k k k k k k k k k k RF NN MPNNumber of Ligands Explored P e r c en t age o f T op - S c o r e s F ound Figure S12: Bayesian optimization performance on Enamine HTS screening data as measuredby the percentage of top-1000 scores found as function of the number of ligands evaluated.Each trace represents the performance of the given acquisition metric with the surrogatemodel architecture corresponding to the chart label. Full opacity: online model training.Faded: full model retraining. Each experiment began with a random 0.1% acquisition (ca.2,100 samples) and acquired 0.1% more each iteration for ﬁve iterations. Error bars reﬂect ± one standard deviation across ﬁve runs. . M M . M M . M . M M . M M . M . M M . M M . M RF NN MPNNumber of Ligands Explored P e r c en t age o f T op - S c o r e s F ound Figure S13: Bayesian optimization performance on AmpC screening data as measured bythe percentage of top-50000 scores found as a function of the number of ligands evaluated.Each trace represents the performance of the given acquisition metric with the surrogatemodel architecture corresponding to the chart label. Full opacity: online model training.Faded: full model retraining. Each experiment began with a random 0.4% acquisition (ca.40,000 samples) and acquired 0.4% more each iteration for ﬁve iterations. Error bars reﬂect ± one standard deviation across three runs. 9 . M . M . M . M M . M . M . M . M . M M . M . M . M . M . M M . M RF NN MPNNumber of Ligands Explored P e r c en t age o f T op - S c o r e s F ound Figure S14: Bayesian optimization performance on AmpC screening data as measured bythe percentage of top-50000 scores found as a function of the number of ligands evaluated.Each trace represents the performance of the given acquisition metric with the surrogatemodel architecture corresponding to the chart label. Full opacity: online model training.Faded: full model retraining. Each experiment began with a random 0.2% acquisition (ca.20,000 samples) and acquired 0.2% more each iteration for ﬁve iterations. Error bars reﬂect ± one standard deviation across three runs. k k k k k k k k k k k k k k k k k k RF NN MPNNumber of Ligands Explored P e r c en t age o f T op - S c o r e s F ound Figure S15: Bayesian optimization performance on AmpC screening data as measured bythe percentage of top-50000 scores found as a function of the number of ligands evaluated.Each trace represents the performance of the given acquisition metric with the surrogatemodel architecture corresponding to the chart label. Full opacity: online model training.Faded: full model retraining. Each experiment began with a random 0.1% acquisition (ca.10,000 samples) and acquired 0.1% more each iteration for ﬁve iterations. Error bars reﬂect ± one standard deviation across three runs. Bayesian Optimization Performance ± s.d.) SMILES ( ± s.d.) Average ( ± s.d.)retrain RF Greedy 51.6 (5.9) 44.8 (5.8) 98.21 (0.31)UCB 43.2 (3.4) 37.2 (3.1) 97.58 (0.25)TS 27.6 (1.9) 22.6 (2.7) 95.97 (0.34)EI 39.4 (9.5) 33.8 (9.1) 97.16 (0.76)PI 47.6 (4.2) 41.4 (3.3) 97.82 (0.24)NN Greedy 66.8 (5.4) 59.2 (6.1) 98.97 (0.20)UCB 58.0 (3.5) 51.2 (3.4) 98.59 (0.16)TS 61.4 (3.9) 54.6 (3.4) 98.73 (0.19)EI 56.0 (7.5) 49.8 (6.9) 98.42 (0.42)PI 57.8 (2.4) 51.6 (2.3) 98.55 (0.15)MPN Greedy 67.0 (3.0) 59.6 (2.3) 98.94 (0.12)UCB 64.4 (4.1) 57.0 (3.2) 98.80 (0.17)TS 60.0 (4.1) 52.2 (4.6) 98.58 (0.20)EI 64.4 (3.8) 57.0 (3.6) 98.80 (0.19)PI 63.0 (3.2) 55.2 (3.3) 98.76 (0.16)online RF Greedy 46.2 (2.1) 40.8 (3.7) 97.86 (0.19)UCB 30.2 (5.8) 26.8 (5.3) 96.46 (0.51)TS 17.4 (3.2) 15.2 (3.2) 94.58 (0.31)EI 32.2 (7.2) 27.0 (5.8) 96.53 (0.49)PI 36.8 (6.3) 31.4 (5.4) 97.06 (0.58)NN Greedy 55.4 (6.2) 49.8 (6.4) 98.49 (0.30)UCB 43.4 (11.6) 38.4 (10.1) 97.59 (0.82)TS 51.2 (3.9) 45.8 (3.5) 98.16 (0.15)EI 28.0 (6.7) 24.6 (6.2) 96.09 (1.12)PI 37.6 (6.7) 33.0 (5.4) 96.93 (0.92)MPN Greedy 44.6 (6.9) 37.2 (6.0) 97.68 (0.50)UCB 45.8 (7.1) 38.8 (6.3) 97.83 (0.47)TS 41.4 (4.7) 37.0 (5.2) 97.44 (0.33)EI 45.4 (5.0) 38.8 (3.4) 97.78 (0.27)PI 39.4 (3.3) 33.2 (2.9) 97.40 (0.29)Random 5.6 (0.8) 5.0 (0.9) 91.41 (0.34)11able S2: Final Bayesian optimization performance on Enamine 50k screening data witha 1.0% batch size as measured by the given metric using the top-500 compounds found.Results are expressed as percentages and reﬂect the average (standard deviation) over ﬁveruns where higher is better.Training Model Metric Scores ( ± s.d.) SMILES ( ± s.d.) Average ( ± s.d.)retrain RF Greedy 59.1 (2.9) 55.1 (3.0) 98.74 (0.15)UCB 49.0 (1.4) 46.9 (1.3) 98.16 (0.11)TS 39.8 (2.9) 37.6 (2.9) 97.49 (0.23)EI 41.9 (2.7) 40.1 (2.7) 97.62 (0.19)PI 45.5 (2.4) 43.4 (2.2) 97.92 (0.15)NN Greedy 74.8 (1.1) 70.1 (1.1) 99.39 (0.05)UCB 74.4 (1.4) 70.0 (1.2) 99.39 (0.0 4)TS 73.4 (2.3) 68.9 (2.3) 99.35 (0.07)EI 66.1 (3.0) 62.2 (2.9) 99.08 (0.12)PI 67.2 (4.0) 63.1 (3.5) 99.08 (0.16)MPN Greedy 72.9 (1.3) 68.8 (1.0) 99.34 (0.03)UCB 72.7 (0.5) 68.4 (0.4) 99.33 (0.02)TS 70.4 (0.7) 66.0 (0.7) 99.25 (0.05)EI 67.2 (0.9) 62.9 (0.9) 99.12 (0.03)PI 69.2 (1.7) 64.9 (1.5) 99.17 (0.07)online RF Greedy 60.4 (1.2) 56.4 (0.9) 98.75 (0.04)UCB 43.5 (1.8) 41.4 (1.9) 97.72 (0.15)TS 28.0 (1.2) 26.7 (1.3) 96.30 (0.11)EI 38.6 (2.7) 37.1 (2.7) 97.33 (0.25)PI 46.4 (1.8) 44.6 (1.4) 98.00 (0.13)NN Greedy 66.5 (2.2) 62.8 (2.0) 99.09 (0.10)UCB 52.7 (10.1) 49.9 (9.3) 98.29 (0.54)TS 55.9 (3.5) 52.6 (3.3) 98.56 (0.17)EI 39.5 (9.2) 37.1 (8.6) 97.10 (1.05)PI 38.8 (7.2) 36.7 (6.8) 97.25 (0.70)MPN Greedy 58.4 (4.9) 54.8 (4.6) 98.70 (0.32)UCB 64.6 (3.5) 60.6 (3.5) 99.00 (0.15)TS 62.2 (2.7) 58.4 (2.3) 98.94 (0.09)EI 63.3 (2.6) 59.2 (2.5) 98.95 (0.11)PI 57.7 (8.1) 54.3 (7.6) 98.64 (0.44)Random 6.6 (1.0) 6.1 (1.2 91.36 (0.19)12able S3: Final Bayesian optimization performance on Enamine HTS screening data witha 0.4% batch size as measured by the given metric using the top-1000 compounds found.Results are expressed as percentages and reﬂect the average (standard deviation) over ﬁveruns where higher is better.Training Model Metric Scores ( ± s.d.) SMILES ( ± s.d.) Average ( ± s.d.)retrain RF Greedy 84.3 (1.1) 79.8 (0.9) 99.53 (0.02)UCB 68.2 (2.7) 65.2 (2.6) 99.03 (0.10)TS 74.1 (1.0) 70.3 (1.2) 99.26 (0.04)EI 44.8 (4.0) 43.2 (3.9) 97.91 (0.26)PI 43.5 (2.6) 42.2 (2.7) 97.80 (0.18)NN Greedy 95.7 (0.0) 93.2 (0.1) 99.89 (0.00)UCB 94.4 (0.5) 91.5 (0.8) 99.84 (0.02)TS 94.5 (0.2) 91.6 (0.3) 99.85 (0.01)EI 76.7 (2.3) 72.9 (2.2) 99.33 (0.05)PI 71.9 (1.9) 68.5 (1.8) 99.17 (0.07)MPN Greedy 97.7 (0.2) 94.8 (0.1) 99.94 (0.01)UCB 98.1 (0.4) 94.9 (0.2) 99.94 (0.01)TS 96.7 (0.3) 94.5 (0.1) 99.92 (0.01)EI 95.8 (0.6) 93.7 (0.7) 99.89 (0.02)PI 96.2 (0.6) 93.9 (0.7) 99.90 (0.02)online RF Greedy 80.6 (2.3) 76.5 (2.1) 99.45 (0.05)UCB 56.4 (2.6) 54.0 (2.4) 98.59 (0.14)TS 60.0 (2.0) 57.1 (1.8) 98.69 (0.07)EI 45.0 (1.8) 43.5 (1.9) 97.93 (0.10)PI 43.2 (5.6) 41.8 (5.3) 97.80 (0.41)NN Greedy 93.0 (0.8) 89.8 (1.0) 99.79 (0.03)UCB 90.1 (1.1) 86.1 (1.6) 99.69 (0.03)TS 77.5 (11.3) 73.4 (10.5) 99.28 (0.40)EI 64.9 (5.6) 61.5 (5.5) 98.87 (0.24)PI 60.3 (14.5) 57.0 (13.8) 98.58 (0.55)MPN Greedy 95.2 (0.4) 93.2 (0.5) 99.87 (0.02)UCB 96.7 (0.4) 94.5 (0.4) 99.91 (0.01)TS 95.8 (0.3) 93.7 (0.5) 99.88 (0.02)EI 90.2 (1.7) 85.7 (1.7) 99.71 (0.05)PI 90.5 (1.1) 86.6 (1.3) 99.70 (0.04)Random 2.6 (0.1) 2.4 (0.1) 90.09 (0.15)13able S4: Final Bayesian optimization performance on Enamine HTS screening data witha 0.2% batch size as measured by the given metric using the top-1000 compounds found.Results are expressed as percentages and reﬂect the average (standard deviation) over ﬁveruns where higher is better.Training Model Metric Scores ( ± s.d.) SMILES ( ± s.d.) Average ( ± s.d.)retrain RF Greedy 72.3 (1.9) 69.0 (1.9) 99.23 (0.07)UCB 51.0 (2.9) 48.9 (2.9) 98.25 (0.15)TS 57.5 (1.4) 54.8 (1.5) 98.60 (0.05)EI 32.6 (3.1) 31.3 (3.0) 96.86 (0.28)PI 29.3 (5.2) 28.3 (5.0) 96.54 (0.52)NN Greedy 88.8 (0.8) 83.9 (0.8) 99.63 (0.03)UCB 86.7 (0.5) 82.1 (0.6) 99.59 (0.01)TS 85.0 (0.9) 80.4 (0.9) 99.53 (0.03)EI 56.6 (4.3) 54.0 (4.1) 98.54 (0.23)PI 59.1 (3.1) 56.6 (3.0) 98.67 (0.13)MPN Greedy 93.7 (0.9) 90.7 (0.9) 99.81 (0.04)UCB 94.3 (0.4) 91.6 (0.5) 99.84 (0.01)TS 90.8 (0.4) 86.7 (0.6) 99.71 (0.01)EI 91.7 (0.7) 87.4 (0.7) 99.74 (0.01)PI 92.1 (0.7) 88.0 (0.9) 99.76 (0.03)online RF Greedy 66.9 (2.4) 64.0 (2.4) 99.01 (0.11)UCB 45.8 (1.6) 44.1 (1.4) 97.94 (0.09)TS 38.5 (4.3) 36.9 (4.0) 97.39 (0.31)EI 30.0 (7.1) 29.0 (7.1) 96.46 (0.75)PI 32.3 (5.5) 31.1 (5.2) 96.79 (0.60)NN Greedy 82.2 (0.8) 78.1 (0.6) 99.47 (0.03)UCB 70.7 (8.6) 67.9 (8.3) 99.13 (0.38)TS 72.5 (4.2) 68.9 (3.9) 99.19 (0.14)EI 40.9 (11.7) 38.9 (11.1) 97.49 (0.80)PI 39.3 (7.1) 37.4 (6.9) 97.46 (0.48)MPN Greedy 88.7 (1.3) 84.1 (1.5) 99.62 (0.04)UCB 89.5 (0.3) 85.1 (0.6) 99.67 (0.02)TS 89.3 (0.3) 84.9 (0.4) 99.65 (0.01)EI 80.5 (3.1) 76.8 (2.9) 99.48 (0.07)PI 77.8 (2.8) 74.3 (2.7) 99.41 (0.08)Random 1.3 (0.4) 1.3 (0.3) 87.75 (0.14)14able S5: Final Bayesian optimization performance on Enamine HTS screening data witha 0.1% batch size as measured by the given metric using the top-1000 compounds found.Results are expressed as percentages and reﬂect the average (standard deviation) over ﬁveruns where higher is better.Training Model Metric Scores ( ± s.d.) SMILES ( ± s.d.) Average ( ± s.d.)retrain RF Greedy 55.8 (4.9) 53.4 (4.6) 98.54 (0.24)UCB 36.2 (4.2) 34.9 (4.2) 97.16 (0.42)TS 38.8 (2.5) 37.1 (2.6) 97.43 (0.18)EI 22.6 (2.7) 21.9 (2.6) 95.62 (0.37)PI 27.9 (2.7) 27.1 (2.7) 96.27 (0.37)NN Greedy 70.5 (1.8) 66.9 (1.6) 99.08 (0.09)UCB 68.0 (0.9) 64.8 (1.2) 99.04 (0.05)TS 68.0 (0.8) 64.8 (0.8) 99.05 (0.03)EI 43.3 (3.8) 41.5 (3.5) 97.74 (0.29)PI 46.3 (2.3) 44.2 (2.2) 97.94 (0.17)MPN Greedy 82.2 (0.9) 78.1 (0.8) 99.47 (0.03)UCB 84.0 (0.8) 79.9 (0.7) 99.54 (0.03)TS 74.8 (1.9) 71.2 (1.8) 99.27 (0.04)EI 77.0 (1.2) 73.9 (1.2) 99.42 (0.03)PI 76.1 (0.9) 73.0 (0.7) 99.41 (0.02)online RF Greedy 41.0 (6.5) 39.5 (6.2) 97.60 (0.49)UCB 26.2 (6.8) 25.4 (6.6) 95.84 (0.91)TS 24.1 (1.6) 23.1 (1.5) 95.63 (0.33)EI 20.2 (5.5) 19.6 (5.2) 94.90 (1.29)PI 27.1 (5.7) 26.4 (5.6) 96.05 (0.85)NN Greedy 65.8 (2.2) 63.2 (2.0) 98.96 (0.09)UCB 45.8 (8.9) 43.9 (8.6) 97.90 (0.62)TS 46.1 (9.1) 44.3 (8.8) 97.90 (0.62)EI 27.4 (9.6) 26.2 (9.4) 96.22 (0.98)PI 38.8 (3.6) 37.3 (3.3) 97.39 (0.30)MPN Greedy 71.9 (3.3) 68.4 (3.5) 99.17 (0.11)UCB 75.8 (2.1) 72.4 (1.8) 99.33 (0.05)TS 73.3 (1.0) 70.0 (1.2) 99.24 (0.03)EI 63.8 (1.4) 61.3 (1.5) 98.95 (0.06)PI 64.3 (1.8) 61.9 (1.8) 98.97 (0.09)Random 0.6 (0.2) 0.5 (0.2) 85.36 (0.11)15able S6: Final Bayesian optimization performance on AmpC screening data with a 0.4%batch size as measured by the given metric using the top-50000 compounds found. Resultsare expressed as percentages and reﬂect the average (standard deviation) over three runswhere higher is better.Training Model Metric Scores ( ± s.d.) SMILES ( ± s.d.) Average ( ± s.d.)retrain RF Greedy 71.4 (2.1) 71.3 (2.1) 98.79 (0.13)UCB 49.2 (7.7) 49.1 (7.7) 97.47 (0.58)TS 71.7 (1.9) 71.6 (1.9) 98.78 (0.10)EI 29.1 (4.4) 29.1 (4.4) 95.47 (0.59)PI 26.4 (4.7) 26.4 (4.7) 95.03 (0.71)NN Greedy 74.7 (1.4) 74.6 (1.4) 98.94 (0.08)UCB 68.4 (1.4) 68.3 (1.4) 98.65 (0.07)TS 73.8 (1.2) 73.7 (1.2) 98.92 (0.06)EI 41.8 (1.8) 41.8 (1.8) 96.90 (0.16)PI 43.9 (2.1) 43.9 (2.1) 97.08 (0.17)MPN Greedy 87.9 (2.3) 87.9 (2.3) 99.56 (0.09)UCB 94.8 (0.1) 94.7 (0.1) 99.83 (0.00)TS 87.9 (0.2) 87.9 (0.2) 99.56 (0.01)EI 75.4 (5.4) 75.4 (5.4) 99.17 (0.24)PI 78.3 (0.9) 78.2 (0.9) 99.31 (0.04)online RF Greedy 52.3 (3.6) 52.3 (3.6) 97.70 (0.26)UCB 35.8 (3.1) 35.7 (3.1) 96.29 (0.32)TS 50.7 (1.5) 50.7 (1.5) 97.59 (0.11)EI 27.1 (8.0) 27.0 (8.0) 95.05 (1.12)PI 24.8 (2.9) 24.8 (2.9) 94.81 (0.51)NN Greedy 50.7 (5.2) 50.7 (5.2) 97.59 (0.38)UCB 30.0 (9.7) 30.0 (9.7) 95.41 (1.16)TS 31.6 (8.6) 31.6 (8.6) 95.64 (1.11)EI 36.3 (12.6) 36.2 (12.6) 96.06 (1.38)PI 21.9 (1.6) 21.9 (1.6) 94.34 (0.30)MPN Greedy 73.7 (5.7) 73.7 (5.7) 98.95 (0.29)UCB 77.8 (5.8) 77.7 (5.8) 99.26 (0.25)TS 74.0 (5.2) 73.9 (5.2) 98.99 (0.25)EI 60.9 (2.2) 60.9 (2.2) 98.41 (0.12)PI 60.2 (7.8) 60.2 (7.8) 98.32 (0.51)Random 2.4 (0.1) 2.4 (0.1) 81.03 (0.04)16able S7: Final Bayesian optimization performance on AmpC screening data with a 0.2%batch size as measured by the given metric using the top-50000 compounds found. Resultsare expressed as percentages and reﬂect the average (standard deviation) over three runswhere higher is better.Training Model Metric Scores ( ± s.d.) SMILES ( ± s.d.) Average ( ± s.d.)retrain RF Greedy 45.5 (1.8) 45.5 (1.8) 97.19 (0.14)UCB 24.4 (2.0) 24.4 (1.9) 94.81 (0.40)TS 40.8 (1.9) 40.8 (1.9) 96.80 (0.17)EI 14.6 (2.7) 14.6 (2.7) 92.44 (0.85)PI 16.0 (1.6) 16.0 (1.6) 92.83 (0.44)NN Greedy 52.8 (0.5) 52.8 (0.5) 97.72 (0.03)UCB 49.8 (0.5) 49.8 (0.5) 97.52 (0.04)TS 50.1 (1.0) 50.1 (1.0) 97.53 (0.07)EI 24.2 (1.0) 24.2 (1.0) 94.75 (0.17)PI 22.3 (1.1) 22.3 (1.1) 94.42 (0.19)MPN Greedy 67.1 (2.1) 67.0 (2.1) 98.61 (0.09)UCB 75.5 (0.7) 75.5 (0.7) 99.16 (0.03)TS 67.4 (3.3) 67.4 (3.3) 98.68 (0.17)EI 50.8 (0.3) 50.8 (0.3) 97.72 (0.02)PI 52.3 (0.6) 52.3 (0.6) 97.84 (0.04)online RF Greedy 30.4 (2.0) 30.3 (2.0) 95.67 (0.24)UCB 16.8 (1.8) 16.8 (1.8) 93.23 (0.46)TS 24.2 (3.3) 24.2 (3.3) 94.75 (0.54)EI 17.3 (2.3) 17.3 (2.3) 93.03 (0.66)PI 16.7 (2.3) 16.7 (2.3) 92.96 (0.63)NN Greedy 24.9 (6.0) 24.8 (6.0) 94.69 (1.09)UCB 17.6 (4.2) 17.6 (4.2) 93.29 (1.14)TS 18.0 (9.8) 18.0 (9.8) 92.83 (2.12)EI 11.7 (0.4) 11.7 (0.4) 91.48 (0.22)PI 14.2 (3.0) 14.2 (3.0) 92.26 (0.88)MPN Greedy 48.3 (3.0) 48.3 (2.9) 97.41 (0.24)UCB 41.7 (6.7) 41.7 (6.7) 96.78 (0.71)TS 63.1 (3.1) 63.1 (3.1) 98.46 (0.19)EI 42.7 (2.4) 42.7 (2.4) 96.98 (0.23)PI 41.9 (0.7) 41.9 (0.7) 96.91 (0.08)Random 1.2 ( < .

1) 1.2 ( < .

1) 72.23 (0.10)17able S8: Final Bayesian optimization performance on AmpC screening data with a 0.1%batch size as measured by the given metric using the top-50000 compounds found. Resultsare expressed as percentages and reﬂect the average (standard deviation) over three runswhere higher is better.Training Model Metric Scores ( ± s.d.) SMILES ( ± s.d.) Average ( ± s.d.)retrain RF Greedy 24.0 (2.2) 24.0 (2.2) 94.76 (0.35)UCB 13.8 (1.0) 13.8 (1.0) 92.01 (0.42)TS 20.1 (2.0) 20.1 (2.0) 94.08 (0.38)EI 9.8 (2.3) 9.8 (2.3) 90.03 (1.03)PI 9.8 (1.3) 9.7 (1.3) 90.44 (0.54)NN Greedy 33.3 (0.3) 33.2 (0.3) 96.00 (0.04)UCB 31.5 (0.6) 31.5 (0.6) 95.80 (0.07)TS 31.0 (0.8) 31.0 (0.8) 95.73 (0.10)EI 14.5 (0.8) 14.5 (0.8) 92.41 (0.30)PI 15.2 (1.4) 15.2 (1.4) 92.72 (0.43)MPN Greedy 52.0 (0.5) 52.0 (0.5) 97.68 (0.04)UCB 47.1 (0.5) 47.1 (0.5) 97.36 (0.05)TS 49.4 (0.3) 49.4 (0.3) 97.53 (0.03)EI 30.5 (1.8) 30.5 (1.8) 95.39 (0.29)PI 31.1 (1.3) 31.1 (1.3) 95.49 (0.17)online RF Greedy 15.9 (0.4) 15.9 (0.4) 93.01 (0.09)UCB 10.7 (1.6) 10.7 (1.6) 90.49 (0.96)TS 11.1 (0.1) 11.0 (0.1) 91.25 (0.10)EI 9.4 (1.8) 9.4 (1.8) 89.75 (1.22)PI 8.2 (1.1) 8.2 (1.1) 89.24 (0.59)NN Greedy 17.9 (2.7) 17.9 (2.7) 93.28 (0.67)UCB 10.0 (2.7) 10.0 (2.7) 90.16 (1.78)TS 16.9 (0.9) 16.9 (0.9) 93.16 (0.24)EI 7.9 (2.2) 7.8 (2.2) 89.01 (1.39)PI 8.6 (3.2) 8.6 (3.2) 89.25 (1.87)MPN Greedy 36.5 (2.8) 36.5 (2.8) 96.27 (0.33)UCB 29.1 (4.3) 29.1 (4.3) 95.02 (0.77)TS 45.9 (2.9) 45.8 (2.9) 97.21 (0.26)EI 23.3 (1.7) 23.3 (1.7) 94.15 (0.32)PI 25.2 (0.8) 25.2 (0.8) 94.50 (0.13)Random 0.6 ( (cid:28) .

1) 0.6 ( (cid:28) .

1) 64.44 (0.05)18 hemical Space Visualization Random

RF NN MPN

Score I t e r a t i o n Model

Figure S16: Visualization of the chemical space searched in the Enamine HTS library at thegiven iteration using a greedy acquisition metric, 0.1% batch size, and speciﬁed surrogatemodel architecture. Points represent the 2-D UMAP embedding of the given molecule’s2048-bit Atom-pair ﬁngerprint. The embedding was trained on a random 10% subset of thefull library. Circled regions indicate clusters of high-scoring compounds in sparse regionsof chemical space. X- and y-axes are the ﬁrst and second components of the 2-D UMAPembedding and range from -7.5 to 17.5. Color scale corresponds to the negative dockingscore (higher is better). 20 raphical TOC Entry selectselect traintraindockdockpredictpredictselectselect traintraindockdockpredictpredict