[PDF] Computational catalyst discovery: Active classification through myopic multiscale sampling

Abstract

The recent boom in computational chemistry has enabled several projects aimed at discovering useful materials or catalysts. We acknowledge and address two recurring issues in the field of computational catalyst discovery. First, calculating macro-scale catalyst properties is not straight-forward when using ensembles of atomic-scale calculations (e.g., density functional theory). We attempt to address this issue by creating a multiscale model that estimates bulk catalyst activity using adsorption energy predictions from both density functional theory and machine learning models. The second issue is that many catalyst discovery efforts seek to optimize catalyst properties, but optimization is an inherently exploitative objective that is in tension with the explorative nature of early-stage discovery projects. In other words: why invest so much time finding a "best" catalyst when it is likely to fail for some other, unforeseen problem? We address this issue by relaxing the catalyst discovery goal into a classification problem: "What is the set of catalysts that is worth testing experimentally?" Here we present a catalyst discovery method called myopic multiscale sampling, which combines multiscale modeling with automated selection of density functional theory calculations. It is an active classification strategy that seeks to classify catalysts as "worth investigating" or "not worth investigating" experimentally. Our results show a ~7-16 times speedup in catalyst classification relative to random sampling. These results were based on offline simulations of our algorithm on two different datasets: a larger, synthesized dataset and a smaller, real dataset.

Full PDF

CComputational catalyst discovery: Active classiﬁcation through myopicmultiscale sampling

Kevin Tran, a) Willie Neiswanger, a) Kirby Broderick, Eric Xing, Jeﬀ Schneider, and Zachary W. Ulissi b) Chemical Engineering Department, Carnegie Mellon University, Pittsburgh,PA 15217 Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15217 (Dated: 3 February 2021)

The recent boom in computational chemistry has enabled several projects aimed at discovering useful materialsor catalysts. We acknowledge and address two recurring issues in the ﬁeld of computational catalyst discovery.First, calculating macro-scale catalyst properties is not straight-forward when using ensembles of atomic-scalecalculations (e.g., density functional theory). We attempt to address this issue by creating a multi-scale modelthat estimates bulk catalyst activity using adsorption energy predictions from both density functional theoryand machine learning models. The second issue is that many catalyst discovery eﬀorts seek to optimize catalystproperties, but optimization is an inherently exploitative objective that is in tension with the explorativenature of early-stage discovery projects. In other words: why invest so much time ﬁnding a “best” catalystwhen it is likely to fail for some other, unforeseen problem? We address this issue by relaxing the catalystdiscovery goal into a classiﬁcation problem: “What is the set of catalysts that is worth testing experimentally?”Here we present a catalyst discovery method called myopic multiscale sampling, which combines multiscalemodeling with automated selection of density functional theory calculations. It is an active classiﬁcationstrategy that seeks to classify catalysts as “worth investigating” or “not worth investigating” experimentally.Our results show a ∼ I. INTRODUCTION

Recent advances in computing hardware and softwarehave led to substantial growth in the ﬁeld of computa-tional materials science. In particular, databases of high-throughput calculations have increased the amount ofinformation available to researchers. These databases fa-cilitate the development of models that supplement hu-man understanding of physical trends in materials.

These models can then be used in experimental discov-ery eﬀorts by identifying promising subsets of the searchspace, resulting in increased experimental eﬃciency.

However, many materials design eﬀorts use ma-terial properties and calculation archetypes that aretoo problem-speciﬁc to be tabulated in generalizeddatabases. When such eﬀorts coincide with design spacestoo large to search in a feasible amount of time, we needa way to search through the design space eﬃciently. Se-quential learning, sometimes referred to as optimal de-sign of experiments or active learning, can ﬁll this role.Sequential learning is the process of using the currentlyavailable data to decide which new data would be mostvaluable for achieving a particular goal.

In prac-tice, this usually involves ﬁtting a surrogate model tothe available data and then pairing the model with an acquisition function that calculates the values of a new,potential data points. Then we query the most valuable a) These authors contributed equally to this work b) Electronic mail: [email protected] data points, add them to the data set, and repeat thisprocess. These sequential learning methods have beenestimated to accelerate materials discovery eﬀorts by upto a factor of 20. Sequential learning has numerous sub-types of meth-ods that can and have been used for diﬀerent goals.One such sub-type is active learning. With many activelearning algorithms, the goal is to replace a relativelyslow data-querying process with a faster-running surro-gate model. Since the surrogate model may be used toquery any point, the acquisition functions focus on en-suring that the entire search space is explored. Anothersub-type of sequential learning is active optimization. With this sub-type, the goal is to maximize or minimizesome objective function. Thus the acquisition functionsgenerally focus on parts of the search space where max-ima or minima are more likely to occur. One of themost common types of active optimization is Bayesianoptimization. Yet another sub-type of sequential learn-ing is online or on-the-ﬂy learning. The goal for thesemethods is to accelerate the predictions of streams ofdata. In the ﬁeld of computational material science,this is often applied to predicting trajectories for Den-sity Functional Theory (DFT) or molecular dynamicscalculations.

In computational materials discovery, we often havethe following task: we have a set of available materials X = { x i } ni =1 , where each material x i has an associatedquantity y i , denoting its value for some application. Ex-amples of common properties for y i include—but are notlimited to—formation energies of materials, catalyst ac-tivity, tensile strength, or conductivity. The value y i is a r X i v : . [ phy s i c s . c h e m - ph ] F e b unknown and must be calculated, which can be costlyin time, money, or other resources. Further, theoreticalcalculations of material properties may be inconsistentwith experimental results. As per a common aphorismamong statisticians: “All models are wrong, but someare useful.”Due to these potential model errors and due to theexploratory nature of materials discovery, we propose re-framing the materials discovery question. Instead of try-ing to discover materials with optimal y i values, whatif we instead classify materials as having promising orunpromising y i values? In other words, what if weframe materials discovery eﬀorts as classiﬁcation prob-lems rather than optimization problems? The estimatedclasses could then be used to design physical experiments.Mathematically, this is akin to assuming that material i has a binary value y i ∈ { , } , where 0 denotes “not ofinterest” , and 1 denotes “of interest” .The goal is then to determine the values y i for each x i ∈ X as cheaply as possible. One can view this asthe task of most-eﬃciently learning a classiﬁer that, foreach x i , correctly predicts its value y i . In this way, ma-terials discovery problems can be framed as problems of active classiﬁcation . Active classiﬁcation is the task ofchoosing an ordering of x i ∈ X , over which we will iter-ate and sequentially measure their values y i , in order tomost eﬃciently (using the fewest measurements) learn aclassiﬁer that predicts the correct label for all materials x i ∈ X . Another aspect of computational materials discovery isthe ability to turn calculations into recommendations—e.g., how can we convert DFT results into actionable ex-periments? This conversion is relatively straight-forwardwhen properties are directly calculable, which is the casefor properties such as the enthalpy of formation. If weperform a single DFT calculation that suggests a singlematerial may be stable, then we can suggest that singlematerial for experimentation. But for many applications,the properties of interest may not be calculable directly.For example, let us say we are interested in ﬁnding activecatalysts. One way to do this is to use DFT to calculatethe adsorption energy between the catalyst and particu-lar reaction intermediates, and then couple the resultingadsorption energy with a Sabatier relationship. But insitu , a catalyst comprises numerous adsorption sites andsurfaces. Thus the true activity of a catalyst may be gov-erned by an ensemble of adsorption energies, and there-fore may need multiple DFT calculations. How do weaddress the fact that we need multiple DFT queries toresolve the properties of a single material?Here we attempt to address both outlined issues: (1)we need an ensemble of DFT queries to calculate a sin-gle experimental property of interest, and (2) we need asequential learning method designed for high-throughputdiscovery/classiﬁcation. We overcome both issues by cre-ating the Myopic Multiscale Sampling (MMS) method(Figure 1). MMS addresses the ﬁrst aforementioned issueby using a multiscale modeling framework for estimating the activity of a catalyst using an ensemble of both DFTand Machine Learning (ML) predicted adsorption ener-gies. MMS then addresses the second issue by combiningthis multiscale modeling framework with a number ofsequential learning methods, including active classiﬁca-tion. Note that MMS, as we describe it in this paper,is tailored to discovering active catalysts. Although thismethod may not be directly transferable to other appli-cations, we hope that others may be able to adapt theprinciples of the method to their own applications.

II. METHODSA. Multiscale Modeling

In this paper, we use the discovery of active catalystsas a case study. Catalyst activity is often correlated withthe adsorption energy of particular reaction intermedi-ates, as per the volcano relationships stemming from theSabatier principle.

These adsorption energies can becalculated using DFT. Each DFT-calculated adsorptionenergy is speciﬁc to a particular binding site of a particu-lar surface of a particular catalyst. Thus the relationshipbetween DFT-calculated adsorption energies and a cata-lyst’s activity is not simple.For example: in cases of lower adsorbate coverageon the catalyst surface, adsorbates tend to adsorb tostronger-binding sites before weaker-binding sites. Incases of higher adsorbate coverage, adsorption energiesare diﬃcult to calculate, so it is not uncommon to assumelow adsorbate coverage.

It follows that the activityof a surface could be estimated by using the Sabatier-calculated activity of the strongest binding site on a sur-face.Given the activities of the surfaces of a catalyst, thenext step is to estimate the activity of the entire catalyst.One way to do this would be to perform a weighted av-erage of the surface activities, where higher weights aregiven to surfaces that are more stable. For simplicity’ssake, we instead propose a uniform average and recognizethat future work may involve investigating more sophis-ticated averaging methods.Concretely, suppose we have n catalyst candidates { x i } ni =1 , where each candidate x i has m surfaces { u i,j } mj =1 , and surface u i,j has (cid:96) sites { s i,j,k } (cid:96)k =1 . Fora given site s i,j,k , denote its adsorption energy by∆ G ( s i,j,k ), and for a given surface u i,j , denote its cat-alytic activity by α ( u i,j ). Likewise, for a given cata-lyst material candidate x i , denote the average catalyticactivity for the candidate by α ( x i ) = m (cid:80) mj =1 α ( u i,j ).Suppose we have a predictive uncertainty estimate forthe adsorption energy ∆ G ( s i,j,k ) of a site, representedby a Normal distribution with mean µ i,j,k and variance σ i,j,k . We can then perform simulation-based uncertaintyquantiﬁcation of catalyst activity by using the multiscalemodeling process we described above to propagate un-certainties from sites’ adsorption energies. Speciﬁcally, s u r f a c e a c t i v i t y choose surface Δ G of siteschoose site probability density p r obab ili t y den s i t y probability densitychoose catalyst c a t a l y s t a c t i v i t y c a t a l y s t a c t i v i t y a bdc hgef - Δ G of sites p r o b a b ili t y d e n s i t y DFTdatabase of DFT site energiesML training + prediction

FIG. 1. Illustration of Myopic Multiscale Sampling (MMS). Given a database of DFT-calculated adsorption energies ( a ), wetrain a ML model to predict adsorption energies ( b ). Then we use those adsorption energies to estimate activities of catalystsurfaces ( c ), which we then use to estimate the activities of the bulk catalysts ( d ). Then we choose which catalyst to samplenext ( e ); then we choose which surface on the catalyst to sample ( f ); then we choose which site on the surface to sample( g ); then we perform DFT of that site to add to the database ( h ). This procedure is repeated continuously with the goal ofclassifying all catalysts as either “relatively active” or “relatively inactive”. for each material candidate x i , we generate H samplesof its catalytic activity, { ˜ α hi } Hh =1 , by simulating from thefollowing generative process:For j = 1 , . . . , m, k = 1 , . . . , (cid:96) : (1) { ˜∆ G hi,j,k } Hh =1 iid ∼ N (cid:0) µ i,j,k , σ i,j,k (cid:1) For h = 1 , . . . , H, j = 1 , . . . , m :˜ α hi,j = (cid:40) exp( M ˜∆ G hi,j, (cid:96) + B ) if ˜∆ G hi,j, (cid:96) ≥ t ∗ exp( M ˜∆ G hi,j, (cid:96) + B ) otherwiseFor h = 1 , . . . , H :˜ α hi = 1 n m (cid:88) j =1 ˜ α hi,j where t ∗ is the optimal absorption energy for a given vol-cano relationship and M , M , B , & B are the linearcoeﬃcients associated with the two sides of the log-scaledvolcano relationship of a given chemistry. Figure 2 illus-trates how we use our multiscale modeling method to es-timate catalyst activity from DFT-calculated adsorptionenergies, including uncertainty quantiﬁcation.Each catalyst material candidate x ∈ X has some truecatalytic activity level α ( x ). Our goal will be to deter-mine the top p -% of catalyst material candidates in termsof their activity levels, which we denote X p = { x ∈ X : r ( α ( x )) ≥ (cid:4) pn (cid:5) } , where r : R + → { , . . . , n } is a func-tion mapping the activity level α ( x ) to an index denotingit’s rank (from highest to lowest activity). Given a spec-iﬁed p , if a candidate material is in this set, i.e. x i ∈ X p ,then we say that its associated binary value y i = 1, andsay y i = 0 otherwise. In simpler terms: we want toﬁnd the top p -% most active catalysts. For this paper,we choose p = 10% arbitrarily. Any catalyst that fallswithin the top 10% in terms of activity will be labeled asactive, and anything below the top 10% will be labeledas inactive.We can therefore frame our goal as determining theassociated binary value y i for each catalyst material can-didate x i ∈ X = { x i } ni =1 . Suppose we have formed pointestimates for each of the binary values, written { ˆ y i } ni =1 .To assess the quality of this set of estimates with respectto the set of true candidate values, we focus on the F score—a popular metric for classiﬁcation accuracy, de-ﬁned as F = 2 × precision × recallprecision + recall (2)= 2 (cid:80) ni =1 y i ˆ y i (cid:80) ni =1 y i ˆ y i + (cid:80) ni =1 (1 − y i )ˆ y i + (cid:80) ni =1 y i (1 − ˆ y i ) . Given a set of ground-truth values { y i } ni =1 , we are able tocompute the F score for a chosen set of value estimates { ˆ y i } ni =1 .However, in practice, we will typically not have accessto these ground-truth values, and thus cannot compute - Δ G Δ G p r o b a b ili t y d e n s i t y a c t i v i t y FIG. 2. Multiscale modeling strategy for estimating the ac-tivity of a catalyst. For each adsorption site, we obtain amachine-learned estimate of its adsorption energy along withuncertainty. Then we aggregate the energy distributions forall sites within each surface through a linear function of sites.Next we transform the energy distributions for all surfacesinto activities using a Sabatier relationship. Finally we aver-age all the surface activities to obtain an estimate of overallcatalyst activity. this score in an online procedure. For use in online ex-periments, we will take advantage of a metric that yieldsan estimate of the change in F score. This metric iscomputable using only our model of the activity of eachcatalyst, without requiring access to ground-truth values { y i } ni =1 , and can be used to assess and compare the con-vergence of our methods. Furthermore, it can be used toprovide an early stopping method for our active proce-dures. We will show experimentally in Section III thatthis metric shows a strong correlation to the F score. B. Sampling Strategy

The goal of MMS is to discover catalysts that are likelyto be experimentally active. Optimization of catalytic ac-tivity is not the main priority, because we assume thatunforeseen experimental issues are likely to obsolete mostcandidate catalysts. Instead, a greater focus is given onidentiﬁcation of a large number of candidates rather thanﬁnding “the most active” candidate. That is why the coresequential learning algorithm we use in MMS is activeclassiﬁcation.

To be speciﬁc, we use Level Set Esti-mation (LSE) to identify catalysts for DFT sampling. Af-ter identifying catalysts for DFT sampling, we then needto choose which surface of the catalyst to sample; herewe use techniques from active regression. Once a surfaceis chosen, we then attempt to ﬁnd the strongest bindingsite on that surface by using active optimization of theadsorption energies. Thus we combine three diﬀerent se-quential learning strategies across three diﬀerent lengthscales to decide which site-based DFT calculation willhelp us classify active vs. inactive catalysts (Figure 3).We ﬁrst describe the initial step of our sampling strat-egy, which consists of selecting a catalyst material candi-date from our candidate set X = { x i } ni =1 . Note that ourhigh-level goal is binary classiﬁcation, in that we wantto eﬃciently produce accurate estimates { ˆ y i } ni =1 of thebinary value for each material candidate. Based on ourdeﬁnition of y i = [ x i ∈ X p ], this problem can be equiv-alently viewed as the task of LSE, in which we aim toeﬃciently produce an accurate estimate of the superlevelset X p = { x ∈ X : r ( α ( x )) ≥ (cid:4) pn (cid:5) } . There has beena body of work on developing acquisition functions forchoosing candidates to query in the task of LSE. Inparticular, we focus on the probability of incorrect clas-siﬁcation acquisition function, deﬁned for an x i ∈ X as ϕ ( x i ) = min( p, − p ) , where (3) p = Pr (cid:16) r ( α ( x )) ≥ (cid:106) pn (cid:107)(cid:17) ≈ H H (cid:88) h =1 (cid:104) r (˜ α hi ) ≥ (cid:106) pn (cid:107)(cid:105)(cid:124) (cid:123)(cid:122) (cid:125) Empirical probability α ( x ) in top p -% Thus to select a subsequent catalyst candidate, we com-pute ϕ ( x i ) for each x i ∈ X and return the maximizer x ∗ = arg max x i ∈X ϕ ( x i ). In simpler terms: we choosethe catalyst that we are most likely to classify incor-rectly. Note how this implies that we not query catalyststhat we are conﬁdent are active, which is diﬀerent fromactive optimization methods. This provides a more ex-ploratory method rather than an exploitative one, whichis appropriate in early-stage computational discoveriesand screenings.The selection of a catalyst candidate x i depends onits estimated catalytic activity, which we model as anaverage of the catalytic activities across the surfaces ofthe candidate, i.e. α ( x i ) = m (cid:80) mj =1 α ( u i,j ). Though weselect a candidate based on its ability to help improveour estimate of the superlevel set X p , once selected, wethen wish to most eﬃciently improve our estimate of thiscandidate’s catalytic activity. Our goal at this stage istherefore to most eﬃciently learn the catalytic activitiesfor each surface of that candidate. This can be viewedas an active regression task, where we aim to samplea surface that will most reduce the uncertainty of oursurface activity estimates. To select a surface, we use an - s u r f a c e a c t i v i t y choose most uncertain surface Δ G of siteschoose strongest binding site probability density p r obab ili t y den s i t y probability densitychoose catalyst we are likely to misclassify c a t a l y s t a c t i v i t y FIG. 3. Myopic Multiscale Sampling (MMS) overview. Atthe highest level, we choose a catalyst to query using level-setestimation—speciﬁcally, we use the probability of incorrectclassiﬁcation as our acquisition function. At the middle level,we choose a surface of the catalyst using uncertainty sampling.At the lowest level, we choose a site on the surface usingBayesian optimization to ﬁnd the lowest energy site. uncertainty sampling for regression acquisition functionfrom the active learning literature , deﬁned as ϕ ( u i,j ) = Var [Pr ( α ( u i,j ))] (4) ≈ H − H (cid:88) h =1 (cid:32) ˜ α hi,j − H H (cid:88) h (cid:48) =1 ˜ α h (cid:48) i,j (cid:33) , which selects a surface u ∗ i of material candidate x i thathas the greatest variance. In simpler terms: we choosethe surface of a catalyst that has the most uncertainty,because we suspect that this choice is most likely to re-duce our uncertainty estimate of catalyst activity.The catalytic activity of a given surface α ( u i,j ) isfunction of the adsorption energies of the sites onthis surface, according to the relationship α ( u i,j ) =exp( −| M ˜∆ G i,j, (cid:96) + B | ) from Equation (1), where˜∆ G i,j, (cid:96) is the set of adsorption energies over all siteson the surface. Therefore, given a selected surface u i,j ,we wish to determine eﬃciently the site on this sur-face with minimum adsorption energy. This can beviewed as an optimization task. We therefore use the ex-pected improvement acquisition function from Bayesianoptimization , deﬁned as ϕ ( s i,j,k ) = E [(∆ G ( s i,j,k ) ≤ ∆ G ∗ ) [∆ G ( s i,j,k ) − ∆ G ∗ ]] ≈ Φ (cid:18) ∆ G ∗ − ˜ µ i,j,k ˜ σ i,j,k (cid:19) φ (cid:18) ∆ G ∗ − ˜ µ i,j,k ˜ σ i,j,k (cid:19) (5) × (∆ G ∗ − ˜ µ i,j,k ) , where ˜ µ = H (cid:80) Hh =1 ˜∆ G hi,j,k is the expected adsorptionenergy, ˜ σ = (cid:114) H − (cid:80) Hh =1 (cid:16) ˜∆ G hi,j,k − ˜ µ (cid:17) is its standarddeviation, Φ is the cumulative density function (CDF)of a standard normal distribution, φ is the PDF of astandard normal distribution, and ∆ G ∗ is the minimumobserved adsorption energy. This selects a site s ∗ i,j whichis expected to most reduce the site adsorption energyrelative to the current minimum observed energy, andallows for eﬃcient estimation of the minimum energy siteon surface u i,j . In simpler terms: we choose the siteon a surface that is most likely to help us identify thestrongest/lowest binding site on the surface. C. Active Learning Stopping Criteria

Assessing convergence of an active algorithm is use-ful for enabling early stopping, which can save resources.Measures of convergence can also provide diagnostics inonline use settings. To quantify convergence, we use the predicted change in F score ( ∆ ˆ F ) . Intuitively speak-ing, this rule says to stop an active learning procedurewhen ∆ ˆ F drops below a predeﬁned threshold (cid:15) when for k consecutive windows, i.e., Stop if ∆ ˆ

F < (cid:15) over k windows Continue otherwise . In our setting, ∆ ˆ F is deﬁned to beˆ∆ F = 1 − a a + b + c , (6)where a is the number of bulks for which the model atiterations i and i + 1 both yield a positive label, b is the number of bulks for which the model at iteration i yieldsa positive label while at iteration i + 1 yields a negativelabel, and c is the number of bulks for which the model atiteration i yields a negative label while at iteration i + 1yields a positive label. Each of a , b , and c are computedover the previous k iterations. This measure providesan estimate of the change in accuracy at each iteration,and it allows us to control how conservatively (or aggres-sively) we stop early via an interpretable parameter (cid:15) .We show results of this measure alongside our F rec-ommend using a stop set of unlabeled points over whichto calculate ∆ ˆ F . Here we use the entire search space ofcatalysts in lieu of a stop set, because it was non-trivialfor us to deﬁne a stop set that was representative of thesearch space. D. Management of Data Queries

Implementation of MMS also involves deﬁnition of sev-eral hyper-parameters. For example, most surrogatemodels require training data before making predictions tofeed the sampling method. This means that we neededto seed MMS with initial training data. We chose tocreate the initial training data by randomly sampling1,000 adsorption energies from the search space. Weused random sampling for simplicity, and we sampled1,000 adsorption energies because that was the minimumamount of data on which Convolution-Fed Gaussian Pro-cess (CFGP) (described below in further detail) couldtrain on and maintain numerical stability.Another consideration for MMS is the batch size andhow to handle queries in-tandem. Normal sequentiallearning assumes that we can make one query at a time.But in applications such as ours, it may be possible tomake multiple queries in parallel—i.e., we can performmultiple DFT calculations at a time. There are severalmethods for handling queries in parallel; we chose to usea type of look-ahead sampling. With look-ahead sam-pling, we began by choosing the ﬁrst point to sample us-ing the standard acquisition strategy. Then, while thatpoint was still “being queried”, we assumed that the ﬁrstpoint was queried successfully and set the “observed”value equal to our predicted value. In other words, wepretend that we sampled the ﬁrst data point and that ourprediction of it was perfect. This allowed us to then re-calculate our acquisition values to choose a second point.This process of “looking ahead” one point at a time wascontinued until a predetermined number of points wereselected for querying—i.e., the batch size. Here we chosea batch size of 200 points, because that was roughly thenumber of DFT calculations that we could perform in aday during our previous high-throughput DFT studies. Note that we did not re-train the surrogate models withineach batch of 200 points; we only re-calculated acquisi-tion values between each sample within each batch. Weskipped re-training of surrogate models within each batchto reduce the amount of model training time required toperform this study. Although this may have reduced theeﬀectiveness of the look-ahead method, we found the in-creased algorithm speed to be worthwhile.

E. Estimating Performance through Simulation

We aim to experimentally assess the performance ofMMS and compare it with a variety of baseline methodswithout incurring the high cost of repeated DFT calcu-lations. To do this, we simulate each procedure using adatabase of pre-determined adsorption energies. Speciﬁ-cally, suppose we have chosen a set of n catalyst materialcandidates { x i } ni =1 of interest. For each candidate x i , wealready have all the adsorption energies ∆ G ( s i,j,k ) for thefull set of sites across the full set of surfaces on x i . Wecan then run our procedures in a relatively fast manner,where we can quickly query the database at each itera-tion of a given method rather than running DFT. Similaroﬄine-data discovery procedures have been pursued byprevious work in optimization and active learning, whereexpensive evaluations have been collected oﬄine and usedfor rapid online evaluation .One notable baseline method is random search , whichat each iteration samples sites to carry out DFT calcula-tions uniformly at random from the full set of sites overall catalyst material candidates. We provide simulationresults using random search as a benchmark to compareMMS against.

1. Surrogate Models Used

Our objective in this paper is to assess the performanceof MMS. The performance of MMS is likely to dependon the surrogate model used to predict adsorption ener-gies from atomic structures. We assume that surrogatemodels with high predictive accuracy and calibrated un-certainty estimates will outperform models with lowaccuracy and uncalibrated uncertainty estimates, but weare unsure of the magnitude of this diﬀerence. We there-fore propose to pair at least two diﬀerent models withMMS: a “perfect” model and an “ignorant” model.We deﬁne the “perfect” model, hereby referred to asthe “prime” model, as a model that returns the true ad-sorption energy of whatever data point is queried. Thisperfect prediction ensures a high model accuracy. Whenasked for a standard deviation in the prediction, theprime model will return a sample from a χ distributionwhose mean is 0.1 electron volts (eV). This uncertaintyensures a sharp and calibrated measure of uncer-tainty. We do not use standard deviation of zero because(1) it causes numerical issues during multiscale modelingand (2) any model in practice should not be returningstandard deviations of zero.We deﬁne the “ignorant” model, hereby referred to asthe “null” model, as a model that returns the optimal adsorption energy no matter what is queried. This con-stant prediction ensures a relatively low model accuracy.When asked for a standard deviation in the prediction,the null model will return 1 eV. This uncertainty ensuresa relatively dull and uncalibrated measure of uncertainty.Lastly, we also choose to use a third, most practicalmodel: CFGP. CFGP is a Gaussian process regressorwhose features are the output of the ﬁnal convolutionallayer in a trained graph convolutional neural network.This model is our best current estimate of both an ac-curate and calibrated model that could be used in prac-tice. Thus we have three models: null, CFGP, and prime,which are intended to give quantitative estimates of theminimal, medial, and maximal performance of MMS, re-spectively.

2. Search Spaces Used

Previous studies have shown that diﬀerent materialsdiscovery problems have varying diﬃculties. Searchingfor a needle in a hay stack is generally more diﬃcult thansearching for a leaf on a branch. Thus any simulation wedo depends on the search space we use. To obtain arange of potential MMS performances, we perform sim-ulations using two diﬀerent data sets. Both data setscomprise thousands of atomic structures that representCO adsorbing onto various catalyst surfaces, as well ascorresponding adsorption energies. We then use Sabatierrelationships from literature to transform the adsorptionenergies into estimates of activity. We deﬁned our ﬁrst search space by synthesizing it ran-domly. We did so by retrieving a database of enumeratedadsorption sites from the Generalized Adsorption Simu-lator for Python (GASpy) . These sites composed allthe unique sites on all surfaces with Miller indices be-tween -2 and 2 across over 10,000 diﬀerent bulk crystalstructures. We then randomly selected 200 of the bulkcrystals along with all of the resulting surfaces and sites,yielding over 390,000 adsorption sites. Then for eachbulk crystal, we randomly sampled its “bulk mean ad-sorption energy” from a unit normal distribution. Thenfor each surface within each crystal, we randomly sam-pled its “surface mean adsorption energy” from a nor-mal distribution whose mean was centered at the corre-sponding bulk mean and whose standard deviation wasset to 0.3 eV. Then for each site within each surface, werandomly sampled its adsorption energy from a normaldistribution whose mean was centered at the correspond-ing surface mean and whose standard deviation was setto 0.1 eV. Thus the adsorption energies were correlatedwithin each bulk, and they were also correlated withineach surface.We deﬁned our second search space by retrieving ourdatabase of ca. and they therefore have bias in the locationsat which they were sampled. Speciﬁcally, the sites inthis database were chosen based on the likelihood thattheir adsorption energies were close to the optimal valueof -0.67 eV. There are several advantages of using the synthesizeddata set over the real GASpy data set, and vice versa.The synthesized data set contains pseudo-random ad-sorption energies that are diﬃcult for CFGP to predict,thereby hindering its performance unfairly. Therefore, weshould not and did not use CFGP with the synthesizeddata set; we used it with the GASpy data set only. Onthe other hand, the number of surfaces per bulk and thenumber of sites per surface in the GASpy data set wasrelatively sparse compared to the synthesized data set.This can result in catalysts that require relatively fewsite queries to sample fully, which reduces the numberof queries necessary to classify a catalyst. This reduc-tion in the number of required queries per catalyst couldartiﬁcially improve the observed performance of MMS.

III. RESULTS

At the beginning of the simulations, the multiscalemodels made their catalyst class predictions (i.e., ac-tive or inactive) using the adsorption energy predictionsand uncertainties of the models. As the simulations pro-gressed and adsorption energies were queried, the models’predictions of each queried energy were replace with the“true” value of the query and the corresponding uncer-tainty was collapsed to 0 eV. This was done to mimica realistic use case where we would not use model pre-dictions when we had the “real” DFT data instead. Itfollows that, as the simulations progressed and nearly allpoints were queried, most models performed similarly be-cause they all had comparable amounts of “true” data touse in the multiscale model.

A. Performance on Synthesized Data

This behavior is seen in Figure 4a, which shows howthe F F ca. F F F changes at each point inthe simulation of the synthesized data set. The simula-tions using random search generally yielded higher ∆ ˆ F values. This indicates slower convergence, which is con-sistent with the slower F F valuesfor the MMS-prime simulation decreased at around 500batches, which is the number of batches it took the F F val-ues for the MMS-null simulation were often zero. Thisis because the null model was a “stiﬀ” learner that didnot result in any multiscale modeling changes unless alow-coverage adsorption site was found. This shows thatslow-learning models may result in relatively low ∆ ˆ F val-ues, which may necessitate higher κ values to oﬀset thisbehavior. In other words: worse models may need longerhorizons before stopping the discovery to mitigate thechances of missing important information.These simulations provided us with an estimate of theimprovement in active classiﬁcation that we may get fromusing MMS. With the synthesized data set, we saw thatthe MMS-with-null case achieved an F ∼ ca.

250 batches (or 50,000 queries). This wasover seven times faster than the random-sample-with-null case, which achieved an F ∼ ca. F ∼ B. Performance on DFT Data

Figure 5 shows the F F of the multi-scale model at each point in the simulation of the GASpydata set. Interestingly, the system performance when us-ing CFGP was similar to the performance when using thenull model, both of which were overshadowed by the rel-atively good performance when using the prime model.This suggests that there is a large room for improvementfor the CFGP model. Note also how the MMS strategyoutperforms random sampling for this data set as well.These simulations provided us with a second estimateof the improvement in active classiﬁcation that we mayget from using MMS. With the GASpy data set, wesaw that the MMS-with-null case achieved an F ∼ ca. F ∼ ca.

80 batches (or 16,000 queries). When using theprime model, both MMS and random search were ableto achieve an F ∼ C. Recommended diagnostics

We note that the F F s c o r e a RS nullRS primeMMS nullMMS prime F b FIG. 4. Performance and convergence results for the simulations on the synthesized dataset. a. F F b. ∆ ˆ F of the multiscale model during simulation of the synthesized data. For clarity of visualization,we plotted the rolling average of ∆ ˆ F using a window of 40 batches (excluding the MMS null line, where no averaging wasdone). RS represents “random search” while MMS represents Myopic Multiscale Sampling. classes, which is not possible to know during a real dis-covery process. We need metrics to monitor the behaviorof both our discovery algorithm. We recommend moni-toring the ∆ ˆ F as well as the accuracy, calibration, andsharpness (i.e., the magnitude of the predicted uncertain-ties) of the surrogate model over time. Figure 6 showsan example of such diagnostic metrics over the courseour simulation that used MMS and CFGP on the GASpydataset.∆ ˆ F estimates the amount of overal improvement inthe discovery process. Sustained low values of ∆ ˆ F are anecessary but not suﬃcient indicator of convergence. Toimprove our conﬁdence in the predictive strength of ∆ ˆ F ,we can test one of its underlying assumptions: that themultiscale model becomes progressively more accurate asit receives more data. This assumption is true when wereplace surrogate model predictions with incoming DFTresults, but it is not necessarily true for unqueried points.We can estimate the accuracy on unqueried points by cal-culating the residuals between the surrogate model andthe incoming DFT results (Figure 6b). As each “batch”of queries is recieved, we compare the queried, true ad-sorption energies with the energies predicted by the sur-rogate model just before retraining—i.e., the predictions used to choose that batch. Any improvements in accu-racy on these points show that the overall, multiscalemodel is improving over time and that the ∆ ˆ F metric isan honest indicator of convergence. Figure 6b shows thatmodel accuracy improves within the ﬁrst ca.

10 batches(or 2,000 adsorption energy queries), but plateaus after-wards. This indicates that, after 10 batches, improve-ments in overall classiﬁcation accuracy came from receiptof additional DFT data rather than improvements in sur-rogate model predictions.Prediction accuracy of adsorption energies is not theonly indicator of improved model performance. If a sur-rogate model’s accuracy does not change but its uncer-tainty predictions decrease/improve, then our conﬁdencein the overall material classiﬁcation may still improve.Of course, improvements in uncertainty must not be ob-tained at the expense of worse calibration. In otherwords, reductions in predicted uncertainties may alsoindicate improved model performance and better conﬁ-dence in ∆ ˆ F , but only if the expected calibration error does not increase. In our illustrative example, Figure 6cshows the predicted uncertainty while Figure 6d showsthe calibration. Unfortunately, the uncertainty predic-tions do not decrease over the course of the discovery0 F s c o r e a RS nullRS CFGPRS primeMMS nullMMS CFGPMMS prime0 20 40 60 80batch number (batch size = 200 queries)0.00.20.40.6 F b FIG. 5. Performance and convergence results for the simulations on the GASpy dataset. a. F b. ∆ ˆ F of the multiscale model during simulation of the synthesized data. RSrepresents “random search” while MMS represents Myopic Multiscale Sampling. process. Note that all uncertainty and calibration esti-mates for each batch should be calculated using the sur-rogate model predictions used to choose that batch, justas was done for the residuals.Lastly, we also recommend monitoring the negative-log-likelihood of the surrogate model for each incomingbatch. This metric incorporates model accuracy, calibra-tion, and sharpness into a single metric. Lower valuesof negative-log-likelihood indicate better model perfor-mance. Figure 6e shows that this metric improves until ca. IV. CONCLUSIONS

Here we created a multi-scale modeling method forcombining atomic-scale DFT results with surrogate/MLmodels to create actionable plans for experimentalists—i.e., a classiﬁcation of catalysts as “worthy of experimen-tal study” or “not worthy”. We then coupled this model-ing method with a Myopic Multiscale Sampling (MMS)strategy to perform automated catalyst discovery via ac- tive classiﬁcation. We tested this strategy on two hy-pothetical datasets using three diﬀerent surrogate mod-els, giving us an estimate on the range of performancewe might see in the future. In some cases, the resultsshow up to a 16-fold reduction in the number of DFTqueries compared to random sampling. The degree ofspeed-up depends on the quality of the ML model used,the homogeneity of the search space, and the hyperpa-rameters used to deﬁne convergence of the active classi-ﬁcation. Speed-up estimates on more realistic use casesshow a more conservative 7-fold reduction in number ofDFT queries. Lastly, we provide a set of recommendeddiagnostic metrics to use during active classiﬁcation (Fig-ure 6): ∆ ˆ F and the ML model’s residuals, uncertaintyestimates, and calibration.Our results elucidated a number of qualitative be-haviors of active classiﬁcation. First, we observed thathigher-quality ML models yielded better initial perfor-mance of the classiﬁcation process. Conversely, we ob-served that higher-quality sampling strategies yieldedbetter rates of improvement over time. We also observedthat our latest ML model (CFGP) yielded performancecloser to a naive, ignorant model than to a perfect, om-niscient model. This suggests that there is a relativelylarge amount of potential improvement left in the ML1 F a r e s i d u a l s ( a cc u r a c y ) [ e V ] b p r e d i c t e d ( s h a r p n e ss ) [ e V ] c e x p e c t e d c a li b r a t i o n e rr o r ( c a li b r a t i o n ) d n e g a t i v e l o g li k e li h oo d e FIG. 6. Example of diagnostic plots that we recommend monitoring during an active discovery campaign: a. predicted changein F F ); b. residuals between the real data and the surrogate model’s predictions; c. expected calibration error of the surrogate model; d. the predicted uncertainties of surrogate model in the form of the predicted standard deviation( σ ); and e. the negative-log-likelihood of the surrogate model. These results were simulated by using the Myopic MultiscaleSampling (MMS) method with the Convolution-Fed Gaussian Process (CFGP) model on the GASpy dataset. For clarity ofvisualization, we plotted rolling averages of all values in this ﬁgure using a window of 100 queries (excluding the ∆ ˆ F values,where no averaging was done) F F ), suggestingthat ∆ ˆ F may be an indicator of sampling strategy per-formance. Conversely, we observed that slow-learningML models may also reduce ∆ ˆ F . This phenomena couldbe counteracted by using more conservative convergencecriteria. All these details were observed in speciﬁc andsynthetic use cases though. The behaviors seen here maynot be observed in situations where search spaces and/orML models diﬀer.We encourage readers to focus on the main goals ofthis work: (1) converting atomic-scale simulations andML models into actionable decisions for experimental-ists, and (2) relaxing the active discovery process from anoptimization/regression problem to a classiﬁcation prob-lem. The ability to convert computational results into ex-perimental recommendations helps us serve the researchcommunity better. Simultaneously, relaxing the discov-ery process to a classiﬁcation problem helps us prioritizeexploration rather than exploitation, which is more ap-propriate for early-stage discovery projects.We also recognize several future directions that maystem from this research. Future work might include in-corporation of DFT-calculated surface stability by per-forming weighted averaging of surface activities when cal-culating bulk activities. Future work may also includecost-weighted sampling such that less computationallyintensive calculations are chosen more frequently thanmore intensive ones, which may improve discovery ratesin real-time. Perhaps most importantly, future workshould incorporate some ability to feed experimental dataand information to computational sampling strategies—e.g., multi-ﬁdelity modeling. DATA AVAILABILITY

The data that support the ﬁndingsof this study are openly available inhttps://github.com/ulissigroup/catalyst-acquisitions/.

ACKNOWLEDGMENTS

This research used resources of the National EnergyResearch Scientiﬁc Computing Center, a DOE Oﬃce ofScience User Facility supported by the Oﬃce of Scienceof the U.S. Department of Energy under Contract No.DE-AC02-05CH11231. KGB acknowledges the U.S. De-partment of Energy, Oﬃce of Science, Oﬃce of Basic En-ergy Sciences, Data Science for Knowledge Discovery forChemical and Materials Research program, under AwardDESC0020392. WN was supported by U.S. Departmentof Energy Oﬃce of Science under Contract No. DE-AC02-76SF00515. A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards,S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, and K. A.Persson, APL Materials , 1 (2013). S. Curtarolo, G. L. Hart, M. B. Nardelli, N. Mingo, S. Sanvito,and O. Levy, Nature Materials , 191 (2013). J. E. Saal, S. Kirklin, M. Aykol, B. Meredig, and C. Wolverton,Jom , 1501 (2013). S. P. Ong, W. D. Richards, A. Jain, G. Hautier, M. Kocher,S. Cholia, D. Gunter, V. L. Chevrier, K. A. Persson, andG. Ceder, Computational Materials Science , 314 (2013). H. M. Berman, T. Battistuz, T. N. Bhat, W. F. Bluhm, P. E.Bourne, K. Burkhardt, Z. Feng, G. L. Gilliland, L. Iype, S. Jain,P. Fagan, J. Marvin, D. Padilla, V. Ravichandran, B. Schnei-der, N. Thanki, H. Weissig, J. D. Westbrook, and C. Zardecki,Nucleic Acids Research , 235 (2000). T. Meyer, M. D’Abramo, A. Hospital, M. Rueda, C. Ferrer-Costa, A. P´erez, O. Carrillo, J. Camps, C. Fenollosa,D. Repchevsky, J. L. Gelp´ı, and M. Orozco, Structure , 1399(2010). P. Schlexer Lamoureux, K. T. Winther, J. A. Garrido Torres,V. Streibel, M. Zhao, M. Bajdich, F. Abild-Pedersen, and T. Bli-gaard, ChemCatChem , 3581 (2019). G. R. Schleder, A. C. M. Padilha, C. M. Acosta, M. Costa, andA. Fazzio, Journal of Physics: Materials , 1 (2019). A. J. Medford, M. R. Kunz, S. M. Ewing, T. Borders, andR. Fushimi, ACS Catalysis , 7403 (2018). Y. G. Chung, J. Camp, M. Haranczyk, B. J. Sikora, W. Bury,V. Krungleviciute, T. Yildirim, O. K. Farha, D. S. Sholl, andR. Q. Snurr, Chemistry of Materials , 6185 (2014). E. I. Ioannidis, T. Z. Gani, and H. J. Kulik, Journal of Compu-tational Chemistry , 2106 (2016). S. Chakraborty, W. Xie, N. Mathews, M. Sherburne, R. Ahuja,M. Asta, and S. G. Mhaisalkar, ACS Energy Letters , 837(2017). M. L. Green, C. L. Choi, J. R. Hattrick-Simpers, A. M. Joshi,I. Takeuchi, S. C. Barron, E. Campo, T. Chiang, S. Empedocles,J. M. Gregoire, A. G. Kusne, J. Martin, A. Mehta, K. Persson,Z. Trautt, J. Van Duren, and A. Zakutayev, Applied PhysicsReviews (2017), 10.1063/1.4977487. K. Tran and Z. W. Ulissi, Nature Catalysis , 696 (2018). M. Zhong, K. Tran, Y. Min, C. Wang, Z. Wang, C.-T. Dinh, P. DeLuna, Z. Yu, A. S. Rasouli, P. Brodersen, S. Sun, O. Voznyy, C.-S. Tan, M. Askerka, F. Che, M. Liu, A. Seiﬁtokaldani, Y. Pang,S.-C. Lo, A. Ip, Z. Ulissi, and E. H. Sargent, Nature , 178(2020). T. Lookman, P. V. Balachandran, D. Xue, and R. Yuan, npjComputational Materials (2019), 10.1038/s41524-019-0153-8. J. Schmidt, M. R. Marques, S. Botti, and M. A. Marques, npjComputational Materials (2019), 10.1038/s41524-019-0221-0. Y. Kim, E. Kim, E. Antono, B. Meredig, and J. Ling, npjComputational Materials (2019), 10.1038/s41524-020-00401-8,arXiv:1911.11201. B. Rohr, H. S. Stein, D. Guevarra, Y. Wang, J. A. Haber,M. Aykol, S. K. Suram, and J. M. Gregoire, Chemical Science , 2696 (2020). B. Settles,

Synthesis Lectures on Artiﬁcial Intelligence and Ma-chine Learning , edited by R. J. Brachman, W. W. Cohen, andT. G. Dietterich (Morgan & Claypool, 2012) p. 100. P. I. Frazier, “A Tutorial on Bayesian Optimization,” (2018),arXiv:arXiv:1807.02811v1. S. C. H. Hoi, D. Sahoo, J. Lu, and P. Zhao, “Online Learning:A Comprehensive Survey,” (2018), arXiv:1802.02871. A. Khorshidi and A. Peterson, Computer Physics Communica-tions , 1 (2016). J. Vandermause, S. B. Torrisi, S. Batzner, Y. Xie, L. Sun, A. M.Kolpak, and B. Kozinsky, npj Computational Materials , 1(2020). Y. Ma, D. J. Sutherland, R. Garnett, and J. Schneider, in , Vol. 38 (2015) pp. 672–680. A. Zanette, J. Zhang, and M. J. Kochenderfer, Lecture Notes inComputer Science , 276 (2019), arXiv:1811.09977. R. A. Flores, C. Paolucci, K. T. Winther, A. Jain, J. A. G.Torres, M. Aykol, J. Montoya, J. K. Nørskov, M. Bajdich, andT. Bligaard, Chemistry of Materials , 5854 (2020). Z. W. Seh, J. Kibsgaard, C. F. Dickens, I. Chorkendorﬀ, J. K.Nørskov, and T. F. Jaramillo, Science , 1 (2017). J. K. Nørskov, F. Studt, F. Abild-Pedersen, and T. Bligaard,

Angewandte Chemie International Edition (John Wiley & Sons,Inc., 2015). J. K. Nørskov, T. Bligaard, A. Logadottir, J. R. Kitchin, J. G.Chen, S. Pandelov, and U. Stimming, Journal of The Electro-chemical Society , J23 (2005). E. M. Lopato, E. A. Eikey, Z. C. Simon, S. Back, K. Tran,J. Lewis, J. F. Kowalewski, S. Yazdi, J. R. Kitchin, Z. W. Ulissi,J. E. Millstone, and S. Bernhard, ACS Catalysis , 4244 (2020). A. Gotovos, N. Casati, G. Hitz, and A. Krause, in

Proceedingsof the Twenty-Third international joint conference on ArtiﬁcialIntelligence (AAAI Press, 2013) pp. 1344–1350. K. Kandasamy, W. Neiswanger, R. Zhang, A. Krishnamurthy,J. Schneider, and B. Poczos, in

International Conference onMachine Learning (2019) pp. 3222–3232. B. Bryan, R. C. Nichol, C. R. Genovese, J. Schneider, C. J. Miller,and L. Wasserman, Advances in neural information processingsystems , 163 (2005). B. Settles,

Active learning literature survey , Tech. Rep. (Univer-sity of Wisconsin-Madison Department of Computer Sciences, 2009). J. Moˇckus, in

Optimization techniques IFIP technical conference (Springer, 1975) pp. 400–404. M. Altschuler and M. Bloodgood, Proceedings - 13th IEEE In-ternational Conference on Semantic Computing, ICSC 2019 , 47(2019), arXiv:1901.09118. T. Desautels, A. Krause, and J. Burdick, Journal of MachineLearning Research , 4053 (2014), arXiv:1206.6402. I. Char, Y. Chung, W. Neiswanger, K. Kandasamy, A. O. Nelson,M. D. Boyer, E. Kolemen, and J. Schneider, in (2020) pp.1–12, arXiv:2001.01793. C. Ying, A. Klein, E. Real, E. Christiansen, K. Murphy, andF. Hutter, in (2019) arXiv:1902.09635. C. White, W. Neiswanger, and Y. Savani, “Bananas: Bayesianoptimization with neural architectures for neural architecturesearch,” (2020), arXiv:1910.11858. V. Kuleshov, N. Fenner, and S. Ermon, in (Stockholm, Sweden, 2018)arXiv:1807.00263. K. Tran, W. Neiswanger, J. Yoon, Q. Zhang, E. Xing, and Z. W.Ulissi, Machine Learning: Science and Technology (2020),10.1088/2632-2153/ab7e1a. X. Liu, J. Xiao, H. Peng, X. Hong, K. Chan, and J. K. Nørskov,Nature Communications (2017), 10.1038/ncomms15438. K. Tran, P. Aini, S. Back, and Z. W. Ulissi, Journal of ChemicalInformation and Modeling58