[PDF] Application of variational policy gradient to atomic-scale materials synthesis

Abstract

Atomic-scale materials synthesis via layer deposition techniques present a unique opportunity to control material structures and yield systems that display unique functional properties that cannot be stabilized using traditional bulk synthetic routes. However, the deposition process itself presents a large, multidimensional space that is traditionally optimized via intuition and trial and error, slowing down progress. Here, we present an application of deep reinforcement learning to a simulated materials synthesis problem, utilizing the Stein variational policy gradient (SVPG) approach to train multiple agents to optimize a stochastic policy to yield desired functional properties. Our contributions are (1) A fully open source simulation environment for layered materials synthesis problems, utilizing a kinetic Monte-Carlo engine and implemented in the OpenAI Gym framework, (2) Extension of the Stein variational policy gradient approach to deal with both image and tabular input, and (3) Developing a parallel (synchronous) implementation of SVPG using Horovod, distributing multiple agents across GPUs and individual simulation environments on CPUs. We demonstrate the utility of this approach in optimizing for a material surface characteristic, surface roughness, and explore the strategies used by the agents as compared with a traditional actor-critic (A2C) baseline. Further, we find that SVPG stabilizes the training process over traditional A2C. Such trained agents can be useful to a variety of atomic-scale deposition techniques, including pulsed laser deposition and molecular beam epitaxy, if the implementation challenges are addressed.

Full PDF

NNotice: This manuscript has been authored by UT-Battelle, LLC, under Contract No. DE-AC0500OR22725 with the U.S. Department of Energy. The United States Government retainsand the publisher, by accepting the article for publication, acknowledges that the United StatesGovernment retains a non-exclusive, paid- up, irrevocable, world-wide license to publish or reproducethe published form of this manuscript, or allow others to do so, for the United States Governmentpurposes. The Department of Energy will provide public access to these results of federally sponsoredresearch in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). 1 a r X i v : . [ phy s i c s . c o m p - ph ] J un pplication of variational policy gradient toatomic-scale materials synthesis Siyan Liu

University of Kansas

Nikolay Borodinov

Siemens Corporation

Lukas Vlcek

Joint Institute for Computational SciencesUniversity of Tennessee, Knoxville

Dan Lu

Computational Sciences and Engineering DivisionOak Ridge National Laboratory

Nouamane Laanait

Anthem, Inc.

Rama K. Vasudevan ∗ Center for Nanophase Materials SciencesOak Ridge National Laboratory [email protected]

Abstract

Atomic-scale materials synthesis via layer deposition techniques present a uniqueopportunity to control material structures and yield systems that display uniquefunctional properties that cannot be stabilized using traditional bulk syntheticroutes. However, the deposition process itself presents a large, multidimensionalspace that is traditionally optimized via intuition and trial and error, slowingdown progress. Here, we present an application of deep reinforcement learningto a simulated materials synthesis problem, utilizing the Stein variational policygradient (SVPG) approach to train multiple agents to optimize a stochastic policyto yield desired functional properties. Our contributions are (1) A fully opensource simulation environment for layered materials synthesis problems, utilizing akinetic Monte-Carlo engine and implemented in the OpenAI Gym framework, (2)Extension of the Stein variational policy gradient approach to deal with both imageand tabular input, and (3) Developing a parallel (synchronous) implementation ofSVPG using Horovod, distributing multiple agents across GPUs and individualsimulation environments on CPUs. We demonstrate the utility of this approachin optimizing for a material surface characteristic, surface roughness, and explorethe strategies used by the agents as compared with a traditional actor-critic (A2C)baseline. Further, we ﬁnd that SVPG stabilizes the training process over traditionalA2C. Such trained agents can be useful to a variety of atomic-scale depositiontechniques, including pulsed laser deposition and molecular beam epitaxy, if theimplementation challenges are addressed.

Reinforcement learning (RL) in recent years has achieved impressive results in an array of problemsin continuous and discrete action spaces, including in games such as Chess, Go, Atari (Silver et al.,2018; Mnih et al., 2015), as well as in robotics (Devin et al., 2017), such as a recent demonstration of ∗ Send correspondence to this author. Code available at github.com/ramav87/KMC-SVPGPreprint. Under review. sing RL to solve a Rubik’s cube puzzle (McAleer et al., 2018). However, despite these considerablesuccesses, applications outside of these ’traditional’ domains remain limited, due to prohibitivesample inefﬁciency as well as a lack of available simulated environments on which agents can betrained for deployment. This is particularly true in the domain of the physical sciences, where,although very good simulations exist for predicting static and dynamic systems ranging from simplemolecules to complex proteins, solid-state matter and polymers, RL has made few inroads. Somenotable exceptions include the use of deep RL in the case of molecular design and optimizing chemicalreactions (Neil et al., 2018), and a recent report on use of RL for automating tip conditioning inscanning tunneling microscopy(Krull et al., 2020). As such, there is substantial potential to apply RLin such domains.In this paper, we show the ﬁrst application of RL to the case of atomic-level materials synthesis in asimulated environment. We developed a fully python-based kinetic Monte-Carlo model incorporatingboth atomic deposition and diffusion elements, and incorporated the environment into the OpenAIGym framework. Given that the simulations are necessarily expensive and highly stochastic, weutilized a recently developed variational policy gradient approach - Stein variational policy gradient- to train agents to optimize speciﬁc materials descriptors. We extended the existing algorithm byincorporating mixed image and tabular data during the training process, and developed a synchronousparallel implementation via Horovod that can be scaled to thousands of agents, with training occurringon GPUs. We then discuss the performance of the algorithm on the environment, and draw conclusionsrelevant for domain experts on the strategies employed by the agents. The paper is organized asfollows. We begin with an overview of the speciﬁc problem from the domain side, and include adescription of the underlying environment, as well as state and action spaces. Next, we introduce theStein variational method as presented by Liu et al. (Liu et al., 2017) as well as a description of themodiﬁcations and parameters of the models. We then introduce the results, comparing the SVPGapproach to a traditional actor-critic algorithm, and explore the robustness of the learned policies.Finally, we conclude with a discussion on the outlook of RL for materials synthesis, and core physicalsciences more generally in light of these ﬁndings.

The challenge we explore here is one of materials synthesis, speciﬁcally those of thin-ﬁlms usingatomic layer deposition approaches such as molecular beam epitaxy or pulsed laser deposition. Theseapproaches have been pivotal in the past three decades in advancing our understanding of materials’structure and function, given that thin ﬁlms can be engineered to possess a variety of propertiesunavailable through traditional bulk routes, and can be generated essentially free of extended defectsand with exquisite control over levels of strain, defect density, and so forth (Christen and Eres,2008). Despite the proliferation of these deposition systems, control over the deposition processduring the deposition itself remains limited for the most part. This is because of a limited ability tointerpret the available signals that are available (typically, surface diffraction images) as well as thestochastic nature of ﬁlm growth. As such, any interventions during a ﬁlm deposition are typicallyconﬁned to stopping ﬂux of incoming atomic species for some time (to enable annealing)(Kosteret al., 1999), or more basic operations such as changing the targets to enable growth of ﬁlms withdifferent compositional layers. As such, current methods are limited to quasi-static depositionconditions, which are not varied through the deposition process, and are time and labor intensive.Enabling automated synthesis with trained artiﬁcial agents would enable not only accelerated materialsdiscovery, but also to potentially new states of matter that could be stabilized through unique policiesthat would be difﬁcult if not impossible to discover through human trial and error.To explore this, we ﬁrst created a simulated environment in which agents could be trained to optimizefor a particular materials descriptor. The simulation utilizes a kinetic Monte-Carlo (kMC) engine andis loosely based on the simulation described in Tan et al. (Tan et al., 2005) The kMC simulation is adiscrete lattice-based simulation that takes an input (starting) atomic conﬁguration as well as rateparameters for distinct events that can occur, and then proceeds to sample from the events based ontheir probabilities, incrementing the simulation time in the process. More speciﬁcally, we outlineﬁve distinct events that can occur: (1) Deposition of atomic species of type A , (2) Deposition ofatomic species of type B , (3) Diffusion of an atom into a neighborhood of similar atom types, (4)Diffusion of an atom into a neighborhood of different atom types, and (5) Diffusion of an atom into a3igure 1: Simulation environment for materials synthesis. (a) Example of ﬁlm growth progression,with atoms of two elements being colored in red and blue. (b) Surface projections at different timesteps from the simulation in (a). (c) Action space for agents: increase and decrease deposition rates,and increase or decrease temperature. (d) Film growth simulation results for ﬁxed deposition rates,and diffusion rates are given by the labels on the x-axis. The results show a distribution of roughnessvalues corresponding to ten simulations run under identical conditions.neighborhood with mixed atom types. We consider that atoms can be deposited on vacant sites onthe surface, and further, that diffusion of atomic species also requires vacancies. A list is constructedof possible events that can occur, and the number of possible events of that type multiplied by therate constant. A random number is chosen between (0,1) which determines the type of event fromthe constructed list, and a random atom (or atom site) is chosen to undergo the event chosen. Thesimulation clock is incremented as ∆ t = [ n (cid:88) i =1 v i ] − ( − ln ( R )) (1)where v i is the rate of the event i and ln ( R ) , where R is a random number on the interval (0,1) isadded for mathematical completeness. There are two main points from this treatment: the ﬁrst is thatkMC, due to the parameters being rates, is a ’real-time’ predictive simulation as opposed to some’time step’ type simulation, i.e. the actual times output by the simulation should be comparable toexperiment. The second is that the probability of an event occurring is down to not only the rateconstant itself, but also to the number of possible events of that type that can actually occur. Forexample, if there are no vacant sites for diffusion, then regardless of the input diffusion rate, nodiffusion will be possible. Conversely, events with low rates, but with large numbers of possible siteswhich can undergo them, can be extremely frequent.We implemented the simulation in pure python, assuming a face-centered cubic lattice crystal structureand a simulation box of size [16 , , where the dimensions correspond to ( x, y, z ) in real-space.An example of the output of the simulation is shown in Figure 1(a), with the state of the ﬁlm at 2seconds and 10 seconds shown. Red and blue spheres correspond to distinct atomic species, calledhere for simplicity types A and B , but these could be Ni, Co, Cu, etc. Note that the simulation beginswith a single layer of atoms of a single type as can be seen in red, and subsequently the ﬁlm growthprocess is initiated, enabling both atomic deposition and diffusion events to occur. This results inﬁlm growth and roughening of the surface. In this case, ’same’ refers to the neighborhood (nearest neighbors) being more than of the same type,different refers to nearest neighbors being less than the same type, while mixed refers to situations in between D i = m i T + c i where the values are given in Table reftable:1 for the coefﬁcients. At hightemperatures, the diffusion rate is higher for atoms to move to different neighborhoods, whereasat lower temperatures the diffusion rates for diffusion into similar environments are highest, andthe diffusion to different environments is the lowest. At 560K there is a crossover where all threediffusion curves meet, i.e. the rates are the same regardless of the environment the atom is hoppinginto.Table 1: Diffusion rate dependence on temperature i m i c i . E − . E − . E − In general, a full 3D picture of the growth process will not be available in any real experiment -rather, the information will be limited to a surface projection if intermittent microscopy is performed,(Voigtländer and Zinner, 1993) but dynamic information available during the growth process islimited to surface diffraction, typically in reﬂection geometry. Whilst this can include some informa-tion on sub-surface layers, the inversion of surface diffraction from reﬂection high-energy electrondiffraction imaging is highly non-trivial, and even the forward simulations for arbitrary surfacestructures remains a vexing computational problem (Peng et al., 1996), that will not be exploredhere. Rather, we assume that the surface projection (2D image) is the state available to the agent, inaddition to tabular data on the deposition rates, temperature, and surface fraction of atomic speciesof type B on the surface. The latter can be derived from e.g., x-ray photoelectron spectroscopymeasurements, or from Auger electron spectroscopy, or other forms of imaging. Thus, the statevariable is S = [ S s , M t ] where S s is the 2D image of size (16 , and M t is a vector of values oflength 4. An example of the progression of states during the simulation in Figure 1(a) is seen inFigure 1(b).The action space A (Figure 1(c)) of the simulation is continuous and consists of three distinct actionscorresponding to the controls available: (1) The rate of deposition of atomic species A , (2) Rateof deposition of atomic species B , and (3) The temperature T . For the case of deposition, the ratecan be changed by up to ± . for every intervention step, whilst the temperature can be altered byup to ± K . The deposition rates are clipped at (0 . , . and the temperature is limited to thewindow of (300 , K . The simulation is initialized at t = 0 s , and then run until t ≈ t ≈ s . Thus, sixsequential actions occur per episode of the simulation. In terms of computational time, one completesimulation episode takes on the order of 20 s on a typical CPU. This varies substantially, though,due to the length of the simulation depends on the particular simulation parameters chosen, and thuscould be anywhere from 10 - 40 s per episode.A variety of reward functions can be considered, comporting to speciﬁc materials descriptors thatare desired for targeted functional properties. These include speciﬁc structural features such as thepresence of 3D mounding, step-type growth, surface roughness, and so forth, as well as chemicalfeatures such as level of surface segregation of speciﬁc atomic species, tendency towards formationof atomic clusters, etc. Here, we consider only the surface roughness as the material descriptor, butreplacing it with other more complex descriptors is straightforward. The surface roughness R film is measured as a root-mean-squared (RMS) value, effectively the difference between the individualheight at an (x,y) position and the average height, normalized to the size of the surface image. Weconsider a reward function R ( · ) that takes as input the surface roughness, a target roughness that isdesired, and then outputs a scalar value. In this , we consider a simple Gaussian centered around the5arget roughness, given at every intervention step in the simulation, but multiplied by 5 for the ﬁnalstate to force additional importance on the ﬁnal state of the ﬁlm, so R ( R film , target ) is (cid:40) e − ( R film − target ) / σ − t < t f e − ( R film − target ) / σ − t = t f where the target is the target roughness, and we set σ = 0 . and the simulation runs from t = 0 to t = t f . This reward function has the effect of providing a -1 reward for any roughness substantiallydistant from the target roughness (-5 for the last step), on the order of 10% or greater from the targetvalue.The simulation is highly stochastic. To illustrate this, we show in Figure 1(d) the roughness value ofﬁlm growth simulated under the identical deposition rates but with the three diffusion rates set toequal a single value, ranging from 0.1 to 0.9, and plot the resulting distribution of roughness valuesfor 10 individual simulation runs at each diffusion rate. This stochastic nature presents an inherentchallenge for any agent in understanding and perturbing the system dynamics to realize desiredmaterial states. With the above state, action spaces and reward signals, the problem of materialsynthesis optimization can now be formulated in terms of standard supervised reinforcement learningterminology. Reinforcement learning offers a framework in which the aforementioned problem can be tackled.Brieﬂy, RL concerns learning a policy π to maximize cumulative rewards in a dynamic environmentthrough repeated interactions. This can be expressed in terms of an optimization which seeks tomaximize a utility function, which is the expected return of a policy π : J ( π ) = E [ ∞ (cid:88) t =0 γ t r ( s t , a t )] (2)where γ is a discount factor, and actions a t are drawn from the policy π ( a t | s t ) , and the states s t +1 are drawn from the dynamic environment, conditional on the action a t and the previous state s t .We will assume here that the environment dynamics are unknown (model-free RL). By the policygradient theorem, we may approximate the gradient of the utility function with respect to parameters θ that parameterize the policy via ∇ θ J ( π ) ≈ ∞ (cid:88) t =0 ∇ logπ ( a t | s t ; θ ) R t (3)where R t = (cid:80) ∞ i =0 γ t r ( s t + i , a t + i ) is the cumulative return from time t , and comprises the RE-INFORCE algorithm (Williams, 1992). However, this method suffers from high variance, andconvergence can be accelerated by subtracting a suitable baseline. The advantage actor-critic (A2C)algorithm (Sutton et al., 1998) utilizes the value function baseline, ∇ θ J ( π ) ≈ ∞ (cid:88) t =0 ∇ logπ ( a t | s t ; θ )( R t − V π ( s t )) (4)where the value function V π ( s t ) = E [ (cid:80) ∞ i =0 γ i r ( s t + i , a t + i )] and gives the expected return for theagent from the state t , under the current policy π . Typically, both the policy and the baseline (valuefunction or action-value function) are approximated using neural networks. We seek a reinforcement learning algorithm that will be both robust to different initialization, aswell as sample efﬁcient. Since each episode takes between 10 and 40 seconds, parallel methods toobtain information about the environment is required. Recently, Liu et al. (Liu et al., 2017) proposeda variational inference method to termed the Stein variational policy gradient (SVPG), where a setof policy particles { θ i } is perturbed to achieve a balance between exploration (repulsion betweenpolicies) and exploitation (ascending the utility function). Speciﬁcally, the update rule they derived is ∆ θ i ← n n (cid:88) j =1 [ ∇ θ j ( 1 α J ( θ j ) + logq ( θ j )) k ( θ j , θ i ) + ∇ θ j k ( θ j , θ i )] (5)6igure 2: (a) Results, with average returns from the last 50 episodes of training for the SVPG andA2C agents. Mean rewards for the top four agents (b) and the next four agents (c) for both methodsas a function of training. Note that smoothing has been used with a convolutional ﬁlter of size 60, andshaded regions are half a standard deviation. (d-f) Test results from 50 runs, of the best-performingagents, compared with results from a random agent. The reward function is shown on the right.It was further shown that, for simple continuous control problems, SVPG outperformed A2C in termsof time to solution (sample efﬁciency), and provided more state exploration than independent A2Cactors. Given that this method satisﬁes the necessities of sample efﬁciency and (empirically observed)robustness, we chose this particular algorithm for our purpose. In order to apply SVPG for our method, we ﬁrst had to extend the existing SVPG implementation toallow for more complex network architectures. Our architecture is shown in supplementary materialsand consists of two separate inputs corresponding to the distinct data types present: a 2D image isfed through convolutional layers, while the vector of tabular data on rates, temperature and atomicfraction are fed into fully connected layers. These two branches are concatenated and then fedthrough two more fully connected layers. Since the model is an actor-critic model, the agent consistsof both an actor model and a separate critic model. The structure of both are identical in our setup,with the exception of the output layer. The full network architecture is included in the supplementalmaterial. For the actor model the output layer is size (6), corresponding to predictions of the mean µ and standard deviation σ of a diagonal Gaussian policy from which the three actions are sampled.Linear activations are used on the last layer for both models. For the critic model, the output is asingle scalar value, corresponding to the estimate of the value function for that state. Note that forthe actor model, the standard deviation is ensured to be positive by passing through a softplus layeron the output, i.e. if the output of the standard deviation is σ (cid:48) , then σ = log ( exp ( σ (cid:48) ) + 1) . Actionssampled from the diagonal Gaussian policy are clipped to lie in the interval [-10,10] and linearlyscaled to reﬂect changes to deposition rates (that lie between [-.25, .25] and temperature [-500 K,+500 K]. The new deposition rates and diffusion rates are calculated based on the sampled action,clipped according to minimum and maximum allowable values, and then the simulation is run to thenext intervention step.As in the original paper by Liu et al. (Liu et al., 2017), we did not apply the Stein update to thecritic, rather only to the actor models. We compared the policies trained with pure A2C, SVPG andbenchmarked against a random agent. Finally, we developed a synchronous parallel version of SVPGthat could scale to multiple nodes on high performance computing systems, via the Horovod package(Sergeev and Del Balso, 2018). We situated multiple agents on GPUs and all simulations on CPU7igure 3: (a) An example of a simulation run with the surface projections of the ﬁlm at different timesteps shown above, and the roughness, rewards, deposition rates and temperatures below made by theSVPG agent. (b) Policy outputs for randomly selected states passed through the actor networks forboth A2C (dashed lines) and SVPG (solid line) agents. Plotted here are normal distributions withparameters µ and σ given by the respective networks. Overlap of the distributions suggests that thesame action is preferred by both agents; larger variance indicates somewhat more exploration by theagent. The output for temperature actions is shown in (c) in the same manner.processes. After every 20 episodes, an allgather is performed to obtain the policy gradient loss andactor network weights from each agent. The Stein update is then calculated based on the gatheredgradients and weights, and then the individual policies are updated in the direction of the Steingradient. We trained for 400 episodes in total for both the standard A2C and SVPG methods. Weutilized the Adam optimizer for the Stein updates with a learning rate of − , and the temperatureparameter was α = 5.0. For the both methods, the Adam optimizer was used for both actor and criticupdates with a learning rate of 0.005 for the actor and 0.01 for the critic. Finally, the discount factorused in all cases was γ = 0 . . All training was done on a single nVIDIA DGX server with 4 TeslaV-100 GPUs. A single run of training took approximately 2 hours. Note that this can be reduced byan order of magnitude if we parallelize the simulations run for each batch. This work is ongoing.We compare the returns of all agents trained with both methods in Figure 2(a), sorted in descendingorder. Here, the return is taken as the average of the last 50 episodes during training. There are twoimportant points to note: ﬁrst, the best A2C agent outperforms the best SVPG agent. Secondly, theSVPG-trained agents show much less variance in the returns than the A2C method. Even though theaverage return is lower, they are uniformly higher than the corresponding A2C agent returns for thethird-best agent onward. This is also well reﬂected in the training curves, seen in Figure 2(b,c) forthe top 4 agents and the next top four agents, respectively. The SVPG training curve in both plotsincreases steadily whereas the opposite is true for the A2C agents in Figure 2(c). The mean of allagent returns is shown as dashed lines in both plots. Note that smoothing has also been applied witha convolutional ﬁlter of size 60.Next, we aimed to test the agents and plot the corresponding ﬁlm roughness as a function of time,for 50 different simulations. These results are shown in Figure 2(d-f), with the reward function seenon the right. Each agent aims to keep the ﬁlm roughness as close as possible to the dashed blueline drawn. The best-performing A2C agent was compared with the best-performing SVPG agent,and ﬁnally, with the performance of a random actor. The best-performing A2C actor effectivelysquashes the distribution to be tight around the target roughness, although more-so on the lower side,even though the reward function is symmetric. On the other hand, the SVPG agent appears to beslightly more varied, with a larger density on the upper side, as well as several ’runaways’ where theroughness increased well away from the target. This still compares favorably to the actions of the8andom agent. The mean roughness values for the A2C agent was 0.78 over all steps for all runs,compared with 0.89 for the SVPG agent, and ﬁnally 0.95 for the random agent. These results clearlyhighlight that the agents, whether trained through SVPG or A2C, are developing strategies to controlthe material characteristic speciﬁed. Of importance are the speciﬁc strategies that are being used by the agents to reduce the surfaceroughness. In general, ﬁlm growth will lead to increased roughness over time due to the randomnature of the surface deposition and diffusion events combined with limited time for diffusion in thepresence of energy barriers (e.g., consider mounding instabilities (Pierre-Louis et al., 1999; Stroscioet al., 1995; Lengel et al., 1999). One option is to enable the surface to reconstruct after depositionfor some time, in a technique called pulsed laser interval deposition (Koster et al., 1999). Althoughthis particular action is not available to the agent, it would be equivalent to dramatically loweringthe deposition rates and increasing the temperature to encourage more surface diffusion. Observingan example in Figure 3(a), the SVPG agent does indeed reduce the deposition rate for atom speciesA to the minimum, but surprisingly maintains or increases slightly the deposition rate throughoutthe process. Meanwhile, the temperature is kept to steadily increase. This is a somewhat unusualstrategy from the point of view of domain expertise, but apparently led to a large reward. At highertemperatures, the diffusion of atoms into mixed neighborhoods is enhanced compared to diffusionto similar atom neighborhoods; thus it could be beneﬁcial to increase the deposition rate slightly to’correct’ for clusters forming that can increase roughness.Since the policies themselves are diagonal multivariate Gaussian policies, one method of inspectingthe degree of exploration is to simply observe the mean and variance outputs for different states.Shown in 3(b,c) are outputs for both SVPG and A2C agents for randomly selected states taken fromduring the test runs. Solid lines are for the action output of the best-performing SVPG agent, whilethe dashed lines are for the best-performing A2C agent. Interestingly, both of the policies appear tohave similar actions for the ﬁrst deposition rate, shown in red. However, while the SVPG agent has aslightly broader distribution of actions for the second deposition rate, the A2C agent has a sharp deltafunction to reduce the deposition rate in all the states considered. The variances for the temperatureaction are similar for both policies, but again the SVPG agent is more conservative than the A2C,likely reﬂecting the combined experience from multiple agents that temper the gradient update.

The results presented in this work suggest that reinforcement learning agents could be utilized, inprinciple, to discover novel strategies for atomic-level material synthesis beyond the existing routines.The ability to utilize agents that are both robust, sample efﬁcient, and adaptable to real world settingsremains a challenge as reinforcement learning shifts to these new domains, where data is inherentlyexpensive to acquire. Numerous challenges, however, must be addressed to make this a reality.First, the relatively simple environment simulated here must be extended to allow for enumeratingmany more types of events, and their rates should be ﬁt based on either ﬁrst principle simulations,experiments, or a combination of the two, so that the environment can faithfully represent the trueﬁlm deposition situation. This will present scalability problems as simulation times run longer andsystem sizes increase. Next, the state function will need to be modiﬁed via inversion of the surfacediffraction images, or at least a rapid forward model for arbitrary surface structures which remainsdifﬁcult (Peng et al., 1996). Future work should therefore be to bring the environment closer tothe realistic setting, as well as incorporate model-based reinforcement learning within the SVPGapproach, such that the agents should be able to rapidly learn during individual synthesis runs. Asreinforcement learning steadily develops, and lab automation in materials science gains prominence,these methods can be expected to lead to new methods of synthesizing matter.

Acknowledgments and Disclosure of Funding

This research was funded by the AI Initiative, as part of the LaboratoryDirected Research and Development Program of Oak Ridge National Laboratory,managed by UT-Battelle, LLC, for the U.S. Department of Energy (DOE). The ore of the simulation was supported by the U.S. Department of Energy (DOE),Office of Science, Basic Energy Sciences (BES), Materials Sciences andEngineering Division (LV). A portion of this work was performed at the OakRidge National Laboratory’s Center for Nanophase Materials Sciences (CNMS),a U.S. DOE Office of Science User Facility. References

Christen, H. M. and Eres, G. (2008). Recent advances in pulsed-laser deposition of complex oxides.

Journal of Physics: Condensed Matter , 20(26):264005.Devin, C., Gupta, A., Darrell, T., Abbeel, P., and Levine, S. (2017). Learning modular neuralnetwork policies for multi-task and multi-robot transfer. In , pages 2169–2176. IEEE.Koster, G., Rijnders, G. J., Blank, D. H., and Rogalla, H. (1999). Imposed layer-by-layer growth bypulsed laser interval deposition.

Applied physics letters , 74(24):3729–3731.Krull, A., Hirsch, P., Rother, C., Schiffrin, A., and Krull, C. (2020). Artiﬁcial-intelligence-drivenscanning probe microscopy.

Communications Physics , 3(1):1–8.Lengel, G., Phaneuf, R., Williams, E., Sarma, S. D., Beard, W., and Johnson, F. (1999). Nonuniver-sality in mound formation during semiconductor growth.

Physical Review B , 60(12):R8469.Liu, Y., Ramachandran, P., Liu, Q., and Peng, J. (2017). Stein variational policy gradient. In .McAleer, S., Agostinelli, F., Shmakov, A., and Baldi, P. (2018). Solving the rubik’s cube withouthuman knowledge. arXiv preprint arXiv:1805.07470 .Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deepreinforcement learning.

Nature , 518(7540):529–533.Neil, D., Segler, M., Guasch, L., Ahmed, M., Plumbley, D., Sellwood, M., and Brown, N. (2018).Exploring deep recurrent models with reinforcement learning for molecule design.

ICLR 2018 .Peng, L.-M., Dudarev, S., and Whelan, M. (1996). Approximate methods in dynamical rheedcalculations.

Acta Crystallographica Section A: Foundations of Crystallography , 52(6):909–922.Pierre-Louis, O., d’Orsogna, M., and Einstein, T. (1999). Edge diffusion during growth: The kinkehrlich-schwoebel effect and resulting instabilities.

Physical review letters , 82(18):3661.Sergeev, A. and Del Balso, M. (2018). Horovod: fast and easy distributed deep learning in tensorﬂow. arXiv preprint arXiv:1802.05799 .Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L.,Kumaran, D., Graepel, T., et al. (2018). A general reinforcement learning algorithm that masterschess, shogi, and go through self-play.

Science , 362(6419):1140–1144.Stroscio, J. A., Pierce, D. T., Stiles, M. D., Zangwill, A., and Sander, L. (1995). Coarsening ofunstable surface features during fe (001) homoepitaxy.

Physical review letters , 75(23):4246.Sutton, R. S., Barto, A. G., et al. (1998).

Introduction to reinforcement learning , volume 135. MITpress Cambridge.Tan, X., Zhou, Y., and Zheng, X. (2005). Pulsed-laser deposition of polycrystalline ni ﬁlms: Athree-dimensional kinetic monte carlo simulation.

Surface science , 588(1-3):175–183.Voigtländer, B. and Zinner, A. (1993). Simultaneous molecular beam epitaxy growth and scanningtunneling microscopy imaging during ge/si epitaxy.

Applied physics letters , 63(22):3055–3057.Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforce-ment learning.